Back to index

4.16.0-rc.1

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.15.36

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Proposed title of this feature request

Add support to OpenShift Telemetry to report the provider that has been added via "platform: external"

What is the nature and description of the request?

There is a new platform we have added support in OpenShift 4.14 called "external" which has been added for partners to enable and support their own integrations with OpenShift rather than making RH to develop and support this.

When deploying OpenShift using "platform: external: we don't have the ability right now to identify the provider where the platform has been deployed which is key for the product team to analyze demand and other metrics.

Why does the customer need this? (List the business requirements)

OpenShift Product Management needs this information to analyze adoption of these new platforms as well as other metrics specifically for these platforms to help us to make decisions for the product development.

List any affected packages or components.

Telemetry for OpenShift

 

There is some additional information in the following Slack thread --> https://redhat-internal.slack.com/archives/CEG5ZJQ1G/p1698758270895639

As an Openshift admin i want to leverage /dev/fuse in unprivileged containers so that to successfully integrate cloud storage into OpenShift application in a secure, efficient, and scalable manner. This approach simplifies application architecture and allows developers to interact with cloud storage as if it were a local filesystem, all while maintaining strong security practices.

Epic Goal

  • Give users the ability to mount /dev/fuse into a pod by default with the `io.kubernetes.cri-o.Devices` annotation

Why is this important?

  • It's the first step in a series of steps that allows users to run unprivileged containers within containers
  • It also gives access to faster builds within containers

Scenarios

  1. as a developer on openshift, I would like to run builds within containers in a performant way

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

Reduce the resource footprint of LVMS in regards to CPU, Memory, Image count and size by collapsing current various containers and deployments into a small number of highly integrated ones.

Goals (aka. expected user outcomes)

  • Reduce the resource footprint while keeping the same functionallity
  • Provide seemless migration for existing customers{}

Requirements (aka. Acceptance Criteria):

  • Ressource Reduction:
    • Reduce memory consumption
    • Reduce CPU consumption (stressed, idle, requested)
    • Reduce container count (to reduce APi/scheduler/crio load)
    • Reduce container image sizes (to speed up deployments)
  • new version must be functionally equivalent no current feature/function is dropped,
  • day1  operations (installation, configuration) must be the same
  • day2 operations (updates, config changes, monitoring) must be the same
  • seamless migration for existing customers
  • special care must be taken with MicroShift, as it uses LVMS in a special way.

Questions to Answer (Optional):

  1. Do we need some sort of DP/TP release, make it an opt in feature for customers to try? 

Out of Scope

tbd

Background

The idea was creating during a ShiftWeek project. Potential saving / reductions are documented here: https://docs.google.com/presentation/d/1j646hJDVNefFfy1Z7glYx5sNBnSZymDjCbUQVOZJ8CE/edit#slide=id.gdbe984d017_0_0 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Resource Requirements should be added/updated to the documentation in the requirements section.

Interoperability Considerations

Interoperability with MicroShift is a challenge, as it allows way more detailed configuration of topolvm with direct access to lvmd.conf

 

Feature Overview (aka. Goal Summary)

Streamline and secure CLI interactions, improve backend validation processes, and refine user guidance and documentation for the hcp command and related functionalities.

Use Cases :

  • Backend systems robustly handle validations especially to API-address featching and logic previously managed by CLI.
  • Users can use the --render command independently of cluster connections, suitable for initial configurations. 
  • Clear guidance and documentation are available for configuring API server addresses and understanding platform-specific behaviors.
    • Clarify which API server address is required for different platforms and configurations.
    • Ensure the correct selection of nodes for API server address derivation.

 

  • Improve the description of the --api-server-address flag, be concrete about which api server address is needed
  • Add a note + the above to let folks to know they need to set it if they don’t want to be “connected” to a cluster.

Future:

  • Improve APIServerAddress Selection?
  • Add documentation about the restriction to have the management cluster being standalone

As a customer, I would like to deploy OpenShift On OpenStack, using the IPI workflow where my control plane would have 3 machines and each machine would have use a root volume (a Cinder volume attached to the Nova server) and also an attached ephemeral disk using local storage, that would only be used by etcd.

As this feature will be TechPreview in 4.15, this will only be implemented as a day 2 operation for now. This might or might not change in the future.

 

We know that etcd requires storage with strong performance capabilities and currently a root volume backed by Ceph has difficulties to provide these capabilities.

By also attaching local storage to the machine and mounting it for etcd would solve the performance issues that we saw when customers were using Ceph as the backend for the control plane disks.

Gophercloud already accepts to create a server with multiple ephemeral disks:

https://github.com/gophercloud/gophercloud/blob/master/openstack/compute/v2/extensions/bootfromvolume/doc.go#L103-L151

 

We need to figure out how we want to address that in CAPO, probably involving a new API; that later would be used in openshift (MAPO, and probably installer).

We'll also have to update the OpenStack Failure Domain in CPMS.

 

ARO (Azure) has conducted some benckmarks and is now recommending to put etcd on a separated data disk:

https://docs.google.com/document/d/1O_k6_CUyiGAB_30LuJFI6Hl93oEoKQ07q1Y7N2cBJHE/edit

Also interesting thread: https://groups.google.com/u/0/a/redhat.com/g/aos-devel/c/CztJzGWdsSM/m/jsPKZHSRAwAJ

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • Log linking is a feature implemented where a container can get access to logs of other containers in the pod. To be used in supported configurations, this feature has to be enabled by default in CRI-O

Why is this important?

  • Allow this feature for most configurations without a support exception

Scenarios

  1. as a cluster admin, I would like to easily configure a log forwarder for containers in a pod

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Provide a way to automatically recover a cluster with expired etcd server and peer certificates.

Goals (aka. expected user outcomes)

 

A cluster has etcd serving, peer, and serving-metrics certificates that are expired. There should be a way to either trigger certificate rotation or have a process that automatically does the rotation.

Requirements (aka. Acceptance Criteria):

Deliver rotation and recovery requirements from OCPSTRAT-714 

 

Epic Goal*

Provide a way to automatically recover a cluster with expired etcd server and peer certs

 
Why is this important? (mandatory)

Currently, the EtcdCertSigner controller, which is part of the CEO, renews the aforementioned certificates roughly every 3 years. However, if the cluster is offline for a period longer than the certificate's validity, upon restarting the cluster, the controller won't be able to renew the certificates since the operator won't be running at all.

We have scenarios where the customer, partner, or service delivery needs to recover a cluster that is offline, suspended, or shutdown, and as part of the process requires a supported way to force certificate and key rotation or replacement.

See the following doc for more use cases of when such clusters need to be recovered:
https://docs.google.com/document/d/198C4xwi5td_V-yS6w-VtwJtudHONq0tbEmjknfccyR0/edit

Required to enable emergency certificate rotation.
https://issues.redhat.com/browse/API-1613
https://issues.redhat.com/browse/API-1603

 
Scenarios (mandatory) 

A cluster has etcd serving, peer and serving-metrics certificates that are expired. There should be a way to either trigger certificate rotation or have a process that automatically does the rotation.
This does not cover the expiration of etcd-signer certificates at this time.
That will be covered under https://issues.redhat.com/browse/ETCD-445

 
Dependencies (internal and external) (mandatory)

While the etcd team will implement the automatic recovery for the etcd certificates, other control-plane operators will be handling their own certificate recovery.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation - etcd docs team
  • QE - etcd qe
  • PX - 
  • Others -

Acceptance Criteria (optional)

When a openshift etcd cluster that has expired etcd server and peer certs is restarted and is able to regenerate those certs. 

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Having an e2e test that puts a cluster into the expired certs failure mode and forces it to recover.
  • Documentation - Docs that explain the cert recovery procedure
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Given the scope creep of the work required to enable an offline cert rotation (or an automated restore), we are going to rely on online cert rotation to ensure that etcd certs don't expire during a cluster shutdown/hibernation.

Slack thread for background:
https://redhat-internal.slack.com/archives/C851TKLLQ/p1712533437483709?thread_ts=1712526244.614259&cid=C851TKLLQ

The estimated maximum shutdown period is 9 months. The refresh rate for the etcd certs can be increased so that there are always e.g 10 months left on the cert validity in the worst case i.e we shutdown right before the controller does its rotation.  

Feature Overview (aka. Goal Summary)  

Add support for Johannesburg, South Africa (africa-south1) in GCP

Goals (aka. expected user outcomes)

As a user I'm able to deploy OpenShift in Johannesburg, South Africa (africa-south1) in GCP and this region is fully supported

Requirements (aka. Acceptance Criteria):

A user can deploy OpenShift in GCP Johannesburg, South Africa (africa-south1) using all the supported installation tools for self-managed customers.

The support of this region is backported to the previuos OpenShift EUS release.

Background

Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.

Documentation Considerations

The information of the new region needs to be added to the documentation so this is supported.

Epic Goal

  • Validate the Installer can deploy OpenShift on Johannesburg, South Africa (africa-south1) in GCP
  • Create the corresponding OCPBUGs if there is any issue while validating this

Feature Overview (aka. Goal Summary)  

Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP

Goals (aka. expected user outcomes)

As a user I'm able to deploy OpenShift in Dammam, Saudi Arabia, Middle East (me-central2) region in GCP and this region is fully supported

Requirements (aka. Acceptance Criteria):

A user can deploy OpenShift in GCP Dammam, Saudi Arabia, Middle East (me-central2) region using all the supported installation tools for self-managed customers.

The support of this region is backported to the previuos OpenShift EUS release.

Background

Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.

Documentation Considerations

The information of the new region needs to be added to the documentation so this is supported.

Epic Goal

  • Validate the Installer can deploy OpenShift on Dammam, Saudi Arabia, Middle East (me-central2) region in AWS
  • Create the corresponding OCPBUGs if there is any issue while validating this

Feature Overview

With this feature MCE will be an additional operator ready to be enabled with the creation of clusters for both the AI SaaS and disconnected installations with Agent.

Currently 4 operators have been enabled for the Assisted Service SaaS create cluster flow: Local Storage Operator (LSO), OpenShift Virtualization (CNV), OpenShift Data Foundation (ODF), Logical Volume Manager (LVM)

The Agent-based installer doesn't leverage this framework yet.

Goals

When a user performs the creation of a new OpenShift cluster with the Assisted Installer (SaaS) or with the Agent-based installer (disconnected), provide the option to enable the multicluster engine (MCE) operator.

The cluster deployed can add itself to be managed by MCE.

Background, and strategic fit

Deploying an on-prem cluster 0 easily is a key operation for the remaining of the OpenShift infrastructure.

While MCE/ACM are strategic in the lifecycle management of OpenShift, including the provisioning of all the clusters, the first cluster where MCE/ACM are hosted, along with other supporting tools to the rest of the clusters (GitOps, Quay, log centralisation, monitoring...) must be easy and with a high success rate.

The Assisted Installer and the Agent-based installers cover this gap and must present the option to enable MCE to keep making progress in this direction.

Assumptions

MCE engineering is responsible for adding the appropriate definition as an olm-operator-plugins

See https://github.com/openshift/assisted-service/blob/master/docs/dev/olm-operator-plugins.md for more details

Feature Goal

As an OpenShift administrator I want to deploy OpenShift clusters with Assisted Installer that have the Multicluster Engine Operator (MCE) enabled with support for managing bare metal clusters.

As an OpenShift administrator I want to have the bare metal clusters deployed with the Assisted Installer managed by MCE, i.e. MCE managing its local cluster.

Definition of Done

  • Assisted Installer allows enabling the MCE operator and configure it with the Infrastructure Operator
  • The deployed clusters can deploy bare metal clusters from MCE successfully
  • The deployed clusters are managed by MCE
  • Documentation exists.

Feature Origin

MCE is strategic to OpenShift adoption in different scenarios. For Edge use cases it has ZTP to automate the provisioning of OpenShift clusters from a central cluster (hub cluster). MCE is also key for lifecycle management of OpenShift clusters. MCE is also available with the OpenShift subscriptions to every customer.

Additionally MCE will be key in the deployment of Hypershift, so it serves a double strategic purpose. 

Lastly, day-2 operations on newly deployed clusters (without the need to manage multiple clusters), can be covered with MCE too.

We expect MCE to enable our customers to grow their OpenShift installation-base more easily and manage their lifecycle.

Reasoning

When enabling the MCE operator in the Assisted Installer we need to add the required storage with the installation to be able to use the Infrastructure Operator to create bare metal/vSphere/Nutanix clusters.

 
Automated storage configuration workflows
The Infrastructure Operator, a dependency of MCE to deploy bare metal, vSphere and Nutanix clusters, requires storage.

There are multiple scenarios:

  • Install with ODF:
    • ODF is the ideal storage for clusters but requires an additional subscriptions.
    • When selected along with MCE it will be configured as the storage required by the Infrastructure Operator and the Infrastructure Operator will be deployed along with MCE.
  • Install on SNO
    • If the user also chooses ODF then ODF is used for the Infrastructure Opertor
    • If ODF isn't configured then LVMS is enabled and the Infrastructure Operator will use it.
  • User doesn't install ODF or a SNO cluster
    • They have to choose their storage and then install the Infrastructure Operator in day-2 and we don't configure storage with the installation (details of this need to be reviewed).

 

Note from planning: Alternative we can use a new feature in install-config that allows enabling some operators in day-2 and let the user configure it this 

In this feature will follow up OCPBU-186 Image mirroring by tags.

OCPBU-186 implemented new API ImageDigestMirrorSet and ImageTagMirrorSet and rolling of them through MCO.

This feature will update the components using ImageContentSourcePolicy to use ImageDigestMirrorSet.

The list of the components: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing.

 

Migrate OpenShift Components to use the new Image Digest Mirror Set (IDMS)

This doc list openshift components currently use ICSP: https://docs.google.com/document/d/11FJPpIYAQLj5EcYiJtbi_bNkAcJa2hCLV63WvoDsrcQ/edit?usp=sharing

Plan for ImageDigestMirrorSet Rollout
Epic: https://issues.redhat.com/browse/OCPNODE-521

4.13: Enable ImageDigestMirrorSet, both ICSP and ImageDigestMirrorSet objects are functional

  • Document that ICSP is being deprecated and will be unsupported by 4.17 (to allow for EUS to EUS upgrades)
  • Reject write to both ICSP and ImageDigestMirrorSet on the same cluster

4.14: Update OpenShift components to use IDMS

4.17: Remove support for ICSP within MCO

  • Error out if an old ICSP object is used

As an openshift developer, I want --idms-file flag so that I can fetch image info from alternative mirror if --icsp-file gets deprecated.

BU Priority Overview

Create custom roles for GCP with minimal set of required permissions.

Goals

Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.

State of the Business

Some of the service accounts that CCO creates, e.g. service account with role  roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions. 

Execution Plans

TBD

 

These are phase 2 items from CCO-188

Moving items from other teams that need to be committed to for 4.13 this work to complete

Epic Goal

  • Request to build list of specific permissions to run openshift on GCP - Components grant roles, but we need more granularity - Custom roles now allow ability to do this compared to when permissions capabilities were originally written for GCP

Why is this important?

  • Some of the service accounts that CCO creates, e.g. service account with role  roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions. 

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Ingress Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

 

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Storage Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

 

Evaluate if any of the GCP predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

 

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cloud Controller Manager Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

 

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster CAPI Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

Evaluate if any of the GCP predefined roles in the credentials request manifest of machine api operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Network Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer

Update GCP Credentials Request manifest of the Cluster Network Operator to use new API field for requesting permissions.

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Image Registry Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.

The new GCP provider spec for credentials request CR is as follows:

type GCPProviderSpec struct {
   metav1.TypeMeta `json:",inline"`
   // PredefinedRoles is the list of GCP pre-defined roles
   // that the CredentialsRequest requires.
   PredefinedRoles []string `json:"predefinedRoles"`
   // Permissions is the list of GCP permissions required to
   // create a more fine-grained custom role to satisfy the
   // CredentialsRequest.
   // When both Permissions and PredefinedRoles are specified
   // service account will have union of permissions from
   // both the fields
   Permissions []string `json:"permissions"`
   // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions
   // have the necessary services enabled
   // +optional
   SkipServiceCheck bool `json:"skipServiceCheck,omitempty"`
} 

we can use the following command to check permissions associated with a GCP predefined role

gcloud iam roles describe <role_name>

 

The sample output for role roleViewer is as follows. The  permission are listed in "includedPermissions" field.

[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA

Feature Overview

As an Infrastructure Administrator, I want to deploy OpenShift on Nutanix distributing the control plane and compute nodes across multiple regions and zones, forming different failure domains.

As an Infrastructure Administrator, I want to configure an existing OpenShift cluster to distribute the nodes across regions and zones, forming different failure domains.

Goals

Install OpenShift on Nutanix using IPI / UPI in multiple regions and zones.

Requirements (aka. Acceptance Criteria):

  • Ensure Nutanix IPI can successfully be deployed with ODF across multiple zones (like we do with vSphere, AWS, GCP & Azure)
  • Ensure zonal configuration in Nutanix using UPI is documented and tested

vSphere Implementation

This implementation would follow the same idea that has been done for vSphere. The following are the main PRs for vSphere:

https://github.com/openshift/enhancements/blob/master/enhancements/installer/vsphere-ipi-zonal.md 

 

Existing vSphere documentation

https://docs.openshift.com/container-platform/4.13/installing/installing_vsphere/installing-vsphere-installer-provisioned-customizations.html#configuring-vsphere-regions-zones_installing-vsphere-installer-provisioned-customizations

https://docs.openshift.com/container-platform/4.13/post_installation_configuration/post-install-vsphere-zones-regions-configuration.html

Epic Goal

Nutanix Zonal: Multiple regions and zones support for Nutanix IPI and Assisted Installer

Note

 

Goal:

Enable and support Multus CNI for microshift.

Background:

Customers with advanced networking requirement need to be able to attach additional networks to a pod, e.g. for high-performance requirements using SR-IOV or complex VLAN setups etc.

Requirements:

  1. opt-in approach: customers can add multus if needed, e.g. by installing/adding "microshift-networking-multus" rpm package to their installation.
  2. if possible, it would be good to be able to add multus  an existing installation. If that requires a restart/reboot, that is acceptable. If not possible, it has to be clearly documented. 
  3. it is acceptable that once multus has been added to an installation, it can not be removed. If removal can be implemented easily, that would be good. If not possible, then it has to be clearly documented.  
  4. Regarding additional networks:
    1. As part of the MVP, the Bridge plugin must be fully supported  
    2. As stretch goal,  macvlan and ipvlan plugins  should be supported
    3. Other plugins, esp. host device and sr-iov are out of scope for the MVP, but will be added with a later version.
    4. Multiple additional networks need to be configurable, e.g. two different bridges leading to two different networks, each consumed by different pods.
  5. IP V6 with bridge plugin. Secondary NICs passed to a container via the bridge plugin should work with IP V6, if the consuming pod does support V6. See also "out of scope"
  6. Regarding IPAM CNI support for IP address provisioning, static and DHCP must be supported.

Documentation:

  1. In the existing "networking" book, we need a new chapter "4. Multiple networks". It can re-use a lot content from OCP doc "https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/understanding-multiple-networks.html", but needs an extra chapter in the beginning "Installing support for multiple networks"{}

Testing:

  1. A simple "smoke test" that multus can be added, and a 2nd nic added to a pod (e.g. using host device) is sufficient. No need to replicate all the multus tests from OpenShift, as we assume that if it works there, it works with MicroShift.

Customer Considerations:

  • This document contains the MVP requirements of a MicroShift EAP customer that need to be considered.

Out of scope:

  • Other plugins, esp. host device and sr-iov are out of scope for the MVP, but will be added with a later version.
  • IP V6 support with OVN-K. That is scope of  feature OCPSTRAT-385 

 

Epic Goal

  • Provide optional Multus CNI for MicroShift

Why is this important?

  • Customers need to add extra interfaces directly to Pods

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

It should include all CNIs (bridge, macvlan, ipvlan, etc.) - if we decide to support something, we'll just update scripting to copy those CNIs

It needs to include IPAMs: static, dynamic (DHCP), host-local (we might just not copy it to host)

RHEL9 binaries only to save space

 

Feature Overview (aka. Goal Summary)  

Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.

Goals (aka. expected user outcomes)

  • Simplify the operators with a unified code pattern
  • Expose metrics from control-plane components
  • Use proper RBACs in the guest cluster
  • Scale the pods according to HostedControlPlane's AvailabilityPolicy
  • Add proper node selector and pod affinity for mgmt cluster pods

Requirements (aka. Acceptance Criteria):

  • OCP regression tests work in both standalone OCP and HyperShift
  • Code in the operators looks the same
  • Metrics from control-plane components are exposed
  • Proper RBACs are used in the guest cluster
  • Pods scale according to HostedControlPlane's AvailabilityPolicy
  • Proper node selector and pod affinity is added for mgmt cluster pods

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
 
Why is this important? (mandatory)

An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
 
Scenarios (mandatory) 

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Finally switch both CI and ART to the refactored aws-ebs-csi-driver-operator.

The functionality and behavior should be the same as the existing operator, however, the code is completely new. There could be some rough edges. See https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md 

 

Ci should catch the most obvious errors, however, we need to test features that we do not have in CI. Like:

  • custom CA bundles
  • cluster-wide proxy
  • custom encryption keys used in install-config.yaml
  • government cluster
  • STS
  • SNO
  • and other

Out CSI driver YAML files are mostly copy-paste from the initial CSI driver (AWS EBS?). 

As OCP engineer, I want the YAML files to be generated, so we can keep consistency among the CSI drivers easily and make them less error-prone.

It should have no visible impact on the resulting operator behavior.

Feature Overview (aka. Goal Summary)  

Enable support to bring your own encryption key (BYOK) for OpenShift on IBM Cloud VPC.

Goals (aka. expected user outcomes)

As a user I want to be able to provide my own encryption key when deploying OpenShift on IBM Cloud VPC so the cluster infrastructure objects, VM instances and storage objects, can use that user-managed key to encrypt the information.

Requirements (aka. Acceptance Criteria):

The Installer will provide a mechanism to specify a user-managed key that will be used to encrypt the data on the virtual machines that are part of the OpenShift cluster as well as any other persistent storage managed by the platform via Storage Classes.

Background

This feature is a required component for IBM's OpenShift replatforming effort.

Documentation Considerations

The feature will be documented as usual to guide the user while using their own key to encrypt the data on the OpenShift cluster running on IBM Cloud VPC

 

Epic Goal

  • Review and support the IBM engineering team while enabling BYOK support for OpenShift on IBM Cloud VPC

Why is this important?

  • As part of the replatform work IBM is doing for their OpenShift managed service this feature is Key for that work

Scenarios

  1. The installer will allow the user to provide their own key information to be used to encrypt the VMs storage and any storage object managed by OpenShift StorageClass objects

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator (in csi-operator)
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator

 

  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Goal

Hardware RAID support on Dell, Supermicro and HPE with Metal3.

Why is this important

Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3.

Dell, Supermicro and HPE, which are the most common hardware platforms we find in our customers environments are the main target.

Goal

Hardware RAID support on Dell with Metal3.

Why is this important

Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3 for Dell, which are the most common hardware platforms we find in our customers environments.

Before implementing generic support, we need to understand the implications of enabling an interface in Metal3 to allow it on multiple hardware types.

Scope questions

While rendering BMO in https://issues.redhat.com/browse/METAL-829 the node cpu_arch was hardcoded to x86_64

 

We should use bmh.Spec.Architecture instead to be more future proof

Feature Overview

Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms.

Goals

Extend the existing features while deploying OpenShift on IBM Cloud.

Background, and strategic fit

This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.

 

Epic Goal

  • Enable installation of disconnected clusters on IBM Cloud. This epic will track associated work.

Why is this important?

Scenarios

  1. Install a disconnected cluster on IBM Cloud.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

User Story:

A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud. 
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).

Description:

IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.

The Image Registry components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.

The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2023-10-04T22:02:15Z"
  generation: 1
  name: cluster
  resourceVersion: "430"
  uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    type: IBMCloud
status:
  apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443
  apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: us-east-disconnect-21-gtbwd
  infrastructureTopology: HighlyAvailable
  platform: IBMCloud
  platformStatus:
    ibmcloud:
      dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::'
      location: us-east
      providerType: VPC
      resourceGroupName: us-east-disconnect-21-gtbwd
      serviceEndpoints:
      - name: iam
        url: https://private.us-east.iam.cloud.ibm.com
      - name: vpc
        url: https://us-east.private.iaas.cloud.ibm.com/v1
      - name: resourcecontroller
        url: https://private.us-east.resource-controller.cloud.ibm.com
      - name: resourcemanager
        url: https://private.us-east.resource-controller.cloud.ibm.com
      - name: cis
        url: https://api.private.cis.cloud.ibm.com
      - name: dnsservices
        url: https://api.private.dns-svcs.cloud.ibm.com/v1
      - name: cis
        url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud
    type: IBMCloud

The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:

data:
  config: |+
    [global]
    version = 1.1.0
    [kubernetes]
    config-file = ""
    [provider]
    accountID = ...
    clusterID = temp-disconnect-7m6rw
    cluster-default-provider = g2
    region = eu-de
    g2Credentials = /etc/vpc/ibmcloud_api_key
    g2ResourceGroupName = temp-disconnect-7m6rw
    g2VpcName = temp-disconnect-7m6rw-vpc
    g2workerServiceAccountID = ...
    g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3
    iamEndpointOverride = https://private.iam.cloud.ibm.com
    g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com
    rmEndpointOverride = https://private.resource-controller.cloud.ibm.com

Acceptance Criteria:

Installer validates and injects user provided endpoint overrides into cluster deployment process and the Image Registry components use specified endpoints and start up properly.

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

With ovn-ic we have multiple actors (zones) setting status on some CRs. We need to make sure individual zone statuses are reported and then optionally merged to a single status

Why is this important?

Without that change zones will overwrite each others statuses.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

The MCO does not reap any old rendered machineconfigs. Each time a user or controller applies a new machineconfig, there are new rendered configs for each affected machineconfigpool. Over time, this leads to a very large number of rendered configs which are a UX annoyance but also could possibly contribute to space and performance issues with etcd. 

Goals (aka. expected user outcomes)

Administrators should have a simple way to set a maximum number of rendered configs to maintain, but there should also be a minimum set as there are many cases where support or engineering needs to be able to look back at previous configs.

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

This story involves implementing the main "deletion" command for rendered machineconfigs under the  oc adm prune subcommand. It will only support deleting rendered MCs that are not in use. 

It would support the following options to start:

  • Pool name, which indicates the owner of the rendered MCs to target. This argument is not required. If not set, all pools will be evaluated.
  • Count, which describes a max number of rendered MCs to delete, oldest first. This argument is not required.
  • Confirm flag, which will cause the prune to take place. If not set, the command will be run in dry run mode. This argument is not required.

Important: If the admin specifies options that select any rendered MCs that are in use by an MCP, it should not be deleted. In such cases, the output should indicate why the rendered MC has been skipped over for deletion.

Sample user workflow:

$ oc adm prune renderedmachineconfigs --pool-name=worker
# lists and deletes all unused rendered MCs for the worker pool in a dry run mode

$ oc adm prune renderedmachineconfigs --count=10  --pool-name=worker
# lists and deletes 10 oldest unused rendered MCs for the worker pool in a dry run mode

$ oc adm prune renderedmachineconfigs --count=10  --pool-name=worker --confirm
# actually deletes the rendered configs with the above options 

 

< High-Level description of the feature ie: Executive Summary >

Goals

Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.

Requirements

Requirements Notes IS MVP
Discover new offerings in Home Dashboard   Y
Access details outlining value of offerings   Y
Access step-by-step guide to install offering   N
Allow developers to easily find and use newly installed offerings   Y
Support air-gapped clusters   Y
    • (Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

Out of scope

Discovering solutions that are not available for installation on cluster

Dependencies

No known dependencies

Background, and strategic fit

 

Assumptions

None

 

Customer Considerations

 

Documentation Considerations

Quick Starts 

What does success look like?

 

QE Contact

 

Impact

 

Related Architecture/Technical Documents

 

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment

Problem:

Cluster admins need to be guided to install RHDH on the cluster.

Goal:

Enable admins to discover RHDH, be guided to installing it on the cluster, and verifying its configuration.

Why is it important?

RHDH is a key multi-cluster offering for developers. This will enable customers to self-discover and install RHDH.

Acceptance criteria:

  1. Show RHDH card in Admin->Dashboard view
  2. Enable link to RHDH documentation from the card
  3. Quick start to install RHDH operator
  4. Guided flow to installation and configuration of operator from Quick Start
  5. RHDH UI link in top menu
  6. Successful log in to RHDH

Dependencies (External/Internal):

RHDH operator

Design Artifacts:

Exploration:

Note:

Description of problem:
The OpenShift Console QuickStarts that promotes RHDH was written in generic terms and doesn't include some information on how to use the CRD-based installation.

We have removed this specific information because the operator wasn't ready at that time. As soon as the RHDH operator is available in the OperatorHub we should update the QuickStarts with some more detailed information.

With a simple CR example and some info on how to customize the base URL or colors.

Version-Release number of selected component (if applicable):
4.15

How reproducible:
Always

Steps to Reproduce:
Just navigate to Quick starts and select the "Install Red Hat Developer Hub (RHDH) with an Operator" quick starts

Actual results:
The RHDH Operator Quick start exists but is written in a generic way.

Expected results:
The RHDH Operator Quick start should contain some more specific information.

Additional info:
Initial PR: https://github.com/openshift/console-operator/pull/806

Description of problem:
The OpenShift Console QuickStarts promotes RHDH but also includes Janus IDP information.

The Janus IDP quick starts should be removed and all information about Janus IDP should be removed.

Version-Release number of selected component (if applicable):
4.15

How reproducible:
Always

Steps to Reproduce:
Just navigate to Quick starts and select the "Install Red Hat Developer Hub (RHDH) with an Operator" quick starts

Actual results:

  1. The RHDH Operator Quick start contains some information and links to Janus IDP.
  2. The Janus IDP Quick start exists and is similar to the RHDH one.

Expected results:

  1. The RHDH Operator Quick start must not contain information about Janus IDP.
  2. The Janus IDP Quick start should be removed

Additional info:
Initial PR: https://github.com/openshift/console-operator/pull/806

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

  • Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
  • Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

  • Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

  • Non-OVN primary CNI plug-in solutions

Background

Customer Considerations

  • What happens to clusters that don't migrate all iptables use to nftables?
    • In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
    • In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

Template:

 

Networking Definition of Planned

Epic Template descriptions and documentation 

 

Epic Goal

  • OCP needs to detect when customer workloads are making use of iptables, and present this information to the customer (e.g. via alerts, metrics, insights, etc)
  • The RHEL 9 kernel logs a warning if iptables is used at any point anywhere in the system, but this is not helpful because OCP itself still uses iptables, so the warning is always logged.
  • We need to avoid false positives due to OCP's own use of iptables in pod namespaces (e.g. the rules to block access to the MCS). Porting those rules to nftables sooner rather than later is one solution.

Why is this important?

  • iptables will not exist in RHEL 10, so if customers are depending on it, they need to be warned.
  • Contrariwise, we are getting questions from customers who are not using iptables in their own workload containers, who are confused about the kernel warning. Clearer messaging should help reduce confusion here.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

 

Feature Overview (aka. Goal Summary)

Consolidated Enhancement of HyperShift/KubeVirt Provider Post GA

This feature aims to provide a comprehensive enhancement to the HyperShift/KubeVirt provider integration post its GA release.

By consolidating CSI plugin improvements, core improvements, and networking enhancements, we aim to offer a more robust, efficient, and user-friendly experience.

Goals (aka. expected user outcomes)

  • User Persona: Cluster service providers / SRE
  • Functionality:
    • Expanded CSI capabilities.
    • Improved core functionalities of the KubeVirt Provider
    • Enhanced networking capabilities.

Goal

Post GA quality of life improvements for the HyperShift/KubeVirt core

User Stories

Non-Requirements

Notes

  • Any additional details or decisions made/needed

Done Checklist

Who What Reference
DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion <link or reference to Polarion>
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

Currently there is no option to influence on the placement of the VMs of an hosted cluster with kubevirt provider. the existing NodeSelector in HostedCluster are influencing only the pods in the hosted control plane namespace.

The goal is to introduce an new field in .spec.platform.kubevirt stanza in NodePool for node selector, propagate it to the VirtualMachineSpecTemplate, and expose this in the hypershift and hcp CLIs.

Epic Goal

  • Add an API extension for North-South IPsec.
  • close gaps from SDN-3604 - mainly around upgrade
  • add telemetry

Why is this important?

  • without API, customers are forced to use MCO. this brings with it a set of limitations (mainly reboot per change and the fact that config is shared among each pool, can't do per node configuration)
  • better upgrade solution will give us the ability to support a single host based implementation
  • telemetry will give us more info on how widely is ipsec used.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.

 

  • nmstate
  • k8s-nmstate
  • easier mechanism for cert injection (??)
  • telemetry
  •  

Dependencies (internal and external)

  1.  

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes
  2. SDN-3604 - Fully supported non-GA N-S IPSec implementation using machine config.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

oc, the openshift CLI, needs as close to feature parity as we can get without the built-in oauth server and its associated user and group management.  This will enable scripts, documentation, blog posts, and knowledge base articles to function across all form factors and the same form factor with different configurations.

Goals (aka. expected user outcomes)

CLI users and scripts should be usable in a consistent way regardless of the token issuer configuration.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

oc login needs to work without the embedded oauth server

 
Why is this important? (mandatory)

We are removing the embedded oauth-server and we utilize a special oauthclient in order to make our login flows functional

This allows documentation, scripts, etc to be functional and consistent with the last 10 years of our product.

This may require vendoring entire CLI plugins.  It may require new kubeconfig shapes.

 

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:

Separate oidc certificate authority and cluster certificate authority.

Version-Release number of selected component (if applicable):

oc 4.16 / 4.15

How reproducible:

Always

Steps to Reproduce:

1. Launch HCP external OIDC cluster. The external OIDC uses keycloak. The keycloak server is created outside of the cluster and its serving certificate is not trusted, its CA is separate than cluster's any CA.

2. Test oc login
$ curl -sSI --cacert $ISSUER_CA_FILE $ISSUER_URL/.well-known/openid-configuration | head -n 1
HTTP/1.1 200 OK

$ oc login --exec-plugin=oc-oidc --issuer-url=$ISSUER_URL --client-id=$CLI_CLIENT_ID --extra-scopes=email,profile --callback-port=8080 --certificate-authority $ISSUER_CA_FILE
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): n

error: The server uses a certificate signed by unknown authority. You may need to use the --certificate-authority flag to provide the path to a certificate file for the certificate authority, or --insecure-skip-tls-verify to bypass the certificate check and use insecure connections.

Actual results:

2. oc login with --certificate-authority pointing to $ISSUER_CA_FILE fails.

The reason is, oc login not only communicates with the oidc server, but also communicates the test cluster's kube-apiserver which is also self signed. Need more action for the --certificate-authority flag, i.e. need combine test cluster's kube-apiserver's CA and $ISSUER_CA_FILE:
$ grep certificate-authority-data $KUBECONFIG | grep -Eo "[^ ]+$" | base64 -d > hostedcluster_kubeconfig_ca.crt

$ cat $ISSUER_CA_FILE hostedcluster_kubeconfig_ca.crt > combined-ca.crt
$ oc login --exec-plugin=oc-oidc --issuer-url=$ISSUER_URL --client-id=$CLI_CLIENT_ID --extra-scopes=email,profile --callback-port=8080 --certificate-authority combined-ca.crt
Please visit the following URL in your browser: http://localhost:8080

Expected results:

For step 2, per https://redhat-internal.slack.com/archives/C060D1W96LB/p1711624413149659?thread_ts=1710836566.326359&cid=C060D1W96LB discussion, separate trust like:

$ oc login api-server --oidc-certificate-auhority=$ISSUER_CA_FILE [--certificate-authority=hostedcluster_kubeconfig_ca.crt]

The [--certificate-authority=hostedcluster_kubeconfig_ca.crt] should be optional if it is included in $KUBECONFIG's certificate-authority-data already.

Description of problem:
Introduce --issuer-url flag in oc login .

Version-Release number of selected component (if applicable):

[xxia@2024-03-01 21:03:30 CST my]$ oc version --client
Client Version: 4.16.0-0.ci-2024-03-01-033249
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
[xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci-2024-02-29-213249   True        False         8h      Cluster version is 4.16.0-0.ci-2024-02-29-213249

How reproducible:

Always

Steps to Reproduce:

1. Launch fresh HCP cluster.

2. Login to https://entra.microsoft.com. Register application and set properly.

3. Prepare variables.
HC_NAME=hypershift-ci-267920
MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig
HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig
AUDIENCE=7686xxxxxx
ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0
CLIENT_ID=7686xxxxxx
CLIENT_SECRET_VALUE="xxxxxxxx"
CLIENT_SECRET_NAME=console-secret

4. Configure HC without oauthMetadata.
[xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG

[xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p="
spec:
  configuration: 
    authentication: 
      oauthMetadata:
        name: ''
      oidcProviders:
      - claimMappings:
          groups:
            claim: groups
            prefix: 'oidc-groups-test:'
          username:
            claim: email
            prefixPolicy: Prefix
            prefix:
              prefixString: 'oidc-user-test:'
        issuer:
          audiences:
          - $AUDIENCE
          issuerURL: $ISSUER_URL
        name: microsoft-entra-id
        oidcClients:
        - clientID: $CLIENT_ID
          clientSecret:
            name: $CLIENT_SECRET_NAME
          componentName: console
          componentNamespace: openshift-console
      type: OIDC
"

Wait pods to renew:
[xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
certified-operators-catalog-7ff9cffc8f-z5dlg          1/1     Running   0          5h44m
kube-apiserver-6bd9f7ccbd-kqzm7                       5/5     Running   0          17m
kube-apiserver-6bd9f7ccbd-p2fw7                       5/5     Running   0          15m
kube-apiserver-6bd9f7ccbd-fmsgl                       5/5     Running   0          13m
openshift-apiserver-7ffc9fd764-qgd4z                  3/3     Running   0          11m
openshift-apiserver-7ffc9fd764-vh6x9                  3/3     Running   0          10m
openshift-apiserver-7ffc9fd764-b7znk                  3/3     Running   0          10m
konnectivity-agent-577944765c-qxq75                   1/1     Running   0          9m42s
hosted-cluster-config-operator-695c5854c-dlzwh        1/1     Running   0          9m42s
cluster-version-operator-7c99cf68cd-22k84             1/1     Running   0          9m42s
konnectivity-agent-577944765c-kqfpq                   1/1     Running   0          9m40s
konnectivity-agent-577944765c-7t5ds                   1/1     Running   0          9m37s

5. Check console login and oc login.
$ export KUBECONFIG=$HOSTED_KUBECONFIG
$ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server
{
"issuer": "https://:0",
"authorization_endpoint": "https://:0/oauth/authorize",
"token_endpoint": "https://:0/oauth/token",
...
}
Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com.

Check oc login:
$ rm -rf ~/.kube/cache/oc/
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

Actual results:

Console login succeeds. oc login fails.

Expected results:

oc login should also succeed.

Additional info:{}

Feature Overview (aka. Goal Summary)

Enable a "Break Glass Mechanism" in ROSA (Red Hat OpenShift Service on AWS) and other OpenShift cloud-services in the future (e.g., ARO and OSD) to provide customers with an alternative method of cluster access via short-lived certificate-based kubeconfig when the primary IDP (Identity Provider) is unavailable.

Goals (aka. expected user outcomes)

  • Enhance cluster reliability and operational flexibility.
  • Minimize downtime due to IDP unavailability or misconfiguration.
  • The primary personas here are OpenShift Cloud Services Admins and SREs as part of the shared responsibility.
  • This will be an addition to the existing ROSA IDP capabilities.

Requirements (aka. Acceptance Criteria)

  • Enable the generation of short-lived client certificates for emergency cluster access.
  • Ensure certificates are secure and conform to industry standards.
  • Functionality to invalidate short-lived certificates in case of an exploit.

Better UX

  • User Interface within OCM to facilitate the process.
  • SHOULD have audit capabilities.
  • Minimal latency when generating and using certificates (to reduce time without access to cluster).

 Use Cases (Optional)

  • A customer's IDP is down, but they successfully use the break-glass feature to gain cluster access.
  • SREs use their own break-glass feature to perform critical operations on a customer's cluster.

Questions to Answer (Optional)

  • What is the lifetime of generated certificates? 7 days life and 1 day rotation?
  • What security measures are in place for certificate generation and storage?
  • What are the audit requirements?

Out of Scope

  • Replacement of primary IDP functionality.
  • Use of break-glass mechanism for routine operations (i.e., this is emergency/contingency mechanism)

 Customer Considerations

  • The feature is not a replacement for the primary IDP.
  • Customers must understand the security implications of using short-lived certificates.

Documentation Considerations

  • How-to guides for using the break-glass mechanism.
  • FAQs addressing common concerns and troubleshooting.
  • Update existing ROSA IDP documentation to include this new feature.

Interoperability Considerations

  • Compatibility with existing ROSA, OSD (OpenShift Dedicated), and ARO (Azure Red Hat OpenShift) features.
  • Interoperability tests should include scenarios where both IDP and break-glass mechanism are engaged simultaneously for access.

Goal

  • Be able to provide the customer of managed OpenShift with a cluster-admin kubeconfig that allows them to access the cluster in the event that their identity provider (IdP or external OIDC) becomes unavailable or misconfigured.

Why is this important?

  • Managed OpenShift customers need to be able to access/repair their clusters without RH intervention (i.e. opening a ticket) in the case of an identity provider outage or misconfiguration.

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. Coordination with OCM to make the kubeconfig available to customers upon request

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)

As part of the deprecation progression of the openshift-sdn CNI plug-in, remove it as an install-time option for new 4.15+ release clusters.

Goals (aka. expected user outcomes)

The openshift-sdn CNI plug-in is sunsetting according to the following progression:

  1. deprecation notice delivered at 4.14 (Release Notes, What's Next presentation)
  2. removal as an install-time option at 4.15+
  3. removal as an option and EOL support at 4.17 GA

Requirements (aka. Acceptance Criteria):

  • The openshift-sdn CNI plug-in will no longer be an install-time option for newly installed 4.15+ clusters across installation options.
  • Customer clusters currently using openshift-sdn that upgrade to 4.15 or 4.16 with openshift-sdn will remain fully supported.
  • EUS customers using openshift-sdn on an earlier release (e.g. 4.12 or 4.14) will still be able to upgrade to 4.16 and receive full support of the openshift-sdn plug-in.

Questions to Answer (Optional):

  • Will clusters using openshift-sdn and upgrading from earlier versions to 4.15 and 4.16 still be supported?
    • YES
  • My customer has a hard requirement for the ability to install openshift-sdn 4.15 clusters. Is there any exceptions to support that?
    • Customers can file a Support Exception for consideration, and the reason for the requirement (expectation: rare) must be clarified.

Out of Scope

Background

All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.

Documentation Considerations

  • Product Documentation updates to reflect the install-time option change.

Epic Goal

The openshift-sdn CNI plug-in is sunsetting according to the following progression:

  1. deprecation notice delivered at 4.14 (Release Notes, What's Next presentation)
  2. removal as an install-time option at 4.15+
  3. removal as an option and EOL support at 4.17 GA

Why is this important?

All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The openshift-sdn CNI plug-in will no longer be an install-time option for newly installed 4.15+ clusters across installation options.
  • Customer clusters currently using openshift-sdn that upgrade to 4.15 or 4.16 with openshift-sdn will remain fully supported.
  • EUS customers using openshift-sdn on an earlier release (e.g. 4.12 or 4.14) will still be able to upgrade to 4.16 and receive full support of the openshift-sdn plug-in.

Open questions::

  • Will clusters using openshift-sdn and upgrading from earlier versions to 4.15 and 4.16 still be supported?
    • YES
  • My customer has a hard requirement for the ability to install openshift-sdn 4.15 clusters. Is there any exceptions to support that?
    • Customers can file a Support Exception for consideration, and the reason for the requirement (expectation: rare) must be clarified.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Have OpenShiftSDN (openshift-sdn CNI plug-in) no longer be an option for networkType, making the only supported value for the network OVNKubernetes

so that I can achieve

  • The removal of the openshift-sdn CNI plug-in at install-time for 4.15+

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal

The openshift-sdn CNI plug-in is sunsetting according to the following progression:

  1. deprecation notice delivered at 4.14 (Release Notes, What's Next presentation)
  2. removal as an install-time option at 4.15+
  3. removal as an option and EOL support at 4.17 GA

Why is this important?

All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only. 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The openshift-sdn CNI plug-in will no longer be an install-time option for newly installed 4.15+ clusters across installation options.
  • Customer clusters currently using openshift-sdn that upgrade to 4.15 or 4.16 with openshift-sdn will remain fully supported.
  • EUS customers using openshift-sdn on an earlier release (e.g. 4.12 or 4.14) will still be able to upgrade to 4.16 and receive full support of the openshift-sdn plug-in.

Open questions::

  • Will clusters using openshift-sdn and upgrading from earlier versions to 4.15 and 4.16 still be supported?
    • YES
  • My customer has a hard requirement for the ability to install openshift-sdn 4.15 clusters. Is there any exceptions to support that?
    • Customers can file a Support Exception for consideration, and the reason for the requirement (expectation: rare) must be clarified.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Address technical debt around self-managed HCP deployments, including but not limited to

  • CA ConfigMaps into the trusted bundle for both the CPO and Ignition Server, improving trust and security.
  • Create dual stack clusters through CLI with or without default values, ensuring flexibility and user preference in network management.
  • Utilize CLI commands to disable default sources, enhancing customizability.
  • Benefit from less intrusive remote write failure modes,. 
  • ...

Goal

  • Address all the tasks we didn't finish for the GA
  • Collect and track all missing topics for self-managed and agent provider
Users are encountering an issue when attempting to "Create hostedcluster on BM+disconnected+ipv6 through MCE." This issue is related to the default settings of `--enable-uwm-telemetry-remote-write` being true. Which might mean that that in the default case with disconnected and whatever is configured in the configmap for UWM e.g (
  minBackoff: 1s
url: https://infogw.api.openshift.com/metrics/v1/receive
Is not reachable with disconneced.

So we should look into reporting the issue and remdiating vs. Fataling on it for disconnected scenarios. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

In MCE 2.4, we currently document to disable `--enable-uwm-telemetry-remote-write` if the hosted control plane feature is used in a disconnected environment.

https://github.com/stolostron/rhacm-docs/blob/lahinson-acm-7739-disconnected-bare-[…]s/hosted_control_planes/monitor_user_workload_disconnected.adoc

Once this Jira is fixed, the documentation needs to be removed, users do not need to disable `--enable-uwm-telemetry-remote-write`. The HO is expected to fail gracefully on `--enable-uwm-telemetry-remote-write` and continue to be operational.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The CLI cannot create dual stack clusters with the default values. We need to create the proper flags to enable the HostedCluster to be a dual stack one using the default values

Epic Goal*

There was an epic / enhancement to create a cluster-wide TLS config that applies to all OpenShift components:

https://issues.redhat.com/browse/OCPPLAN-4379
https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/tls-config.md

For example, this is how KCM sets --tls-cipher-suites and --tls-min-version based on the observed config:

https://issues.redhat.com/browse/WRKLDS-252
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/506/files

The cluster admin can change the config based on their risk profile, but if they don't change anything, there is a reasonable default.

We should update all CSI driver operators to use this config. Right now we have a hard-coded cipher list in library-go. See OCPBUGS-2083 and OCPBUGS-4347 for background context.

 
Why is this important? (mandatory)

This will keep the cipher list consistent across many OpenShift components. If the default list is changed, we get that change "for free".

It will reduce support calls from customers and backport requests when the recommended defaults change.

It will provide flexibility to the customer, since they can set their own TLS profile settings without requiring code change for each component.

 
Scenarios (mandatory) 

As a cluster admin, I want to use TLSSecurityProfile to control the cipher list and minimum TLS version for all CSI driver operator sidecars, so that I can adjust the settings based on my own risk assessment.

 
Dependencies (internal and external) (mandatory)

None, the changes we depend on were already implemented.

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation - 
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Goal:

As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.

 

Problem:

While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.

 

Why is this important:

  • Provides customers with the flexibility to leverage their own custom managed ingress DNS solutions already in use within their organizations.
  • Required for regions like AWS GovCloud in which many customers may not be able to use the Route53 service (only for commercial customers) for both internal or ingress DNS.
  • OpenShift managed internal DNS solution ensures cluster operation and nothing breaks during updates.

 

Dependencies (internal and external):

 

Prioritized epics + deliverables (in scope / not in scope):

  • Ability to bootstrap cluster without an OpenShift managed internal DNS service running yet
  • Scalable, cluster (internal) DNS solution that's not dependent on the operation of the control plane (in case it goes down)
  • Ability to automatically propagate DNS record updates to all nodes running the DNS service within the cluster
  • Option for connecting cluster to customers ingress DNS solution already in place within their organization

 

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

 

Open questions:

 

Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • At this point in the feature, we would have a working in-cluster CoreDNS pod capable of resolving API and API-Int URLs.

This Epic details that work required to augment this CoreDNS pod to also resolve the *.apps URL. In addition, it will include changes to prevent Ingress Operator from configuring the cloud DNS after the ingress LBs have been created.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • https://github.com/openshift/api/pull/1685 introduced updates that allows the  LB IPs to be added to GCPPlatformStatus along with the state of DNS for the cluster.
  • Update cluster-ingress-operator to add the Ingress LB IPs when DNSType is `ClusterHosted`
  • In this state, Within https://github.com/openshift/api/blob/master/operatoringress/v1/types.go set the DNSManagementPolicy to Unmanaged within the DNSRecordSpec when the DNS manifest has customer Managed DNS enabled. 
  • With the DNSManagementPolicy set to Unmanaged, the IngressController should not try to configure DNS records.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Append Infra CR with only the GCP PlatformStatus field (without any other fields esp the Spec) set with the LB IPs at the end of the bootstrap ignition. The theory is that when Infra CR is applied from the bootstrap ignition, first the infra manifest is applied. As we progress through all the other assets in the ignition files, Infra CR appears again but with only the LB IPs set. That way it will update the existing Infra CR already applied to the cluster.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Currently the behavior is to only use the default security group if no security groups were specified for a NodePool. This makes it difficult to implement additional security groups in ROSA because there is no way to know the default security group on cluster creation. By always appending the default security group, any security groups specified on the NodePool become additional security groups.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

This outcome tracks the overall CoreOS Layering story as well as the technical items needed to converge CoreOS with RHEL image mode. This will provide operational consistency across the platforms.

ROADMAP for this Outcome: https://docs.google.com/document/d/1K5uwO1NWX_iS_la_fLAFJs_UtyERG32tdt-hLQM8Ow8/edit?usp=sharing
 

 

 

 

Note: phase 2 target is tech preview.

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

Description of problem:

When opting into on-cluster builds on both the worker and control plane MachineConfigPools, the maxUnavailable value on the MachineConfigPools is not respected when the newly built image is rolled out to all of the nodes in a given pool.

Version-Release number of selected component (if applicable):

    

How reproducible:

Sometimes reproducible. I'm still working on figuring out what conditions need to be present for this to occur.

Steps to Reproduce:

    1. Opt an OpenShift cluster in on-cluster builds by following these instructions: https://github.com/openshift/machine-config-operator/blob/master/docs/OnClusterBuildInstructions.md
    2. Ensure that both the worker and control plane MachineConfigPools are opted in.    

Actual results:

Multiple nodes in both the control plane and worker MachineConfigPools are drained and cordoned simultaneously, irrespective of the maxUnavailable value. This is particularly problematic for control plane nodes since draining more than one control plane node at a time can cause etcd issues, in addition to PDBs (Pod Disruption Budgets) which can make the config change take substantially longer or block completely.

I've mostly seen this issue affect control plane nodes, but I've also seen it impact both control plane and worker nodes.

Expected results:

I would have expected the new OS image to be rolled out in a similar fashion as new MachineConfigs are rolled out. In other words, a single node (or nodes up to maxUnavailable for non-control-plane nodes) is cordoned, drained, updated, and uncordoned at a time.

Additional info:

I suspect the bug may be someplace within the NodeController since that's the part of the MCO that controls which nodes update at a given time. That said, I've had difficulty reliably reproducing this issue, so finding a root cause could be more involved. This also seems to be mostly confined to the initial opt-in process. Subsequent updates seem to follow the original "rules" more closely.

Description of problem:


In clusters with OCB functionality enabled, sometimes the machine-os-builder pod is not restarted when we update the imageBuilderType.

What we have observed is that the pod is restarted if a build is running, but it is not restarted if we are not building anything.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-09-12-195514   True        False         88m     Cluster version is 4.14.0-0.nightly-2023-09-12-195514

How reproducible:

Always

Steps to Reproduce:

1. Create the configuration resources needed by the OCB functionality.

To reproduce this issue we use an on-cluster-build-config configmap with an empty imageBuilderType

 oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}'


2. Create a infra pool and label it so that it can use OCB functionality

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


 oc label mcp/infra machineconfiguration.openshift.io/layering-enabled=

3. Wait for the build pod to finish.

4. Once the build has finished and it has been cleaned, update the imageBuilderType so that we use "custom-pod-builder" type now.

oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'


Actual results:


We waited for one hour, but the pod is never restarted.

$ oc get pods |grep build
machine-os-builder-6cfbd8d5d-xk6c5           1/1     Running   0          56m

$ oc logs machine-os-builder-6cfbd8d5d-xk6c5 |grep Type
I0914 08:40:23.910337       1 helpers.go:330] imageBuilderType empty, defaulting to "openshift-image-builder"

$ oc get cm on-cluster-build-config -o yaml |grep Type
  imageBuilderType: custom-pod-builder


Expected results:


When we update the imageBuilderType value, the machine-os-builder pod should be restarted.

Additional info:


Description of problem:


In a cluster with a pool using OCB functionality, if we update the imageBuilderType value while an openshift-image-builder pod is building an image, the build fails.


It can fail in 2 ways:

1. Removing the running pod that is building the image, and what we get is a failed build reporting "Error (BuildPodDeleted)"
2. The machine-os-builder pod is restarted but the build pod is not removed. Then the build is never removed.


Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-09-12-195514   True        False         154m    Cluster version is 4.14.0-0.nightly-2023-09-12-195514

How reproducible:


Steps to Reproduce:

1. Create the needed resources to make OCB functionality work (on-cluster-build-config configmap, the secrets and the imageSpec)

We reproduced it using imageBuilderType=""

oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}'


2. Create an infra pool and label it so that it can use OCB functionality

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


 oc label mcp/infra machineconfiguration.openshift.io/layering-enabled=



3. Wait until the triggered build has finished.

4. Create a new MC to trigger a new build. This one, for example:


kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-machine-config
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,dGVzdA==
        filesystem: root
        mode: 420
        path: /etc/test-file.test



5. Just after a new build pod is created, configure the on-cluster-build-config configmap to use the "custom-pod-builder" imageBuilderType

oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'



Actual results:


We have observed 2 behaviors after step 5:


1. The machine-os-builder pod is restarted and the build is never removed.

build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855   Docker   Dockerfile   Running   10 seconds ago
NAME                                                              READY   STATUS              RESTARTS   AGE
pod/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855-build   1/1     Running             0          12s
pod/machine-config-controller-5bdd7b66c5-dl4hh                    2/2     Running             0          90m
pod/machine-config-daemon-5wbw4                                   2/2     Running             0          90m
pod/machine-config-daemon-fqr8x                                   2/2     Running             0          90m
pod/machine-config-daemon-g77zd                                   2/2     Running             0          83m
pod/machine-config-daemon-qzmvv                                   2/2     Running             0          83m
pod/machine-config-daemon-w8mnz                                   2/2     Running             0          90m
pod/machine-config-operator-7dd564556d-mqc5w                      2/2     Running             0          92m
pod/machine-config-server-28lnp                                   1/1     Running             0          89m
pod/machine-config-server-5csjz                                   1/1     Running             0          89m
pod/machine-config-server-fv4vk                                   1/1     Running             0          89m
pod/machine-os-builder-6cfbd8d5d-2f7kd                            0/1     Terminating         0          3m26s
pod/machine-os-builder-6cfbd8d5d-h2ltd                            0/1     ContainerCreating   0          1s

NAME                                                                             TYPE     FROM         STATUS    STARTED          DURATION
build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855   Docker   Dockerfile   Running   12 seconds ago




2. The build pod is removed and the build fails with Error (BuildPodDeleted):

NAME                                                                             TYPE     FROM         STATUS    STARTED          DURATION
build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855   Docker   Dockerfile   Running   10 seconds ago
NAME                                                              READY   STATUS        RESTARTS   AGE
pod/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855-build   1/1     Terminating   0          12s
pod/machine-config-controller-5bdd7b66c5-dl4hh                    2/2     Running       0          159m
pod/machine-config-daemon-5wbw4                                   2/2     Running       0          159m
pod/machine-config-daemon-fqr8x                                   2/2     Running       0          159m
pod/machine-config-daemon-g77zd                                   2/2     Running       8          152m
pod/machine-config-daemon-qzmvv                                   2/2     Running       16         152m
pod/machine-config-daemon-w8mnz                                   2/2     Running       0          159m
pod/machine-config-operator-7dd564556d-mqc5w                      2/2     Running       0          161m
pod/machine-config-server-28lnp                                   1/1     Running       0          159m
pod/machine-config-server-5csjz                                   1/1     Running       0          159m
pod/machine-config-server-fv4vk                                   1/1     Running       0          159m
pod/machine-os-builder-6cfbd8d5d-g62b6                            1/1     Running       0          2m11s

NAME                                                                             TYPE     FROM         STATUS    STARTED          DURATION
build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855   Docker   Dockerfile   Running   12 seconds ago

.....



NAME                                                                             TYPE     FROM         STATUS                    STARTED          DURATION
build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855   Docker   Dockerfile   Error (BuildPodDeleted)   17 seconds ago   13s




Expected results:

Updating the imageBuilderType while a build is running should not result in the OCB functionlity in a broken status.


Additional info:


Must-gather files are provided in the first commen in this ticket.

Description of problem:


MachineConfigs that use 3.4.0 ignition with a kernelArguments are not currently allowed by MCO.

In on-cluster build pools, when we create a 3.4.0 MC with kernelArguments, the pool is not degraded.

No new rendered MC is created either.

 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-06-065940
 

How reproducible:

Always
 

Steps to Reproduce:


1. Enable on-cluster build in the "worker" pool
2. Create a MC using 3.4.0 ignition version with kernelArguments

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: "2023-09-07T12:52:11Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: worker
  name: mco-tc-66376-reject-ignition-kernel-arguments-worker
  resourceVersion: "175290"
  uid: 10b81a5f-04ee-4d7b-a995-89f319968110
spec:
  config:
    ignition:
      version: 3.4.0
    kernelArguments:
      shouldExist:
      - enforcing=0

Actual results:

The build process is triggered and new image is built and deployed.

The pool is never degraded.
 

Expected results:


MCs with igition 3.4.0 kernelArguments are not currently allowed. The MCP should be degraded reporting a message similar to this one (this is the error reported if we deploy the MC in the master pool, which is a normal pool):

oc get mcp -o yaml
....
 - lastTransitionTime: "2023-09-07T12:16:55Z"
    message: 'Node sregidor-s10-7pdvl-master-1.c.openshift-qe.internal is reporting:
      "can''t reconcile config rendered-master-57e85ed95604e3de944b0532c58c385e with
      rendered-master-24b982c8b08ab32edc2e84e3148412a3: ignition kargs section contains
      changes"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

 

Additional info:



When the image is deployed (it shouldn't be deployed) the kernel argument enforcing=0 is not present:
sh-5.1# cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-05f51fadbc7fe74fa1e2ba3c0dbd0268c6996f0582c05dc064f137e93aa68184/vmlinuz-5.14.0-284.30.1.el9_2.x86_64 ostree=/ostree/boot.0/rhcos/05f51fadbc7fe74fa1e2ba3c0dbd0268c6996f0582c05dc064f137e93aa68184/0 ignition.platform.id=gcp console=tty0 console=ttyS0,115200n8 root=UUID=95083f10-c02f-4d94-a5c9-204481ce3a91 rw rootflags=prjquota boot=UUID=0440a909-3e61-4f7c-9f8e-37fe59150665 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1

 

There are a few situations in which a cluster admin might want to trigger a rebuild of their OS image in addition to situations where cluster state may dictate that we should perform a rebuild. For example, if the custom Dockerfile changes or the machine-config-osimageurl changes, it would be desirable to perform a rebuild in that case. To that end, this particular story covers adding the foundation for a rebuild mechanism in the form of an annotation that can be applied to the target MachineConfigPool. What is out of scope for this story is applying this annotation in response to a change in cluster state (e.g., custom Dockerfile change).

 

Done When:

  • BuildController is aware of and recognizes a special annotation on layered MachineConfigPools (e.g., machineconfiguration.openshift.io/rebuildImage).
  • Upon recognizing that a MachineConfigPool has this annotation, BuildController will clear any failed build attempts, delete any failed builds and their related ephemeral objects (e.g., rendered Dockerfile / MachineConfig ConfigMaps), and schedule a new build to be performed.
  • This annotation should be removed when the build process completes, regardless of outcome. In other words, should the build success or fail, the annotation should be removed.
  •  [optional] BuildController keeps track of the number of retries for a given MachineConfigPool. This can occur via another annotation, e.g., machineconfiguration.openshift.io/buildRetries=1 . For now, this can be a hard-coded value (e.g., 5), but in the future, this could be wired up to an end-user facing knob. This annotation should be cleared upon a successful rebuild. If the rebuild is reached, then we should degrade.

The current OCB approach is a private MCO only API. making a public API would introduce the following benefits:

1. Transparent update information linked with the proposed MachineOSUpdater API
2. Follow the MCO migration to openshift/api. We should not have private APIs anymore in the MCO. Especially if the feature is publicly used.
3. Consolidate build information into one place that both the MCO and other users can pull from

the general proposal of changes here are as follows:

1. Move global build settings to ControllerConfig object or to this object. These include `finalImagePushSecret` and `finalImagePullspec`
2. create MachineOSBuild CRD which will included Dockerfile field, MachineConfig to build from etc.
3. Add these fields to MCP as well. Rather than thinking of this as two sources of truth, you can view the MCP fields as triggers to create or modify an existing MachineOSBuild object. This is similar to the mechanism that OpenShift BuildV1 uses with its BuildConfigs and Builds CRDs; the BuildConfig houses all of the necessary configs and a new Build is created with those configs. One does not need a BuildConfig to do a build, but one can use a BuildConfig to launch multiple builds.

Making these changes will enforce a system for builds rather than the appendage that the build API is currently to the MCO. The aim here is visibility rather than hidden operations.

Description of problem:

When a MCP has the on-cluster-build functionality enabled, when we configure a valid imageBuilderType in the on-cluster-build configmap, and later on we update this configmap with an invalid imageBuilderType the machine-config ClusterOperator is not degraded.

 

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-09-12-195514   True        False         3h56m   Cluster version is 4.14.0-0.nightly-2023-09-12-195514

 

How reproducible:

Always
 

Steps to Reproduce:

1. Create a valid OCB configmap, and 2 valid secrets. Like this:

apiVersion: v1
data:
  baseImagePullSecretName: mco-global-pull-secret
  finalImagePullspec: quay.io/mcoqe/layering
  finalImagePushSecretName: mco-test-push-secret
  imageBuilderType: ""
kind: ConfigMap
metadata:
  creationTimestamp: "2023-09-13T15:10:37Z"
  name: on-cluster-build-config
  namespace: openshift-machine-config-operator
  resourceVersion: "131053"
  uid: 1e0c66de-7a9a-4787-ab98-ce987a846f66

3. Label the "worker" MCP in order to enable the OCB functionality in it.

$ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=

4. Wait for the machine-os-builder pod to be created, and for the build to be finished. Just the wait for the pods, do not wait for the MCPs to be updated. As soon as the build pod has finished the build, go to step 5.

5. Patch the on-cluster-build configmap to use a valid imageBuilderType

 oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "fake"}}'


Actual results:

The machine-os-builder pod crashes
$ oc get pods
NAME                                         READY   STATUS             RESTARTS        AGE
machine-config-controller-5bdd7b66c5-6l7sz   2/2     Running            2 (45m ago)     63m
machine-config-daemon-5ttqh                  2/2     Running            0               63m
machine-config-daemon-l95rj                  2/2     Running            0               63m
machine-config-daemon-swtc6                  2/2     Running            2               57m
machine-config-daemon-vq594                  2/2     Running            2               57m
machine-config-daemon-zrf4f                  2/2     Running            0               63m
machine-config-operator-7dd564556d-9smk4     2/2     Running            2 (45m ago)     65m
machine-config-server-9sxjv                  1/1     Running            0               62m
machine-config-server-m5sdl                  1/1     Running            0               62m
machine-config-server-zb2hr                  1/1     Running            0               62m
machine-os-builder-6cfbd8d5d-t6g8w           0/1     CrashLoopBackOff   6 (3m11s ago)   9m16s


But the machine-config ClusterOperator is not degraded


$ oc get co machine-config
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.14.0-0.nightly-2023-09-12-195514   True        False         False      63m     


 

Expected results:


The machine-config ClusterOperator  should become degraded when an invalid imageBuilderType is configured.

 

Additional info:


If we configure an invalid imageBuilderType directly (not by patching/editing the configmap), then the machine-config CO is degraded, but when we edit the configmap it is not.

A link to the must-gather file is provided in the first comment in this issue


PS: If we wait for the MCPs to be updated in step 4, the machine-os-builder pod is not restarted with the new "fake" imageBuilderType, but the machine-config CO is not degraded either, and it should. Does it make sense?

 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Goal

We want to add a new CLI option named `--all-images` to the `oc adm must-gather` command.
The oc client will scan all the CSVs on the cluster looking for `operators.openshift.io/must-gather-image=pullUrlOfTheImage` annotation building a list of must-gather images to be used to collect logs for all the products installed on the cluster and then the client will execute all of them aggregating the collection results.
Operator authors that want to opt-in to this mechanism should explicitly annotate their CSV with ``operators.openshift.io/must-gather-image` annotation.

User Stories

  • As a Cluster Admin, I want to execute a `oc adm must-gather` only once collecting all the logs for all the products deployed on my cluster without the need to  explicitly specify a long (and error prone) list of images
  • As a Red Hat support engineer, I want to be able to suggest to customers a single and simple command to collect must-gather logs for different products in a single pass reducing the request ping-pong came and so the case resolving time.

Non-Requirements

  • List of things not included in this epic, to alleviate any doubt raised during the grooming process.

Notes

  • Any additional details or decisions made/needed

Placeholder feature for ccx-ocp-core maintenance tasks.

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Description of problem:

    The operator panics in HyperShift hosted cluster with OVN and with enabled networking obfuscation:
 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 858 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x26985e0?, 0x454d700})
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0010d67e0?})
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x26985e0, 0x454d700})
	/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/insights-operator/pkg/anonymization.getNetworksFromClusterNetworksConfig(...)
	/go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:292
github.com/openshift/insights-operator/pkg/anonymization.getNetworksForAnonymizer(0xc000556700, 0xc001154ea0, {0x0, 0x0, 0x0?})
	/go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:253 +0x202
github.com/openshift/insights-operator/pkg/anonymization.(*Anonymizer).readNetworkConfigs(0xc0005be640)
	/go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:180 +0x245
github.com/openshift/insights-operator/pkg/anonymization.(*Anonymizer).AnonymizeMemoryRecord.func1()
	/go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:354 +0x25
sync.(*Once).doSlow(0xc0010d6c70?, 0x21a9006?)
	/usr/lib/golang/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/lib/golang/src/sync/once.go:65
github.com/openshift/insights-operator/pkg/anonymization.(*Anonymizer).AnonymizeMemoryRecord(0xc0005be640, 0xc000cf0dc0)
	/go/src/github.com/openshift/insights-operator/pkg/anonymization/anonymizer.go:353 +0x78
github.com/openshift/insights-operator/pkg/recorder.(*Recorder).Record(0xc00075c4b0, {{0x2add75b, 0xc}, {0x0, 0x0, 0x0}, {0x2f38d28, 0xc0009c99c0}})
	/go/src/github.com/openshift/insights-operator/pkg/recorder/recorder.go:87 +0x49f
github.com/openshift/insights-operator/pkg/gather.recordGatheringFunctionResult({0x2f255c0, 0xc00075c4b0}, 0xc0010d7260, {0x2adf900, 0xd})
	/go/src/github.com/openshift/insights-operator/pkg/gather/gather.go:157 +0xb9c
github.com/openshift/insights-operator/pkg/gather.collectAndRecordGatherer({0x2f50058?, 0xc001240c90?}, {0x2f30880?, 0xc000994240}, {0x2f255c0, 0xc00075c4b0}, {0x0?, 0x8dcb80?, 0xc000a673a2?})
	/go/src/github.com/openshift/insights-operator/pkg/gather/gather.go:113 +0x296
github.com/openshift/insights-operator/pkg/gather.CollectAndRecordGatherer({0x2f50058, 0xc001240c90}, {0x2f30880, 0xc000994240?}, {0x2f255c0, 0xc00075c4b0}, {0x0, 0x0, 0x0})
	/go/src/github.com/openshift/insights-operator/pkg/gather/gather.go:89 +0xe5
github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Gather.func2(0xc000a678a0, {0x2f50058, 0xc001240c90}, 0xc000796b60, 0x26f0460?)
	/go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:206 +0x1a8
github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Gather(0xc000796b60)
	/go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:222 +0x450
github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).periodicTrigger(0xc000796b60, 0xc000236a80)
	/go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:265 +0x2c5
github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Run.func1()
	/go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:161 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007d7c0?, {0x2f282a0, 0xc0012cd800}, 0x1, 0xc000236a80)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001381fb0?, 0x3b9aca00, 0x0, 0x0?, 0x449705?)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xabfaca?, 0x88d6e6?, 0xc00078a360?)
	/go/src/github.com/openshift/insights-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/insights-operator/pkg/controller/periodic.(*Controller).Run
	/go/src/github.com/openshift/insights-operator/pkg/controller/periodic/periodic.go:161 +0x1ea

Version-Release number of selected component (if applicable):

    

How reproducible:

    Enable networking obfuscation for the Insights Operator and wait for gathering to happen in the operator. You will see the above stacktrace. 

Steps to Reproduce:

    1. Create a HyperShift hosted cluster with OVN
    2. Enable networking obfuscation for the Insights Operator
    3. Wait for data gathering to happen in the operator
    

Actual results:

    operator panics

Expected results:

    there's no panic

Additional info:

    

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The PowerVS IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing PowerVS Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The OpenStack Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing OpenStack Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Goal

  • Create cluster and OpenStackCluster resource for the install-config.yaml
  • Create OpenStackMachine
  • Remove terraform dependency for OpenStack

Why is this important?

  • To have a CAPO cluster functionally equivalent to the installer

Scenarios

\

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Rebase Installer onto the development branch of cluster-api-provider-openstack to provide CI signal to the CAPO maintainers.

Right now when trying one installation with this work https://github.com/openshift/installer/pull/7939 the bootstrap machine is not getting deleted. We need to ensure it's gone once bootstrap is finalized.

Essentially: bring the upstream-master branch of shiftstack/cluster-api-provider-openstack under the github.com/openshift organisation.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The IBM Cloud VPC IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing IBM Cloud VPC Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Feature Overview (aka. Goal Summary)

Today we expose two main APIs for HyperShift, namely `HostedCluster` and `NodePool`. We also have metrics to gauge adoption by reporting the # of hosted clusters and nodepools.

But we are still missing other metrics to be able to make correct inference about what we see in the data.

Goals (aka. expected user outcomes)

  • Provide Metrics to highlight # of Nodes per NodePool or # of Nodes per cluster
  • Make sure the error between what appears in CMO via `install_type` and what we report as # Hosted Clusters is minimal.

Use Cases (Optional):

  • Understand product adoption
  • Gauge Health of deployments
  • ...

 

Overview

Today we have hypershift_hostedcluster_nodepools as a metric exposed to provide information on the # of nodepools used per cluster. 

 

Additional NodePools metrics such as hypershift_nodepools_size and hypershift_nodepools_available_replicas are available but not ingested in Telemetry.

In addition to knowing how many nodepools per hosted cluster, we would like to expose the knowledge of the nodepool size.

 

This will help inform our decision making and provide some insights on how the product is being adopted/used.

Goals

The main goal of this epic is to show the following NodePools metrics on Telemeter, ideally as recording rules: 

  • Hypershift_nodepools_size
  • hypershift_nodepools_available_replicas

Requirements

The implementation involves creating updates to the following GitHub repositories:

similar PRs:
https://github.com/openshift/hypershift/pull/1544
https://github.com/openshift/cluster-monitoring-operator/pull/1710

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The GCP IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing GCP Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision GCP infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing GCP

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

The bootstrap machine never contains a public IP address. When the publish strategy is set to External, the bootstrap machine should contain a public ip address.    

Version-Release number of selected component (if applicable):

    

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

User Story:

I want the installer to create the service accounts that would be assigned to control plane and compute machines, similar to what is done in terraform now. 

Acceptance Criteria:

Description of criteria:

  • Control plane and compute service accounts are created with appropriate permissions (see details)
  • Service accounts are attached to machines
  • Skip creation of control-plane service accounts when they are specified in the install config

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

I want to create the public and private DNS records using one of the CAPI interface SDK hooks.

Acceptance Criteria:

Description of criteria:

  • Vanilla install: create public DNS record, private zone, private record
  • Shared VPC: check for existing private DNS zone
  • Publish Internal: create private zone, private record
  • Custom DNS scenario: create no records, or zone

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • Value for DNS record for capi-provisioned LB, can be grabbed from cluster manifest
  • The value for the SDK provisioned LB will need to be handled within our hook

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Create the GCP Infrastructure controller in /pkg/clusterapi/system.go.
It will be based on the AWS controller in that file, which was added in https://github.com/openshift/installer/pull/7630.

User Story:

As a (user persona), I want to be able to:

  • Create the GCP cluster manifest for CAPI installs

so that I can achieve

  • The manifests in <assets-dir>/cluster-api will be applied to bootstrap ignition and the files will find their way to the machines.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

When installing on GCP, I want control-plane (including bootstrap) machines to bootstrap using ignition.

I want bootstrap ignition to be secured so that sensitive data is not publicly available.

Acceptance Criteria:

Description of criteria:

  • Control-plane machines pull ignition (boot successfully)
  • Bootstrap ignition is not public (typically signed url)
  • Service account is not required for signed url (stretch goal)
  • Should be labeled (with owned and user tags)

(optional) Out of Scope:

Destroying bootstrap ignition can be handled separately.

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

As an installer user, I want my gcp creds used for install to be used by the CAPG controller when provisioning resources.

Acceptance Criteria: 

  • Users can authenticate using the service account from ~/.gcp/osServiceAccount.json
  • users can authenticate with default application credentials
  • Docs team is updated to whether existing credential methods will continue to work (specifically environment variables):  see official docs

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

I want to create a load balancer to provide split-horizon DNS for the cluster.

Acceptance Criteria:

Description of criteria:

  • In a vanilla install, we have both ext and int load balancers
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://issues.redhat.com/browse/CORS-3217 covers the upstream chagnes to CAPG needed to add disk encrytion. In addition changes will be needed in the installer to set the GCPMachine disk encryption based on the machinepool settings.

Notes on the required changes are at https://docs.google.com/document/d/1kVgqeCcPOrq4wI5YgcTZKuGJo628dchjqCrIrVDS83w/edit?usp=sharing

 

Once the upstream changes from CORS-3217 have been accepted:

  • plumb through existing encryption key fields
  • vendor updated CAPG
  • update infrastructure-components.yaml (CRD definitions) if necessary
  •  

When a GCP cluster is created using CAPI, upon destroy the addresses associated with the apiserver LoadBalancer are not removed. For example here are addresses left over after previous installations

$ gcloud compute addresses list --uri | grep bfournie
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-gn6g7-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-h96j2-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-k7fdj-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-nh4z5-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-nls2h-apiserver
https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-qrhmr-apiserver

Here is one of the addresses:

$ gcloud compute addresses describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver
address: 34.107.255.76
addressType: EXTERNAL
creationTimestamp: '2024-04-15T15:17:56.626-07:00'
description: ''
id: '2697572183218067835'
ipVersion: IPV4
kind: compute#address
labelFingerprint: 42WmSpB8rSM=
name: bfournie-capg-test-27kzq-apiserver
networkTier: PREMIUM
selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver
status: RESERVED
[bfournie@bfournie installer-patrick-new]$ gcloud compute addresses describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver
address: 34.149.208.133
addressType: EXTERNAL
creationTimestamp: '2024-03-27T09:35:00.607-07:00'
description: ''
id: '1650865645042660443'
ipVersion: IPV4
kind: compute#address
labelFingerprint: 42WmSpB8rSM=
name: bfournie-capg-test-6jrwz-apiserver
networkTier: PREMIUM
selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver
status: RESERVED


User Story:

I want to destroy the load balancers created by capg

Acceptance Criteria:

Description of criteria:

  • destroy deletes all capg related resources
  • backwards compatible: continues to destroy terraform clusters
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • CAPG creates: global proxy load balancers
  • our destroy code only looks for regional passthrough load balancers
  • need to:
    • update destroy code to not look at regional load balancers (only)
    • destroy tcp proxy resource, which is not created in passthrough load balancers, so is not currently deleted
  • Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

When GCP workers are created they are not able to pull ignition over the internal subnet as its not allowed by the firewall rules created by CAPG. The allow-<infraID>-cluster allows all TCP traffic with tags for <infraID>-node and <infraID>-control-plane but the workers that are created have tags <infraID>-worker.

We need to either add the worker tags to this firewall rule or add node tags to the worker. We should decide on a general use of CAPG firewall rules.

When testing GCP using the CAPG provider (not Terraform) in 4.16, it was found that the master VM instances were not distributed across instance groups but were all assigned to the same instance group.

Here is a (partial) CAPG install vs a installation completed using Terraform. The capg installation (bfournie-capg-test-5ql8j) has VMs all using us-east1-b

$ gcloud compute instances list | grep bfournie
bfournie-capg-test-5ql8j-bootstrap     us-east1-b     n2-standard-4               10.0.0.4     34.75.212.239  RUNNING
bfournie-capg-test-5ql8j-master-0      us-east1-b     n2-standard-4               10.0.0.5                    RUNNING
bfournie-capg-test-5ql8j-master-1      us-east1-b     n2-standard-4               10.0.0.6                    RUNNING
bfournie-capg-test-5ql8j-master-2      us-east1-b     n2-standard-4               10.0.0.7                    RUNNING
bfournie-test-tf-pdrsw-master-0        us-east4-a     n2-standard-4               10.0.0.4                    RUNNING
bfournie-test-tf-pdrsw-worker-a-vxjbk  us-east4-a     n2-standard-4               10.0.128.2                  RUNNING
bfournie-test-tf-pdrsw-master-1        us-east4-b     n2-standard-4               10.0.0.3                    RUNNING
bfournie-test-tf-pdrsw-worker-b-ksxfg  us-east4-b     n2-standard-4               10.0.128.3                  RUNNING
bfournie-test-tf-pdrsw-master-2        us-east4-c     n2-standard-4               10.0.0.5                    RUNNING
bfournie-test-tf-pdrsw-worker-c-jpzd5  us-east4-c     n2-standard-4               10.0.128.4                  RUNNING

User Story:

As a (user persona), I want to be able to:

  • Ensure that when the cluster should use CAPI installs that the correct path is chosen. AKA: no longer use terraform for installations in GCP when the user has selected the featureSet for CAPI.

so that I can achieve

  • CAPI gcp installation
  •  

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

When using the CAPG provider the ServiceAccounts created by the installer for the master and worker nodes do not have the role bindings added correctly.

For example this query shows that the SA for the master nodes has no role bindings.

$ gcloud projects get-iam-policy openshift-dev-installer --flatten="bindings[].members" --format='table(bindings.role)' --filter='bindings.members:bfournie-capg-test-lk5t5-m@openshift-dev-installer.iam.gserviceaccount.com'
$

Now that https://issues.redhat.com/browse/CORS-3447 providing the ability to override the APIServer instance group to be compatible with MAPI, we need to set the override in the installer when the Internal LoadBalancer is created.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision AWS infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The AWS IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing AWS Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic Goal

  • Provision AWS infrastructure without the use of Terraform

Why is this important?

  • This is a key piece in producing a terraform-free binary for ROSA. See parent epic for more details.

Scenarios

  1. The new provider should aim to provide the same results as the existing AWS terraform provider.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

 

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Issue:

  • Kubernetes API on OpenShift requires advanced health check configuration ensure graceful termiantion[1] works correctly. The health check protocol must be HTTPS testing the path /readyz.
    • Currently CAPA support standard configuration (health check probe timers) and for path, when using HTTP or HTTPS is selected.
  • MCS listener/target must use HTTPS health check with custom probe periods too.

Steps to reproduce:

  • Create CAPA cluster
  • Check the Target Groups created to route traffic to API (6443), for both Load Balancers (int and ext)

Actual results:

  • TCP health check w/ default probe config

Expected results:

  • HTTPS health check evaluating /readyz path
  • Check parameters: 2 checks in the interval of 10s, using 10s of timeout

 

References:

When using Wavelength zones, networks are cidr'd differently than in vanilla installs. Ensure wavelength support

Because of the assumption that subnets have auto-assign public IPs turned on, which is how CAPA configures the subnets it creates, supplying your own VPC where that is not the case causes the bootstrap node to not get a public IP and therefore not be able to download the release image (no internet connection).

The bootstrap node needs a public IP because the public subnets are connected only to the internet gateway, which does not provide NAT.

when installconfig.controlPlane.platform.aws.metadataService is set, the metadataservice is correctly configured for control plane machines

CAPA creates 4 security groups:

$ aws ec2 describe-security-groups --region us-east-2 --filters "Name = group-name, Values = *rdossant*" --query "SecurityGroups[*].[GroupName]" --output text
rdossant-installer-03-tvcbd-lb
rdossant-installer-03-tvcbd-controlplane
rdossant-installer-03-tvcbd-apiserver-lb
rdossant-installer-03-tvcbd-node

Given that the maximum number of SGs in a network interface is 16, we should update the max number validation in the installer:

https://github.com/openshift/installer/blob/master/pkg/types/aws/validation/machinepool.go#L66

Patrick says:

I think we want to update this to cap the user limit to 10 additional security groups:

More context: https://redhat-internal.slack.com/archives/C68TNFWA2/p1697764210634529?thread_ts=1697471429.293929&cid=C68TNFWA2

Private hosted zone and cross-account shared vpc works when installconfig.platform.aws.hostedZone is specified

Goal:

  • As OCP developer I would like to deploy OpenShift using the installer/CAPA to provision the infrastructure setting the health check parameters to satisfy the existing terraform implementation
  •  

Issue:

  • The API listeners is created exposing different health check attributes than the terraform implementation. There is a bug to fix to the correct path when using health check protocol HTTP or HTTPS (CORS-3289)
  • The MCS listener, created using the AdditionalListeners option from the *LoadBalancer, requires advanced health check configuration to satisfy terraform implementation. Currently CAPA sets the health check with standard TCP or default path ("/") when using HTTP or HTTPS protocol.

Steps to reproduce:

  • Create CAPA cluster
  • Check the Target Groups created to route traffic to MCS (22623), for internal Load Balancers

Actual results:

  • TCP or HTTP/S health check w/ default probe config

Expected results:

  • API listeners with target health check using HTTPS evaluating /healthz path
  • MCS listener with target health check using HTTPS evaluating /healthz path
  • Check parameters: 2 checks in the interval of 10s, using 10s of timeout

 

References:

security group ids are added to control plane nodes when installconfig.controlPlane.platform.aws.additionalSecurityGroupIDs  is specified

User Story:

Destroy all bootstrap resources created through the new non-terraform provider.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • implement custom endpoint support if it's still needed.

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

when installconfig.platform.aws.userTags is specified, all taggable resources should have the specified user tags.

  • Manifests:
    • cluster
    • machines
  • Non-capi provisioned resources:
    • IAM roles
    • Load balancer resources
    • DNS resources (private zone is tagged)
  • Check compatibility of how CAPA provider is tagging bootstrap S3 bucket

 

User Story:

As a (user persona), I want to be able to:

  • make sure ignition respects proxy configuration

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Use cases to ensure:

  • When installconfig.controlPlane.platform.aws.zones is specified, control plane nodes are correctly placed in the zones.
  • When installconfig.controlPlane.platform.aws.zones isn't specified, the control plane nodes are correctly balanced in the zones available in the region, preventing single-zone when possible.

CAPA shows

 

I0312 18:00:13.602972     109 s3.go:220] "Deleting S3 object" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-2" reconcileID="9cda22be-5acd-4670-840f-8a6708437385" machine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" cluster="openshift-cluster-api-guests/rdossant-installer-03-jjf6b" bucket="openshift-bootstrap-data-rdossant-installer-03-jjf6b" key="control-plane/rdossant-installer-03-jjf6b-master-2"
I0312 18:00:13.608919     109 s3.go:220] "Deleting S3 object" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="1ed0ad52-ffc1-4b62-97e4-876f8e8c3242" machine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" cluster="openshift-cluster-api-guests/rdossant-installer-03-jjf6b" bucket="openshift-bootstrap-data-rdossant-installer-03-jjf6b" key="control-plane/rdossant-installer-03-jjf6b-master-0"
[...]
E0312 18:04:25.282967     109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYY3QSWKBBDZ7R8, host id: 2f3HawFbPheaptP9E+WRbu3fhEXTMwyZQ1DBPGBG7qlg74ssQR0XISM4OSlxvrn59GeFREtN4hp9C+S5LgQD2g==
 >
E0312 18:04:25.284197     109 controller.go:329] "Reconciler error" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYY3QSWKBBDZ7R8, host id: 2f3HawFbPheaptP9E+WRbu3fhEXTMwyZQ1DBPGBG7qlg74ssQR0XISM4OSlxvrn59GeFREtN4hp9C+S5LgQD2g==
 > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="7fac94a1-772a-4c7b-a631-5ef7fc015d5b"
E0312 18:04:25.286152     109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYPFY0EQBM42VYH, host id: nJZakAhLrbZ1xrSNX3tyk0IKmMgFjsjMSs/D9nzci90GfRNNfUnvwZTbcaUBQYiuSlY5+aysCuwejWpvi8FmGusbQCK1Qtjr9pjqDQfxzY4=
 >
E0312 18:04:25.287353     109 controller.go:329] "Reconciler error" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYPFY0EQBM42VYH, host id: nJZakAhLrbZ1xrSNX3tyk0IKmMgFjsjMSs/D9nzci90GfRNNfUnvwZTbcaUBQYiuSlY5+aysCuwejWpvi8FmGusbQCK1Qtjr9pjqDQfxzY4=
 > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-2" reconcileID="b6c792ad-5519-48d5-a994-18dda76d8a93"
E0312 18:04:25.291383     109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYGWSJDR35Q4GWX, host id: Qnltg++ia3VapXjtENZOQIwfAxbxfwVLPlC0DwcRBx+L60h52ENiNqMOkvuNwJyYnPxbo/CaawzMT11oIKGO9g==
 >
E0312 18:04:25.292132     109 controller.go:329] "Reconciler error" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYGWSJDR35Q4GWX, host id: Qnltg++ia3VapXjtENZOQIwfAxbxfwVLPlC0DwcRBx+L60h52ENiNqMOkvuNwJyYnPxbo/CaawzMT11oIKGO9g==
 > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-1" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-1" reconcileID="92e1f8ed-b31f-4f75-9083-59aad15efe79"
E0312 18:04:25.679859     109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYSBZGYPC7SNJEX, host id: EplmtNQ+RxmbU88z+4App6YEVvniJpyCeMiMZuUegJIMqZgbkA1lmCjHntSLDm4eA857OdhtHsn+zD6AX7uelGIsogzN2ZziiAZXZrbIIEg=
 >
E0312 18:04:25.680663     109 controller.go:329] "Reconciler error" err=<
    deleting bootstrap data object: deleting S3 object: NotFound: Not Found
        status code: 404, request id: 9QYSBZGYPC7SNJEX, host id: EplmtNQ+RxmbU88z+4App6YEVvniJpyCeMiMZuUegJIMqZgbkA1lmCjHntSLDm4eA857OdhtHsn+zD6AX7uelGIsogzN2ZziiAZXZrbIIEg=
 > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="9e436c67-aca0-409c-9179-0ce4cccce9ad"

Even though we are not creating s3 buckets for the master nodes. That's preventing the bootstrap process from finishing.

 

iam role is correctly attached to control plane node when installconfig.controlPlane.platform.aws.iamRole is specified

The schema check[1] in the LB reconciliation is hardcoded to check the primary Load Balancer only, it will result to always filter the subnets from the schema for the primary, ignoring additional Load Balancers ("SecondaryControlPlaneLoadBalancer")

How to reproduce:

  • Create a cluster w/ secondaryLoadBalancer w/ internet-facing
  • Check the subnets for the secondary load balancers: "ext" (public) API Load Balancer's subnets

Actual results:

  • Private subnets attached to the SecondaryControlPlaneLoadBalancer

Expected results:

  • Public subnets attached to the SecondaryControlPlaneLoadBalancer

 

References:

 

User Story:

As a (user persona), I want to be able to:

  • Deploy AWS cluster with IPI with minimum customizations without terraform

Acceptance Criteria:

Description of criteria:

  • Install complete, e2e pass
  • Production-ready
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal

  • Through this epic, we will update our CI to use a UPI workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.

Why is this important?

  • There is an active initiative in openshift to remove terraform from the openshift installer.

Acceptance Criteria

  • All tasks within the epic are completed.

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

As a multiarch CI-focused engineer, I want to create a workflow in `openshift/release` that will enable creating the backend nodes for a cluster installation.

Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.

Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit

https://access.redhat.com/solutions/5437491

Feature Overview (aka. Goal Summary)

Enhance Dynamic plugin with similar capabilities as Static page. Add new control and security related enhancements to Static page.

Goals (aka. expected user outcomes)

The Dynamic plugin should list pipelines similar to the current static page.

The Static page should allow users to override task and sidecar task parameters.

The Static page should allow users to control tasks that are setup for manual approval.

The TSSC security and compliance policies should be visible in Dynamic plugin.

Requirements (aka. Acceptance Criteria):

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

As a DevOps Engineer, I want to add manual approval points in my pipeline so that the pipeline pauses at that point and waits for a manual approval before continuing execution. Manual approvals are commonly used for approving deploying to production or modeling activities that are not automated (e.g. manual testing) in the pipeline.

Acceptance Criteria

  • User defines a manual approval point into their pipeline
  • PipelineRuns for user's pipeline pause at the point where manual approval is defined and waits for approval
  • Once user approves the PipelineRun, the PipelineRun continues execution

Description

As a user, I want a list of all approvals needed for all my pipeline runs. From this page, I can approve or reject if I am an approver for the pipelines.

Acceptance Criteria

  1. Create a new Tab named "Approvals" along with the existing PL, PLR, and Repo in the Pipelines Tab page. tabs.
  2. The new tab should render a list table and Search and Filter options. The filtration will be done based on the Status of approvals.
  3. The approvals page should list all the PLRs in that ns that require approval.
  4. Create the Approve/Reject Modals 
  5. Find the way to fetch the current user and check whether the current user is one of the approvers.
  6. Approve/Reject the ApprovalTask with the Modal if the current user is listed as one of the approvers.

Additional Details:

Description

As a developer, I want to remove the feature of introducing the ApprovalTasks into the Developer Console from the Console Repository as it will be shipped as a Dynamic plugin.

Acceptance Criteria

  1. Delete all the files related to the Approvals feature
  2. Ensure the CustomRunNode remains there.
  3. Fix the Node for the Embedded Pipelines

Additional Details:

Description

As a user, I need to properly distinguish between a classic Task and a CustomTask in the Pipeine Topology, once I have created a Pipeline using the YAML view.

Also, I need to see the details of the CustomTask on hovering over the node.

Acceptance Criteria

  1. Design the new CustomTask node
  2. Add the new icons to the console and use them as the Status Icons

Additional Details:

Description of problem:

Update the Pipeline/PipelineRun List and Details Pages to acknowledge Custom Task 

Version-Release number of selected component (if applicable):

4.16.0    

How reproducible:

Always when Custom Task is used

Steps to Reproduce:

    1.Create a Pipeline with a CustomTask like the Approval Task
    2.Check the Tasks List in the Pipeline/Run Details Page
    3.Check the Progress Bar in the PipelineRun List page
    

Actual results:

CustomTask is not recognized. Either throwing undefined or showing Task as pending always.

Expected results:

Just like Normal Tasks, CT should be infused thoroughly in the mentioned pages.    

Additional info:

    

Description

As a developer, you need to look into the recent changes proposed by the UX and apply those changes.

Acceptance Criteria

  1. Update the columns of the Approval List table as per the new design
  2. Add the Approvals Page to the Admin Perspective
  3. Add the new sections in the Approval Details page as per the design
  4. Resolve any other NIT comments made on the previous PR.

Additional Details:

Problem:

With the OpenShift Pipelines operator 1.2x we added support for a dynamic console plugin to the operator. In the first version it is only responsible for the Dashboard and Pipeline/Repository Metrics tab. We want move more and more code to the dynamic plugin and remove this later from the console repository.

Goal:

Non-goal

  • We don't plan to change the "Import from Git" flow as part of this Epic.
  • We don't want remove the code from the console repository yet. => We don't do this yet so that the console still supports the Pipeline feature with an older Pipelines Operator (that doesn't have the Pipeline Plugin).

Why is it important?

  • We want to migrate to dynamic plugins so that it's easier to align changes in the console with changes in the operator.
  • In the mid-term this will also reduce the code size and build times of the console project.

Acceptance criteria:

  • The console support for Pipelines (Tekton resources) should still work without and with the new dynamic Pipelines Plugin
  • The following detail pages should be copied* to the dynamic plugin
    1. Pipeline detail page (support for v1beta1 and v1)
    2. PipelineRun detail page (support for v1beta1 and v1)
    3. ClusterTask detail page (support for v1beta1 and v1)
    4. Task detail page (support for v1beta1 and v1)
    5. TaskRun detail page (support for v1beta1 and v1)
    6. EventListener (support only v1alpha1 in the static plugin, should support v1beta1 in the dynamic plugin as well)
    7. ClusterTriggerBinding (support only v1alpha1 in the static plugin, should support v1beta1 in the dynamic plugin as well)
    8. TriggerBinding (support only v1alpha1 in the static plugin, should support v1beta1 in the dynamic plugin as well)
  • The following list pages should be copied to the dynamic plugin:
    1. Pipeline
    2. PipelineRun
    3. ClusterTask
    4. Task
    5. TaskRun
  • We don't want to copy these deprecated resources:
    1. PipelineResource
    2. Condition
  • E2e tests for all list pages should run on the new dynamic pipeline plugin

Dependencies (External/Internal):

  • Changes will happen in the new dynamic pipeline-plugin github.com/openshift-pipelines/console-plugin, but we don't see any external dependencies

Design Artifacts:

Exploration:

Note:

Description

As a user,

Acceptance Criteria

  1. Create a flag for list pages to hide pages if it present in dynamic plugin

Additional Details:

Description of problem:

    Multiple Output tabs is present on PipelineRun details page if dynamic Pipeline console-plugin is enabled.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Feature Overview (aka. Goal Summary)

Enable developers to perform actions in a faster and better way than before.

Goals (aka. expected user outcomes)

Developers will be able to reduce the time or clicks spent on the UI to perform specific set of actions.

Requirements (aka. Acceptance Criteria):

Search->Resources option should show 5 of the recently searched resources across all sessions by the user.

The recently searched resources should should be clearly visible and separated from rest.

Pinning resources capability should be removed.

Getting Started menu on Add page can be collapsed and expanded

Use Cases (Optional):

 

Questions to Answer (Optional):

 

Out of Scope

 

Background

 

Customer Considerations

 

Documentation Considerations

 

Interoperability Considerations

__

Problem:

The Getting Started menu on Add page can cannot be restored after users click on the "X" symbol when its hidden.

Goal:

Users should be able to collapse and expand the Getting Started menu on Add Page.

Why is it important?

The current behavior causes confusion.

Use cases:

  1. <case>

Acceptance criteria:

  1. The Getting Started menu on Add page can be collapsed by clicking on the screen.
  2. The collapsed Getting Started page can be expanded by clicking on the screen.
  3. When collapsed, it is visibly available for someone to expand again.

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As of now, getting started section can be hidden and enable back using the button but user can close that button and it will not show back. It is confusing for the users. So add the expandable section instead of hide and show button similar to Functions List page

Acceptance Criteria

  1. Update getting Started section to use expandable section in Add page
  2. Update getting Started section to use expandable section in Cluster tab in overview page in Admin perspective
  3. Updated the test cases

Additional Details:

Check Functions list page in Dev perspective for the design

Problem:

Developers have to repeatedly search for resources and pin them separately in order to view details of a resource that have seen in the past.

Goal:

Provide developers with the ability to see the last 5 resources that they have seen in the past so they can quickly view their details without any further actions.

Why is it important?

Provides a better user experience for developers when using the console.

Use cases:

  1. <case>

Acceptance criteria:

  1. Same as listed in OCPSTRAT-1024

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want the console to remember the resources I have recently searched so that I don't have to type the names of the same resources I use frequently in the Search Page.

Acceptance Criteria

  1. Modify the Search dropdown to add a new section to display the recently searched items in the order they were searched.

Additional Details:

Phase 2 Deliverable:

GA support for a generic interface for administrators to define custom reboot/drain suppression rules. 

Epic Goal

  • Allow administrators to define which machineconfigs won't cause a drain and/or reboot.
  • Allow administrators to define which ImageContentSourcePolicy/ImageTagMirrorSet/ImageDigestMirrorSet won't cause a drain and/or reboot
  • Allow administrators to define alternate actions (typically restarting a system daemon) to take instead.
  • Possibly (pending discussion) add switch that allows the administrator to choose to kexec "restart" instead of a full hw reset via reboot.

Why is this important?

  • There is a demonstrated need from customer cluster administrators to push configuration settings and restart system services without restarting each node in the cluster. 
  • Customers are modifying ICSP/ITMS/IDMS outside post day 1/adding them+
  • (kexec - we are not committed on this point yet) Server class hardware with various add-in cards can take 10 minutes or longer in BIOS/POST. Skipping this step would dramatically speed-up bare metal rollouts to the point that upgrades would proceed about as fast as cloud deployments. The downside is potential problems with hardware and driver support, in-flight DMA operations, and other unexpected behavior. OEMs and ODMs may or may not support their customers with this path.

Scenarios

  1. As a cluster admin, I want to reconfigure sudo without disrupting workloads.
  2. As a cluster admin, I want to update or reconfigure sshd and reload the service without disrupting workloads.
  3. As a cluster admin, I want to remove mirroring rules from an ICSP, ITMS, IDMS object without disrupting workloads because the scenario in which this might lead to non-pullable images at a undefined later point in time doesn't apply to me.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Follow up epic to https://issues.redhat.com/browse/MCO-507, aiming to graduate the feature from tech preview and GA'ing the functionality.

This status was added as a legacy field and isn't currently used for anything, nor should it be there. We'd like to remove this, so:

  1. we get rid of unused fields
  2. we can use the proper api conditions types for status's

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.

This feature will be used to track all the CAPI preparation work that is common for all the supported providers

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic Goal

  • Day 0 Cluster Provisioning
  • Compatibility with existing workflows that do not require a container runtime on the host

Why is this important?

  • This epic would maintain compatibility with existing customer workflows that do not have access to a management cluster and do not have the dependency of a container runtime

Scenarios

  1. openshift-install running in customer automation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Right now the CAPI providers will run indefinitely. We need to stop the installer if installs fail, based on either a timeout or more sophisticated analysis.

Fit provisioning via the CAPI system into the infrastructure.Provider interface

so that:

  • code path to selecting an infrastructure provider is simplified
  • enabling/disabling infrastructure providers via feature gates is centralized in platform package

User Story:

 I want hack/build.sh to embed the kube-apiserver and etcd dependencies in openshift-install without making external network calls so that ART/OSBS can build the installer with CAPI dependencies.

Acceptance Criteria:

Description of criteria:

  • dependencies are not obtained over the internet
  • gated by OPENSHIFT_INSTALL_CLUSTER_API env var
  • should work when building for various architectures

(optional) Out of Scope:

Engineering Details:

  • Currently the dependencies are obtained through the sync_envtest function in build-cluster-api.sh
  • Cluster API provider dependencies are vendored and built here

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Extract the needed binaries during the Installer container image build and copy them to an appropriate location so they can be used by CAPI.

Write CAPI manifests to disk during create manifests so that they can be user edited and users can also provide their own set of manifests. In general, we think of manifests as an escape hatch that should be used when a feature is missing from the install config, and users accept the degraded user experience of editing manifests in order to achieve non-install-config-supported functionality.

Acceptance criteria:

Manifests should be generated correctly (and applied correctly to the control plane):

  • when create manifests is run before create cluster
  • only when create cluster is run

 

There is some WIP for this, but there are issues with the serialization/deserialization flow when writing the GVK in the manifests.

User Story:

As a CAPI install user, I want to be able to:

  • gather bootstrap logs
    • automatically after bootstrap failure during install
    • when running `gather bootstrap` command

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Define CAPI provider interface for grabbing IP addresses
  • Handle loading manifests from disk
  • Confirm that IP addresses are specified in manifests
  • Point 3

(optional) Out of Scope:

This is intended to be platform agnostic. If there is a common way for obtaining ip addresses from capi manifests, this should be sufficient. Otherwise, this should enable other platforms to implement their specific logic.

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

The 100.88.0.0/14 IPv4 subnet is currently reserved for the transit switch in OVN-Kubernetes for east west traffic in the OVN Interconnect architecture. We need to make this value configurable so that users can avoid conflicts with their local infrastructure.  We need to support this config both prior to installation and post installation (day 2).

This epic will include stories for the upstream ovn-org work, getting that work downstream, an api change, and a cno change to consume the new api

Scope of this card is to track the work around getting the required pieces in for transit switch subnet in CNO that will let users do custom configurations to transit switch subnet on both day0 (install) and day2 (post-install).

This card will complement https://issues.redhat.com/browse/SDN-4156 

You can create the cluster-bot cluster with Ben's PR and do CNO changes locally and test them out.

Feature Overview

Allow setting custom tags to machines created during the installation of an OpenShift cluster on vSphere.

Why is this important

Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.

Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.

For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.

Epic Goal

  • Customers need to assign additional tags to machines that are reconciled via the machine API.

Why is this important?

  • Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.

Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.

For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Add ControlPlaneMachineSet for vSphere

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Promote vSphere control plane machinesets from tech preview to GA

Why is this important?

Scenarios

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Promotion PRs collectively pass payload testing

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space 

Goals (aka. expected user outcomes)

have "--since and --until " option in mustgather 

 

Epic Goal*

As a openshift administrator when i run must-gather in large cluster it tends to be in GB's which fill up my master node . I want an ability to define the time stamp of duration where problem happened and in that way we can have targeted log gathered that can take less space

 
Why is this important? (mandatory)
Reduce must-gather size

 
Scenarios (mandatory) 

Must-gather running over a cluster with many logs can produces tens of GBs of data in cases where only few MBs is needed. Such a huge must-gather archive takes too long to collect and too long to upload. Which makes the process impractical.
 
Dependencies (internal and external) (mandatory)

The default must-gather images need to implement this functionality. Custom images will be asked to implement the same.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - workloads team
  • Documentation - docs team
  • QE - workloads QE team
  • PX - 
  • Others -

Acceptance Criteria (optional)

Must-gather contain only logs from requested time interval

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

rotated-pod-logs does not interact with since/since-time. This causes inspect (and must-gather) outputs to considerably increase in size, even when users attempt to filter it by using time constraints.

Suggested solution:

  • parse the date from the log filename and only gather the logs when they are within the provided since or since-time flags

Acceptance criteria:

  • `oc adm inspect` respects --since or --since-time flags work as expected even when used in conjunction with --rotated-pod-logs

Must-gather currently does not allow to limit the amount of data collected by a customer. Which can lead into collection of tens of GBs even when only a limited set of data (e.g. for the last 2 days) is required. In some cases ending with a master node down.

Suggested solution:

  • pass envs such as MUST_GATHER_SINCE_TIME (Only return logs after a specific date (RFC3339)) or MUST_GATHER_SINCE (Only return logs newer than a relative duration like 5s, 2m, or 3h).

Acceptance criteria:

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

  • We want to remove official support for UPI and IPI support for Alibaba Cloud provider, which is in Tech Preview in OpenShift 4.15 and earlier. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.
  • There will be no upgrades from OpenShift 4.14 / OpenShift 4.15 to OpenShift 4.16.
  • For OpenShift 4.16, we will instead offer customers the ability to install OpenShift on Alibaba Cloud with the agnostic platform (platform=none) using the Assisted Installer as Tech Preview - see OCPSTRAT-1149 for details. Note: we cannot remove IPI (Tech Preview) install method until we provide the Assisted Installer (Tech Preview) solution.

 

Impacted areas based on CI:

  • alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
  • alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
  • cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
  • cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
  • cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
  • machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  1. Update cloud.redhat.com (Hybrid Cloud Console) to remove Alibaba support.
  2. Update/removal from Alibaba IPI installer code.
  3. Update release notes with removal information and no updates/upgrades offered from OpenShift 4.14 or OpenShift 4.15.
  4. Remove Alibaba installation instructions from Openshift documentation including updating OpenShift Container Platform 4.x Tested Integrations.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed
Classic (standalone cluster) Classic (standalone)
Hosted control planes N/A
Multi node, Compact (three node), or Single node (SNO), or all All
Connected / Restricted Network All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All
Operator compatibility N/A
Backport needed (list applicable versions) N/A
Other (please specify) N/A

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • Remove any Alibaba-related operator
  • Remove Alibaba-related driver
  • Remove Alibaba related code from CSO, CCM, and other repos
  • Remove container images
  • Remove related CI jobs
  • Remove documentation
  • Stop building images in ART pipeline
  • Archive github repositories

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We want to remove official support for UPI and IPI support for Alibaba Cloud provider. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.

Why is this important?

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Scenarios

Impacted areas based on CI:

alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI jobs are removed
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Per OCPSTRAT-1042, we are removing Alicloud IPI/UPI support in 4.16 and removing code in 4.16. This epic track the necessary actions to remove the Alicloud Disk CSI driver 

 
Why is this important? (mandatory)

Since we are removing support of Alicloud as a supported provider we need to clean up all the storage artefacts related to Alicloud. 

  

Dependencies (internal and external) (mandatory)

Alicloud IPI/UPI support removal must be confirmed. IPI/UPI code should be remove.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • Remove operator
  • Remove driver
  • Remove Alibaba related code from CSO.
  • Remove container images
  • Remove related CI jobs
  • Remove documentation
  • Stop building images in ART pipeline
  • Archive github repositories

Feature Overview (aka. Goal Summary)

Add support for snapshots into kubevirt-csi when the underlying infra csi driver supports snapshots.

Goals (aka. expected user outcomes)

  • Users of kubevirt-csi should be able to use csi snapshots within their HCP Kubevirt cluster if the underlying infra storageclass mapped to kubevirt-csi supports snapshots.

Requirements (aka. Acceptance Criteria):

  • snapshot support
  • downstream ci
  • documentation

 

Goal

Add support for snapshots into kubevirt-csi when the underlying infra csi driver supports snapshots.

User Stories

  • As a HCP KubeVirt cluster admin, i would like to be able to create snapshots if the underlying infra cluster csi driver mapped to my kubevirt-csi storageclass supports snapshots.

Non-Requirements

  • Scale testing should be handled in the future after this lands as part of the ongoing HCP scale testing effort.

Notes

  • Any additional details or decisions made/needed

Done Checklist

Who What Reference
DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion <link or reference to Polarion>
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

This task is to add infra snapshot support into upstream kubevirt-csi driver.

 

This should include unit and functional testing upstream.

This is a new section to the configuring storage for HCP OpenShift Virtualization  section in the ACM docs. https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.10/html/clusters/cluster_mce_overview#configuring-storage-kubevirt 

 

There's a new feature landing in 4.16 that gives HCP Openshift Virtualization's kubevirt-csi component the ability to perform csi snapshots. We need this feature to be documented downstream.

The downstream documentation should follow closely to upstream hypershift documentation that is being introduced in this issue, https://issues.redhat.com/browse/CNV-36075 

Feature Overview (aka. Goal Summary)

Enable the OCP Console to send back user analytics to our existing endpoints in console.redhat.com. Please refer to doc for details of what we want to capture in the future:

Analytics Doc

Collect desired telemetry of user actions within OpenShift console to improve knowledge of user behavior.

Goals (aka. expected user outcomes)

OpenShift console should be able to send telemetry to a pre-configured Red Hat proxy that can be forwarded to 3rd party services for analysis.

Requirements (aka. Acceptance Criteria):

User analytics should respect the existing telemetry mechanism used to disable data being sent back

Need to update existing documentation with what we user data we track from the OCP Console: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/about-remote-health-monitoring.html

Capture and send desired user analytics from OpenShift console to Red Hat proxy

Red Hat proxy to forward telemetry events to appropriate Segment workspace and Amplitude destination

Use existing setting to opt out of sending telemetry: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html#opting-out-remote-health-reporting

Also, allow just disabling user analytics without affecting the rest of telemetry: Add annotation to the Console to disbale just user analytics

Update docs to show this method as well.

We will require a mechanism to store all the segment values
We need to be able to pass back orgID that we receive from the OCM subscription API call

Use Cases (Optional):

 

Questions to Answer (Optional):

 

Out of Scope

Sending telemetry from OpenShift cluster nodes

Background

Console already has support for sending analytics to segment.io in Dev Sandbox and OSD environments. We should reuse this existing capability, but default to http://console.redhat.com/connections/api for analytics and http://console.redhat.com/connections/cdn to load the JavaScript in other environments. We must continue to allow Dev Sandbox and OSD clusters a way to configure their own segment key, whether telemetry is enabled, segment API host, and other options currently set as annotations on the console operator configuration resource.

Console will need a way to determine the org-id to send with telemetry events. Likely the console operator will need to read this from the cluster pull secret.

Customer Considerations

 

Documentation Considerations

 

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a user, I want to opt-in or opt-out from the telemetry.

If the cluster prefers opt-in or opt-out should be configable via SERVER_FLAGS.

Acceptance Criteria

  1. Prompt user to opt in to sending telemetry to Red Hat
  2. Prompt should include link to official RH doc about data collection, privacy, and processing.
  3. Provide settings in user preferences to opt out of sending telemetry
  4. Ensure that NO data are send to the server before the user opt-in or after opt-out.

Additional Details:

Based on a discussion with Ali Mobrem and Parag Dave yesterday, we want to hide the analytics option from the cluster configuration.

We keep the option for a cluster admin to activate the opt-in or opt-out banner in the UI.

For clarification: This is an undocumented feature that we keep for customers that might request such a feature.

Meeting notes: https://docs.google.com/document/d/11gxr_7kxMqm1zSJC5pPVHZhvLJthZGLH-QvqCqAyFdc/edit

Problem:

The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis. 

Goal:

Update console telemetry plugin to send data to the appropriate ingress point.

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. Update the URL to the Ingress point created for console.redhat.com
  2. Ensure telemetry data is flowing to the ingress point.

Dependencies (External/Internal):

Ingress point created for console.redhat.com

Design Artifacts:

Exploration:

Note:

We want to enable segment analytics by default on all (incl. self-managed) OCP clusters using a known segment API key and the console.redhat.com proxy. We'll still want to allow to honor the segment-related annotations on the console operator config for overriding these values.

Most likely the segment key should be defaulted in the console operator, otherwise we would need a separate console flag for disabling analytics. If the operator provides the key, then the console backend can use the presence of the key to determine when to enable analytics. We can likely change the segment URL and CDN default values directly in the console code, however.

ODC-7517 tracks disabling segment analytics when cluster telemetry is disabled, which is a separate change, but required for this work.

OpenShift UI Telemetry Implementation details
 
This three keys should have new default values: 

  1. SEGMENT_API_KEY
  2. SEGMENT_API_HOST
  3. SEGMENT_JS_HOST OR SEGMENT_JS_URL

See https://github.com/openshift/console/blob/master/frontend/packages/console-telemetry-plugin/src/listeners/segment.ts

Defaults:

stringData:
  SEGMENT_API_KEY: BnuS1RP39EmLQjP21ko67oDjhbl9zpNU
  SEGMENT_API_HOST: console.redhat.com/connections/api/v1
  SEGMENT_JS_HOST: console.redhat.com/connections/cdn

AC:

  • Add default values for telemetry annotations into a CM in the openshift-console-operator NS.
  • Add and update unit tests and add e2e tests

The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.

For that the telemetry-console-plugin must have options to configure where it loads the analytics.js and where to send the API calls (analytics events).

Description

As an administrator, I want to disable all telemetry on my cluster including UI analytics sent to segment.

We should honor the existing telemetry configuration so that we send no analytics when an admin opts out of telemetry. See the documentation here:

https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html#insights-operator-new-pull-secret_opting-out-remote-health-reporting

Simon Pasquier

 yes this is the official supported way to disable telemetry though we also have a hidden flag in the CMO configmap that CI clusters use to disable telemetry (it depends if you want to push analytics for CI clusters).
CMO configmap is set to
data:
config.yaml: |-
telemeterClient:
enabled: false
 

the CMO code that reads the cloud.openshift.com token:
https://github.com/openshift/cluster-monitoring-operator/blob/b7e3f50875f2bb1fed912b23fb80a101d3a786c0/pkg/manifests/config.go#L358-L386

Acceptance Criteria

  • No segment events are sent when
    1. Cluster is a CI cluster, which means at least one of these 2 conditions are met:
      • Cluster pull secret does not have "cloud.openshift.com" credentials [1]
      • Cluster monitoring config has 'telemeterClient: {enabled: false}' [2]
    2. Console operator config telemetry disabled annotation == true [3]
  • Add and update unit tests and also e2e

Additional Details:

Slack discussion https://redhat-internal.slack.com/archives/C0VMT03S5/p1707753976034809

 

# [1] Check cluster pull secret for cloud.openshift.com creds
oc get secret pull-secret -n openshift-config -o json | jq -r '.data.".dockerconfigjson"' | base64 -d | jq -r '.auths."cloud.openshift.com"'

# [2] Check cluster monitoring operator config for 'telemeterClient.enabled == false'
oc get configmap cluster-monitoring-config -n openshift-monitoring | jq -r '.data."config.yaml"'

# [3] Check console operator config telemetry disabled annotation 
oc get console.operator.openshift.io cluster -o json | jq -r '.metadata.annotations."telemetry.console.openshift.io/DISABLED"' 

Epic Goal

  • Console should receive “organization.external_id”  from the OCM Subscription API call.
  • We need to store the ORG ID locally and pass it back to Segment.io to enhance the user analytics we are tracking

 

API the Console uses:  
const apiUrl = `/api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%27${clusterID}%27`;

Reference: Original Console PR

Why is this important?

High Level Feature Details can be found here

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently, we can get the organization ID from the OCM server by querying subscription and adding the fetchOrganization=true query parameter based on the comment.

We should be passing this ID as SERVER_FLAG.telemetry.ORGANIZATION_ID to the frontend, and as organizationId to Segment.io

Fetching should be done by the console-operator due to its RBAC permissions. Once the Organization_ID is retrieved, console operator should set it on the console-config.yaml, together with other telemetry variables.

 

AC: 

  • Update console-operator's main controller to check if the telemeter client is available on the cluster, which signalises that its a customer/pro cluster
  • Consume the telemetry parameter ORG_ID and pass it as parameter to segment in the console

Feature Overview (aka. Goal Summary)

Explore providing a consistent developer experience within OpenShift and Kubernetes via OpenShift console.

Goals (aka. expected user outcomes)

Enable running OpenShift console on Kubernetes clusters and explore the user experience that can be consistent across clusters.

Requirements (aka. Acceptance Criteria):

Install and run OpenShift console on Kubernetes

Enable selection of menu items in Dev and Admin perspective

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

Fully working OpenShift console on Kubernetes.

Background

 

Customer Considerations

Provide a simple way to install and access OpenShift console on Kubernetes.

Documentation Considerations

None

Interoperability Considerations

N/A

Goal:

Enable running OpenShift console on Kubernetes clusters and explore the user experience that can be consistent across clusters.

Explore providing a consistent developer experience within OpenShift and Kubernetes via OpenShift console.

Out of Scope: The OpenShift console is fully working on Kubernetes.

Acceptance criteria:

  1. Install and run the OpenShift console on Kubernetes (provide a helm or yaml setup)
  2. General console should work, and users should see the Dev and Admin perspective (fix obvious crashes and hide features that depend on OpenShift)
  3. Explore and extend (when possible) the setup without Authentification with one that supports a user login
  4. List missing (hidden) or non-working features and dependencies
  5. Check how good our console works with just the upstream operators like. Recommended order:
    1. Knative instead of OpenShift Serverless
    2. Tekton instead of OpenShift Pipelines
    3. Shipwright instead of OpenShift Builds
    4. Operator Lifecycle Manager (OLM)
    5. Web Terminal Operator
    6. Topology export with Primer operator

Description of problem:
When running the console against a non-OpenShift/non-OKD cluster the admin perspective almost work fine. But the developer perspective has a lot of small issues.

Version-Release number of selected component (if applicable):
Almost all versions

How reproducible:
Always

Steps to Reproduce:
You need a non-OpenShift cluster. You might can test this any another kubernetes distribution. I used kubeadm to create a local cluster, but it take some time until I can start a Pod locally. (That's the precondition.)

You might want test kind or k3s instead.

To run the console on your local machine against a non-OpenShift k8s cluster on your local machine you can use this script: https://github.com/jerolimov/openshift/commit/e6fe0924807017ff1320cfc8d82bde23c162eba3

Actual results:

  1. Add page, Topology and Search wasn't shown in the navigation. The add page is opened anyway when switching to developer perspective.
  2. Topology page and Search already worked fine and are available when changing the URL manually.
  3. Add page quick search doesn't find anything
  4. The add page contains actions like Import from Git that doesn't work.
  5. The Namespace dropdown says "Namespaces" multiple times, but one "Projects" has left
  6. The masthead navigation contains a Quick Start item that shows just a "No Quick Starts found" page

Expected results:

  1. Add page, Topology and Search navigation items should be shown.
  2. Add page should not show quick search button if it doesn't find anything
  3. Add page should not show any broken actions (better is to fix them, but this might be more complex)
  4. The namespace dropdown should only show Namespace labels.
  5. The masthead navigation should not show a Quick Start item if the Quick Start CRD is not installed

Additional info:

Feature Overview (aka. Goal Summary)  

Graduce the new PV access mode ReadWriteOncePod as GA.

Such PV/PVC can be used only in a single pod on a single node compared to the traditional ReadWriteOnce access mode, where such a PV/PVC can be used on a single node by many pods.

Goals (aka. expected user outcomes)

The customers can start using the new ReadWriteOncePod access mode.

This new mode allows customers to provision and attach PV and get the guarantee that it cannot be attached to another local pod.

 

Requirements (aka. Acceptance Criteria):

This new mode should support the same operations as regular ReadWriteOnce PVs therefore it should pass the regression tests. We should also ensure that this PV can't be accessed by another local-to-node pod.

 

Use Cases (Optional):

As a user I want to attach a PV to a pod and ensure that it can't be accessed by another local pod.

Background

We are getting this feature from upstream as GA. We need to test it and fully support it.

Customer Considerations

 

Check that there is no limitations / regression.

Documentation Considerations

Remove tech preview warning. No additional change.

 

Interoperability Considerations

N/A

Epic Goal

Support upstream feature "New RWO access mode " in OCP as GA, i.e. test it and have docs for it.

This is continuation of STOR-1171  (Beta/Tech Preview in 4.14), now we just need to mark it as GA and remove all TechPreview notes from docs.

Why is this important?

  • We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. External: the feature is currently scheduled for GA in Kubernetes 1.29, i.e. OCP 4.16, but it may change before Kubernetes 1.29 GA.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set.

Users need to be able to update the BMC entries in the BareMetalHost objects after they have been created . 

Use Cases

There are at least two scenarios in which this causes problems:

1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.

2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.

Currently it's not possible to modify the BMC address of a BareMetalHost after it has been set. It's understood that this was an initial design choice, and there's a webhook that prevents any modification of the object once it has been set for the first time. There are at least two scenarios in which this causes problems:

1. When deploying baremetal clusters with the Assisted Installer, Metal3 is deployed and BareMetalHost objects are created but with empty BMC entries. We can modify these BMC entries after the cluster has come up to bring the nodes "under management", but if you make a mistake like I did with the URI then it's not possible to fix that mistake.

2. BMC addresses may change over time - whilst it may appear unlikely, network readdressing, or changes in DNS names may mean that BMC addresses will need to be updated.

Thanks!

Currently, the baremetal operator does not allow the BMC address of a node to be updated after BMH creation. This ability needs to be added in BMO.

Feature Overview (aka. Goal Summary)

Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

This is also a key requirement for backup and DR solutions.

https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot

Goals (aka. expected user outcomes)

Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.

Requirements (aka. Acceptance Criteria):

The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.

Use Cases (Optional):

 

  1. As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
  2. As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.
  3. As a customer I want early access to test the VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs.

Out of Scope

CSI drivers development/support of this feature.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.

Documentation Considerations

Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.

Interoperability Considerations

Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.

Epic Goal*

Create an OCP feature gate that allows customers and parterns to  VolumeGroupSnapshot feature while the feature is in alpha & beta upstream.

 
Why is this important? (mandatory)

Volume group snapshot is an important feature for ODF, OCP virt and backup partners. It requires driver support so partners need early access to the feature to confirm their driver works as expected before GA. The same applies to backup partners.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
  2. As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.

 
Dependencies (internal and external) (mandatory)

This depends on the driver's support, the feature gate will enable it in the drivers that support it (OCP shipped drivers).

The feature gate should

  • Configure the snapshotter to start with the right parameter to enable VolumeGroupSnapshot 
  • Create the necessary CRDs
  • Configure the OCP shipped CSI driver

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - N/A
  • QE - STOR
  • PX - 
  • Others -

Acceptance Criteria (optional)

By enabling the feature gate partners should be able to use the VolumeGroupSnapshot API. Non OCP shipped drivers may need to be configured.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

From apiserver audit logs:

customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io" not found
customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotcontents.groupsnapshot.storage.k8s.io" not found
customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshots.groupsnapshot.storage.k8s.io" not found
clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrolebinding" not found
clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrolebinding" not found
clusterroles.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrole" not found
clusterroles.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrole" not found
customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io" not found
customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshotcontents.groupsnapshot.storage.k8s.io" not found
customresourcedefinitions.apiextensions.k8s.io "volumegroupsnapshots.groupsnapshot.storage.k8s.io" not found
clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrolebinding" not found
clusterrolebindings.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrolebinding" not found
clusterroles.rbac.authorization.k8s.io "csi-snapshot-controller-clusterrole" not found
clusterroles.rbac.authorization.k8s.io "csi-snapshot-webhook-clusterrole" not found

Feature Overview (aka. Goal Summary)

 

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

Goals (aka. expected user outcomes)

Customers can override the default (three) value and set it to a custom value.

Make sure we document (or link) the VMWare recommendations in terms of performances.

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

https://kb.vmware.com/s/article/1025279

Requirements (aka. Acceptance Criteria):

The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.

No change in the default

Use Cases (Optional):

As an OCP admin I would like to change the maximum number of snapshots per volumes.

Out of Scope

Anything outside of 

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

Background

The default value can't be overwritten, reconciliation prevents it.

Customer Considerations

Make sure the customers understand the impact of increasing the number of snapshots per volume.

https://kb.vmware.com/s/article/1025279

Documentation Considerations

Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.

Interoperability Considerations

N/A

Epic Goal*

The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.

Possible future candidates:

  • configure EFS volume size monitioring (via driver cmdline arg.) - STOR-1422
  • configure OpenStack topology - RFE-11

 
Why is this important? (mandatory)

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

https://kb.vmware.com/s/article/1025279

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to configure the maximum number of snapshots per volume.
  2. As a user I would like to create more than 3 snapshots per volume

 
Dependencies (internal and external) (mandatory)

1) Write OpenShift enhancement (STOR-1759)

2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)

3) Update vSphere operator to use the new snapshot options (STOR-1804)

4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)

  • prerequisite: add e2e test and demonstrate stability in CI (STOR-1838)

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Enablement
  • Others -

Acceptance Criteria (optional)

Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.

Drawbacks or Risk (optional)

Setting this config setting with a high value can introduce performances issues. This needs to be documented.

https://kb.vmware.com/s/article/1025279

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)

 

To support volume provisioning and usage in multi-zonal clusters, the deployment should match certain requirements imposed by CSI driver - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-4E5F8F65-8845-44EE-9485-426186A5E546.html

The requirements have slightly changed in 3.1.0 - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-162E7582-723B-4A0F-A937-3ACE82EAFD31.html

We need to ensure that, cluster is compliant with topology requirement and if not, vsphere-problem-detector should detect invalid configuration and create warnings and alerts.

A patch to the vSphere CSI driver was added to accept both the old and new tagging methods in order to avoid regressions. A warning is thrown if the old way is used.

Goals (aka. expected user outcomes)

Ensure that if customer's configuration is not compliant, OCP raises a warning. The goal is to validate customer's config. Improve VPD to detect these misconfigurations.

Ensure the driver keeps working with the old configuration. Raise a warning if customers are still using the old tagging way.

Requirements (aka. Acceptance Criteria):

This feature should be able to detect any anomalies when customers are configuring vSphere topology.

Driver should work with both new and old way to define zones

Out of Scope

This epic is not about testing topology which is already supported.

Background

The vSphere CSI driver changed the way tag are applied to nodes and clusters. This feature ensures that the customer's config and what is expected from the driver.

Customer Considerations

This will help customers get the guarantee that their configuration is compliant especially for those who are used to the old way of configuring topology

Documentation Considerations

Update the topology documentation to match the new driver requirements.

Interoperability Considerations

OCP on vSphere

Epic Goal*

To support volume provisioning and usage in multi-zonal clusters, the deployment should match certain requirements imposed by CSI driver - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-4E5F8F65-8845-44EE-9485-426186A5E546.html

The requirements have slightly changed in 3.1.0 - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-162E7582-723B-4A0F-A937-3ACE82EAFD31.html

We need to ensure that, cluster is compliant with topology requirement and if not, vsphere-problem-detector should detect invalid configuration and create warnings and alerts.

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

This is important because clusters could be misconfigured and it will be tricky to detect if volume provisioning is not working because of misconfiguration or there are some other errors. Having a way to validate customer's topology will ensure that we have right topology.

We already have checks in VPD, but we need to enhance those checks to ensure we are compliant.

 
Scenarios (mandatory)

In 4.15: make cluster Upgradeable=False when:

  • Customer has deployed OCP in multizonal clusters and has forgot to tag compute clusters.
  • Customer has deployed OCP in multizonal clusters and has tagged compute clusters as well as hosts.

4.16: mark the cluster degraded in the conditions above.

These scenarios will result in invalid cluster configuration.
 
Dependencies (internal and external) (mandatory)

  • No dependencies on other teams.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)

Support the SMB CSI driver through an OLM operator as tech preview. The SMB CSI driver allows OCP to consume SMB/CIFS storage with a dynamic CSI driver. This enables customers to leverage their existing storage infrastructure with either SAMBA or Microsoft environment.

https://github.com/kubernetes-csi/csi-driver-smb

Goals (aka. expected user outcomes)

Customers can start testing connecting OCP to their backend exposing CIFS. This can allow to consume net new volume or consume existing data produced outside OCP.

Requirements (aka. Acceptance Criteria):

Driver already exists and is under the storage SIG umbrella. We need to make sure the driver is meeting OCP quality requirement and if so develop an operator to deploy and maintain it.

Review and clearly define all driver limitations and corner cases.

Use Cases (Optional):

  • As an OCP admin, I want OCP to consume storage exposed via SMB/CIFS to capitalise on my existing infrastructure.
  • As an user, I want to consume external data stored on a SMB/CIFS backend.

Questions to Answer (Optional):

Review the different authentication method.

Out of Scope

Windows containers support.

Only storage class login/password authentication method. Other methods can be reviewed and considered for GA.

Background

Customers are expecting to consume storage and possibly existing data via SMB/CIFS. As of today vendor's drivers support is really limited in terms of CIFS support whereas this protocol is widely used on premise especially with MS/AD customers.

Customer Considerations

Need to understand what customers expect in terms of authentication.

How to extend this feature to windows containers.

Documentation Considerations

Document the operator and driver installation, usage capabilities and limitations.

Interoperability Considerations

Future: How to manage interoperability with windows containers (not for TP)

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature Overview:

Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in OCPBU-559 and OCPBU-560.

 

Goals 

There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.

 

Requirements:

  • the flows in OCPBU-559 and OCPBU-560 need to work unchanged on Hypershift-managed clusters
  • most likely this means that Hypershift needs to adopt the CloudCredentialOperator
  • all operators enabled as part of OCPBU-563, OCPBU-564, OCPBU-566 and OCPBU-568 need to be able to leverage short-lived authentication on Hypershift-managed clusters without being aware that they are on Hypershift-managed clusters
  • also OCPBU-569 and OCPBU-570 should be achievable on Hypershift-managed clusters

 

Background

Currently, Hypershift lacks support for CCO.

Customer Considerations

Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.

Documentation Considerations

If we are successful, no special documentation should be needed for this.

 

Outcome Overview

Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.

Success Criteria

CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.

 

Expected Results (what, how, when)

 

 

Post Completion Review – Actual Results

After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).

 

Feature Overview (aka. Goal Summary)  

The etc-ca must be rotatable both on-demand and automatically when expiry approaches.

Goals (aka. expected user outcomes)

 

  • Have a tested path for customers to rotate certs manually
  • We must have a tested path for auto rotation of certificates when certs need rotation due to age

 

Requirements (aka. Acceptance Criteria):

Deliver rotation and recovery requirements from OCPSTRAT-714 

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

To make better decision on rev rollouts and when to recert leafs, we should store the revision when a given CA was rotated.

 

AC:

  • when a new CA is generated, store the static pod revision in the operator status

Testing in ETCD-512 revealed that CEO does not react to changes in the CA bundle or the client certificates.

The current mounts are defined here:
https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/manifests/0000_20_etcd-operator_06_deployment.yaml#L90-L106

A simple fix would be to watch the respective resources in a controller and exit the container on changes. This is how we did it with feature gates as well: (https://github.com/openshift/cluster-etcd-operator/blob/60b7665a26610a095722d3b12b2bb08dcae6965f/pkg/operator/starter.go#L174C1-L174C1)

If hot-reload would be feasible we should take a look at it, but it seems a larger refactoring.

AC:

  • CEO needs to react (restart) when it detects changes in certificate related secrets
  • add an e2e testcase for it

Refactoring in ETCD-512 creates a rotation due to incompatible certificate creation processes. We should update the render [1] the same way the controller manages the certificates. Keep in mind that important information are always stored in annotations, which means we also need to update the manifest template itself (just exchanging file bytes isn't enough).

AC:

  • CEO should render the same certificates it would otherwise when the refactored CertSignerController runs
  • test that a fresh installation avoids re-creating certificates after the bootstrap phase

[1] https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/cmd/render/render.go#L347-L365

Given the manual rotation of signers for 4.16, we should add an alert that proactively tells customers to run the manual rotation procedure.

AC:

  • Add a metric that denotes how many days the signer certs are still valid
  • Add an alert over that metric (eg 300 days before expiry)

 

After merging ETCD-512, we need to ensure the certs are regenerated when the signer changes.

Current logic in library-go only changes when the bundle is updated, which is not sufficient of a criteria for the etcd rotation. 

Some initial take: https://github.com/openshift/library-go/pull/1674

discussion in: https://redhat-internal.slack.com/archives/CC3CZCQHM/p1706889759638639

 

AC:

 

This spike explores using the library-go cert rotation utils in the etcd-operator to replace or augment the existing etcdcertsigner controller.

https://github.com/openshift/library-go/blob/master/pkg/operator/certrotation/client_cert_rotation_controller.go
https://github.com/openshift/cluster-etcd-operator/pull/1177

The goal of this spike is to evaluate if the library-go cert rotation util gives us rotation capabilities for the signer cert along with the peer and server certs.

There are a couple of issues to explore with the use of the library-go cert signer controller:

  • The etcd cluster is currently configured with a single CA for etcd's peer and server certs, whereas the library-go controller would require using different CAs for the peer and server certs.
  • We also need to consider how upgrades would be handled, i.e if we change to using two new CAs, would our new certsignercontroller handle that transparently?

Refactoring in ETCD-512 does not clean up certificates that are dynamically generated. Imagine you're recreating all your master nodes everyday, we would create new peer/serving/metrics certificates for each node and never clean them up.

We should try to be conservative when cleaning them up, so keep them around for a certain retention period (7-10 days?) after the node went away.

AC:

  • CEO should clean up old-enough "node" certificates

Market Problem

  • As an OpenShift cluster administrator for a self-managed standalone OpenShift cluster, I want to use the Priority based expander for cluster-autoscaler to select instance types based on priorities assigned by a user to scaling groups.
    The Configuration is based on the values stored in a ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    10: 
      - .*t2\.large.*
      - .*t3\.large.*
    50: 
      - .*m4\.4xlarge.*

Expected Outcomes

  • When creating a MachineSet, customers would like to define their preferred instance types with priorities so that if one is not available a fallback instance type option is available down the list.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md

Epic Goal

  • Allow users to choose the "priority" and "least-waste" expanders when creating a ClusterAutoscaler resource.

Why is this important?

  • The priority expander gives users the capability to instruct the cluster autoscaler about preferred node groups that it should expand. In this manner, a user can dictate to the cluster autoscaler which MachineSets should take priority when scaling up the cluster.
  • The least-waste expander gives users the capability to instruct the cluster autoscaler to choose instance sizes which will have the least amount of cpu and memory resource wastage when choosing for pending pods. It is a great supplement the priority expander to give backup option when multiple instance sizes might work.

Scenarios

  1. As a user I would like the cluster autoscaler to prefer MachineSets that use spot instances as it will lower my costs. By using the priority expander, I can indicate to the cluster autoscaler which MachineSets should be utilized first when scaling up the cluster.
  2. As a user I wan the autoscaler to prefer instance sizes which produce the least amount of wasted cpu and memory resources. By using the least-waste expander I can instruct the autoscaler to make this decision.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • New e2e tests added to exercise this functionality

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Is there any automation we want to add to the CAO to make this easier for users? (eg a new field on MachineAutoscaler to indicate a priority)

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a user, I would like to specify different expanders for the cluster autoscaler. Having an API field on the ClusterAutoscaler resource to specify the expanders and their precedence will solve this issue for me.

Background

The cluster autoscaler allows users to specify a list of expanders to use when creating new nodes. This list is expressed as a command line flag that takes multiple comma-separated options, eg "--expander=priority,least-waste,random".

We need to add a new field to the ClusterAutoscaler resource that allows users to specify an ordered list of expanders to use. We should limit values in that list to "priority", "least-waste", and "random" only. We should limit the length of the list to 3 items.

Steps

  • add a new API field for expander options
  • write unit tests to confirm behavior in CAO

Stakeholders

  • openshift eng

Definition of Done

  • user can express priority, least-waste, and random expanders through the ClusterAutoscaler resource
  • Docs
  • need docs on the expanders, what they are, how to select them
  • need docs describing how to configure the priority expander
  • Testing
  • we will add unit tests in this card, e2e tests will follow in another card

Feature Overview (aka. Goal Summary)  

Due to historical reasons, the etcd operator uses a static definition of an 8GB limit. Nowadays, customers with dense cluster configurations regularly reach the limit. OpenShift should provide a mechanism for increasing the database size while maintaining a validated configuration.

Goals (aka. expected user outcomes)

This feature aims to provide validated selectable sizes for the etcd database, allowing cluster admins to opt-in for larger sizes.

 

Requirements (aka. Acceptance Criteria):

Since using larger etcd database sizes may impact the defragmentation process, causing more noticeable transaction "pauses", this should be an opt-in configuration.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Yes
Classic (standalone cluster) Yes
Hosted control planes No
Multi node, Compact (three node), or Single node (SNO), or all Yes
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Yes
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify) N/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal*

Provide a way to change the etcd database size limit from the default 8GB which is non-configurable.
https://etcd.io/docs/v3.5/dev-guide/limit/#storage-size-limit

https://github.com/openshift/cluster-etcd-operator/blob/dbfa8262a0a48846e8519425a937a7d4e9fd52e0/pkg/etcdenvvar/etcd_env.go#L42

This will likely be done through the API as a new field in the `cluster` `etcds.operator.openshift.io` CustomResouce object. Similar to the etcd latency tuning profiles which allow a selectable set of configurations this limit should also be bound within reasonable limits or levels.

 
Why is this important? (mandatory)

Due to historical reasons, the etcd operator uses a static definition of an 8GB limit. Nowadays, customers with dense cluster configurations regularly reach the limit. OpenShift should provide a mechanism for increasing the database size while maintaining a validated configuration.

The 8GB limit was due to historical limitations of the bbolt store which have been improved upon for a while and there are examples and discussion upstream to suggest that the quota limit can be set higher.

See:

 

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I can increase the etcd database size via the openshift etcd API

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation - etcd docs team
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.

  • The version of the oc binary can help to identify if an issue could be caused because the oc version is different than the version of the cluster.
  • The logs generated by the oc adm must-gather command will help to identify if some information could not be collected, and also the exact image used (while the image used is currently in the directory name, I think it could be better to use a short directory name, specially for customers running oc on Windows, as the too long directory name can cause issues there).

The version of the oc binary could be included in the oc adm must-gather output [1], and if it's 2 or more minor versions than the running cluster, a warning should be shown.
 

1. Proposed title of this feature request

Include additional info into must-gather directory

2. What is the nature and description of the request?

Include the version of the oc binary used, and the logs generated by the command into the must-gather directory when running the oc adm must-gather command.

The version of the oc binary can help to identify if an issue could be caused because the oc version is different than the version of the cluster.
The logs generated by the oc adm must-gather command will help to identify if some information could not be collected, and also the exact image used

4. List any affected packages or components.

oc

[1] https://github.com/openshift/oc/blob/1d4a9afe8cf066b4c34018b3e0f4919d0157f091/pkg/cli/admin/mustgather/summary.go#L16-L45

Include logs generated by the command into the must-gather directory when running the oc adm must-gather command.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

The Azure File CSI driver currently lacks cloning and snapshot restore features. The goal of this feature is to support the cloning feature as technology preview. This will help support snapshots restore in a future release

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

This feature only applies to OCP running on Azure / ARO and File CSI.

The usual CSI cloning CI must pass.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all although SNO is rare on Azure
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86
Operator compatibility Azure File CSI operator
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify) ship downstream images with from forked azcopy

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Restoring snapshots are out of scope for now.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Update the CSI capability matrix and any language that mentions that Azure File CSI does not support cloning.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Not impact but benefit Azure / ARO customers.

Epic Goal*

Azure File added support for cloning volumes which relies on azcopy command upstream. We need to fork azcopy so we can build and ship downstream images with from forked azcopy. AWS driver does the same with efs-utils.

Upstream repo: https://github.com/Azure/azure-storage-azcopy

NOTE: using snapshots as a source is currently not supported: https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/7591a06f5f209e4ef780259c1631608b333f2c20/pkg/azurefile/controllerserver.go#L732 

 

Why is this important? (mandatory)

This is required for adding Azure File cloning feature support.

 

Scenarios (mandatory) 

1. As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.

 
Dependencies (internal and external) (mandatory)

1) Write OpenShift enhancement (STOR-1757)

2) Fork upstream repo (STOR-1716)

3) Add ART definition for OCP Component (STOR-1755)

  • prerequisite: Onboard image with DPTP/CI (STOR-1752)
  • prerequisite: Perform a threat model assessment (STOR-1753)
  • prerequisite: Establish common understanding with Product Management / Docs / QE / Product Support (STOR-1753)
  • requirement: ProdSec Review (STOR-1756)

4) Use the new image as base image for Azure File driver (STOR-1794)

5) Ensure e2e cloning tests are in CI (STOR-1818)

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - yes
  • Documentation - yes
  • QE - yes
  • PX - ???
  • Others - ART

 

Acceptance Criteria (optional)

Downstream Azure File driver image must include azcopy and cloning feature must be tested.

 

Drawbacks or Risk (optional)

No risks detected so far.

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Feature Overview (aka. Goal Summary)  

Add support for standalone secondary networks for HCP kubevirt.

Advanced multus integration involves the following scenarios

1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM

Goals (aka. expected user outcomes)

Users of HCP KubeVirt should be able to create a guest cluster that is completely isolated on a secondary network outside of the default pod network. 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both self-managed
Classic (standalone cluster) na
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all na
Connected / Restricted Network yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86
Operator compatibility na
Backport needed (list applicable versions) na
UI need (e.g. OpenShift Console, dynamic plugin, OCM) na
Other (please specify) na

Documentation Considerations

ACM documentation should include how to configure secondary standalone networks.

 

This is a continuation of CNV-33392.

Multus Integration for HCP KubeVirt has three scenarios.

1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
3. Secondary network + pod network (default for kubelet) as multiple interfaces for VM

Item 3 is the simplest use case because it does not require any addition considerations for ingress and load balancing. This scenario [item 3] is covered by CNV-33392.

Items [1,2] are what this epic is tracking, which we are considering advanced use cases.

When creating a cluster with secondary network the ingress is broken since the created service cannot access the VMs secondary addresses.

We need to create a controller that manually create and update the service endpoints so it's always pointing to the VMs IPs.

When no default pod network is used, we need the LB mirroring that cloud-provider-kubevirt preforms to create custom endpoints that map to the secondary network. This will allow the LB service to route to the secondary network. Otherwise, the LB service will not be able to pass traffic if the pod network interface is not attached to the VM.

Now that the hypershift kubevirt provider has a way to expose secondary network services generating endpointslices we should document it.

Also enable --attach-default-network option at "hcp" command tool is needed.

Feature Overview (aka. Goal Summary)  

The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.

Goals (aka. expected user outcomes)

Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.

Requirements (aka. Acceptance Criteria):

There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.
 

Background

Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel

Documentation Considerations

Usual documentation will be required in case there are any new user-facing options available as a result of this feature.

Epic Goal

  • Implement support in the install config to receive a Public IPv4 Pool ID and create resources* which consumes Public IP when publish strategy is "External".
  • The implementation must cover the installer changes in Terraform to provision the infrastructure

*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways

Why is this important?

  • The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. 

Scenarios

  1. As a customer with BYO Public IPv4 pools in AWS, I would like to install OpenShift cluster on AWS consuming public IPs for my own CIDR blocks, so I can have control of IPs used by my the services provided by me and will not be impacted by AWS Public IPv4 charges
  2.  

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Will the terraform implementation be backported to supported releases to be used in CI for previous e2e test infra?
  2. Is there a method to use the pool by default for all Public IPv4 claims from a given VPC/workload? So the implementation doesn't need to create EIP and associations for each resource and subnet/zone.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

USER STORY:

  • As a customer with custom Public IPv4 blocks presented in AWS, I would like to install OpenShift clusters on AWS with publish strategy Public consuming public IPv4 blocks from my own pool, so I will not be impacted by additional Public IPv4 charges
  •  

DESCRIPTION:

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

Required:

  • e2e workflow with a presubmit job testing the new configuration
  •  

Nice to have:

...

ACCEPTANCE CRITERIA:

  • Installer PR reviewed, accepted and merged
  • e2e step testing the PR
    • Presubmit job testing that flow (frequency should be defined by Test platform team)
    •  

ENGINEERING DETAILS:

 

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Support EgressIP feature with ExternalTrafficPolicy=Local and External2Pod direct routing in OVNKubernetes.

Why is this important?

We see a lot of customers using Multi-Egress Gateway with EgressIP. 

Currently,  connections which reaches pod via the OVN routing gateway are send back via EgressIP if  it is associated with the specific namespace. 

Multiple bugs have been reported by customers: 

https://issues.redhat.com/browse/OCPBUGS-16792 

https://issues.redhat.com/browse/OCPBUGS-7454

https://issues.redhat.com/browse/OCPBUGS-18400

Also, resulting in filing RFE, as it was too complicated to be fixed via a bug.

https://issues.redhat.com/browse/RFE-4614

https://issues.redhat.com/browse/RFE-3944

This is observed by multiple customers using MetalLB and F5 load balancers. We haven't really tested this combination.

From the initial discussion, looks like the fix is needed in OVN. Request the team to expedite this fix, given it has bunch of customers hitting it.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

  1. OVN team has to do https://issues.redhat.com/browse/FDP-42 and only then can we consume that into OVNKubernetes
  2. Design discussions Doc: https://docs.google.com/document/d/1VgDuEhkDzNOjIlPtwfIhEGY1Odatp-rLF6Pmd7bQtt0/edit 

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview{}

Rename OpenShift update channels:

  • Reposition Conditional Update Risks as “Known Issues”, and reduce UX differentiation
  • Remove references to “supported but not recommended”  in UX and CLI
  • Conditional update experience improved in the UX

    Do not hide conditional updates. Better to show all the OCP versions in the UI and not hide them.

Epic Goal*

Rename “supported but not recommended” to  "known issues"

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

oc adm upgrade {}include-not-recommended{-} today includes a Supported but not recommended updates: header when introducing those updates. It also renders the Recommended condition. OTA-1191 is about what to do with the -include-not-recommended flag. This ticket is about addressing the header and possibly about adjusting/contextualizing/something the Recommended condition type.

 

Here is a current output

 

$ oc adm upgrade --include-not-recommended
Cluster version is 4.10.0-0.nightly-2021-12-23-153012

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/cincinnati-graph-for-targeted-edge-blocking-demo/cincinnati-graph.json
Channel: stable-4.10

Recommended updates:

  VERSION                   IMAGE
  4.10.0-always-recommended quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000

Supported but not recommended updates:

  Version: 4.10.0-conditionally-recommended
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Exposure to SomeChannelThing is unknown due to an evaluation failure: client-side throttling: only 16.3µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
  On clusters with the channel set to 'buggy', this imaginary bug can happen. https://bug.example.com/b

  Version: 4.10.0-fc.2
  Image: quay.io/openshift-release-dev/ocp-    release@sha256:85c6ce1cffe205089c06efe363acb0d369f8df7ad48886f8c309f474007e4faf
  Recommended: False
  Reason: ModifiedAWSLoadBalancerServiceTags
  Message: On AWS clusters for Services in the openshift-ingress namespace… This will not cause issues updating between 4.10 releases.  This conditional update is just a demonstration of the conditional update system. https://bugzilla.redhat.com/show_bug.cgi?id=2039339

 

 

Definition of done:

After this change the output will look similar to below

 

$ oc adm upgrade --include-not-recommended
Cluster version is 4.10.0-0.nightly-2021-12-23-153012

Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/cincinnati-graph-for-targeted-edge-blocking-demo/cincinnati-graph.json
Channel: stable-4.10

Recommended updates:

  VERSION                   IMAGE
  4.10.0-always-recommended quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000


Updates with known issues:

  Version: 4.10.0-conditionally-recommended
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111
  Recommended: Unknown
  Reason: EvaluationFailed
  Message: Exposure to SomeChannelThing is unknown due to an evaluation failure: client-side throttling: only 16.3µs has elapsed since the last match call completed for this cluster condition backend; this cached cluster condition request has been queued for later execution
  On clusters with the channel set to 'buggy', this imaginary bug can happen. https://bug.example.com/b

Version: 4.10.0-fc.2
  Image: quay.io/openshift-release-dev/ocp-release@sha256:85c6ce1cffe205089c06efe363acb0d369f8df7ad48886f8c309f474007e4faf
  Recommended: False
  Reason: ModifiedAWSLoadBalancerServiceTags
  Message: On AWS clusters for Services in the openshift-ingress namespace… This will not cause issues updating between 4.10 releases.  This conditional update is just a demonstration of the conditional update system. https://bugzilla.redhat.com/show_bug.cgi?id=2039339

 

 

This Feature covers the pending tasks from OCPBU-16 to be covered in openshift-4.14. 

Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.

Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.

Why is this important: Increased operational simplicity and scale flexibility of the cluster's control plane deployment.

 

See slack working group: #wg-ctrl-plane-resize

Epic Goal

  • Resolve the outstanding technical debt from the ControlPlaneMachineSet project

Why is this important?

  • We need to make sure the project is tested, documented and maintained going forward

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. OCPCLOUD-1372

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a user I'd like to be warned when I'm setting up a Control Plane Machine Set for my control plane machines on GCP and I don't have the necessary TargetPools requirement for it to work correctly.

Background

We previously had a validating webhook on the CPMSO for GCP that would check if the control plane machine provider config set on the CPMS did have TargetPools, otherwise it would error.

This unfortunately collides with GCP Private clusters, see https://issues.redhat.com/browse/OCPBUGS-6760.

As such we decided to remove the check until warnings are supported in controller-runtime.
Once those land we can re-add the check and throw a warning in those situations, to still inform the user without disrupting normal operations where that's not an issue.

See:  https://redhat-internal.slack.com/archives/C68TNFWA2/p1675105892589279.

Webhook warnings available for 0.16.0 release of controller-runtime

Steps

  • <Add steps to complete this card if appropriate>

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

In phase 1 provided tech preview for GCP.

In phase 2, GCP support goes to GA. Support for other IPI footprints is new and tech preview.

Requirements

This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589

We'll want to add some tests to make sure the managing bootimages hasn't broken our existing functionality and that our new feature works. Proposed flow:

  • opt in an existing machineset to updates
  • update a GCP machineset with a dummy "bootimage"
  • wait for machineset to be reconciled
  • check if the machineset is restored to original value
  • if it matches -> success

1/30/24: Updated based on enhancement discussions

This is the MCO side of the changes. Once the API PR lands, the MSBIC should start watching for the new API object. 

It is also important to note that MachineSets having an ownerreference should not opted in to this mechanism, even if they are opt-ed in via the API. See discussion here: https://github.com/openshift/enhancements/pull/1496#discussion_r1463386593

Done when:

  • user has a way to switch update mechanism on and off
  • MachineSets with an OwnerRef are ignored.

 

Update 3/26/24 - Moved ValidatingAdmissionPolicy bit into a separate story as that got a bit more involved.

This will be implemented via a global knob in the API. This is required in addition to the feature gate as we expect customers to still want to toggle this feature when it leave tech preview.

Done when:

  • a new API object is created for opting in machine resources

 

A ValidatingAdmissionPolicy should be implemented(via an MCO manifest) for changes to this new API object, so that the feature is not turned on in unsupported platforms. The only platform currently supported is GCP. The ValidationAdmissionPolicy is kube native and is behind its own feature gate, so this will have to be checked while applying these manifests. Here is what the YAML of what these manifests would look like:

---
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingAdmissionPolicy
metadata:
  name: "managed-bootimages-platform-check"
spec:
  failurePolicy: Fail
  paramKind:
    apiVersion: config.openshift.io/v1
    kind: Infrastructure
  matchConstraints:
    resourceRules:
    - apiGroups:   ["operator.openshift.io"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["MachineConfiguration"]
  validations:
    - expression: "has(object.spec.ManagedBootImages) && param.status.platformStatus.Type == `GCP`"
      message: "This feature is only supported on these platforms: GCP" 
---
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "managed-bootimages-platform-check-binding"
spec:
  policyName: "managed-bootimages-platform-check"
  validationActions: [Deny]
  paramRef:
    name: "cluster"     
    parameterNotFoundAction: "Deny"

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

In 4.15, before conducting the live migration, CNO will check if a cluster is managed by the SD team. We need to remove this checking for supporting unmanaged clusters.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

As a cluster administrator, I would like to migrate my existing (that does not currently use Azure AD Workload Identity) cluster to use Azure AD Workload Identity

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Many customers would like to migrate to Azure AD Workload Identity with minimal downtime but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to take advantage of using Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+) in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Provide a documented method for migration to Azure AD Workload Identity for OpenShift 4.14+ with minimal downtime, and without customers having to start over with a new cluster using AZ Workload Identity and migrating over their workload. If there is risk of workload disruptive or downtime, we will keep to inform customers of this risk and have them accept this risk.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  Self-managed
Classic (standalone cluster)  Classic
Hosted control planes  N/A
Multi node, Compact (three node), or Single node (SNO), or all   All
Connected / Restricted Network, or all  All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)   All applicable architectures
Operator compatibility  
Backport needed (list applicable versions)  4.14+
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Goal

Spike to evaluate if we can provide an automated way to support migration to Azure Managed Identity (preferred), or alternatively a manual method (second option) for customers to perform the migration themselves that is documented and supported, or not at all.

This spike will evaluate, scope the level of effort (sizing), and make recommendation on next steps.

Feature request

Support migration to Azure Managed Identity

Feature description

Many customers would like to migrate to Azure Managed Identity but have numerous existing clusters and an aversion to supporting two concurrent operational requirements. Therefore they would like to migrate existing Azure clusters to Managed Identity in a safe manner after they have been upgraded to a version of OCP supporting that feature (4.14+).

Why?

Provide a uniform operational experience for all clusters running versions which support Azure Managed Identity without having to decommission long running clusters

Other considerations

  • Disruption to customer's workload.
  • Has to be closely coordinated with update effort to minimize disruption.
  • Tokenized operators and other layered products - work not yet done (OCP 4.15/4.16 plans) and has to be manually done for now and may not cover the full set.
  • If we grant this for Azure MI/WI, we will likely will need to also do this for STS and GCP WIF.
  • If we grant this, would we do this for self-managed and managed OpenShift (ARO)?

The cloud-credential-operator repository documentation can be updated for installing and/or migrating a cluster with Azure workload identity integration.

  • The document currently uses manually numbered lists, which creates more work when we want to add/remove steps. Migrating the document to use dynamically numbered lists where each number is (1.) will remove this extra work in future changes. 
  • During installation, we can take advantage of the new(ish) --included parameter in the `oc adm extract release-images` command by creating an install-config prior to executing the command.
  • During migration, we can take advantage of the new(ish) `reboot-machine-config-pool` sub-command of `oc adm` to restart all of the pods on a cluster to reduce the risks. The sub-command restarts each node in series resulting in highly available services being restarted one pod at a time.

Goal of this feature is to add support to 

  • telemetry
  • nmstate ipv6
  • nmstate net2net

Why is this important?

  • without API, customers are forced to use MCO. this brings with it a set of limitations (mainly reboot per change and the fact that config is shared among each pool, can't do per node configuration)
  • better upgrade solution will give us the ability to support a single host based implementation
  • telemetry will give us more info on how widely is ipsec used.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.

 

  • nmstate
  • k8s-nmstate
  • easier mechanism for cert injection (??)
  • telemetry
  • improve ci and test coverage
     

Dependencies (internal and external)

  1.  nmstate tasks

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes
  2. SDN-3604 - Fully supported non-GA N-S IPSec implementation using machine config.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • telemetry
  • nmstate ipv6
  • nmstate net2net

Why is this important?

  • without API, customers are forced to use MCO. this brings with it a set of limitations (mainly reboot per change and the fact that config is shared among each pool, can't do per node configuration)
  • better upgrade solution will give us the ability to support a single host based implementation
  • telemetry will give us more info on how widely is ipsec used.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Must allow for the possibility of offloading the IPsec encryption to a SmartNIC.

 

  • nmstate
  • k8s-nmstate
  • easier mechanism for cert injection (??)
  • telemetry
  • improve ci and test coverage
     

Dependencies (internal and external)

  1.  nmstate tasks

Related:

  • ITUP-44 - OpenShift support for North-South OVN IPSec
  • HATSTRAT-33 - Encrypt All Traffic to/from Cluster (aka IPSec as a Service)

Previous Work (Optional):

  1. SDN-717 - Support IPSEC on ovn-kubernetes
  2. SDN-3604 - Fully supported non-GA N-S IPSec implementation using machine config.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Base RHCOS for OpenShift 4.16 on RHEL 9.4 content.

Goals (aka. expected user outcomes)

New RHEL minors bring additional features and performance optimizations, and most importantly, new hardware enablement. Customers and partners need to be able to install on the latest hardware.

We want to start looking at testing OpenShift on RHCOS built from RHEL 9.4 packages. While it is currently possible to do most of that testing work using OKD-SCOS, only the x86_64 architecture is available there. We'll publish OCP release images and boot images that include RHCOS builds made out of CentOS Stream packages to prepare for the RHEL 9.4 release.

Summary of the steps:
1. Add manifests for a new rhel-9.4 variant in openshift/os, based on the existing manifests for rhel-9.2 and c9s. We'll use as much C9S packages as possible and re-use existing OpenShift specific packages.
2. Update the staging pipeline configuration to add a 4.15-9.4 stream.
3. Trigger an RHCOS build and use it to replace an existing image from a nightly OCP release image. Upload this release image to the rhcos-devel namespace on registry.ci.openshift.org
4. Ask for as many people as possible to test this release image. Write an email to aos-devel, publish updated instructions in this Epic, publish the same instructions in the Slack channel.
5. Repeat steps 3 and 4 every two weeks

Discussed this with Michael Nguyen 
 
This would work and ensure that we don't promote a release publicly with 9.4 GA content prior to RHEL 9.4 GA on April 30th

  1. April 23 4.16.0-ec.6 nightly selected, RHCOS is based on public 9.4 Beta
  2. April 24-25 RHCOS switches from 9.4 Beta repos to 9.4 GA repos
  3. April 26 4.16.0-ec.6 promoted and 4.16 branched from master assuming green 4.17 nightlies
  4. April 30 - RHEL 9.4 GA
  5. May 03 fist 4.16 RC on RHEL 9.4 GA

The thing we want to be careful about is checking in selection of 4.16.0-ec.6 which should be tracked via https://issues.redhat.com/browse/FDN-623 but we could just reach out to TRT team on #forum-ocp-release-oversight to confirm. This is because we don't want to switch to 9.4 GA content until the EC build which will be made public ahead of 9.4 GA has been selected.

Feature Overview (aka. Goal Summary)  

Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA

Goals (aka. expected user outcomes)

Remove the feature gate flag and ,ake the feature accessible to all customers 

Requirements (aka. Acceptance Criteria):

Requires fixes to apiserver to handle etcd client retries correctly

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both yes
Classic (standalone cluster) yes
Hosted control planes no
Multi node, Compact (three node), or Single node (SNO), or all Multi node and compact clusters
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Yes
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify) N/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal*

Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA

https://github.com/openshift/api/pull/1538
https://github.com/openshift/enhancements/pull/1447

 
Why is this important? (mandatory)

Graduating the feature to GA makes it accessible to all customers and not hidden behind a feature gate.

As further outlined in the linked stories the major roadblock for this feature to GA is to ensure that the API server has the necessary capability to configure its etcd client for longer retries on platforms with slower latency profiles. See: https://issues.redhat.com/browse/OCPBUGS-18149

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an openshift admin I can change the latency profile of the etcd cluster without causing any downtime to the control-plane availability

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd
  • Documentation - etcd docs team
  • QE - etcd qe
  • PX - 
  • Others -

Acceptance Criteria (optional)

Once the cluster is installed, we should be able to change the default latency profile on the API to a slower one and verify that etcd is rolled out with the updated leader election and heartbeat timeouts. During this rollout there should be no disruption or unavailability to the control-plane.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA

Goals (aka. expected user outcomes)

Remove the feature gate flag and ,ake the feature accessible to all customers 

Requirements (aka. Acceptance Criteria):

Requires fixes to apiserver to handle etcd client retries correctly

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both yes
Classic (standalone cluster) yes
Hosted control planes no
Multi node, Compact (three node), or Single node (SNO), or all Multi node and compact clusters
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Yes
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify) N/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Feature Overview (aka. Goal Summary)  

Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.

Goals (aka. expected user outcomes)

Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.

Requirements (aka. Acceptance Criteria):

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Managed
Classic (standalone cluster) N/A
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all N/A
Connected / Restricted Network Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_64 ARM
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify)  

Questions to Answer (Optional):

Check with OCM and CAPI requirements to expose larger worker node count.

 

Documentation:

  • Design document detailing the autoscaling mechanism and configuration options
  • User documentation explaining how to configure and use the autoscaling feature.

Acceptance Criteria

  • Configure max-node size from CAPI
  • Management cluster nodes automatically scale up and down based on the hosted cluster's size.
  • Scaling occurs without manual intervention.
  • A set of "warm" nodes are maintained for immediate hosted cluster creation.
  • Resizing nodes should not cause significant downtime for the control plane.
  • Scaling operations should be efficient and have minimal impact on cluster performance.

 

Goal

  • Dynamically scale the serving components of control planes

Why is this important?

  • To be able to have clusters with large amount of worker nodes

Scenarios

  1. A hosted cluster amount of worker nodes increases past X amount, the serving components are moved to larger cloud instances
  2. A hosted cluster amount of workers falls below a threshold, the serving components are moved to smaller cloud instances.

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As the HyperShift scheduler, I want to be able to:

  • see the size of each HostedCluster as translated to "t-shirt" sizes

so that I can achieve

  • correct scheduling outcomes and warm-node reserve

As the HyperShift administrator, I want to be able to:

  • configure how the size of each HostedCluster is translated to "t-shirt" sizes
  • configure how often the size of each HostedCluster is translated to "t-shirt" sizes, both as the size gets larger and as it gets smaller

so that I can achieve

  • tuning for auto-scaling of my management cluster

Acceptance Criteria:

Description of criteria:

  • allow configuration for sizing buckets and sizing change throughput
  • document the configuration above
  • expose t-shirt sizes of hosted clusters

Engineering Details:

This requires a design proposal.
This does not require a feature gate.

Description of problem

When provisioning an HCP on an MC enabled with sizing enabled (that has no existing HCPs) HCP install can be stuck trying to schedule the kube-apiserver for the hosted control plane. It seems the the placeholder deployment cannot be created, because of an empty selector value in the NodeAffinity:

operator-56b7ccb598-4hqz4 operator {"level":"error","ts":"2024-04-23T13:35:42Z","msg":"Reconciler error","controller":"DedicatedServingComponentSchedulerAndSi
zer","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"dry3","namespace":"ocm-staging-2aqkcjamdtbcmjtp0lk1
il3vo9hfd4n1"},"namespace":"ocm-staging-2aqkcjamdtbcmjtp0lk1il3vo9hfd4n1","name":"dry3","reconcileID":"0772c093-ceef-46c1-a450-6bc8184ba633","error":"failed t
o ensure placeholder deployment: Deployment.apps \"ocm-staging-2aqkcjamdtbcmjtp0lk1il3vo9hfd4n1-dry3\" is invalid: spec.template.spec.affinity.nodeAffinity.re
quiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values: Required value: must be specified when `operator` is 'In' or 'No
tIn'","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-r
untime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/opt/app-root/sr
c/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.
func2.2\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

This appears to be due the the In selector of the nodeAffinity being populated with an empty list in the source: https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/scheduler/dedicated_request_serving_nodes.go#L704

The value of unavailableNodePairs can be an empty list in the case that no HCPs exist on the cluster already, and therefore no nodes are labelled with both a cluster and a serving pair label. In this case, the empty list is passed in the NodeAffinity and results in the error above

Description of problem:

    HyperShift operator crashes with size tagging enabled and a clustersizingconfiguration that does not have an effects section under the size configuration.

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

Always    

Steps to Reproduce:

    1. Install hypershift operator with size tagging enabled
    2. Create a hosted cluster with request serving isolation topology
    3.
    

Actual results:

    HyperShift operator crashes

Expected results:

    Cluster creation succeeds

Additional info:

    

Description of problem:

  When using cluster size tagging and related request serving node scheduler, if a cluster is deleted while in the middle of resizing its request serving pods, the placeholder deployment that was created for it is not cleaned up.

Version-Release number of selected component (if applicable):

   HyperShift main (4.16)

How reproducible:

   Always 

Steps to Reproduce:

    1. Setup a management cluster with request-serving machinesets
    2. Create a hosted cluster
    3. Add workers to the hosted cluster so that it changes size
    4. Delete the hosted cluster after the it's tagged with the new size but before nodes for its corresponding placeholder pods are created.
    

Actual results:

    The placeholder deployment is never removed from the `hypershift-request-serving-autosizing-placeholder` namespace

Expected results:

    The placeholder deployment is removed when the cluster is deleted.

Additional info:

 

 

 

Description of problem:

  For large cluster sizes, non-request serving pods such as OVN, etcd, etc. require more resources. Because these pods live in the same nodes as other hosted cluster's non-request serving pods, we can run into resource exhaustion unless requests for these pods are properly sized.  

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Create large hosted cluster (200+ nodes)
    2. Observe resource usage of non-request-serving pods
    

Actual results:

    Resource usage is large, while resource requests remain the same

Expected results:

    Resource request grows corresponding to cluster size

Additional info:

    

User Story:

As a HyperShift controller in the management plane, I want to be able to:

  • read the capacity of a cluster measured in the count of nodes from a HostedControlPlane or HostedCluster

so that I can achieve

  • automation that reacts to the number of cluster workers changing at runtime

Acceptance Criteria:

Description of criteria:

  • the count of nodes can be read from the HostedCluster or HostedControlPlane status

Engineering Details:

This requires a design proposal.
This does not require a feature gate.

User Story:

When a HostedCluster is given a size label, it needs to ensure that request serving nodes exist for that size label and when they do, reschedule request serving pods to the appropriate nodes.

Acceptance Criteria:

Description of criteria:

  • HostedClusters reconcile the content of hypershift.openshift.io/request-serving-node-additional-selector label
    • Reschedule the request serving pods to run to instances matching the label

This does not require a design proposal.
This does not require a feature gate.

Feature Overview (aka. Goal Summary)  

Allow supporting RWX block PVCs with kubevirt csi when the underlying infra storageclass supports RWX Block

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

  • Allow users to create RWX Block PVCs withing HCP KubeVIrt guest clusters when the underlying infra storage class mapped to the guest supports RWX Block
  • Add presubmit and conformance periodics exercising HCP KubeVirt with RWX Block pvcs in guest.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both self-managed
Classic (standalone cluster) no
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

This feature should be documented as a capability of HCP OpenShift Virtualization in the ACM HCP docs

Interoperability Considerations

 

Currently, kubevirt-csi is limited to ReadWriteOnce for the pvcs within the guest cluster. This is true even when the infra storageclass supports RWX.

We should expand the ability for the guest cluster to use RWX block when the underlying infra storage class supports RWX block

Currently, kubevirt-csi is limited to ReadWriteOnce for the pvcs within the guest cluster. This is true even when the infra storageclass supports RWX.

We should expand the ability for the guest cluster to use RWX when the underlying infra storage class supports RWX 

Feature Overview

Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift

prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Phase 1 & 2 covers implementing base functionality for CAPI.

Background, and strategic fit

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

So far we haven't tested this provider at all. We have to run it and spot if there are any issues with it.

Steps:

  • Try to run the vSphere provider on OCP
  • Try to create MachineSets
  • Try various features out and note down bugs
  • Create stories for resolving issues up stream and downstream

Outcome:

  • Create stories in epic of items for vSphere that need to be resolved
  •  

vSphere provider is not present in downstream operator, someone has to add it there.

This will include adding a tech preview e2e job in release repo and going through the process described here https://github.com/openshift/cluster-capi-operator/blob/main/docs/provideronboarding.md

 

Feature Overview (aka. Goal Summary)  

Enable selective management of HostedCluster resources via annotations, allowing hypershift operators to operate concurrently on a management cluster without interfering with each other. This feature facilitates testing new operator versions or configurations in a controlled manner, ensuring that production workloads remain unaffected.

Goals (aka. expected user outcomes)

  • Primary User Type/Persona: Cluster Service Providers
  • Observable Functionality: Administrators can deploy "test" and "production" hypershift operators within the same management cluster. The "test" operator will manage only those HostedClusters annotated according to a predefined specification, while the "production" operator will ignore these annotated HostedClusters, thus maintaining the integrity of existing workloads.

Requirements (aka. Acceptance Criteria):

  • The "test" operator must respond only to HostedClusters that carry a specific annotation defined by the HOSTEDCLUSTERS_SCOPE_ANNOTATION environment variable.
  • The "production" operator must ignore HostedClusters with the specified annotation.
  • The feature is activated via the ENABLE_HOSTEDCLUSTERS_ANNOTATION_SCOPING environment variable. When not set, the operators should behave as they currently do, without any annotation-based filtering.
  • Upstream documentation to describe how to set up and use annotation-based scoping, including setting environment variables and annotating HostedClusters appropriately.
  • The solution should not impact the core functionality of HCP for self-managed and cloud-services (ROSA/ARO)

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes Applicable
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

  • Testing new versions or configurations of hypershift operators without impacting existing HostedClusters.
  • Gradual rollout of operator updates to a subset of HostedClusters for performance and compatibility testing.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

  • Automatic migration of HostedClusters between "test" and "production" operators.
  • Non-IBM integrations (e.g., MCE)

Background

Current hypershift operator functionality does not allow for selective management of HostedClusters, limiting the ability to test new operator versions or configurations in a live environment without affecting production workloads.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Upstream only for now

Interoperability Considerations

  • The solution should not impact the core functionality of HCP for self-managed and cloud-services (ROSA/ARO)

As a mgmt cluster admin, I want to be able to run multiple hypershift-operators that operate on a disjoint set of HostedClusters.

Feature Overview (aka. Goal Summary)  

Create an Installer RHEL9-based build for FIPS-enabled OpenShift installations

Goals (aka. expected user outcomes)

As a user, I want to enable FIPS while deploying OpenShift on any platform that supports this standard, so the resultant cluster is compliant with FIPS security standards

Requirements (aka. Acceptance Criteria):

Provide a dynamically linked build of the Installer for RHEL 9 in the release payload

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes n/a
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility  
Backport needed (list applicable versions) no
UI need (e.g. OpenShift Console, dynamic plugin, OCM) OCM
Other (please specify)  

Documentation Considerations

Docs will need to guide the Installer binary to use for FIPS-enabled clusters

Epic Goal

  • Provide a dynamically-linked build of the installer for RHEL 9 in the release payload

Why is this important?

  • RHEL 9 is out and users are switching to it
  • When the host is in FIPS mode, a dynamically-linked build of the installer is required
  • RHEL 8 & 9 have different versions of OpenSSL, so there's no one build that can work on both.

Scenarios

  1. Installing a FIPS mode cluster with the installer running on RHEL 9 with FIPS enabled
  2. Installing a baremetal IPI cluster with the installer running on RHEL 9

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ART-7207

Open questions::

  1. Do we need to build for both RHEL 8 & RHEL 9, or could we just switch the build to RHEL 9 at this point?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently to use baremetal IPI, a user must retrieve the openshift-baremetal-installer binary from the release payload. Historically, this was due to it needing to dynamically link to libvirt. This is no longer the case, so we can make baremetal IPI available in the standard openshift-installer binary.

As a user, I want to know how to download and use the correct installer binary to install a cluster with FIPS mode enabled. If I use the wrong binary or don't have FIPS enabled, I need instructions at the point I am trying to create a FIPS-mode cluster.

Allow the user to do oc release extract --command=openshift-install-fips to obtain an installer binary that is FIPS-ready.
The binary extracted will be the same one as is extracted when the command is openshift-baremetal-install; this name is provided for convenience.

As a user with no FIPS requirement, I want to be able to use the same openshift-installer binary on both RHEL 8 and RHEL 9, as well as other common Linux distributions.

User Story:

As a customer, I want to be able to:

  • Have installer binaries in the release payload based on RHEL9

so that I can achieve

Acceptance Criteria:

Description of criteria:

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

libvirt is not a supported platform for openshift-installer. Nonetheless, it appears in the platforms list (likely inadvertently?) when running the openshift-baremetal-installer binary because the code for it was enabled in order to link against libvirt.
Now that linking against libvirt is no longer required, there is no reason to continue shipping this unsupported code.

We will need to come up with a separate build tag to distinguish between the openshift-baremetal-install (dynamic) and openshift-install (static) builds. Currently these are distinguished by the libvirt tag.

Feature Overview

Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.

This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.

Goals and requirements

  • Users can install a host on day 2 using a bootable image to an OpenShift cluster.
  • At least platforms baremetal, vSphere, none and Nutanix are supported
  • Clusters installed with any installation method can be expanded with the image
  • Clusters don't need to run any special agent to allow the new nodes to join.

How this workflow could look like

1. Create image:

$ export KUBECONFIG=kubeconfig-of-target-cluster
$ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker

2. Boot image

3. Check progress

$ oc adm add-node 

Consolidate options

An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:

  • UPI: Adding RHCOS worker nodes to a user-provisioned infrastructure cluster
    • This feature will replace the need to use this method for the majority of UPI clusters. The current UPI method consists on many many manual steps. The new method would replace it by a couple of commands and apply to probably more than 90% of UPI clusters.
  • Field-documented methods and asks
  • IPI:
    • There are instances were adding a node to an bare metal IPI-deployed cluster can't be done via its BMC. This new feature, while not replacing the day-2 IPI workflow, solves the problem for this use case.
  • MCE: Scaling hosts to an infrastructure environment
    • This method is the most time-consuming and in many cases overkilling, but currently, along with the UPI method, is one of the two options we can give to users.
    • We shouldn't need to ask users to install and configure the MCE operator and its infrastructure for single clusters as it becomes a project even larger than UPI's method and save this for when there's more than one cluster to manage.

With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.

In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.

This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.

Why is this important

This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).

Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.

Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.

Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.

Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters. 

Oracle Cloud Infrastructure

This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.

Existing work

We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.

Day 2 node addition with agent image.

Yet Another Day 2 Node Addition Commands Proposal

Enable day2 add node using agent-install: AGENT-682

 

Epic Goal

  • Cleanup/carryover work from AGENT-682 for the GA release

Why is this important?

  • Address all the required elements for the GA, such as the FIPS compliancy. This will allow a smoother integration of the node-joiner into the oc tool, as planned in   OCPSTRAT-784

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. https://issues.redhat.com/browse/AGENT-682

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Modify dev-scripts and add a new job that tries to add a node to an existing cluster (using the scripts)

Feature Overview (aka. Goal Summary)  

RedHat allows following roles for system:anonymous user and system:unauthenticated group: 

oc get clusterrolebindings -o json | jq '.items[] | select(.subjects[]?.kind

== "Group" and .subjects[]?.name == "system:unauthenticated") |

.metadata.name' | uniq

Returns what unauthenticated users can do, which is the following:

"self-access-reviewers"

"system:oauth-token-deleters"

"system:openshift:public-info-viewer"

"system:public-info-viewer"

"system:scope-impersonation"

"system:webhooks"

Customers would like to minimize the allowed permissions to unauthenticated groups and users. 

It was determined after initial analysis that following roles are necessary for OIDC access and version information and will not change   

"system:openshift:public-info-viewer"

"system:public-info-viewer"

 

Workaround available: Gating the access with policy engines 

Expected: Minimize the allowed roles for unauthenticated access 

Goals (aka. expected user outcomes)

Reduce use of cluster-wide permissions for system:anonymous user and  system:unauthenticated group for following roles

"self-access-reviewers"

"system:oauth-token-deleters"

"system:scope-impersonation"

"system:webhooks"

 

Requirements (aka. Acceptance Criteria):

Customers would like to minimize the allowed permissions to unauthenticated groups and users. 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster) Yes
Hosted control planes tbd
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

It was determined that following roles will not change

"system:openshift:public-info-viewer"

"system:public-info-viewer"

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Many security teams have flagged anonymous access on apiserver as a security risk. Reducing the permissions granted at cluster level helps in hardening access to apiserver. 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

This feature should not impact upgrade from previous versions. 

This feature will allow enabling the new functionality for fresh installs. 

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Impact to existing usecases should be documented 

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

RedHat allows following roles for system:anonymous user and system:unauthenticated group: 

oc get clusterrolebindings -o json | jq '.items[] | select(.subjects[]?.kind

== "Group" and .subjects[]?.name == "system:unauthenticated") |

.metadata.name' | uniq

Returns what unauthenticated users can do, which is the following:

"self-access-reviewers"

"system:oauth-token-deleters"

"system:openshift:public-info-viewer"

"system:public-info-viewer"

"system:scope-impersonation"

"system:webhooks"

Customers would like to minimize the allowed permissions to unauthenticated groups and users. 

Workaround available: Gating the access with policy engines 

Outcome: Minimize the allowed roles for unauthenticated access 

Goals of spike:

  1. Investigate impact of disabling the roles listed above for new and existing clusters
  2. Document risks and feasibility 

Feature description

Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals: 

  • Manage complex air-gapped scenarios, providing support for the enclaves feature
  • Faster and more robust: introduces caching, it doesn’t rebuild catalogs from scratch
  • Improves code maintainability, making it more reliable and easier to add features, and fixes, and including a feature plugin interface

 

Continue scale testing and performance improvements for ovn-kubernetes

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Manage Openshift Virtual Machines IP addresses from within the SDN solution provided by OVN-Kubernetes.

Why is this important?

Customers want to offload IPAM from their custom solutions (e.g. custom DHCP server running on their cluster network) to SDN.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Review, refine and harden the CAPI-based Installer implementation introduced in 4.16

Goals (aka. expected user outcomes)

From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.

Requirements (aka. Acceptance Criteria):

Review existing implementation, refine as required and harden as possible to remove all the existing technical debt

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Documentation Considerations

There should not be any user-facing documentation required for this work

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Continue to refine and harden aspects of CAPI-based Installs launched in 4.16

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • Less verbose logging during stdout
  • bundle logs on failure
  • place post-install capi manifests in hidden dir
  • place envtest.kubeconfig in hidden dir

Epic Goal

  • Review design and development PRs that require feedback from NE team.

Why is this important?

  • Customer requires certificates to be managed by cert-manager on configured/newly added routes.

Acceptance Criteria

  • All PRs are reviewed and merged.

Dependencies (internal and external)

  1. CFE team dependency for addressing review suggestions.

Done Checklist

  • DEV - All related PRs are merged.

In the current version, router does not support to load secrets directly and uses route resource to load private key and certificates exposing the security artifacts.

 

Acceptance criteria :

  1. Support router to load secrets from secret reference.
  2. E2E testcases (Targeted for GA)

Update cluster-ingress-operator to bootstrap router with required featuregates

 

The cluster-ingress-operator will propagate the relevant Tech-Preview feature gate down to the router. This feature gate will be added as a command-line argument called ROUTER_EXTERNAL_CERTIFICATE to the router and will not be user configurable.

 

Refer:

 

Acceptance criteria 

  • Introduce new cmdline arg on router to inject ExternalCertificate fetauregate status
  • Dev test if injected env is available in the router pod
  • Update any affected UTs
  • Need to inject openshift featuregates
  • Update route validator to inject dependencies required for route validations from library-go changes
  • Update Invocation order of validations
  • Use openshift featuregates
  • Update route validator to inject dependencies required for route validations from library-go changes
  • Update Invocation order of validations

Bump Kubernetes in openshift-apiserver to 1.29.2 to unblock CFE-885.

 

Background:

We need to bump openshift/library-go with latest commit into openshift/openshift-apiserver in order to vendor Route validation changes done in https://github.com/openshift/library-go/pull/1625, but  due the kube version mismatch between library-go and openshift-apiserver , there are some dependency issues.

library-go is at 1.29 , but openshift-apiserver  is still using 1.28

References:

Goal:
Support enablement of dual-stack VIPs on existing clusters created as dual-stack but at a time when it was not possible to have both v4 and v6 VIPs at the same time.

Why is this important?
This is a followup to SDN-2213 ("Support dual ipv4 and ipv6 ingress and api VIPs").

We expect that customers with existing dual stack clusters will want to make use of the new dual stack VIPs fixes/enablement, but it's unclear how this will work because we've never supported modifying on-prem networking configuration after initial deployment. Once we have dual stack VIPs enabled, we will need to investigate how to alter the configuration to add VIPs to an existing cluster.

We will need to make changes to the VIP fields in the Infrastructure and/or ControllerConfig objects. Infrastructure would be the first option since that would make all of the fields consistent, but that relies on the ability to change that object and have the changes persist and be propagated to the ControllerConfig. If that's not possible, we may need to make changes just in ControllerConfig.

For epics https://issues.redhat.com/browse/OPNET-14 and https://issues.redhat.com/browse/OPNET-80 we need a mechanism to change configuration values related to our static pods. Today that is not possible because all of the values are put in the status field of the Infrastructure object.

We had previously discussed this as part of https://issues.redhat.com/browse/OPNET-21 because there was speculation that people would want to move from internal LB to external, which would require mutating a value in Infrastructure. In fact, there was a proposal to put that value in the spec directly and skip the status field entirely, but that was discarded because a migration would be needed in that case and we need separate fields to indicate what was requested and what the current state actually is.

There was some followup discussion about that with Joel Speed from the API team (which unfortunately I have not been able to find a record of yet) where it was concluded that if/when we want to modify Infrastructure values we would add them to the Infrastructure spec and when a value was changed it would trigger a reconfiguration of the affected services, after which the status would be updated.

This means we will need new logic in MCO to look at the spec field (currently there are only fields in the status, so spec is ignored completely) and determine the correct behavior when they do not match. This will mean the values in ControllerConfig will not always match those in Infrastructure.Status. That's about as far as the design has gone so far, but we should keep the three use cases we know of (internal/external LB, VIP addition, and DNS record overrides) in mind as we design the underlying functionality to allow mutation of Infrastructure status values.

Depending on how the design works out, we may only track the design phase in this epic and do the implementation as part of one of the other epics. If there is common logic that is needed by all and can be implemented independently we could do that under this epic though.

Infrastructure.Spec will be modified by end-user. CNO needs to validate those changes and if valid, propagate them to Infrastructure.Status

For clusters that are installed as fresh 4.15 o/installer will propagate Infrastructure.Spec and Infrastructure.Status based on the install-config. However for clusters that are upgraded this code in o/installer will never run.

In order to have a consistent state at upgrade, we will make CNO to propagate Status back to Spec when cluster is upgraded to OCP 4.15.

As we have already done it when introducing multiple VIPs (API change that created plural field next to the singular), all the necessary code scaffolding is already in place.

TLDR: cluster-bootstrap and network-operator fight for the same resource using `fieldManager` property of k8s objects. We need to take over everything what has been owned by cluster-bootstrap and manage it ourselves

Feature Overview (aka. Goal Summary)  

The Agent Based installer is a clean and simple way to install new instances of OpenShift in disconnected environments, guiding the user through the questions and information needed to successfully install an OpenShift cluster. We need to bring this highly useful feature to the IBM Power and IBM zSystem architectures

 

Goals (aka. expected user outcomes)

Agent based installer on Power and zSystems should reflect what is available for x86 today.

 

Requirements (aka. Acceptance Criteria):

Able to use the agent based installer to create OpenShift clusters on Power and zSystem architectures in disconnected environments

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • The goal of this Epic is to enable Agent Based Installer for P/Z

Why is this important?

  • The Agent Based installer is a research Spike item for the Multi-Arch team during the 4.12 release and later

Scenarios
1. …

Acceptance Criteria

  • See "Definition of Done" below

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

As the multi-arch engineer, I would like to build an environment and deploy using Agent Based installer, so that I can confirm if the feature works per spec.

Entrance Criteria

  • (If there is research) research completed and proven that the feature could be done

Acceptance Criteria

  • “Proof” of verification (Logs, etc.)
  • If independent test code written, a link to the code added to the JIRA story

Feature Overview:

Ensure CSI Stack for Azure is running on management clusters with hosted control planes, allowing customers to associate a cluster as "Infrastructure only" and move the following parts of the stack:

  • Azure Disk CSI driver
  • Azure File CSI driver
  • Azure File CSI driver operator

Value Statement:

This feature enables customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

Goals:

  1. Ability for customers to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
  2. Ability to run cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster.
  3. Ability to run the driver DaemonSet in the hosted cluster.

Requirements:

  1. The feature must ensure that the CSI Stack for Azure is installed and running on management clusters with hosted control planes.
  2. The feature must allow customers to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
  3. The feature must enable the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to run on the appropriate clusters.
  4. The feature must enable the cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods to run in the management cluster.
  5. The feature must enable the driver DaemonSet to run in the hosted cluster.
  6. The feature must ensure security, reliability, performance, maintainability, scalability, and usability.

Use Cases:

  1. A customer wants to run their Azure infrastructure using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. They use this feature to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
  2. A customer wants to use Azure storage without having to see/manage its stack, especially on a managed service. This would mean that we need to run the cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster and the driver DaemonSet in the hosted cluster. 

Questions to Answer:

  1. What Azure-specific considerations need to be made when designing and delivering this feature?
  2. How can we ensure the security, reliability, performance, maintainability, scalability, and usability of this feature?

Out of Scope:

Non-CSI Stack for Azure-related functionalities are out of scope for this feature.

Workload identity authentication is not covered by this feature - see STOR-1748

Background

This feature is designed to enable customers to run their Azure infrastructure more efficiently and cost-effectively by using HyperShift control planes and supporting infrastructure without incurring additional charges from Red Hat.

Documentation Considerations:

Documentation for this feature should provide clear instructions on how to enable the CSI Stack for Azure on management clusters with hosted control planes and associate a cluster as "Infrastructure only." It should also include instructions on how to move the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to the appropriate clusters.

Interoperability Considerations:

This feature impacts the CSI Stack for Azure and any layered products that interact with it. Interoperability test scenarios should be factored by the layered products.

 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Run Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".

 

 
Why is this important? (mandatory)

This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

 
Scenarios (mandatory) 

When leveraging Hosted control planes, the Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.

 
Dependencies (internal and external) (mandatory)

Hosted control plane on Azure.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation -
  • QE - 
  • PX - 
  • Others -

 

Done - Checklist (mandatory)

As part of this epic, Engineers working on Azure Hypershift should be able to build and use Azure Disk storage on hypershift guests via developer preview custom build images.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Run Azure File CSI driver operator + Azure File CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".

 

 
Why is this important? (mandatory)

This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

 
Scenarios (mandatory) 

When leveraging Hosted control planes, the Azure File CSI driver operator + Azure File CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.

 
Dependencies (internal and external) (mandatory)

Hosted control plane on Azure.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation -
  • QE - 
  • PX - 
  • Others -

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

This enhances EgressQoS CRD with status information and provide implementation to update this field with relevant information while creating/updating EgressQoS.

Feature Overview (aka. Goal Summary)  

The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.

BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes. 

Goals (aka. expected user outcomes)

Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.

OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.

Requirements (aka. Acceptance Criteria):

  1. The customer should be able to tie into RBAC functionality, similar to how it is closely aligned with OpenShift OAuth 
  2.  

Use Cases (Optional):

  1. As a customer, I would like to integrate my OIDC Identity Provider directly with the OpenShift API server.
  2. As a customer in multi-cluster cloud environment, I have both K8s and non-K8s clusters using my IDP and hence I need seamless authentication directly to the OpenShift/K8sAPI using my Identity Provider 
  3.  

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • Epic to should contain all the followup stories that should land in 4.16
  • figure out if these followups will need to be backported

Why is this important?

  • Console and console-operator implementation will require additional followups - tests, refactoring, ...

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/CONSOLE-3804

Previous Work (Optional):

  1. https://issues.redhat.com/browse/CONSOLE-3804

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview
This is a TechDebt and doesn't impact OpenShift Users.
As the autoscaler has become a key feature of OpenShift, there is the requirement to continue to expand it's use bringing all the features to all the cloud platforms and contributing to the community upstream. This feature is to track the initiatives associated with the Autoscaler in OpenShift.

Goals

  • Scale from zero available on all cloud providers (where available)
  • Required upstream work
  • Work needed as a result of rebase to new kubernetes version

Requirements

Requirement Notes isMvp?
vSphere autoscaling from zero   No
Upstream E2E testing   No 
Upstream adapt scale from zero replicas   No 
     

Out of Scope

n/a

Background, and strategic fit
Autoscaling is a key benefit of the Machine API and should be made available on all providers

Assumptions

Customer Considerations

Documentation Considerations

  • Target audience: cluster admins
  • Updated content: update docs to mention any change to where the features are available.

Epic Goal

  • Update the scale from zero autoscaling annotations on MachineSets to conform with the upstream keys, while also continuing to accept the openshift specific keys that we have been using.

Why is this important?

  • This change makes our implementation of the cluster autoscaler conform to the API that is described in the upstream community. This reduces the mental overhead for someone that knows kubernetes but is new to openshift.
  • This change also reduces the maintenance burden that we carry in the form of addition patches to the cluster autoscaler. By changing our controllers to understand the upstream annotations we are able to remove extra patches on our fork of the cluster autoscaler, making future maintenance easier and closer to the upstream source.

Scenarios

  1. A user is debugging a cluster autoscaler issue by examining the related MachineSet objects, they see the scale from zero annotations and recognize them from the project documentation and from upstream discussions. The result is that the user is more easily able to find common issues and advice from the upstream community.
  2. An openshift maintainer is updating the cluster autoscaler for a new version of kubernetes, because the openshift controllers understand the upstream annotations, the maintainer does not need to carry or modify a patch to support multiple varieties of annotation. This in turn makes the task of updating the autoscaler simpler and reduces burden on the maintainer.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Scale from zero autoscaling must continue to work with both the old openshift annotations and the newer upstream annotations.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - OpenShift code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - OpenShift documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - OpenShift build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - OpenShift documentation merged: <link to meaningful PR>

please note, the changes described by this epic will happen in OpenShift controllers and as such there is no "upstream" relationship in the same sense as the Kubernetes-based controllers.

User Story

As a developer, in order to deprecate the old annotations, we will need to carry both for at least one release cycle. Updating the CAO to apply the upstream annotations, and the CAS to accept both (preferring upstream), will allow me to properly deprecate the old annotations.

Background

During the process of making the CAO recognize the annotations, we need to enable it to modify the machineset to have the new annotation. Similarly, we want the autoscaler to recognize both sets of annotations in the short term while we switch.

Steps

  • update CAO to enable annotations
  • write unit tests

Stakeholders

  • openshift cloud team

Definition of Done

  • CAO applies upstream annotations, leaves old annotations alone
  • Docs
  • upstream annotations referenced in product docs
  • Testing
  • unit testing of behavior

User Story

As a developer, in order to deprecate the old annotations, we will need to carry both for at least one release cycle. Updating the CAO to apply the upstream annotations, and the CAS to accept both (preferring upstream), will allow me to properly deprecate the old annotations.

Background

as part of the effort to migrate to the upstream scale from zero annotations, we should add e2e tests which confirm the presence of the annotations. this can be an addition to our current scale from zero tests.

Steps

  • update current scale from zero test to include a scan for the upstream capacity annotations

Stakeholders

  • openshift eng

Definition of Done

  • tests check for annotations
  • Docs
  • n/a
  • Testing
  • this is all testing

Make it possible to entirely disable the Ingress Operator by leveraging the OCPPLAN-9638 Composable OpenShift capability.

Why is this important?

  • For Managed OpenShift on AWS (ROSA), we use the AWS load balancer and don't need the Ingress operator.  Disabling the Ingress Operator will reduce our resource consumption on infra nodes for running OpenShift on AWS.
  • Customers want to be able to disable the Ingress Operator and use their own component.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), implications for auto scaling, chargeback/showback use scenarios, etc.
  2. Disabled configuration must persist throughout cluster lifecycle including upgrades

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Links:

RFE: https://issues.redhat.com/browse/RFE-3395

Enhancement PR: https://github.com/openshift/enhancements/pull/1415

API PR: https://github.com/openshift/api/pull/1516

Ingress  Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/950

Background

Feature Goal: Make it possible to entirely disable the Ingress Operator by leveraging the Composable OpenShift capability.

Epic Goal

Implement the ingress capability focusing on the HyperShift users.

Non-Goals

  • Fully implement the ingress capability on the standalone OpenShift.

Design

As described in the EP PR.

Why is this important?

  • For Managed OpenShift on AWS (ROSA), we use the AWS load balancer and don't need the Ingress operator. Disabling the Ingress Operator will reduce our resource consumption on infra nodes for running OpenShift on AWS.
  • Customers want to be able to disable the Ingress Operator and use their own component.

Scenarios

 # ...

Acceptance Criteria

 * Release Technical Enablement - Provide necessary release enablement details and documents.
 * Ingress Operator can be disabled on HyperShift.

  • Dependent operators and OpenShift components can tolerate the disabled ingress operator on HyperShift.

Dependencies (internal and external)

 # The install-config and ClusterVersion API have been updated with the capability feature.
 # The console operator.

Previous Work (Optional):

Open questions:

 #  

Done Checklist

 * CI - CI is running, tests are automated and merged.
 * Release Enablement <link to Feature Enablement Presentation>
 * DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
 * DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
 * DEV - Downstream build attached to advisory: <link to errata>
 * QE - Test plans in Polarion: <link or reference to Polarion>
 * QE - Automated tests merged: <link or reference to automated tests>
 * DOC - Downstream documentation merged: <link to meaningful PR>

Goal
The goal of this user story is to add the new (ingress) capability to the cluster operator's payload (manifests: CRDs, RBACs, deployment, etc.).

Out of scope

  • CRDs and operands created at runtime (assets: gateway CRDs, controllers, subscription. etc.)

Acceptance criteria

  • The ingress capability is known to the openshift installer.
  • The new capability does not introduce any new regression to the e2e tests.

Links

Context
HyperShift uses the cluster-version-operator (CVO) to manage the part of the ingress operator's payload, namely CRDs and RBACs. The ingress operator's deployment though is reconciled directly by the control plane operator. That's why HyperShift projects the ClusterVersion resource into the hosted cluster and the enabled capabilities have to be set properly to enable/disable the ingress operator's payload and to let users as well as the other operators be aware of the state of the capabilities.

Goal
The goal of this user story is to implement a new capability in the OpenShift API repository. Use this procedure as example.

Goal
The goal of this user story is to bump the openshift api which contains the ingress capability.

Acceptance criteria

  • The ingress capability is known by the cluster version operator.
  • The new capability does not introduce any new regression to the openshift CI test (test example).

Links

Epic Goal

  • Allow administrators to define which machineconfigs won't cause a drain and/or reboot.
  • Allow administrators to define which ImageContentSourcePolicy/ImageTagMirrorSet/ImageDigestMirrorSet won't cause a drain and/or reboot
  • Allow administrators to define what services need to be started or restarted after writing the new config file.

Why is this important?

  • There is a demonstrated need from customer cluster administrators to push configuration settings and restart system services without restarting each node in the cluster. 
  • Customers are modifying ICSP/ITMS/IDMS outside post day 1/adding them+

Scenarios

  1. As a cluster admin, I want to reconfigure sudo without disrupting workloads.
  2. As a cluster admin, I want to update or reconfigure sshd and reload the service without disrupting workloads.
  3. As a cluster admin, I want to remove mirroring rules from an ICSP, ITMS, IDMS object without disrupting workloads because the scenario in which this might lead to non-pullable images at a undefined later point in time doesn't apply to me.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This epic is another epic under the "reduce workload disruptions" umbrella. 

This is now updated to get us most of the way to MCO-200 (Admin-Defined reboot & drain), but not necessarily with all the final features in place.

This epic aims to create a reboot/drain policy object and a MCO-management apparatus for initial functionality with MachineConfig backed updates, with a restricted set of actions for the user. We also need reboot/drain policy object for ImageContentSourcePolicy, ImageTagMirrorSet and ImageDigestMirrorSet to avoid drains/reboots when admins use these APIs and have other ways of ensuring image integrity.,

This mostly focuses on the user interface for defining reboot/drain policies. We will also need this for the layering "live apply" cases and bifrost-backed updates, to be implemented into a future update.

The MCO's reboot and drain rules are currently hard-coded in the machine-config-daemon here.

Node drains also occur even beyond OCP 4.9 when not just adding but also removing ICSP, ITMS, IDMS objects or single mirroring rules in their configuratuion according to RFE-3667.

This causes at least three problems:

  • A user does not  know what the rules are unless they read the code (the rules aren't visible to the user)
  • The controller can't see the rules to "pre-compute" the effect that a MachineConfig will have on a Node before that MachineConfig is delivered (which makes it hard for a user to know what will actually happen if they apply a config)
  • The only way for a template owner to mark their config as "does not require reboot" is to edit the MCD code

Done when:

  • A CRD is defined for post config action policies covering both MCO and ICSP/ITMS/IDMS APIs
  • The existing daemon rules are broken out into one of these resources
  • The reboot/drain policies are visible in the cluster (e.g. "oc get rebootpolicies")
  • The drain controller handles processing and validation of the user's policies (and could put the computed post-config actions in the machineconfig's and ICSP/ITMS/IDMS status or the custom image's metadata if layering)
  • A template owner has a procedure to mark that their template config changing does/does not require a reboot

Description of problem:

The MCO logic today allows users to not reboot when changing the
registries.conf file (through ICSP/IDMS/ITMS objects), but the MCO will
sometimes drain the node if the change is deemed "unsafe" (deleting a
mirror, for example).

This behaviour is very disruptive for some customers who with to make
all image registries changes non-disruptive. We will address this long term with admin defined policies via the API properly, but we would like to have a backportable solution (as a support exception) for users to do so

Version-Release number of selected component (if applicable):

    4.14->4.16

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Have the MCC validate the correctness of user-provided spec, and render the final object into the status for the daemon to use

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

  • The goal of this epic is to enable monitoring on the Windows nodes created by Windows Machine Config Operator(WMCO) in an OpenShift cluster.

Why is this important?

  • Monitoring is critical to identify issues with nodes, pods and containers running on the nodes. Monitoring enables users to make informed decisions, optimize performance, and strategically manage infrastructure costs. Implementation of this epic will ensure that we have a consistent user experience for monitoring across Linux and Windows.

Scenarios

Display console graphs for the following:

  1. Windows node metrics.
  2. Windows pod metrics.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated to test that the console graphs are present.
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. None

Previous Work :

  1. Enhancement: Monitoring-windows-nodes

Open questions::

      1. Specifics on testing framework- adding tests to https://github.com/openshift/console- pending story creation.

      2. Feasibility of https://issues.redhat.com/browse/WINC-530

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User or Developer story

As a WMCO user, I want filesystem graphs to be visible for Windows nodes.

Description

WMCO uses existing console queries to display graphs for Windows nodes. Changes in filesystem queries (https://github.com/openshift/console/pull/7201) on the console side, prevent filesystem graphs to be displayed for Windows nodes.

Engineering Details

The root cause for filesystem queries to not work is windows_exporter does not return any value for `mountpoint` field used in the console queries.

Acceptance Criteria

Filesytem graph is populated for Windows nodes in the OpenShift console.

Description

When inspecting a pod in the OCP console, the metrics tab shows a number of graphs. The networking graphs do not show data for Windows pods.

Engineering Details

Using windows exporter 0.24.0 I am getting `No datapoints found` for the query made by the console:
(sum(irate(container_network_receive_bytes_total{pod='win-webserver-685cd6c5cc-8298l'}[5m])) by (pod, namespace, interface)) + on(namespace,pod,interface) group_left(network_name) (pod_network_name_info)

There is data returned from the query `irate(container_network_receive_bytes_total{pod='windows-machine-config-operator-7c8bcc7b64-sjqxw'}[5m])`
Which makes me believe the error is due to pod_network_name_info not having data for the Windows pods I am looking at.

I'm confirming that by checking in the namespace the workloads are deployed to via the query: pod_network_name_info{namespace="openshift-windows-machine-config-operator"}
I only see metrics for the Linux pods in the namespace.

Looking into this it seems like these metrics are coming from https://github.com/openshift/network-metrics-daemon which runs on each Linux node, and creates a metric for applicable pods running on the node.

Acceptance Criteria

  • The pod network graphs shown in the OCP console are populated for Windows pods.

Executive Summary

Image and artifact signing is a key part of a DevSecOps model. The Red Hat-sponsored sigstore project aims to simplify signing of cloud-native artifacts and sees increasing interest and uptake in the Kubernetes community. This document proposes to incrementally invest in OpenShift support for sigstore-style signed images and be public about it. The goal is to give customers a practical and scalable way to establish content trust. It will strengthen OpenShift’s security philosophy and value-add in the light of the recent supply chain security crisis.

 

CRIO 

  1. Support customer image validation
  2. Support OpenShift release image validation

https://docs.google.com/document/d/12ttMgYdM6A7-IAPTza59-y2ryVG-UUHt-LYvLw4Xmq8/edit# 

 

 

Goal

This goals of this features are:

  • optimize and streamline the operations of HyperShift Operator (HO) on Azure Kubernetes Service (AKS) clusters
  • Enable auto-detectopm of the underlying environment (managed or self-managed) to optimize the HO accordingly.

Goal

We need to be able to install the HO with external DNS and create HCPs on AKS clusters

Why is this important?

  • AKS clusters will serve as the management clusters on ARO HCP.

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a HyperShift CLI user, I want to be able to:

  • include my pull secret when installing the HyperShift operator with an externalDNS setup

so that I can achieve

  • the ability to pull the Red Hat version of the externalDNS image.

Acceptance Criteria:

Description of criteria:

  • The ability to include the pull secret on a HyperShift operator installation is available in the HyperShift CLI

Out of Scope:

  • Any additional changes needed to install HCPs on AKS

Engineering Details:

  • N/A

This requires/does not require a design proposal.
This requires/does not require a feature gate.

The cloud-network-config-operator is being deployed on HyperShift with `runAsNonRoot` set to true. When HCP is deployed on non-OpenShift management clusters, such as AKS, this needs to be unset so the pod can run as root.

This is currently causing issues deploying this pod on HCP on AKS with the following error:

      state:
        waiting:
          message: 'container has runAsNonRoot and image will run as root (pod: "cloud-network-config-controller-59d4677589-bpkfp_clusters-brcox-hypershift-arm(62a4b447-1df7-4e4a-9716-6e10ec55d8fd)", container: hosted-cluster-kubecfg-setup)'
          reason: CreateContainerConfigError 

User Story:

As a user of HyperShift on Azure, I want to be able to:

  • create HCP with externalDNS setup on AKS clusters

so that I can achieve

  • the ability to successfully setup HCPs on AKS for the ARO HCP effort.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation on setting up an HCP with externalDNS on AKS
  • Any code changes needed to successfully setup HCP with externalDNS on AKS has been peer reviewed and approved

Out of Scope:

  • Any private or publicPrivate topology setup.

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

CSI pods are failing to create on HCP on an AKS cluster.

% k get pods | grep -v Running
NAME                                                  READY   STATUS                       RESTARTS   AGE
csi-snapshot-controller-cfb96bff7-7tc94               0/1     CreateContainerConfigError   0          17h
csi-snapshot-webhook-57f9799848-mlh8k                 0/1     CreateContainerConfigError   0          17h 

The issue is 

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  34m                    default-scheduler  Successfully assigned clusters-brcox-hypershift-arm/csi-snapshot-controller-cfb96bff7-7tc94 to aks-nodepool1-24902778-vmss000001
  Normal   Pulling    34m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83"
  Normal   Pulled     34m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:805280104b2cc8d8799b14b2da0bd1751074a39c129c8ffe5fc1b370671ecb83" in 3.036610768s (10.652709193s including waiting)
  Warning  Failed     32m (x12 over 34m)     kubelet            Error: container has runAsNonRoot and image will run as root (pod: "csi-snapshot-controller-cfb96bff7-7tc94_clusters-brcox-hypershift-arm(45ab89f2-9c00-4afa-bece-7846505edbfc)", container: snapshot-controller) 

User Story:

As a hub cluster admin, I want to be able to:

  • Detect when running in managed vs self-managed mode in Azure

so that I can prevent

  • Impossible usage of managed service components in self-manage use cases

Acceptance Criteria:

Description of criteria:

  • HyperShift controllers are aware of when they run in managed environment
  • HyperShift components have a single programmatic way to check in which mode they run
  • HyperShift deployment takes the mode in consideration when setting up its components

(optional) Out of Scope:

Implementing managed-only configs is out of scope for this story

Engineering Details:

  •  

This does not require a design proposal.
This does not require a feature gate.

User Story:

As a hub cluster admin, I want to be able to:

  • Detect when running in managed vs self-managed mode in Azure

so that I can prevent

  • Impossible usage of managed service components in self-manage use cases

Acceptance Criteria:

Description of criteria:

  • HyperShift controllers are aware of when they run in managed environment
  • HyperShift components have a single programmatic way to check in which mode they run
  • HyperShift deployment takes the mode in consideration when setting up its components

(optional) Out of Scope:

Implementing managed-only configs is out of scope for this story

Engineering Details:

  •  

This does not require a design proposal.
This does not require a feature gate.

Feature Overview

As of OpenShift 4.14, this functionality is Tech Preview for all platforms but OpenStack, where it is GA.  This Feature is to bring the functionality to GA for all remaining platforms.

Feature Description

Allow to configure control plane nodes across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.

Goals

I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and nodes across multiple subnets.

I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.

Background, and strategic fit

Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.

Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.

Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.

While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.

Functionalities per Epic

 

Epic Control Plane with Multiple Subnets  Compute with Multiple Subnets Doesn't need external LB Built-in LB
NE-1069 (all-platforms)
NE-905 (all-platforms)
NE-1086 (vSphere)
NE-1087 (Bare Metal)
OSASINFRA-2999 (OSP)  
SPLAT-860 (vSphere)
NE-905 (all platforms)
OPNET-133 (vSphere/Bare Metal for AI/ZTP)
OSASINFRA-2087 (OSP)
KNIDEPLOY-4421 (Bare Metal workaround)
SPLAT-409 (vSphere)

Previous Work

Workers on separate subnets with IPI documentation

We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow

This procedure works on vSphere too, albeit no QE CI and not documented.

External load balancer with IPI documentation

  1. Bare Metal: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-post-installation-configuration.html#nw-osp-configuring-external-load-balancer_ipi-install-post-installation-configuration
  2. vSphere: https://docs.openshift.com/container-platform/4.11/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#nw-osp-configuring-external-load-balancer_installing-vsphere-installer-provisioned

Scenarios

  1. vSphere: I can define 3 or more networks in vSphere and distribute my masters and workers across them. I can configure an external load balancer for the VIPs.
  2. Bare metal: I can configure the IPI installer and the agent-based installer to place my control plane nodes and compute nodes on 3 or more subnets at installation time. I can configure an external load balancer for the VIPs.

Acceptance Criteria

  • Can place compute nodes on multiple subnets with IPI installations
  • Can place control plane nodes on multiple subnets with IPI installations
  • Can configure external load balancers for clusters deployed with IPI with control plane and compute nodes on multiple subnets
  • Can configure VIPs to in external load balancer routed to nodes on separate subnets and VLANs
  • Documentation exists for all the above cases

 

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces 4.18 4.17 4.16 4.15
monitored 82 82 82 82
fix needed 69 69 69 69
fixed 34 34 30 39
remaining 35 35 39 30
~ remaining non-runlevel 14 14 18 9
~ remaining runlevel (low-prio) 21 21 21 21
~ untested 2 2 2 82

Progress breakdown

# namespace 4.18 4.17 4.16 4.15
1 oc debug node pods #1763 #1816 #1818
2 openshift-apiserver-operator #573 #581
3 openshift-authentication #656 #675
4 openshift-authentication-operator #656 #675
5 openshift-catalogd #50 #58
6 openshift-cloud-credential-operator #681 #736
7 openshift-cloud-network-config-controller #2282 #2490 #2496  
8 openshift-cluster-csi-drivers     #170 #459 #484
9 openshift-cluster-node-tuning-operator #968 #1117
10 openshift-cluster-olm-operator #54 n/a
11 openshift-cluster-samples-operator #535 #548
12 openshift-cluster-storage-operator #516   #459 #196 #484 #211
13 openshift-cluster-version     #1038 #1068
14 openshift-config-operator #410 #420
15 openshift-console #871 #908 #924
16 openshift-console-operator #871 #908 #924
17 openshift-controller-manager #336 #361
18 openshift-controller-manager-operator #336 #361
19 openshift-e2e-loki #56579 #56579 #56579 #56579
20 openshift-image-registry     #1008 #1067
21 openshift-ingress #1031      
22 openshift-ingress-canary #1031      
23 openshift-ingress-operator #1031      
24 openshift-insights     #915 #967
25 openshift-kni-infra #4504 #4542 #4539 #4540
26 openshift-kube-storage-version-migrator #107 #112
27 openshift-kube-storage-version-migrator-operator #107 #112
28 openshift-machine-api   #407 #315 #282 #1220 #73 #50 #433 #332 #326 #1288 #81 #57 #443
29 openshift-machine-config-operator   #4219 #4384 #4393
30 openshift-manila-csi-driver #234 #235 #236
31 openshift-marketplace #578   #561 #570
32 openshift-metallb-system #238 #240 #241  
33 openshift-monitoring     #2335 #2420
34 openshift-network-console        
35 openshift-network-diagnostics #2282 #2490 #2496  
36 openshift-network-node-identity #2282 #2490 #2496  
37 openshift-nutanix-infra #4504 #4504 #4539 #4540
38 openshift-oauth-apiserver #656 #675
39 openshift-openstack-infra #4504 #4504 #4539 #4540
40 openshift-operator-controller #100 #120
41 openshift-operator-lifecycle-manager #703 #828
42 openshift-route-controller-manager #336 #361
43 openshift-service-ca #235 #243
44 openshift-service-ca-operator #235 #243
45 openshift-sriov-network-operator #754 #995 #999 #1003
46 openshift-storage        
47 openshift-user-workload-monitoring #2335 #2420
48 openshift-vsphere-infra #4504 #4542 #4539 #4540
49 (runlevel) default        
50 (runlevel) kube-system        
51 (runlevel) openshift-cloud-controller-manager        
52 (runlevel) openshift-cloud-controller-manager-operator        
53 (runlevel) openshift-cluster-api        
54 (runlevel) openshift-cluster-machine-approver        
55 (runlevel) openshift-dns        
56 (runlevel) openshift-dns-operator        
57 (runlevel) openshift-etcd        
58 (runlevel) openshift-etcd-operator        
59 (runlevel) openshift-kube-apiserver        
60 (runlevel) openshift-kube-apiserver-operator        
61 (runlevel) openshift-kube-controller-manager        
62 (runlevel) openshift-kube-controller-manager-operator        
63 (runlevel) openshift-kube-proxy        
64 (runlevel) openshift-kube-scheduler        
65 (runlevel) openshift-kube-scheduler-operator        
66 (runlevel) openshift-multus        
67 (runlevel) openshift-network-operator        
68 (runlevel) openshift-ovn-kubernetes        
69 (runlevel) openshift-sdn        

Workloads running in platform namespaces (openshift-, kube-, default) must have the required-scc annotation defined in order to pin a specific SCC (see AUTH-482) for more details. This task adds a monitor test that analyzes all such workloads and tests the existence of the "openshift.io/required-scc" annotation.

Openshift prefixed namespaces should all define their required PSa labels. Currently, the list of namespaces that are missing some or all PSa labels are the following:

namespace in review merged
openshift    
openshift-apiserver-operator PR  
openshift-cloud-credential-operator PR  
openshift-cloud-network-config-controller PR  
openshift-cluster-samples-operator PR  
openshift-cluster-storage-operator PR  
openshift-config OCPBUGS-28621 PR
openshift-config-managed PR  
openshift-config-operator PR  
openshift-console PR  
openshift-console-operator PR  
openshift-console-user-settings PR  
openshift-controller-manager PR   
openshift-controller-manager-operator PR  
openshift-dns-operator PR  
openshift-etcd-operator PR
openshift-host-network PR  
openshift-ingress-canary PR  
openshift-ingress-operator PR  
openshift-insights PR   
openshift-kube-apiserver-operator PR   
openshift-kube-controller-manager-operator PR   
openshift-kube-scheduler-operator PR   
openshift-kube-storage-version-migrator PR   
openshift-kube-storage-version-migrator-operator PR   
openshift-network-diagnostics PR   
openshift-node    
openshift-operator-lifecycle-manager PR   
openshift-operators  PR  
openshift-route-controller-manager PR   
openshift-service-ca PR  
openshift-service-ca-operator PR  
openshift-user-workload-monitoring PR   

Goal

As an OpenShift installer I want to update the firmware of the hosts I use for OpenShift on day 1 and day 2.

As an OpenShift installer I want to integrate the firmware update in the ZTP workflow.

Description

The firmware updates are required in BIOS, GPUs, NICs, DPUs, on hosts that will often be used as DUs in Edge locations (commonly installed with ZTP).

Acceptance criteria

  • Firmware can be updated (upgrade/downgrade)
  • Existing firmware version can be checked

Out of Scope

  • Day 2 host firmware upgrade

 

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

CVO story: persistence for "how long has this ClusterOperator been updating?". For a first pass, we can just hold this in memory in the CVO, and possibly expose it to our managers via the ResourceReconcilliationIssues structure created in OTA-1159. But to persist between CVO restarts, we could have the CVO annotate the ClusterOperator with "here's when I first precreated this ClusterOperator during update $KEY". Update keys are probably (a hash of?) (startTime, targetImage). And the CVO would set the annotation when it pre-created the cluster operator for the release, unless an annotation already existed with the same update key. And the reconciling-mode CVO could clear off the annotations.

We are not sure yet whether we want to do this with CO annotations or OTA-1159 "result of work", we defer this decision to after we implement OTA-1159.

I ended up doing this via the OTA-1159 result-of-work route, and I just did the "expose" side.  If we want to cover persistence between CVO-container-restarts, we'd need a follow-up ticket for that.  The CVO is only likely to restart when machine-config is moving though, and giving that component a bit more time doesn't seem like a terrible thing, so we might not need a follow-up ticket at all.

 

Implementing RFE-928 would help a number of update-team use-cases:

  • OTA-368, OSDOCS-2427, and other tickets that are mulling over rendering update-related alerts when folks are trying to decide whether to launch an update.
  • OTA-1021's update status subcommand could use these to help admins discover and respond to update-related issues in their updating clusters.

The updates team is not well positioned to maintain oc access long-term; that seems like a better fit for the monitoring team (who maintain Alertmanager) or the workloads team (who maintain the bulk of oc). But we can probably hack together a proof-of-concept which we could later hand off to those teams, and in the meantime it would unblock our work on tech-preview commands consuming the firing-alert information.

The proof-of-concept could follow the following process:

  1. Get alertmanager URL from route alertmanager-main in openshift-monitoring namespace
  2. Use the $ALERTMANAGER/api/v1/alerts endpoint to get data about alerts (see https://github.com/prometheus/alertmanager#api)
  3. The endpoint is authenticated via bearer token, same as against apiserver (possible Role for this mentioned in MON-3396 OBSDA-530)

Definition of done:

$ OC_ENABLE_CMD_INSPECT_ALERTS=true oc adm inspect-alerts
...dump of firing alerts...

and a backing Go function that other subcommands like oc adm upgrade status can consume internally.

Add node status as mentioned in the sample output to "oc adm upgrade status".

With this output, users will be able to see the state of the nodes that are part of the master machine config pool and what version of RHCOS they are currently on. I am not sure if it is possible to also show corresponding OCP version. If possible we should display that as well.

Control Plane Node(s)
NAME                                ASSESSMENT        PHASE              VERSION   	 
ip-10-0-128-174.ec2.internal        Complete          Updated   	 4.12.16   
ip-10-0-142-240.ec2.internal        Progressing       Rebooting          -   		  
ip-10-0-137-108.ec2.internal        Outdated   	      Pending   	 4.12.1   	    

Definition of done:

  • Add code to get similar output as listed above. if a particular field/column needs more work, create separate Jira cards.

In https://github.com/openshift/oc/pull/1554 we scaffolded the status command with existing CVO condition message. We should stop printing this message once the standard command output relays this information well enough.

An update is in progress for 59m13s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available

Failing=True:
  Reason: ClusterOperatorNotAvailable
  Message: Cluster operator control-plane-machine-set is not available

The above is CVO condition message: we should remove once all information it presents is presented in the actual new status output.

Definition of Done

None of this specially processed CVO / ClusterVersion condition content is emitted, the first section in the output is "= Control Plane =". We should only do this if any of its possible content is already surfaced via an assessment or a health insight.

Current state

= Control Plane =
...
Operator Status: 33 Total, 31 Available, 1 Progressing, 4 Degraded

Improvement opportunities

1. "33 Total" confuses people into thinking this number is a sum of the others (e.g. OCPBUGS-24643)
2. "Available" is useless most of the time: in happy path it equals total, in error path the "unavailable" would be more useful, we should not require the user to do a mental subtraction when we can just tell them
3. "1 Progressing" does not mean much, I think we can relay similar information by saying "1 upgrading" on the completion line (see OTA-1153)
4. "0 degraded" is not useful (and 0 unavailable would be too), we can hide it on happy path
5. somehow relay that operators can be both unavailable and degraded

Definition of Done

// Happy path, all is well
Operator status: 33 Healthy
// Two operators Available=False
Operator status: 31 Healthy, 2 Unavailable
// Two operators Available=False, one degraded
Operator status: 30 Healthy, 2 Unavailable, 1 Available but degraded
// Two operators Available=False, one degraded, one both => unavailable trumps degraded
Operator status: 29 Healthy, 3 Unavailable, 1 Available but degraded

Open questions

1. How to handle COs briefly unavailable / degraded? OTA-1087 PoC does not show insights about them if they are briefly down / degraded to avoid noise, so we can either be inconsistent between counts and health, or between "adm status" and "get co". => extracted to OTA-1175

oc adm upgrade status currently renders Progressing and Failing!=False directly, instead of feeding them in through updateInsight. OTA-1154 is removing those. But Failing has useful information about the cluster-version operator's direct dependents which isn't available via ClusterOperator, MachineConfigPools, or the other resources we consume. This ticket is about adding logic to assessControlPlaneStatus to convert Failing!=False into an updateInsight, so it can be rendered via the consolidated insights-rendering pathways, and not via the one-off printout.

Add node status as mentioned in the sample output to "oc adm upgrade status" in OpenShift Update Concepts

With this output, users will be able to see the state of the nodes that are not part of the master machine config pool and what version of RHCOS they are currently on. I am not sure if it is possible to also show corresponding OCP version. If possible we should display that as well.

=Worker Upgrade=

Worker Pool: openshift-machine-api/machineconfigpool/worker
Assessment: Admin required
Completion: 25% (Est Time Remaining: N/A - Manual action required)
Mode: Manual | Assisted [- No Disruption][- Disruption Permitted][- Scaling Permitted]
Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Node(s)
NAME					ASSESSMENT	PHASE		VERSION		EST		MESSAGE
ip-10-0-134-165.ec2.internal		Manual		N/A		4.12.1		?		Awaiting manual reboot
ip-10-0-135-128.ec2.internal		Complete	Updated		4.12.16		-
ip-10-0-138-184.ec2.internal		Manual		N/A		4.12.1		?		Awaiting manual reboot
ip-10-0-139-31.ec2.internal		Manual		N/A		4.12.1		?		Awaiting manual reboot

Definition of done:

  • Add code to get similar output as listed above. if a particular field/column needs more work, create separate Jira cards.
  • This output should be shown even when control plane already updated but workers are not yet (as of Jan 2024, the status command will say that the cluster is not upgrading in such case)
  • Output should take into account that there can be multiple worker pools
  • We should capture fixtures for the following to scenarios:
  • cluster that finished control plane update but is still upgrading worker nodes
  • cluster that finished control plane update but is still upgrading worker nodes and there’s a restrictive PDB that prohibits draining

As a PoC (proof of concept), warn about Available=False and Degraded=True ClusterOperators, smoothing flapping conditions (like discussed on a refinement call on Nov 29).

Follow Justin's direction from design ideas, make this easily pluggable for more "warnings"

Example Output

=Update Health= 
SINCE	        LEVEL 		        IMPACT 			MESSAGE
3h		Warning		        API Availability	High control plane CPU during upgrade
Resolved	Info			Update Speed		Pod is slow to terminate
4h		Info			Update Speed		Pod was slow to terminate
30m		Info			None			Worker node version skew
1h		Info			None			Update worker node

Expose 'Result of work' as structured JSON in a ClusterVersion condition (ResourceReconciliationIssues? ReconciliationIssues). This would allow oc to explain what the CVO is having trouble with, without needing to reach around the CVO and look at ClusterOperators directly (avoiding the risk of latency/skew between what the CVO thinks it's waiting on and the current ClusterOperator content). And we could replace the current Progressing string with more information, and iterate quickly on the oc side. Although if we find ways we like more to stringify this content, we probably want to push that back up to the CVO so all clients can benefit.

We want to do this work behind a feature gate OTA-1169.

Definition of Done

  • All code added for this effort must be conditional on the UpgradeStatus feature gate (added in https://github.com/openshift/api/pull/1725 so we will need to bump o/api in CVO).
  • CVO will have a new .status.conditions item with type==ReconciliationIssues
  • When CVO encounters an "issue" (more details on issues below) while applying a resource, the condition value is True and message is a JSON with all issues encountered
  • Reason values can be invented as necessary
  • When no issue is encountered, condition is False
  • Reconciliation issue is when CVO cannot reach desired state for the given resource in the given sync loop - this will differ between upgrading and reconciling mode. In upgrading mode CVO needs to report things like waiting on ClusterOperators going out of Available=False/Degraded=True, waiting for deployments to go ready etc. In reconciling mode we should report failures to apply.

Note that a condition with Message containing a JSON instead of a human-readable message is against apimachinery conventions and is VERY poor API method in general. The purpose of this story is simply to experiment with how such API could like, and will inform how we will build it in the future. Do not worry too much about tech debt, this is exploratory code.

OpenShift Update Concepts proposes a --details option that should supply the insights with SOP/documentation links, but does not give an example of how the output would look like:

=Update Health= 
SINCE		LEVEL 		IMPACT 			MESSAGE
3h		Warning		API Availability	High control plane CPU during upgrade
Resolved	Info		Update Speed		Pod is slow to terminate
4h		Info		Update Speed		Pod was slow to terminate
30m		Info		None			Worker node version skew
1h		Info		None			Update worker node

Run with --details for additional description and links to online documentation.

Justin's design ideas contain a remediation struct for this which I like.

Definition of Done:

  • With --details, every insight will have two more fields - description and remediation URL (runbook, documentation...). Both mandatory.
  • Output will not be a table, but a oc describe -like tree output
  • For CO insights, we stop putting the (potentially long) CO condition message into insight message, but place it in new description  field instead.
  • We can use the links to generic alert runbooks that someone asks us to do in alerts (OCPBUGS-14246)

Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

  • I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal

As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.

As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.

Why is this important?

Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.

Acceptance Criteria

  • I can specify static IPs for node VMs at install time with IPI

Previous Work

Bare metal related work:

CoreOS Afterburn:

https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28

https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

{}USER STORY:{}

As a system admin, I would like the static IP support for vSphere to use IPAddressClaims to provide IP address during installation so that after the install, the machines are defined in a way that is intended for use with IPAM controllers.

 

{}DESCRIPTION:{}

Currently the installer for vSphere will directly set the static IPs into the machine object yaml files.  We would like to enhance the installer to create IPAddress, IPAddressClaim for each machine as well as update the machinesets to use addressesFromPools to request the IPAddress.  Also, we should create a custom CRD that is the basis for the pool defined in the addressesFromPools field.

 

{}ACCEPTANCE CRITERIA:{}

After installing static IP for vSphere IPI, the cluster should contain machines, machinesets, crd, ipaddresses and ipaddressclaims related to static IP assignment.

 

{}ENGINEERING DETAILS:{}

These changes should all be contained in the installer project.  We will need to be sure to cover static IP for zonal and non-zonal installs.  Additionally, we need to have this work for all control-plane and compute machines.

We need to ensure that when vSphere static IP IPI install is being performed, we need to make sure the masters that are generated are treated as valid machines and do not get recreated by CPMS operator.

Feature Overview

Add authentication to the internal components of the Agent Installer so that the cluster install is secure.

Goals

  • Day1: Only allow agents booted from the same agent ISO to register with the assisted-service and use the agent endpoints
  • Day2: Only allow agents booted from the same node ISO to register with the assisted-service and use the agent endpoints
  •  
  • Only allow access to write endpoints to the internal services
  • Use authentication to read endpoints

 

Epic Goal

  • This epic scope was originally to encompass both authentication and authorization but we have split the expanding scope into a separate epic.
  • We want to add authorization to the internal components of Agent Installer so that the cluster install is secure. 

Why is this important?

  • The Agent Installer API server (assisted-service) has several methods for Authorization but none of the existing methods are applicable tothe Agent Installer use case. 
  • During the MVP of Agent Installer we attempted to turn on the existing authorization schemes but found we didn't have access to the correct API calls.
  • Without proper authorization it is possible for an unauthorized node to be added to the cluster during install. Currently we expect this to be done as a mistake rather than maliciously.

Brainstorming Notes:

Requirements

  • Allow only agents booted from the same ISO to register with the assisted-service and use the agent endpoints
  • Agents already know the InfraEnv ID, so if read access requires authentication then that is sufficient in some existing auth schemes.
  • Prevent access to write endpoints except by the internal systemd services
  • Use some kind of authentication for read endpoints
  • Ideally use existing credentials - admin-kubeconfig client cert and/or kubeadmin-password
  • (Future) Allow UI access in interactive mode only

 

Are there any requirements specific to the auth token?

  • Ephemeral
  • Limited to one cluster: Reuse the existing admin-kubeconfig client cert

 

Actors:

  • Agent Installer: example wait-for
  • Internal systemd: configurations, create cluster infraenv, etc
  • UI: interactive user
  • User: advanced automation user (not supported yet)

 

Do we need more than one auth scheme?

Agent-admin - agent-read-write

Agent-user - agent-read

Options for Implementation:

  1. New auth scheme in assisted-service
  2. Reverse proxy in front of assisted-service API
  3. Use an existing auth scheme in assisted-service

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. AGENT-60 Originally we wanted to just turn on local authorization for Agent Installer workflows. It was discovered this was not sufficient for our use case.

Open questions::

  1. Which API endpoints do we need for the interactive flow?
  2. What auth scheme does the Assisted UI use if any?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user, when running agent create image, agent create pxe-files and agent create config iso commands, I want to be able to:

  • generate ECDSA private key Generating ECDSA key pair and save it to asset store
  • generate ECDSA public key Generating ECDSA key pair and save it to asset store
  • set generated public/priv keys into appropriate env var EC_PUBLIC_KEY_PEM , EC_PRIVATE_KEY_PEM
  • pass env var to assisted service. 

so that I can achieve

  • authentication for the APIs.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

  • As a Cluster Administrator, I want to opt-out of certain operators at deployment time using any of the supported installation methods (UPI, IPI, Assisted Installer, Agent-based Installer) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a Cluster Administrator, I want to opt-in to previously-disabled operators (at deployment time) from UI (e.g. OCP Console, OCM, Assisted Installer), CLI (e.g. oc, rosa), and API.
  • As a ROSA service administrator, I want to exclude/disable Cluster Monitoring when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — since I get cluster metrics from the control plane.  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.
  • As a ROSA service administrator, I want to exclude/disable Ingress Operator when I deploy OpenShift with HyperShift — using any of the supported installation methods including the ROSA wizard in OCM and rosa cli — as I want to use my preferred load balancer (i.e. AWS load balancer).  This configuration should be persisted through not only through initial deployment but also through cluster lifecycle operations like upgrades.

Goals

  • Make it possible for customers and Red Hat teams producing OCP distributions/topologies/experiences to enable/disable some CVO components while still keeping their cluster supported.

Scenarios

  1. This feature must consider the different deployment footprints including self-managed and managed OpenShift, connected vs. disconnected (restricted and air-gapped), supported topologies (standard HA, compact cluster, SNO), etc.
  2. Enabled/disabled configuration must persist throughout cluster lifecycle including upgrades.
  3. If there's any risk/impact of data loss or service unavailability (for Day 2 operations), the System must provide guidance on what the risks are and let user decide if risk worth undertaking.

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:

Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

  • CORS-1873 Installer to allow users to select OpenShift components to be included/excluded
  • OTA-555 Provide a way with CVO to allow disabling and enabling of operators
  • OLM-2415 Make the marketplace operator optional
  • SO-11 Make samples operator optional
  • METAL-162 Make cluster baremetal operator optional
  • OCPPLAN-8286 CI Job for disabled optional capabilities

Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators

Phase 3 (OpenShift 4.13): OCPBU-117

  • OTA-554 Make oc aware of cluster capabilities
  • PSAP-741 Make Node Tuning Operator (including PAO controllers) optional

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • OCPBU-352 Make Ingress Operator optional
  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)

Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)

  • CCO-186 ccoctl support for credentialing optional capabilities
  • MCO-499 MCD should manage certificates via a separate, non-MC path (formerly IR-230 Make node-ca managed by CVO)
  • CNF-5642 Make cluster autoscaler optional
  • CNF-5643 - Make machine-api operator optional
  • WRKLDS-695 - Make DeploymentConfig API + controller optional
  • CNV-16274 OpenShift Virtualization on the Red Hat Application Cloud (not applicable)
  • CNF-9115 - Leverage Composable OpenShift feature to make control-plane-machine-set optional
  • BUILD-565 - Make Build v1 API + controller optional
  • CNF-5647 Leverage Composable OpenShift feature to make image-registry optional (replaces IR-351 - Make Image Registry Operator optional)

Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)

  • OCPVE-634 - Leverage Composable OpenShift feature to make olm optional
  • CCO-419 (OCPVE-629) - Leverage Composable OpenShift feature to make cloud-credential  optional

Phase 6 (OpenShift 4.16): OCPSTRAT-731

Phase 7 (OpenShift 4.17): OCPSTRAT-1308

  • MON-3152 (OBSDA-242) Optional built-in monitoring
  • IR-400 - Remove node-ca from CIRO*
  • CNF-9116 Leverage Composable OpenShift feature to machine-auto-approver optional
  • CCO-493 Make Cloud Credential Operator optional for remaining providers and topologies (non-SNO topologies)

References

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

 

 

Note: phase 2 target is tech preview.

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

Feature Overview

Allow attaching an ISO image that will be used for configuration data on an already provisioned system using a BareMetalHost.

Feature Overview

Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.

Overarching Goal

Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.

Background

  • For a single node cluster effectively cut off from all other networking, update the cluster despite the lack of access to image registries, local or remote.
  • For multi-node clusters that could have a complete power outage, recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.
  • Allow cluster node(s) to boot without any access to a registry in case all the required images are pinned

 

Given enhancement - https://github.com/openshift/enhancements/pull/1481

Design Review Doc: https://docs.google.com/document/d/1-XuHN6_jvJMLULFwwAThfIcHqY32s32lU6m4bx7BiBE/edit

We want to allow the relevant APIs to pin images and make sure those don't get garbage collected.

Here is a summary of what will be required:

  1. CRI-O will need to be changed so that it doesn't remove pinned images, regardless of the version_file_persist setting.
  2. Add the new PinnedImageSet custom resource definition to the API.
  3. Initial proposal: #1609
  4. Add a new PinnedIMageSetController to the machine-config-controller.
  5. Add the logic to pin and pull the images to the machine-config-daemon.
  6. Update the documentation of recovery procedures to explain that pinned images shouldn't be removed.

It is important that when a new CRI-O pinned-image configuration is applied via machine config that the net result is a reload of the crio systemd unit vs node reboot.

Describes the work needed from the MCO team to take Pinned Image Sets to GA.

Similar to bug 1955300, but seen in a recent 4.11-to-4.11 update [1]:

: [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
Run #0: Failed expand_less 47m16s
1 unexpected clusteroperator state transitions during e2e test run

Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.nightly-2022-02-05-152519}]

Not something that breaks the update, but still something that sounds pretty alarming, and which had ClusterOperatorDown over its 10m threshold [2]. In this case, the alert did not fire because the metric-serving CVO moved from one instance to another as the MCO was rolling the nodes, and we haven't attempted anything like [3] yet to guard ClusterOperatorDown against that sort of label drift.

Over the past 24h, seems like a bunch of hits across several versions, all for 17+ minutes:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | jq -r 'to_entries[].value |
to_entries[].value[].context[]'
Feb 05 22:19:52.700 - 2038s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.nightly-s390x-2022-02-05-125308
Feb 05 18:57:26.413 - 1470s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-143255}]
Feb 05 22:00:03.973 - 1265s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-173245}]
Feb 05 15:17:47.103 - 1154s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-04-094135}]
Feb 05 07:55:59.474 - 1162s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-04-190512}]
Feb 05 12:15:30.132 - 1178s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-063300}]
Feb 05 19:48:07.442 - 1588s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.ci-2022-02-05-173245}]

Feb 05 23:30:46.180 - 5629s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.17
Feb 05 19:02:16.918 - 1622s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 22:05:50.214 - 1663s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 22:54:19.037 - 6791s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.17
Feb 05 09:47:44.404 - 1006s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 20:20:47.845 - 1627s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 06 03:40:24.441 - 1197s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.9.19
Feb 05 23:28:33.815 - 5264s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.9.17}]
Feb 05 06:20:32.073 - 1261s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-04-213359}]
Feb 05 09:25:36.180 - 1434s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-04-213359}]

Feb 05 12:20:24.804 - 1185s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-075430}]
Feb 05 21:47:40.665 - 1198s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-141309}]
Feb 06 04:41:02.410 - 1187s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.ci-2022-02-05-203247}]
Feb 05 09:18:04.402 - 1321s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 05 12:31:23.489 - 1446s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]

Feb 06 01:32:14.191 - 1011s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 06 04:57:35.973 - 1508s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]

Feb 05 09:16:49.005 - 1198s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-rc.1}]
Feb 05 22:44:04.061 - 1231s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
Feb 05 09:30:33.921 - 1209s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-04-094135}]
Feb 05 19:53:51.738 - 1054s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-132417}]
Feb 05 20:12:54.733 - 1152s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-132417}]

Feb 06 03:12:05.404 - 1024s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-190244}]
Feb 06 03:18:47.421 - 1052s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.10.0-0.nightly-2022-02-05-190244}]

Feb 05 12:15:03.471 - 1386s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-04-143931}]
Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-05-152519}

]
Feb 05 17:21:15.357 - 1087s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.nightly-2022-02-04-143931}

]
Feb 05 09:31:14.667 - 1632s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.10.0-0.okd-2022-02-05-081152}

]
Feb 05 12:29:22.119 - 1060s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.okd-2022-02-05-101655
Feb 05 17:43:45.938 - 1380s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
Feb 06 02:35:34.300 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.ci.test-2022-02-06-011358-ci-op-xl025ywb-initial}

]
Feb 06 06:15:23.991 - 1135s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.ci.test-2022-02-06-044734-ci-op-1xyd57n7-initial}

]
Feb 05 09:25:22.083 - 1071s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [

{operator 4.11.0-0.ci.test-2022-02-05-080202-ci-op-dl3w4ks4-initial}

]

Breaking down by job name:

$ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade (all) - 70 runs, 47% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 40 runs, 60% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 76 runs, 42% failed, 9% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 77 runs, 65% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 41 runs, 61% failed, 12% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 80 runs, 59% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 7% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 88 runs, 55% failed, 8% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 79 runs, 54% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 45 runs, 44% failed, 25% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 33 runs, 45% failed, 13% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 31 runs, 100% failed, 3% of failures match = 3% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Those impact percentages are just matches; this particular test-case is non-fatal.

The Available=False conditions also lack a 'reason', although they do contain a 'message', which is the same state we had back when I'd filed bug 1948088. Maybe we can pass through the Degraded reason around [4]?

Going back to the run in [1], the Degraded condition had a few minutes at RenderConfigFailed, while [4] only has a carve out for RequiredPools. And then the Degraded condition went back to False, but for reasons I don't understand we remained Available=False until 22:33, when the MCO declared its portion of the update complete:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/machine-config '
Feb 05 22:15:40.029 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.029 - 147s E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.430 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [

{operator 4.11.0-0.nightly-2022-02-05-152519}]
Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.11.0-0.nightly-2022-02-05-152519}

]
Feb 05 22:18:07.150 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.150 - 898s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.178 W clusteroperator/machine-config condition/Degraded status/False changed:
Feb 05 22:18:21.505 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
Feb 05 22:33:04.574 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [

{operator 4.11.0-0.nightly-2022-02-05-152519}

]
Feb 05 22:33:04.584 W clusteroperator/machine-config condition/Upgradeable status/True changed:
Feb 05 22:33:04.931 I clusteroperator/machine-config versions: operator 4.11.0-0.nightly-2022-02-05-152519 -> 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:33:05.531 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.11.0-0.nightly-2022-02-05-211325
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088
[2]: https://github.com/openshift/cluster-version-operator/blob/06ec265e3a3bf47b599e56aec038022edbe8b5bb/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L79-L87
[3]: https://github.com/openshift/cluster-version-operator/pull/643
[4]: https://github.com/openshift/machine-config-operator/blob/2add8f323f396a2063257fc283f8eed9038ea0cd/pkg/operator/status.go#L122-L126

Description of problem:

MCO taking too much time to update the node count for MCP when removing labels from node which MCP uses to match with nodes

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Remove `node-role.kubernetes.io/worker=` label from any worker node.
~~~
# oc label node worker-0.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com node-role.kubernetes.io/worker-
~~~
2. Check MCP worker for correct node count.
~~~
# oc get mcp  worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-6916abae250ad092875791f8297c13e1   True      False      False      3              3                   3                     0                      5d7h
~~~
3. Check after 10-15 mins
~~~
# oc get mcp  worker NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE worker   rendered-worker-6916abae250ad092875791f8297c13e1   True      False      False      2              2                   2                     0                      5d7h
~~~

Actual results:

It took 10-15 mins for MCP to detect node removal.

Expected results:

It should detect node removal as soon as the appropriate label from the node gets missing.

Additional info:

 

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP GA
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This is continuation of CORS-2455 / CFE-719 work, where support for GCP tags & labels delivered as TechPreview in 4.14 and to make it GA in 4.15. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined tags.

Resources List

Resource Terraform API
VM Instance google_compute_instance
Storage Bucket google_storage_bucket

Acceptance Criteria:

  • Code linting, validation and best practices adhered to
  • List of gcp resources created by installer should have user defined labels and as well as the default OCP label.

Enhancement proposed for GCP tags support in OCP, requires machine-api-provider-gcp to add azure userTags available in the status sub resource of infrastructure CR, to the gcp virtual machine resources created.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • UTs and e2e are added/updated

Currently the ROSA/ARO versions are not managed by OTA team.
This Feature covers the engineering effort to take the responsibility of management of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.

Here is the design document for the effort: https://docs.google.com/document/d/1hgMiDYN9W60BEIzYCSiu09uV4CrD_cCCZ8As2m7Br1s/edit?skip_itp2_check=true&pli=1

Here are some objectives :

  • Managed clusters would get update recommendations from Red Hat hosted OSUS directly without much middle layers like ClusterImageSet.
  • The new design should reduce the cost of maintaining versions for managed clusters including Hypershift hosted control planes.

Presentation from Jeremy Eder :

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

This epic is to transfer the responsibility of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.

  • The responsibility of management of OCP versions available to all self-managed OCP customers lies with the OTA team.
    • This responsibility makes the OTA team the center of excellence around version health.
  • The responsibility of management of OCP versions available to managed customers lies with the SRE-P team.  Why do this project:
    • The SREP team took on version management responsibility out of necessity.  Code was written and maintained to serve a service-tailored "version list".  This code requires effort to maintain.
    • As we begin to sell more managed OCP, it makes sense to move this responsibility back to the OTA team as this is not an "SRE" focused activity.
    • As the CoE for version health, the OTA team has the most comprehensive overview of code health.
    • The OTA team would benefit by coming up to speed on managed OCP-specific lifecycles/policies as well as become aware of the "why" for various policies where they differ from self-managed OCP.

 

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

To deliver RFE-5160  without passing data through the customer-accessible ClusterVersion resource (which lives in the hosted Kubernetes API), the cluster-version operator should grow a new command-line switch that allows its caller to override ClusterVersion with a custom upstream.

Definition of done:

  • The CVO can be executed with --upstream-update-service https://example.com. When set, the CVO will ignore spec.upstream in ClusterVersion and instead use the value passed in via the command-line option.

Or similar knob to deliver RFE-5160, attaching the new parameter to OTA-1210's command-line flag.

Definition of done:

Management cluster admins can set spec.updateService on HostedCluster and have their status.version.availableUpdates and similar populated based on advice from that upstream. At no point in the implementation is the update service data pulled from anywhere accessible from the hosted cluster.

Epic Goal

  • Enable users to install a host as day2 using agent based installer

Why is this important?

  • Enable easy day2 installation without requiring additional knowledge from the user
  • Unified experience for day1 and day2 installation for the agent based installer
  • Unified experience for day1 and day2 installation for appliance workflow
  • Eliminate the requirement of installing MCE that have high requirements (requires 4 cores and 16GB RAM for a multi-node cluster, and if the infrastructure operator is included then it will require storage as well)
  • Eliminate the requirement of nodes having a BMC available to expand bare metal clusters (see docs).
  • Simplify adding compute nodes based on the the UPI method or other method implemented in the field such as WKLD-433 or other automations that try to solve this problem

Scenarios

  1. User installed day1 cluster with agent based install and want to add workers or replace failed nodes, currently alternative is to install MCE or, if connected, use SAAS.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. N/A

Previous Work (Optional):

ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components 

Assisted installer can create installed cluster and use it to perform day2 operations

A doc that explains how it's done with kube-api 

Parameters that are required from the user:

  • Cluster name,Domain name - used to get the url that will be used to pull ignition
  • Kubeconfig to day1 cluster 
  • Openshift version
  • Network configuration

Actions required from the user

  • Post reboot the user will need to manually approve the node on the day1 cluster

Implementation suggestion:

To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster

Open questions::

  1. How should ABI detect an existing day1 cluster? Yes
  2.  Using a provided kubeconfig file? Yes 
  3. If so, should it be added to the config-image API (i.e. included in the ISO) - question to the agent team
  4. Should we add a new command for day2 in agent-installer-client? E.g. question to the agent team 
  5. So this command would import the existing day1 cluster and add the day2 node. It means that every day2 node should run an assisted-service first Yes, those instances are not depending on each other

It will need to create the container, set it up with an appropriate kubeconfig, extract and appropriately direct the output (ISO+any errors or streamed status), and delete the container again when complete. Initially could be a script, but potentially could be implemented directly in the code. To be distributed within the installer image

The two commands, one for adding the nodes (ISO generation) and the other to monitor the process, should be exposed by a new cli tool (name to be defined) built using the installer source. This task will be used to add the main of the cli tool and the two (empty) commands entry points

Deploy a command to generate a suitable ISO for adding a node to an existing cluster

A new workflow will be required to talk to assisted-service to import an existing cluster / add the node.New services could be required in the ignition assets to handle properly that

Not all the required info are provided by the user (in reality, we do want to minimize as much as possible the amount of configuration provided by the user). Some of the required info needs to be extracted from the existing cluster, or from the existing kubeconfig. A dedicated asset could be useful for such operation.

The ignition assets currently assembles the ignition file with the requires files and services to install a cluster. In case of add node, this needs to be modified to support the new workflow.

Usually the manifests assets (ClusterImageSet / AgentPullSecret / InfraEnv / NMStateConfig / AgentClusterInstall / ClusterDeployment) depends on OptionInstallConfig (or eventually a file on the asset dir, in case of ZTP manifests). We'll need to change the assets code so that it could be possible to retrieve the required info from ClusterInfo asset instead of OptionalInstallConfig). This may impact the asset framework itself.

Another approach could be to stick this info directly into OptionalInstallConfig, if possible

Feature Overview (aka. Goal Summary)  

As an openshift admin ,who wants to make my OCP more secure and stable . I want to prevent anyone to schedule their workload in master node so that master node only run OCP management related workload  .

 

Goals (aka. expected user outcomes)

secure OCP master node by preventing scheduling of customer workload in master node

 

 

 

 

 

Anyone applying toleration(s) in a pod spec can unintentionally tolerate master taints which protect master nodes from receiving application workload when master nodes are configured to repel application workload. An admission plugin needs to be configured to protect master nodes from this scenario. Besides the taint/toleration, users can also set spec.NodeName directly, which this plugin should also consider protecting master nodes against.

Needed so we can provide this workflow to customers following the proposal at https://github.com/openshift/enhancements/pull/1583

 

Reference https://issues.redhat.com/browse/WRKLDS-1015

 

kube-scheduler pods are created by code residing in controllers provided by the kubescheduler operator. So changes are required in that repo to add a toleration to the node-role.kubernetes.io/control-plane:NoExecute taint. 

https://github.com/openshift/cluster-kube-scheduler-operator/blob/4be4e433eec566df60d6d89f09a13b706e93f2a3/bindata/assets/kube-scheduler/pod.yaml#L13

The operator itself does not run in the control-plane nodes, but if that change is necessary it would be here: https://github.com/openshift/cluster-kube-scheduler-operator/blob/4be4e433eec566df60d6d89f09a13b706e93f2a3/manifests/0000_25_kube-scheduler-operator_06_deployment.yaml#L12

DoD

We need to ensure we have parity with OCP and support heterogeneous clusters

https://github.com/openshift/enhancements/pull/1014

Goal

Why is this important?

  • Necessary to enable workloads with different architectures in the same Hosted Clusters.
  • Cost savings brought by more cost effective ARM instances

Scenarios

  1. I have an x86 hosted cluster and I want to have at least one NodePool running ARM workloads
  2. I have an ARM hosted cluster and I want to have at least one NodePool running x86 workloads

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides

Dependencies (internal and external)

  1. The management cluster must use a multi architecture payload image.
  2. The target architecture is in the OCP payload
  3. MCE has builds for the architecture used by the worker nodes of the management cluster

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image

Acceptance Criteria:

  • Upstream documentation
  • HyperShift CLI accomplishes design specified in the document in the Engineering Details section
  • HCP CLI accomplishes design specified in the document in the Engineering Details section

Out of Scope:

Engineering Details:

User Story:

As a user of multi-arch HyperShift, I would like a CEL validation to be added to the NodePool types to prevent the arch field from being changed from `amd64` when the platform is not supported (AWS is currently the only supported platform).

Acceptance Criteria:

Description of criteria:

  • CEL added to the arch field for NodePools which will error if a user tries to change the arch field from `amd64` for any other platform other than AWS.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

  • As a user of the HyperShift, I would like the CLI or HyperShift API to fail early when:
    • the release image is not multi-arch AND
    • the management cluster's CPU architecture is not the same as the NodePool's CPU architecture
      so that we can prevent a HostedCluster or a NodePool from being created that will have errors due mismatches between the release image, management cluster's CPU architecture, and NodePool's CPU architecture.

Why is this important?

  • This improves the UX of using multi-arch HyperShift by preventing a user from utilizing the wrong release image when creating a HostedCluster or NodePool.

Scenarios

  1. Scenarios That Should Succeed
    1. Using a mgmt cluster of CPU arch 'A', create a HostedCluster with the CLI with a multi-arch release image
    2. Using a mgmt cluster of CPU arch 'A', create a HostedCluster with a yaml spec with a multi-arch release image
    3. Using a mgmt cluster of CPU arch 'A', create a NodePool with the CLI with a multi-arch release image
    4. Using a mgmt cluster of CPU arch 'A', create a NodePool with a yaml spec with a multi-arch release image
    5.  
  2. Scenarios That Should Fail
    1. Using a mgmt cluster of CPU arch 'A', create a HostedCluster with the CLI with a single arch release image from CPU arch 'B'
    2. Using a mgmt cluster of CPU arch 'A', create a HostedCluster with a yaml spec with a single arch release image from CPU arch 'B'
    3. Using a mgmt cluster of CPU arch 'A', create a NodePool with the CLI with a single arch release image from CPU arch 'B'
    4. Using a mgmt cluster of CPU arch 'A', create a NodePool with a yaml spec with a single arch release image from CPU arch 'B'

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user of the HyperShift CLI, I would like the CLI to fail early if these conditions are all true:

  • the release image is not multi-arch
  • the management cluster's CPU architecture is not the same as the NodePool's CPU architecture

so that we can prevent a HostedCluster from being created that will have errors due mismatches between the release image, management cluster's CPU architecture, and NodePool's CPU architecture.

Acceptance Criteria:

  • The HyperShift CLI fails to create a cluster when the release image is not multi-arch and the management cluster's CPU architecture does not match the NodePool's CPU architecture.
  • There is documentation providing information the CLI will fail when it meets the conditions above.

(optional) Out of Scope:

This should be done for the API as well but will be covered thru HOSTEDCP-1105.

Engineering Details:

  • HyperShift CLI currently defaults to a multi-arch image in version.go.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of the HCP CLI, I want to be able to specify a NodePools arch from a flag so that I can easily create NodePools of different CPU architectures in AWS HostedClusters.

Acceptance Criteria:

  • The HCP CLI contains a `arch` flag for the create cluster command for AWS.
  • The HCP CLI contains a `arch` flag for the create nodepool command for AWS.
  • The HCP CLI contains a `multi-arch` flag for the create cluster command for AWS.
  • The only valid values for the arch flag are:
    • amd64
    • arm64

Out of Scope:

Other CPU arches are not being considered for the arch flag since they are unavailable in AWS.

Engineering Details:

  • N/A

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of the HyperShift, I would like the API to fail early if these conditions are all true:

  • the release image is not multi-arch
  • the management cluster's CPU architecture is not the same as the NodePool's CPU architecture

so that we can prevent a HostedCluster from continuing to be created that will have errors due mismatches between the release image, management cluster's CPU architecture, and NodePool's CPU architecture.

Acceptance Criteria:

  • The HyperShift API fails to create a cluster when the release image is not multi-arch and the management cluster's CPU architecture does not match the NodePool's CPU architecture.
  • There is documentation providing information the API will fail when it meets the conditions above.

Out of Scope:

This should be done for the CLI as well but will be covered thru HOSTEDCP-1104.

Engineering Details:

  • HyperShift CLI currently defaults to a multi-arch image in version.go.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Technology Preview of the oc mirror enclaves support (Dev Preview was OCPSTRAT-765 in OpenShift 4.15).

Feature description

oc-mirror already focuses on mirroring content to disconnected environments for installing and upgrading OCP clusters.

This specific feature addresses use cases where mirroring is needed for several enclaves (disconnected environments), that are secured behind at least one intermediate disconnected network.

In this context, enclave users are interested in:

  • being able to mirror content for several enclaves, and centralizing it in a single internal registry. Some customers are interested in running security checks on the mirrored content, vetting it before allowing mirroring to downstream enclaves.
  • being able to mirror contents directly from the internal centralized registry to enclaves without having to restart the mirroring from internet for each enclave
  • keeping the volume of data transfered from one network stage to the other to a strict minimum, avoiding to transfer a blob or an image more than one time from one stage to another.

 

This epic covers the work for RFE-3800 (includes RFE-3393 and RFE-3733) for mirroring operators and additonal images

The full description / overview of the enclave support is best described here 

The design document can be found here 

 

Architecture Overview (diagram)

 

User Stories

All user stories are in the form :

  • Role (As a ...)
  • Goal (I want ..)
  • Reason (So that ..)

 

Overview

Consider this as part of the separate discussions and design of the upgrade path/introspection tool

Acceptance Criteria{}

  • All tests pass (80%) coverage
  • Documentation approved by docs team
  • All tests QE approved 
  • Well documented README/HOWTO/OpenShift documentation
  • Best effort (but better than v1)
  • Change the ImageSetConfiguration schema to accept a list of bundles
  • Get related images from a specific bundle/s names and packages stated in the imagesetconfig (using v1alpha3 version of the ISC spec)
  • Any other associations bundles and other filtering that is not package is not allowed

 Tasks

  • Implement V2 versioning as discussed in the EP document
  • Implement code to filter  bundles according to ImageSetConfig criteria.
  • Implement unit tests

Acceptance Criteria

  • All tests pass (80%) coverage
  • Documentation approved by docs team
  • All tests QE approved 
  • Well documented README/HOWTO/OpenShift documentation

Tasks

  • Implement the scenario where the operator catalog is only one file instead of multiple files (catalog.json) see comment here
  • Investigate if the operator catalog for IBM is all in one json only or multiple
  • Implement unit tests

Acceptance Criteria

  • All tests pass (80%) coverage
  • Documentation approved by docs team
  • All tests QE approved 
  • Well documented README/HOWTO/OpenShift documentation

Tasks

  • Implement V2 versioning as discussed in the EP document
  • Implement code to bulk (use concurrency) delete images functionality
  • Consider deleting relevant images in cache
  • Implement unit tests

Acceptance Criteria

  • All tests pass (80%) coverage
  • Documentation approved by docs team
  • All tests QE approved 
  • Well documented README/HOWTO/OpenShift documentation
  • Best effort (but better than v1)
  • Take into account targetCatalog, targetTag
  • Catalogs mirrored by tag should be pulled when a minor version is detected (we should not rely only on existence of the folder in the cache)

Tasks

  • Implement V2 versioning as discussed in the EP document
  • Implement code to filter catalogs , channels and bundles according to ImageSetConfig criteria.
  • Implement unit tests
  • TargetCatalog and TargetTag
  • Check index manifest and compare to the one on disk
  • Handle cases where /configs contains a single index.json vs /configs containing several folders (1 per package), with several index.json files.
  • Rebuild catalogs?
  • Redefine the default channel (OCPBUGS-7465, also requested by Verizon)

Overview

Ensure that the current v2 respects the v1 TargetCatalog and TargetTag fields (if set) for oci catalog and registry catalogs.

Also TargetCatalog and TargetTag should not be mutually exclusive.

Invalid example:

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14
    targetCatalog: abc/def:v5.5
    packages:
    - name: aws-load-balancer-operator

Valid example:

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14
    targetCatalog: abc/def
    targetTag: v4.4
    packages:
    - name: aws-load-balancer-operator

Acceptance Criteria

  • All tests pass (80%) coverage
  • Documentation approved by docs team
  • All tests QE approved 
  • Well documented README/HOWTO/OpenShift documentation

Tasks

  • Implement V2 versioning as discussed in the EP document
  • Implement code to generate artifacts IDMS ITMS and audit data (CFE-817)
  • keep in mind that IDMS generation might be needed for v1
  • Implement code to generate CatalogSource
  • Implement unit tests

Acceptance Criteria

  • All unit tests pass (80% coverage)
  • Multi arch functionality works for catalogs and additional images for all mirroring scenarios
  • Documentation approved by docs team
  • All tests QE approved 
  • Well documented README/HOWTO/OpenShift documentation

Tasks

  • Implement multi arch (manifest list) functionality and code
  • Implement unit tests

Acceptance Criteria

  • Ensure all unit tests pass
  • Ensure code coverage is above 80%
  • Ensure golanglint-ci has no errors
  • All tests QE approved 
  • Show the results in code base README

Tasks

  • Spike to asses current MVP gaps
  • Implement unit tests for v2
  • Acceptance Criteria
    • Create e2e test plan for MVP with QE sign-off
    • All e2e tests pass
    • Documentation approved by docs team
    • All tests QE approved 
    • Well documented README/HOWTO/OpenShift documentation
  • Tasks
    • Implement e2e testing for MVP according to test plan
    • Document the e2e process

Feature Overview

In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform. 

In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform. 

When we used Terraform to provision the control plane, the Terraform deployment could eventually time out and report an error. The installer was monitoring the Terraform output and could pass the error on to the user, e.g.

level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [2h9m1s elapsed]
level=error
level=error msg=Error: could not inspect: inspect failed , last error was 'timeout reached while inspecting the node'
level=error
level=error msg=  with ironic_node_v1.openshift-master-host[2],
level=error msg=  on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg=  13: resource "ironic_node_v1" "openshift-master-host" {

Now that provisioning is managed by Metal³, we have nothing monitoring it for errors:

level=info msg=Waiting up to 1h0m0s (until 1:05AM UTC) for bootstrapping to complete...
level=debug msg=Bootstrap status: complete

By this stage the bootstrap API is up (and this is a requirement for BMO to do its thing). The installer is capable of monitoring the API for the appearance of the bootstrap complete ConfigMap, so it is equally capable of monitoring the BaremetalHost status. This should actually be an improvement on Terraform, as we can monitor in real time as the hosts progress through the various stages, and report on errors and retries.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision vSphere infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The vSphere IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing vSphere Terraform provider. Specifically, _we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision vsphere infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing vSphere terraform provider.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
level=info msg=Running process: vsphere infrastructure provider with args [-v=2 --metrics-bind-addr=0 --health-addr=127.0.0.1:37167 --webhook-port=38521 --webhook-cert-dir=/tmp/envtest-serving-certs-445481834 --leader-elect=false] and env [...] 

may contain sensitive data - passwords, logins etc. It should be filtered

Epic Goal

In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which the vSphere IPI installer uses Terraform. 

{}USER STORY:{}

The assisted installer should be able to use CAPI-based vsphere installs without requiring access to vcenter.

{}DESCRIPTION:{}

The installer makes calls to vcenter to determine the networks, which are required for CAPI based installs, but vcenter access is not guaranteed in the assisted installer.

See:

https://github.com/openshift/installer/pull/7962/commits/2bfb3d193d375286d80e36e0e7ba81bb74559a9d#diff-c8a93a8c9fb0e5dbfa50e3a8aa1fbd253dd324fb981a01a69f6bf843305117cbR267

https://github.com/openshift/installer/pull/7962/commits/2bfb3d193d375286d80e36e0e7ba81bb74559a9d#diff-42f3fb5184e7ed95dc5efdff5ca11cb2b713ac266322a99c03978106c0984f22R127

which were lovingly lifted from this slack thread.

{}Required:{}

In cases where the installer calls vcenter to obtain values to populate manifests, the installer should leave empty fields (or a default value) if it is unable to access vcenter. It should produce partial manifests, rather than throw an error.

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

Continued compatibility with agent installer, particularly producing capi manifests when access to vcenter fails.

{}ENGINEERING DETAILS:{}

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Description of problem:

    the installer download the rhcos image locally to cache multiple times when using failure domains

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    time="2024-02-22T14:34:42-05:00" level=debug msg="Generating Cluster..."
time="2024-02-22T14:34:42-05:00" level=warning msg="FeatureSet \"CustomNoUpgrade\" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster."
time="2024-02-22T14:34:42-05:00" level=info msg="Creating infrastructure resources..."
time="2024-02-22T14:34:43-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:34:43-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:36:02-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:36:02-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:37:22-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:37:22-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:38:39-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:38:39-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:39:33-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'"
time="2024-02-22T14:39:33-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..."
time="2024-02-22T14:39:33-05:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: The name 'ngirard-dev-pr89z-rhcos-us-west-us-west-1a' already exists."

Expected results:

    should only download once

Additional info:

    

WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3' 
DEBUG The file was found in cache: /home/jcallen/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing... 

 

<very long pause here with no indication>

 

 

Replace https://github.com/openshift/installer/tree/master/upi/vsphere with powercli.  Keep terraform in place until powercli installations are working.

  • Update UPI image with powershell and powercli
  • Backport change through all supported releases
  • Update installer repo with powercli scripts
  • Update installer repo to remove UPI terraform

 

example of updates to be made to the upi image:

~~~
FROM upi-installer-image
RUN curl https://packages.microsoft.com/config/rhel/8/prod.repo | tee
/etc/yum.repos.d/microsoft.repo

RUN yum install -y powershell

RUN pwsh -Command 'Install-Module VMware.PowerCLI -Force -Scope
CurrentUser'
~~~

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The Azure IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing Azure Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision Azure infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing Azure
  2. terraform provider.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Until we get dual load balancer support in CAPZ, we will need to create the external load balancer for public clusters.

In the ignition hook, we need to upload the bootstrap ignition data to the bootstrap storage account (created to hold the rhcos  image). Then create an ignition stub containing the SAS link for the object.

The install config allows users to specify a `diskEncryptionSet` in machinepools. 

 

CAPZ has existing support for disk encryption sets:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/38a5395e4320e116882734ffd420bf4a1f959ff2/api/v1beta1/types.go#L670

Note that CAPZ says the encryption set must belong to the same subscription, whereas our docs may not indicate that. We should point this out to the docs team.

Create private zone and DNS records in the resource group specified by baseDomainResourceGroupName. The records should be cleaned up with destroy cluster.

The ControlPlaneEndpoint will be available in the Cluster spec and can be used to populate the DNS records.  

Currently we create both A and CNAME records in different scenarios: https://github.com/openshift/installer/blob/master/data/data/azure/cluster/dns/dns.tf

 

Ideally we do this in the InfraReady hook, before machine creation, so that control plane machines can pull ignition immediately.

The `image` field in the AzureMachineSpec needs to point to an RHCOS image. For marketplace images, those images should already be available.

For non-marketplace images, we need to create an image for the users, using the VHD from the RHCOS stream.

The image could created in the PreProvision hook: https://github.com/openshift/installer/blob/master/pkg/infrastructure/clusterapi/types.go#L26

Technicaly it could also be done in the InfraAvailable hook, if that is needed. 

User Story:

As a developer, I want to:

  • Create machine manifests for Azure CAPI implementation

so that I can achieve

  • Create control plane and bootstrap machines using CAPZ

Acceptance Criteria:

Description of criteria:

  • Control plane and bootstrap machines are created using CAPI

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision Nutanix infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The Nutanix IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing Nutanix Terraform provider. Specifically, _we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Based on the this old feature request 

https://issues.redhat.com/browse/RFE-1530

we do have impersonation in place for gaining access to other user's permissions via the console. But the only documentation we currently have is how to impersonate system:admin via the CLI see

https://docs.openshift.com/container-platform/4.14/authentication/impersonating-system-admin.html

Please provide documentation for the console feature and the required prerequisites for the users/groups accordingly.

AC:

  • Create a ConsoleQuickStart CR that would help user with impersonate access understand the impersonation workflow and functionality
  • This quickstart should be available only to user with impersonate
  • Created CR should be placed in the console-operator's repo, where the default quickstarts are placed

 

More info on the impersonate access role - https://github.com/openshift/console/pull/13345/files

Implement a toast notification feature in the console UI to notify the user that their action for creating/updating a resource violated a warn policy though the request was allowed.
 

See theConfigure OpenShift Console to display warnings from apiserver when creating/updating resources spike on how to reproduce the warn policy response in `oc` CLI

A.C.

Display a warning toast notification after create/update resource action for a resource

RFE: https://issues.redhat.com/browse/RFE-1772

In OpenShift v3 we have been displaying pod's last termination state and we need to display the same info in v4. Here is the v3 html interpretation of the kubernetes-object-describe-container-state directive.

AC: Add the Last State login into the pod's details page by rewritting the v3 implementation.

 Add support for returning `response.header` in `consoleFetchCommon` function in the dynamic-plugin-sdk package

 

Problem: 

The `consoleFetchCommon function in the dynamic-plugin-sdk package lack the support for retrieving HTTP `response.header`. 

 

Justification:

 The policy warning responses are visible in the `oc cli` but not visible on the console UI.  The customer wants a similar behavior on the UI. The policy warning responses are returned in the HTTP `response.header` which is not implemented in the  `consoleFetchCommon` function currently.

 

Proposed Solution

 Add logic for extracting all `response.headers` along with `response.json` in the `consoleFetchCommon` function using `options` or something else. 

 

A.C. 

  Add an option parameter to `consoleFetchCommon` to conditionally return full `response` or `response.json` so that the k8s functions like "K8sCreate` consume either, preventing breaking change for dynamic plugins

Feature Overview (aka. Goal Summary)

A guest cluster can use an external OIDC token issuer.  This will allow machine-to-machine authentication workflows

Goals (aka. expected user outcomes)

A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that 

  1. allows fixing mistakes
  2. alerts the owner of the configuration that it's likely that there is a misconfiguration (self-service)
  3. makes distinction between product failure (expressed configuration not applied) from configuration failure (the expressed configuration was wrong), easy to determine
  4. makes cluster recovery possible in cases where the external token issuer is permanently gone
  5. allow (might not require) removal of the existing oauth server

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

A guest cluster can use an external OIDC token issuer.  This will allow machine-to-machine authentication workflows

Goals (aka. expected user outcomes)

A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that 

  1. allows fixing mistakes
  2. alerts the owner of the configuration that it's likely that there is a misconfiguration (self-service)
  3. makes distinction between product failure (expressed configuration not applied) from configuration failure (the expressed configuration was wrong), easy to determine
  4. makes cluster recovery possible in cases where the external token issuer is permanently gone
  5. allow (might not require) removal of the existing oauth server

Goal

  • Provide API for configuring external OIDC to management cluster components
  • Stop creating oauth server deployment
  • Stop creating oauth-apiserver
  • Stop registering oauth-apiserver backed apiservices
  • See what breaks next

Why is this important?

  • need API starting point for ROSA CLI and OCM
  • need cluster that demonstrates what breaks next for us to fix

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
$ oc logs --previous --timestamps -n openshift-console console-64df9b5bcb-8h8xk
2024-03-22T11:17:07.824396015Z I0322 11:17:07.824332       1 main.go:210] The following console plugins are enabled:
2024-03-22T11:17:07.824574844Z I0322 11:17:07.824558       1 main.go:212]  - monitoring-plugin
2024-03-22T11:17:07.824613918Z W0322 11:17:07.824603       1 authoptions.go:99] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
2024-03-22T11:22:07.828873678Z I0322 11:22:07.828819       1 main.go:634] Binding to [::]:8443...
2024-03-22T11:22:07.828982852Z I0322 11:22:07.828967       1 main.go:636] using TLS
2024-03-22T11:22:07.833771847Z E0322 11:22:07.833726       1 asynccache.go:62] failed a caching attempt: Get "https://keycloak-keycloak.apps.xxxx/realms/master/.well-known/openid-configuration": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-03-22T11:22:10.831644728Z I0322 11:22:10.831598       1 metrics.go:128] serverconfig.Metrics: Update ConsolePlugin metrics...
2024-03-22T11:22:10.848238183Z I0322 11:22:10.848187       1 metrics.go:138] serverconfig.Metrics: Update ConsolePlugin metrics: &map[monitoring:map[enabled:1]] (took 16.490288ms)
2024-03-22T11:22:12.829744769Z I0322 11:22:12.829697       1 metrics.go:80] usage.Metrics: Count console users...
2024-03-22T11:22:13.236378460Z I0322 11:22:13.236318       1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.580502ms)

The cause is that the HCCO is not copying the issuerCertificateAuthority configmap into the openshift-config namespace of the HC.

Description of problem:

HCP does not honor the oauthMetadata field of hc.spec.configuration.authentication, making console crash and oc login fail.

Version-Release number of selected component (if applicable):

HyperShift management cluster: 4.16.0-0.nightly-2024-01-29-233218
HyperShift hosted cluster: 4.16.0-0.nightly-2024-01-29-233218

How reproducible:

Always

Steps to Reproduce:

1. Install HCP env. Export KUBECONFIG:
$ export KUBECONFIG=/path/to/hosted-cluster/kubeconfig

2. Create keycloak applications. Then get the route:
$ KEYCLOAK_HOST=https://$(oc get -n keycloak route keycloak --template='{{ .spec.host }}')
$ echo $KEYCLOAK_HOST
https://keycloak-keycloak.apps.hypershift-ci-18556.xxx
$ curl -sSk "$KEYCLOAK_HOST/realms/master/.well-known/openid-configuration" > oauthMetadata

$ cat oauthMetadata 
{"issuer":"https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master"

$ oc create configmap oauth-meta --from-file ./oauthMetadata -n clusters --kubeconfig /path/to/management-cluster/kubeconfig
...

3. Set hc.spec.configuration.authentication:
$ CLIENT_ID=openshift-test-aud
$ oc patch hc hypershift-ci-18556 -n clusters --kubeconfig /path/to/management-cluster/kubeconfig --type=merge -p="
spec:
  configuration:
    authentication:
      oauthMetadata:
        name: oauth-meta
      oidcProviders:
      - claimMappings:
          ...
        issuer:
          audiences:
          - $CLIENT_ID
          issuerCertificateAuthority:
            name: keycloak-oidc-ca
          issuerURL: $KEYCLOAK_HOST/realms/master
        name: keycloak-oidc-test
      type: OIDC
"

Check KAS indeed already picks up the setting:
$ oc logs -c kube-apiserver kube-apiserver-5c976d59f5-zbrwh -n clusters-hypershift-ci-18556 --kubeconfig /path/to/management-cluster/kubeconfig | grep "oidc-"
...
I0130 08:07:24.266247       1 flags.go:64] FLAG: --oidc-ca-file="/etc/kubernetes/certs/oidc-ca/ca.crt"
I0130 08:07:24.266251       1 flags.go:64] FLAG: --oidc-client-id="openshift-test-aud"
...
I0130 08:07:24.266261       1 flags.go:64] FLAG: --oidc-issuer-url="https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master"
...

Wait about 15 mins.

4. Check COs and check oc login. Both show the same error:
$ oc get co | grep -v 'True.*False.*False'
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.16.0-0.nightly-2024-01-29-233218   True        True          False      4h57m   SyncLoopRefreshProgressing: Working toward version 4.16.0-0.nightly-2024-01-29-233218, 1 replicas available
$ oc get po -n openshift-console
NAME                        READY   STATUS             RESTARTS         AGE
console-547cf6bdbb-l8z9q    1/1     Running            0                4h55m
console-54f88749d7-cv7ht    0/1     CrashLoopBackOff   9 (3m18s ago)    14m
console-54f88749d7-t7x96    0/1     CrashLoopBackOff   9 (3m32s ago)    14m

$ oc logs console-547cf6bdbb-l8z9q -n openshift-console
I0130 03:23:36.788951       1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.059196ms)
E0130 06:48:32.745179       1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused
E0130 06:53:32.757881       1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused
...

$ oc login --exec-plugin=oc-oidc --client-id=openshift-test-aud --extra-scopes=email,profile --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

5. Check root cause, the configured oauthMetadata is not picked up well:
$ curl -k https://a6e149f24f8xxxxxx.elb.ap-east-1.amazonaws.com:6443/.well-known/oauth-authorization-server
{
"issuer": "https://:0",
"authorization_endpoint": "https://:0/oauth/authorize",
"token_endpoint": "https://:0/oauth/token",
...
}

Actual results:

As above steps 4 and 5, the configured oauthMetadata is not picked up well, causing console and oc login hit the error.

Expected results:

The configured oauthMetadata is picked up well. No error.

Additional info:

For oc, if I manually use `oc config set-credentials oidc --exec-api-version=client.authentication.k8s.io/v1 --exec-command=oc --exec-arg=get-token --exec-arg="--issuer-url=$KEYCLOAK_HOST/realms/master" ...` instead of using `oc login --exec-plugin=oc-oidc ...`, oc authentication works well. This means my configuration is correct.
$ oc whoami  
Please visit the following URL in your browser: http://localhost:8080
oidc-user-test:xxia@redhat.com

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:
In https://issues.redhat.com/browse/OCPBUGS-28625?focusedId=24056681&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24056681 , Seth Jennings states "It is not required to set the oauthMetadata to enable external OIDC".

Today having a chance to try without setting oauthMetadata, hit oc login fails with the error:

$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

Console login can succeed, though.

Note, OCM QE also encounters this when using ocm cli to test ROSA HCP external OIDC. Either oc or HCP, or anywhere (as a tester I'm not sure TBH ), worthy to have a fix, otherwise oc login is affected.

Version-Release number of selected component (if applicable):

[xxia@2024-03-01 21:03:30 CST my]$ oc version --client
Client Version: 4.16.0-0.ci-2024-03-01-033249
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
[xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci-2024-02-29-213249   True        False         8h      Cluster version is 4.16.0-0.ci-2024-02-29-213249

How reproducible:

Always

Steps to Reproduce:

1. Launch fresh HCP cluster.

2. Login to https://entra.microsoft.com. Register application and set properly.

3. Prepare variables.
HC_NAME=hypershift-ci-267920
MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig
HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig
AUDIENCE=7686xxxxxx
ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0
CLIENT_ID=7686xxxxxx
CLIENT_SECRET_VALUE="xxxxxxxx"
CLIENT_SECRET_NAME=console-secret

4. Configure HC without oauthMetadata.
[xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG

[xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p="
spec:
  configuration: 
    authentication: 
      oauthMetadata:
        name: ''
      oidcProviders:
      - claimMappings:
          groups:
            claim: groups
            prefix: 'oidc-groups-test:'
          username:
            claim: email
            prefixPolicy: Prefix
            prefix:
              prefixString: 'oidc-user-test:'
        issuer:
          audiences:
          - $AUDIENCE
          issuerURL: $ISSUER_URL
        name: microsoft-entra-id
        oidcClients:
        - clientID: $CLIENT_ID
          clientSecret:
            name: $CLIENT_SECRET_NAME
          componentName: console
          componentNamespace: openshift-console
      type: OIDC
"

Wait pods to renew:
[xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
certified-operators-catalog-7ff9cffc8f-z5dlg          1/1     Running   0          5h44m
kube-apiserver-6bd9f7ccbd-kqzm7                       5/5     Running   0          17m
kube-apiserver-6bd9f7ccbd-p2fw7                       5/5     Running   0          15m
kube-apiserver-6bd9f7ccbd-fmsgl                       5/5     Running   0          13m
openshift-apiserver-7ffc9fd764-qgd4z                  3/3     Running   0          11m
openshift-apiserver-7ffc9fd764-vh6x9                  3/3     Running   0          10m
openshift-apiserver-7ffc9fd764-b7znk                  3/3     Running   0          10m
konnectivity-agent-577944765c-qxq75                   1/1     Running   0          9m42s
hosted-cluster-config-operator-695c5854c-dlzwh        1/1     Running   0          9m42s
cluster-version-operator-7c99cf68cd-22k84             1/1     Running   0          9m42s
konnectivity-agent-577944765c-kqfpq                   1/1     Running   0          9m40s
konnectivity-agent-577944765c-7t5ds                   1/1     Running   0          9m37s

5. Check console login and oc login.
$ export KUBECONFIG=$HOSTED_KUBECONFIG
$ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server
{
"issuer": "https://:0",
"authorization_endpoint": "https://:0/oauth/authorize",
"token_endpoint": "https://:0/oauth/token",
...
}
Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com.

Check oc login:
$ rm -rf ~/.kube/cache/oc/
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused
Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1

Actual results:

Console login succeeds. oc login fails.

Expected results:

oc login should also succeed.

Additional info:{}

Description of problem:

When issuerCertificateAuthority is set, kube-apiserver pod is CrashLoopBackOff.

Tried RCA debugging, found the cause is: the path /etc/kubernetes/certs/oidc-ca/ca.crt is incorrect. The expected path should be /etc/kubernetes/certs/oidc-ca/ca-bundle.crt .    

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-03-13-061822

How reproducible:

    Always

Steps to Reproduce:

1. Create fresh HCP cluster.
2. Create keycloak as OIDC server exposed as a Route which uses cluster's default ingress certificate as the serving certificate.
3. Configure clients necessarily on keycloak admin UI.
4. Configure external OIDC:
$ oc create configmap keycloak-oidc-ca --from-file=ca-bundle.crt=router-ca/ca.crt --kubeconfig $MGMT_KUBECONFIG -n clusters

$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p="
spec:
  configuration:
    authentication:
      oidcProviders:
      - claimMappings:
          groups:
            claim: groups
            prefix: 'oidc-groups-test:'
          username:
            claim: email
            prefixPolicy: Prefix
            prefix:
              prefixString: 'oidc-user-test:'
        issuer:
          audiences:
          - $AUDIENCE_1
          - $AUDIENCE_2
          issuerCertificateAuthority:
            name: keycloak-oidc-ca
          issuerURL: $ISSUER_URL
        name: keycloak-oidc-server
        oidcClients:
        - clientID: $CONSOLE_CLIENT_ID
          clientSecret:
            name: $CONSOLE_CLIENT_SECRET_NAME
          componentName: console
          componentNamespace: openshift-console
      type: OIDC
"

5. Check pods should be renewed, but new pod is CrashLoopBackOff:
$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp | tail -n 4
openshift-apiserver-65f8c5f545-x2vdf                  3/3     Running            0               5h8m
community-operators-catalog-57dd5886f7-jq25f          1/1     Running            0               4h1m
kube-apiserver-5d75b5b848-c9c8r                       4/5     CrashLoopBackOff   25 (3m9s ago)   107m

$ oc logs --timestamps -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -c kube-apiserver kube-apiserver-5d75b5b848-gk2t8
...
2024-03-18T09:11:14.836540684Z I0318 09:11:14.836495       1 dynamic_cafile_content.go:119] "Loaded a new CA Bundle and Verifier" name="client-ca-bundle::/etc/kubernetes/certs/client-ca/ca.crt"
2024-03-18T09:11:14.837725839Z E0318 09:11:14.837695       1 run.go:74] "command failed" err="jwt[0].issuer.certificateAuthority: Invalid value: \"<omitted>\": data does not contain any valid RSA or ECDSA certificates"

Actual results:

5. New kube-apiserver pod is CrashLoopBackOff.

`oc explain` for issuerCertificateAuthority says the configmap data should use ca-bundle.crt. But I also tried to use ca.crt in configmap's data, got same result.

Expected results:

6. No CrashLoopBackOff.

Additional info:
Below is my RCA for the CrashLoopBackOff kube-apiserver pod:
Check if it is valid RSA certificate, it is valid:

$ openssl x509 -noout -text -in router-ca/ca.crt | grep -i rsa
        Signature Algorithm: sha256WithRSAEncryption
            Public Key Algorithm: rsaEncryption
    Signature Algorithm: sha256WithRSAEncryption

So, the CA certificate has no issue.
Above pod logs show "/etc/kubernetes/certs/oidc-ca/ca.crt" is used. Double checked the configmap:

$ oc get cm auth-config -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -o jsonpath='{.data.auth\.json}' | jq | ~/auto/json2yaml.sh
---
kind: AuthenticationConfiguration
apiVersion: apiserver.config.k8s.io/v1alpha1
jwt:
- issuer:
    url: https://keycloak-keycloak.apps..../realms/master
    certificateAuthority: "/etc/kubernetes/certs/oidc-ca/ca.crt"
...

Then debug the CrashLoopBackOff pod:

The used path /etc/kubernetes/certs/oidc-ca/ca.crt does not exist! The correct path should be /etc/kubernetes/certs/oidc-ca/ca-bundle.crt:

$ oc debug -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -c kube-apiserver kube-apiserver-5d75b5b848-gk2t8
Starting pod/kube-apiserver-5d75b5b848-gk2t8-debug-kpmlf, command was: hyperkube kube-apiserver --openshift-config=/etc/kubernetes/config/config.json -v2 --encryption-provider-config=/etc/kubernetes/secret-encryption/config.yaml
sh-5.1$ cat /etc/kubernetes/certs/oidc-ca/ca.crt
cat: /etc/kubernetes/certs/oidc-ca/ca.crt: No such file or directory
sh-5.1$ ls /etc/kubernetes/certs/oidc-ca/
ca-bundle.crt
sh-5.1$ cat /etc/kubernetes/certs/oidc-ca/ca-bundle.crt
-----BEGIN CERTIFICATE-----
MIIDPDCCAiSgAwIBAgIIM3E0ckpP750wDQYJKoZIhvcNAQELBQAwJjESMBAGA1UE
...

Description of problem:

Updating oidcProviders does not take effect. See details below.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-26-155043

How reproducible:

Always 

Steps to Reproduce:

1. Install fresh HCP env and configure external OIDC as steps 1 ~ 4 of https://issues.redhat.com/browse/OCPBUGS-29154 (to avoid repeated typing those steps, only referencing as is here).

2. Pods renewed:
$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
network-node-identity-68b7b8dd48-4pvvq               3/3     Running   0          170m
oauth-openshift-57cbd9c797-6hgzx                     2/2     Running   0          170m
kube-controller-manager-66f68c8bd8-tknvc             1/1     Running   0          164m
kube-controller-manager-66f68c8bd8-wb2x9             1/1     Running   0          164m
kube-controller-manager-66f68c8bd8-kwxxj             1/1     Running   0          163m
kube-apiserver-596dcb97f-n5nqn                       5/5     Running   0          29m
kube-apiserver-596dcb97f-7cn9f                       5/5     Running   0          27m
kube-apiserver-596dcb97f-2rskz                       5/5     Running   0          25m
openshift-apiserver-c9455455c-t7prz                  3/3     Running   0          22m
openshift-apiserver-c9455455c-jrwdf                  3/3     Running   0          22m
openshift-apiserver-c9455455c-npvn5                  3/3     Running   0          21m
konnectivity-agent-7bfc7cb9db-bgrsv                  1/1     Running   0          20m
cluster-version-operator-675745c9d6-5mv8m            1/1     Running   0          20m
hosted-cluster-config-operator-559644d45b-4vpkq      1/1     Running   0          20m
konnectivity-agent-7bfc7cb9db-hjqlf                  1/1     Running   0          20m
konnectivity-agent-7bfc7cb9db-gl9b7                  1/1     Running   0          20m

3. oc login can succeed:
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
Please visit the following URL in your browser: http://localhost:8080
Logged into "https://a4af9764....elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.You don't have any projects. Contact your system administrator to request a project.

4. Update HC by changing claim: email to claim: sub:
$ oc edit hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG
...
          username:
            claim: sub
...

Update is picked up:
$ oc get authentication.config cluster -o yaml
...
spec:
  oauthMetadata:
    name: tested-oauth-meta
  oidcProviders:
  - claimMappings:
      groups:
        claim: groups
        prefix: 'oidc-groups-test:'
      username:
        claim: sub
        prefix:
          prefixString: 'oidc-user-test:'
        prefixPolicy: Prefix
    issuer:
      audiences:
      - 76863fb1-xxxxxx
      issuerCertificateAuthority:
        name: ""
      issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0
    name: microsoft-entra-id
    oidcClients:
    - clientID: 76863fb1-xxxxxx
      clientSecret:
        name: console-secret
      componentName: console
      componentNamespace: openshift-console
  serviceAccountIssuer: https://xxxxxx.s3.us-east-2.amazonaws.com/hypershift-ci-267402
  type: OIDC
status:
  oidcClients:
  - componentName: console
    componentNamespace: openshift-console
    conditions:
    - lastTransitionTime: "2024-02-28T10:51:17Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Degraded
    - lastTransitionTime: "2024-02-28T10:51:17Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Progressing
    - lastTransitionTime: "2024-02-28T10:51:17Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "True"
      type: Available
    currentOIDCClients:
    - clientID: 76863fb1-xxxxxx
      issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0
      oidcProviderName: microsoft-entra-id

4. Check pods again:
$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp
...
kube-apiserver-596dcb97f-n5nqn                       5/5     Running   0          108m
kube-apiserver-596dcb97f-7cn9f                       5/5     Running   0          106m
kube-apiserver-596dcb97f-2rskz                       5/5     Running   0          104m
openshift-apiserver-c9455455c-t7prz                  3/3     Running   0          102m
openshift-apiserver-c9455455c-jrwdf                  3/3     Running   0          101m
openshift-apiserver-c9455455c-npvn5                  3/3     Running   0          100m
konnectivity-agent-7bfc7cb9db-bgrsv                  1/1     Running   0          100m
cluster-version-operator-675745c9d6-5mv8m            1/1     Running   0          100m
hosted-cluster-config-operator-559644d45b-4vpkq      1/1     Running   0          100m
konnectivity-agent-7bfc7cb9db-hjqlf                  1/1     Running   0          99m
konnectivity-agent-7bfc7cb9db-gl9b7                  1/1     Running   0          99m

No new pods renewed.

5. Check login again, it does not use "sub", still use "email":
$ rm -rf ~/.kube/cache/
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080
Please visit the following URL in your browser: http://localhost:8080
Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.
  
You don't have any projects. Contact your system administrator to request a project.
$ cat ~/.kube/cache/oc/* | jq -r '.id_token' | jq -R 'split(".") | .[] | @base64d | fromjson'
...
{
...
  "email": "xxia@redhat.com",
  "groups": [
...
  ],
...
  "sub": "EEFGfgPXr0YFw_ZbMphFz6UvCwkdFS20MUjDDLdTZ_M",
...

Actual results:

Steps 4 ~ 5: after editing HC field value from "claim: email" to "claim: sub", even if `oc get authentication cluster -o yaml` shows the edited change is propagated:
1> The pods like kube-apiserver are not renewed.
2> After clean-up ~/.kube/cache, `oc login ...` relogin still prints 'Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer', i.e. still uses old claim "email" as user name, instead of using the new claim "sub".

Expected results:

Steps 4 ~ 5: Pods like kube-apiserver pods should renew after HC editing that changes user claim. The login should print that the new claim is used as user name.

Additional info:

    

 

https://redhat-internal.slack.com/archives/GQ0CU2623/p1692107036750429?thread_ts=1689276746.185269&cid=GQ0CU2623 

This card adds support for implementing ANP.Egress.Networks Peer in OVNKubernetes:

  1. vendoring in api from netpol api repo
  2. designing the ovnk pieces into the existing controller
  3. writing unit tests
  4. bringing in the conformance tests from upstream

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

  • Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
  • Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

  • Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

  • Non-OVN primary CNI plug-in solutions

Background

Customer Considerations

  • What happens to clusters that don't migrate all iptables use to nftables?
    • In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
    • In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

Template:

 

Networking Definition of Planned

Epic Template descriptions and documentation 

 

Epic Goal

  • Replace the random bits of iptables glue in ovn-kubernetes with exactly equivalent nftables versions

Why is this important?

  • iptables will not be supported in RHEL 10, so we need to replace all uses of it in OCP with nftables. See OCPSTRAT-873.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

 

Feature Overview (aka. Goal Summary)

When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.

Goals (aka. expected user outcomes)

An end user can use the openshift console without a notable difference in experience.  This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery

Requirements (aka. Acceptance Criteria):

  1. User can log in and use the console
  2. User can get a kubeconfig that functions on the CLI with matching oc
  3. Both of those work on hypershift
  4. both of those work on standalone.

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • When installed with external OIDC, the clientID and clientSecret need to be configurable to match the external (and unmanaged) OIDC server

Why is this important?

  • Without a configurable clientID and secret, I don't think the console can identify the user.
  • There must be a mechanism to do this on both hypershift and openshift, though the API may be very similar.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • When the oauthclient API is not present, the operator must stop creating the oauthclient

Why is this important?

  • This is preventing the operator from creating its deployment
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Console needs to be able to auth agains external OIDC IDP. For that console-operator need to set configure it in that order.

AC:

  • bump console-operator API pkg
  • sync oauth client secret from OIDC configuration for auth type OIDC
  • add OIDC config to the console configmap
  • add auth server CA to the deployment annotations and volumes when auth type OIDC
  • consume OIDC configuration in the console configmap and deployment
  • fix roles for oauthclients and authentications watching

Feature Overview

Users of the OpenShift Console leverage a streamlined, visual experience when discovering and installing OLM-managed operators in clusters that run on cloud providers with support for short-lived token authentication enabled. Users are intuitively becoming aware when this is the case and are put on the happy path to configure OLM-managed operators with the necessary information to support GCP Workload Identify Foundation (WIF).

 

Goals:

Customers do not need to re-learn how to enable GCP WIF authentication support for each and every OLM-managed operator that supports it. The experience is standardized and repeatable so customers spend less time with initial configuration and more team implementing business value. The process is so easy that OpenShift is perceived as enabler for an increased security posture.

 

Requirements:

  • based on OCPSTRAT-922, the installation and configuration experience for any OLM-managed operator using short-lived token authentication is streamlined using the OCP console in the form of a guided process that avoids misconfiguration or unexpected behavior of the operators in question
  • the OCP Console helps in detecting when the cluster itself is already using GCP WIF for core functionality
  • the OCP Console helps discover operators capable of GCP WIF authentication and their IAM permission requirements
  • the OCP Console has a filtering capability for operators capable of GCP WIF authentication in the main catalog tile view
  • the OCP Console drives the collection of the required information for GCP WIF authentication at the right stages of the installation process and stops the process when the information is not provided
  • the OCP Console implements this process with minimal differences across different cloud providers and is capable of adjusting the terminology depending on the cloud provider that the cluster is running on

 

Use Cases:

  • A cluster admin browses the OperatorHub catalog and looks at the details view of a particular operator, there they discover that the cluster is configured for GCP WIF
  • A cluster admin browsing the OperatorHub catalog content can filter for operators that support the GCP WIF flow described in OCPSTRAT-922
  • A cluster admin reviewing the details of a particular operator in the OperatorHub view can discover that this operator supports GCP WIF authentication
  • A cluster admin installing a particular operator can get information about the GCP IAM permission requirements the operator has
  • A cluster admin installing a particular operator is asked to provide GCP ServiceAccount that is required for GCP WIF prior to the actual installation step and is prevented from continuing without this information
  • A cluster admin reviewing an installed operators with support for GCP WIF can discover the related CredentialRequest object that the operator created in an intuitive way (not generically via related objects that have an ownership reference or as part of the InstallPlan)

Out of Scope

  • update handling and blocking in case of increased permission requirements in the next / new version of the operator
  • more complex scenarios with multiple IAM roles/service principals resulting in multiple CredentialRequest objects used by a single operator

 

Background

The OpenShift Console today provides little to no support for configuring OLM-managed operators for short-lived token authentication. Users are generally unaware if their cluster runs on a cloud provider and is set up to use short-lived tokens for its core functionality and users are not aware which operators have support for that by implementing the respective flows defined in OCPSTRAT-922.

Customer Considerations

Customers may or may not be aware about short-lived token authentication support. They need to proper context and pointers to follow-up documentation to explain the general concept and the specific configuration flow the Console supports. It needs to become clear that the Console cannot 100% automate the overall process and some steps need to be run outside of the cluster/Console using Cloud-provider specific tooling.

Epic Goal

  • Transparently support old and new infrastructure annotations format delivered by OLM-packaged operators

Why is this important?

  • As part of part of OCPSTRAT-288 we are looking to improve the metadata quality of Red Hat operators in OpenShift
  • via PORTENABLE-525 we are defining a new metadata format that supports the aforementioned initiative with more robust detection of individual infrastructure features via boolean data types

Scenarios

  1. A user can use the OCP console to browse through the OperatorHub catalog and filter for all the existing and new annotations defined in PORTENABLE-525
  2. A user reviewing an operator's detail can see the supported infrastructures transparently regardless if the operator uses the new or the existing annotations format

Acceptance Criteria

  • the new annotation format is supported in operatorhub filtering and operator details pages
  • the old annotation format keeps being supported in operatorhub filtering and operator details pages
  • the console will respect both the old and the new annotations format
  • when for a particular feature both the operator denotes data in both the old and new annotation format, the annotations in the newer format take precedence
  • the newer infrastructure features from PORTENABLE-525 tls-profiles and token-auth/* do not have equivalents in the old annotation format and evaluation doesn't need to fall back as described in the previous point

Dependencies (internal and external)

  1. none

Open Questions

  1. due to the non-intrusive nature of this feature, can we ship it in a 4.14.z patch release?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/tls-profiles, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.

AC: 

https://issues.redhat.com/browse/PORTENABLE-525 adds new annotations, including features.operators.openshift.io/token-auth-gcp, but console does not yet support this new annotation. We need to determine what changes are necessary to support this new annotation and implement them.

 

AC: 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Goal of this epic is to capture all the amount of required work and efforts that take to update the openshift control plane with the upstream kubernetes v1.29

Why is this important?

  • Rebase is a must process for every ocp release to leverage all the new features implemented upstream

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Following epic captured the previous rebase work of k8s v1.28
    https://issues.redhat.com/browse/STOR-1425 

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

While monitoring the payload job failures, open a parallel openshift/origin bump.

Note: There is a high chance of job failures in openshift/origin bump until the openshift/kubernetes PR merges as we only update the test and not the actual kube.

 

Benefit of opening this PR before ocp/k8s merge is to identify and fix the issues beforehand.

Prev Ref: https://github.com/openshift/origin/pull/28097 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.28
  • target is 4.16 since CAPI is always a release behind upstream

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.29
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that cloud-credential-operator uses to v1.29 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

 

This epic will own all of the usual update, rebase and release chores which must be done during the OpenShift 4.16 timeframe for Custom Metrics Autoscaler, Vertical Pod Autoscaler and Cluster Resource Override Operator

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.29 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • Uncover any possible issues with the openshift/kubernetes rebase before it merges.
  • MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
  • MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

  • All stories in this epic must be completed.
  • Go version is upgraded for MCO components.
  • CI is running successfully with the upgraded components against the 4.16/master branch.

Dependencies (internal and external)

  1. ART team creating the go 1.29 image for upgrade to go 1.29.
  2. OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

  1. Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

  • Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
  • Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
  • Step 3 - Upgrade OpenShift client and API dependencies
  • Step 4 - Update kubelet and kube-proxy submodules in MCO repository
  • Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

This story relates to this PR https://github.com/openshift/machine-config-operator/pull/4275

A new PR has been opened to investigate the issues found in the original PR (this is the link to the new PR): https://github.com/openshift/machine-config-operator/pull/4306

The original PR exceeded the watch request limits when merged. When discovered, the CNTO team needed to revert it. (see https://redhat-internal.slack.com/archives/C01CQA76KMX/p1711711538689689)

To investigate if exceeding the watch request limit was introduced from the API bump and its associated changes, or the kubeconfig changes, an additional PR was opened just for looking at removing the hardcoded values from the kubelet template, and payload tests were run against it: https://github.com/openshift/machine-config-operator/pull/4270. The payload tests passed, and it was concluded that the watch request limit issue was introduced in the portion of the PR that included the API bump and its associated changes.

It was discovered that the CNTO team was using an outdated form of openshift deps, so they were asked to bump. https://redhat-internal.slack.com/archives/CQNBUEVM2/p1712171079685139?thread_ts=1711712855.478249&cid=CQNBUEVM2

https://github.com/openshift/cluster-node-tuning-operator/pull/990
was opened in the past to address the kube bump (this just merged), and https://github.com/openshift/cluster-node-tuning-operator/pull/1022
was opened as well (still open)

CURRENT STATUS: waiting for https://github.com/openshift/cluster-node-tuning-operator/pull/1022 to merge so we can rerun payload tests against the revert PR open.

User or Developer story

As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.29 k8s rebase to track the k8s version as rest of the OpenShift 1.29 cluster.

Engineering Details

  • Update the go.mod, go.sum and vendor dependencies pointing to the kube1.29 libraries. This includes all direct kubernetes related libraries as well as openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils

Acceptance Criteria:

  • All k8s.io related dependencies should be upgraded to 1.29.
  • openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils should be upgraded to latest commit from master branch
  • All ci tests must be passing

This feature is now re-opened because we want to run z-rollback CI. This feature doesn't block the release of 4.17.This is not going to be exposed as a customer-facing feature and will not be documented within OpenShift documentation.  This is strictly going to be covered as a RH Support guided solution with KCS article providing guidance. A public facing KCS will basically point to contacting Support for help on Z-stream rollback, and Y-stream rollback is not supported.

NOTE:
Previously this was closed as "won't do" because didn't have a plan to support y-stream and z-stream rollbacks is standalone openshift.
For Single node openshift please check TELCOSTRAT-160 . "won't do"  decisions was after further discussion with leadership.
The e2e tests https://docs.google.com/spreadsheets/d/1mr633YgQItJ0XhbiFkeSRhdLlk6m9vzk1YSKQPHgSvw/edit?gid=0#gid=0 We have identified a few bugs that need to be resolved before the General Availability (GA) release. Ideally, these should be addressed in the final month before GA when all features are development complete. However, asking component teams to commit to fixing critical rollback bugs during this time could potentially delay the GA date.

------

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Red Hat Support assisted z-stream rollback from 4.16+

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Red Hat Support may, at their discretion, assist customers with z-stream rollback once it’s determined to be the best option for restoring a cluster to the desired state whenever a z-stream rollback compromises cluster functionality.

Engineering will take a “no regressions, no promises” approach, ensuring there are no major regressions between z-streams, but not testing specific combinations or addressing case-specific bugs.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Public Documentation (or KCS?) that explains why we do not advise unassisted z-stream rollback and what to do when a cluster experiences loss of functionality associated with a z-stream upgrade.
  • Internal KCS article that provides a comprehensive plan for troubleshooting and resolving issues introduced after applying a z-stream update, up to and including complete z-stream rollback.
  • Should include alternatives such as limited component rollback (single operator, RHCOS, etc) and workaround options
  • Should include incident response and escalation procedures for all issues incurred during application of a z-stream update so that even if rollback is performed we’re tracking resolution of defects with highest priority
  • Foolproof command to initiate z-stream rollback with Support’s approval, aka a hidden command that ensures we don’t typo the pull spec or initiate A->B->C version changes, only A->B->A
  • Test plan and jobs to ensure that we have high confidence in ability to rollback a z-stream along happy paths
  • Need not be tested on all platforms and configurations, likely metal or vSphere and one foolproof platform like AWS
  • Test should not monitor for disruption since it’s assumed disruption is tolerable during an emergency rollback provided we achieve availability at the end of the operation
  • Engineering agrees to fix bugs which inhibit rollback completion before the current master branch release ships, aka they’ll be filed as blockers for the current master branch release. This means bugs found after 4.N branches may not be fixed until the next release without discussion.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed all
Multi node, Compact (three node) all
Connected and Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Release payload only all
Starting with 4.16, including all future releases all
   
   

While this feature applies to all deployments we will only run a single platform canary test on a high success rate platform, such as AWS. Any specific ecosystems which require more focused testing should bring their own testing.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an admin who has determined that a z-stream update has compromised cluster functionality I have clear documentation that explains that unassisted rollback is not supported and that I should consult with Red Hat Support on the best path forward.

As a support engineer I have a clear plan for responding to problems which occur during or after a z-stream upgrade, including the process for rolling back specific components, applying workarounds, or rolling the entire cluster back to the previously running z-stream version.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Should we allow rollbacks whenever an upgrade doesn’t complete? No, not without fully understanding the root cause. If it’s simply a situation where workers are in process of updating but stalled, that should never yield a rollback without credible evidence that rollback will fix that.

Similar to our “foolproof command” to initiate rollback to previous z-stream should we also craft a foolproof command to override select operators to previous z-stream versions? Part of the goal of the foolproof command is to avoid potential for moving to an unintended version, the same risk may apply at single operator level though impact would be smaller it could still be catastrophic.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Non-HA clusters, Hosted Control Planes – those may be handled via separately scoped features

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Occasionally clusters either upgrade successfully and encounter issues after the upgrade or may run into problems during the upgrade. Many customers assume that a rollback will fix their concerns but without understanding the root cause we cannot assume that’s the case. Therefore, we recommend anyone who has encountered a negative outcome associated with a z-stream upgrade contact support for guidance.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

It’s expected that customers should have adequate testing and rollout procedures to protect against most regressions, i.e. roll out a z-stream update in pre-production environments where it can be adequately tested prior to updating production environments.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

This is largely a documentation effort, i.e. we should create either a KCS article or new documentation section which describes how customers should respond to loss of functionality during or after an upgrade.
KCS Solution : https://access.redhat.com/solutions/7083335 

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Given we test as many upgrade configurations as possible and for whatever reason the upgrade still encounters problems, we should not strive to comprehensively test all configurations for rollback success. We will only test a limited set of platforms and configurations necessary to ensure that we believe the platform is generally able to roll back a z-stream update.

Epic Goal

  • Validate z-stream rollbacks in CI starting with 4.10 by ensuring that a rollback completes unassisted and e2e testsuite passes
  • Provide internal documentation (private KCS article) that explains when this is the best course of action versus working around a specific issue
  • Provide internal documentation (private KCS article) that explains the expected cluster degradation until the rollback is complete
  • Provide internal documentation (private KCS article) outlining the process and any post rollback validation

Why is this important?

  • Even if upgrade success is 100% there's some chance that we've introduced a change which is incompatible with a customer's needs and they desire to roll back to the previous z-stream
  • Previously we've relied on backup and restore here, however due to many problems with time travel, that's only appropriate for disaster recovery scenarios where the cluster is either completely shut down already or it's acceptable to do so while also accepting loss of any workload state change (PVs that were attached after the backup was taken, etc)
  • We believe that we can reasonably roll back to a previous z-stream

Scenarios

  1. Upgrade from 4.10.z to 4.10.z+n
  2. oc adm upgrade rollback-z-stream – which will initially be hidden command, this will look at clusterversion history and rollback to the previous version if and only if that version is a z-stream away
  3. Rollback from 4.10.z+n to exactly 4.10.z, during which the cluster may experience degraded service and/or periods of service unavailability but must eventually complete with no further admin action
  4. Must pass 4.10.z e2e testsuite

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Fix all bugs listed here
    project = "OpenShift Bugs" AND affectedVersion in( 4.12, 4.14, 4.15) AND labels = rollback AND status not in (Closed ) ORDER BY status DESC

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. At least today we indend to only surface this process internally and work through it with customers actively engaged with support, where do we put that?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (Goal Summary):

Identify and fill gaps related to CI/CD for HyperShift-ARO integration. 

Goals (Expected User Outcomes):

  • Establish a consistent and automated testing pipeline that reduces the risk of regressions and improves code quality.
  • Enhance the user experience by ensuring that HyperShift's integration with Azure is thoroughly tested and reliable.

Requirements (Acceptance Criteria):

  • Develop and integrate an Azure-specific testing suite that can be run as part of the pres-ubmit checks.
  • Implement a schedule for periodic conformance tests on ARO. 
  • Maintain documentation that guides contributors on how to write and run tests within the ARO ecosystem.

Use Cases

  • A developer submits a pull request to HyperShift’s codebase, which automatically triggers the Azure-specific presubmit tests.
  • Scheduled conformance tests run automatically at predetermined intervals, providing ongoing assurance of ARO's integration with Azure.

Acceptance Criteria:

Description of criteria:

  • Conformance tests run periodically for Azure

This does not require a design proposal.
This does not require a feature gate.

Currently when creating the resource group it will be deleted after 24 hours. We can change this by setting an expiry tag on the resource group once created.

The ability to set this expiry tag when creating the resource group as part of the cluster creation command would be nice

We want to make sure ARO/HCP development happens while satisfying e2e expectations

Acceptance Criteria:

There's a running, blocking test for azure in presubmits

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • Discussed creating a HC on our root CI cluster that would have an Azure HC which would be used as our Azure mgmt cluster. We would need to 
    • Manually create the Azure HC on the root CI 
    • Capture the manifests for that and put them in contrib
    • Configure the IDP
    • Setup credentials 

This does not require a design proposal.
This does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • Create ARO HCs off an ARO MGMT cluster

so that I can achieve

  • Better Dev flow
  • Enable pre-submit e2e testing for ARO dev work

Acceptance Criteria:

Description of criteria:

  • ARO mgmt cluster exists.
  • Documents to reproduce mgmt cluster env exist.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

Phase one: GCP tech preview

The bootimage references are currently saved off in the machineset by the openshift installer and is thereafter unmanaged. This machineset object is not updated on an upgrade, so any node scaled up using it will boot up with the original “install” bootimage.

The “new” boot image references are available in a configmap/coreos-bootimages in the MCO namespace. Here is the PR that implemented this, it’s basically a CVO manifest that pulls from this file in the installer binary. Hence, they are updated on an upgrade. It can also be printed out to console by the following command on the installer: /openshift-install coreos print-stream-json. 

Implementing this portion should be as simple as iterating through each machineset, and updating the new disk image by crossreferencing the configmap, architecture, region and the platform used in the machineset. This is where the installer figures out the bootimage during an install, so we could model a bit after this.

It looks like we have Machine API objects for every platform specific providerSpec(formally called providerConfig) we support here. We'd still have to special case the image/ami actual portion of this, but we should be able to leverage some of the work done in the installer(to generate machinesets, for example, GCP) to understand how the image reference is stored for every platform.

Done when:

For MVP, the goal is to

  • add a new sub controller within the MCC. This subcontroller can be triggered by a listener on the machinesets and if any changes happen to the "golden" configmap mentioned above
  • We'll support GCP to start. I'll make a follow-up card for the other platforms, but I'm open to adding more here if needed! 

Feature Overview (Goal Summary)

This feature is dedicated to enhancing data security and implementing encryption best practices across control-planes, Etcd, and nodes for HyperShift with Azure. The objective is to ensure that all sensitive data, including secrets is encrypted, thereby safeguarding against unauthorized access and ensuring compliance with data protection regulations.

 

User Story:

As a user of HCP on Azure, I would like to be able to pass a customer-managed key when creating a HC so that the disks for the VMs in the NodePool are encrypted.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of HCP on Azure, I want to be able to provide a DiskEncryptionSet ID to encrypt the OS disks for the VMs in the NodePool so that the data on the OS disks will be protected by encryption.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation add on what is needed for the Azure Key Vault and how to encrypt the OS disks thru both the CLI and through the CR spec. 
  • HyperShift CLI lets a user provide a DiskEncryptionSet ID to encrypt the OS disk.
  • Ability to encrypt the OS disks through the HyperShift CLI.
  • Ability to encrypt the OS disks through the HC CR.
  • Any applicable unit tests.

Out of Scope:

N/A

Engineering Details:

User Story:

  • As a service provider/consumer I want to make sure secrets are encrypted with key owned by the consumer

Acceptance Criteria:

Expose and propagate input for kms secret encryption similar to what we do in AWS.

https://github.com/openshift/hypershift/blob/90aa44d064f6fe476ba4a3f25973768cbdf05eb5/api/v1beta1/hostedcluster_types.go#L1765-L1790

 

See related discussion:

https://redhat-internal.slack.com/archives/CCV9YF9PD/p1696950850685729

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (Goal Summary)

This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.

Goal

  • All Azure API fields should have appropriate definitions as to their use and purpose.
  • All Azure API fields should have appropriate k8s cel added.

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions:

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As ARO/HCP provider, I want to be able to:

  • receive a customer's subnet id 

so that I can achieve

  • parse out the resource group name, vnet name, subnet name for use in CAPZ (creating VMs) and Azure CCM (setting up Azure cloud provider).

Acceptance Criteria:

Description of criteria:

  • Customer hosted cluster resources get created in a managed resource group.
  • Customer vnet is used in setting up the VMs.
  • Customer resource group remains unchanged.

Engineering Details:

  • Benjamin Vesel said the managed resource group and customer resource group would be under the same subscription id.
  • We are only supporting BYO VNET at the moment; we are not supporting the VNET being created with the other cloud resources in the managed resource group.

This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.

User Story:

As ARO/HCP provider, I want to be able to:

  • specify which resource group is going to be used for the resources in the customer side

so that I can achieve

  • control over where resources get allocated instead of getting defaults.

Acceptance Criteria:

Description of criteria:

  • Customer side resources get created in the API specified resource group.

Engineering Details:

  •  

 This requires a design proposal so OCM knows where to specify the resource group
 This requires might require a feature gate in case we don't want it for self-managed.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user of ARO HCP, I want to be able to:

  • specify which subnet my NodePool belongs to

so that I can achieve

  • place different NodePools in different subnets.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Ability to specify subnet per NodePool

Out of Scope:

  • N/A

Engineering Details:

  • N/A

This requires/does not require a design proposal.
This requires/does not require a feature gate.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story:

As a HyperShift, I want to use the service principal as the identity type for CAPZ, so that the warning message about using manual service principal in the capi-provider pod goes away.

Acceptance Criteria:

Description of criteria:

  • The identity type for CAPZ is set to service principal.
  • The warning messages about manual service principal in the capi-provider pod are gone.

Out of Scope:

  • N/A

Engineering Details:

  • Manual service principal was deprecated and will be removed in the future. Further details are here title.

This does not require a design proposal.
This does not require a feature gate.

User Story:

As a user of HyperShift, I want to be able to create Azure VMs with ephemeral disks, so that I can achieve higher IOPS.

Note: AKS also defaults to using them.

Acceptance Criteria:

Description of criteria:

  • Verify upstream documentation exists on how to setup Azure VMs with ephemeral disks
  • HyperShift documentation added on how to setup Azure VMs with ephemeral disks
  • Capability to create clusters with Azure VMs with ephemeral disks exists
  • Capability to create NodePools with Azure VMs with ephemeral disks exists

Out of Scope:

N/A

Engineering Details:

  • Upstream documentation exists in CAPZ here.

This does not require a design proposal.
This does not require a feature gate.

Feature Overview (aka. Goal Summary)

The feature is specifically designed to concentrate on network optimizations, particularly targeting improvements in how network is configured and how access is managed using Cluster API (CAPI) for Azure (Potentially running the control-plane on AKS).

Goal

The current HCP implementation lets the Azure CCM pod create a load balancer (LB) and public IP address for guest cluster egress. The outbound SNAT is using default port allocation.

ARO HCP needs more control over how the LB is created and setup. Ideally, it would be nice to have CAPZ create and manage the LB. ARO HCP would also like the ability to utilize a LB with user-defined routing (UDR).

Why is this important?

Utilizing a LB for guest cluster egress is the better option cost wise and availability wise compared to NAT Gateway. NAT Gateways are more expensive and also zonal.

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a hosted cluster deployer, I want to be able to:

  • specify network security groups

so that I can achieve

  • Network flexibility to run my workloads

Acceptance Criteria:

  • Able to specify network security group when creating Azure cluster
  • Able to specify network security group when creating Azure infra

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • Use a pre-existing network security group when creating an Azure cluster

so that I can achieve

  •  more network flexibility in my Hosted Control planes deployment

Acceptance Criteria:

Description of criteria:

  • Able to specify network security group when creating Azure cluster
  • Able to specify network security group when creating Azure infra

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

 Feature Overview (Goal Summary)

 This feature introduces automatic Etcd snapshot functionality for self-managed hosted control planes, expanding control and flexibility for users. Unlike managed hosted control planes, self-managed environments allow for customized configurations. This feature aims to enable users to leverage any S3-compatible storage for etcd snapshot storage, ensuring high availability and resilience for their OpenShift clusters.

Goals (Expected User Outcomes)

  • Primary User Persona: Cluster Service Providers 
  • User Benefit: Enhanced data protection and quicker disaster recovery for Hosted Clusters clusters through automated etcd snapshots.

Requirements (Acceptance Criteria)

  • Automatic Snapshot Creation: Etcd snapshots must be taken automatically at regular intervals.
  • S3 Storage: Support for any S3-compatible storage for snapshot storage.
  • Snapshot Rotation and Retention Policy: Snapshots are rotated/removed after a specified period to manage storage efficiently.
  • Restoration SOP: Standard Operating Procedures for Etcd restoration should be established, targeting a recovery time objective (RTO) of approximately 1 hour at max. Preferrearbly automated as well. 
  • Metrics: Track Mean Time to Recovery (MTTR) for improved reliability. Do we have metrics?

 

Goal

  • Prescriptive guide for leveraging OADP/Velero to perform hosted cluster back up and restore

Why is this important?

  • Customers need guidance on how they can perform back up and restore using OADP and what can and can't be done.

Scenarios

  1. Disaster recovery
  2. Migration

Acceptance Criteria

  • Dev - Documentation
  • CI - Test that backs up and restores a non empty hosted cluster
  • QE - covered in Polarion test plan and tests implemented
  • DOCS - Inclusion in the release documentation

Dependencies (internal and external)

  1. OADP
  2. Velero

Open questions:

  1. What is the maximum scope that can be done without plugins?
  2. What is and is not backed up?
  3. Ordering considerations

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an engineer I need to perform a spike over backup and restore proceduers on HostedCluster using OADP, in order to know if that could be an alternative or a resource in the new ETCD Backup API for HCP.

Feature Overview (aka. Goal Summary)

Stop generating long-lived service account tokens. Long-lived service account tokens are currently generated in order to then create an image pull secret for the internal image registry. This feature calls for using the TokenRequest API to generate a bound service account token for use in the image pull secret.

Goals (aka. expected user outcomes)

Use TokenRequest API to create image pull secrets. 
{}Performance benefits:

One less secret created per service account. This will result in at least three less secrets generated per namespace.

Security benefits:

Long lived tokens which are no longer recommended as they present a possible security risk.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The upstream test `ServiceAccounts no secret-based service account token should be auto-generated` was previously patched to allow for the internal image registry's managed image pull secret to be present in the `Secrets` field. This will not longer be the case as of 4.16.

Post merge of API-1644, we can remove the patch entirely.

*Executive Summary *

Provide mechanisms for the builder service account to be made optional in core OpenShift.

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

  • Let cluster administrators disable the automatic creation of the "builder" service account when the Build capability is disabled on the cluster. This reduces potential attack vectors for clusters that do not run build or other CI/CD workloads. Example - fleets for mission-critical applications, edge deployments, security-sensitive environments.
  • Let cluster administrators enable/disable the generation of the "builder" service account at will. Applies to new installations with the "Build" capability enabled as well as upgraded clusters. This helps customers who are not able to easily provision new OpenShift clusters and block usage of the Build system through other means (ex: RBAC, 3rd party admission controllers (ex OPA, Kyverno)).

Requirements

Requirements Notes IS MVP
Disable service account controller related to Build/BuildConfig when Build capability is disabled When the API is marked as removed or disabled, stop creating the "builder" service account and its associated RBAC Yes
Option to disable the "builder" service account Even if the Build capability is enabled, allow admins to disable the "builder" service account generation. Admins will need to bring their own service accounts/RBAC for builds to work Yes

(Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

  • Build as an installation capability - see WRKLDS-695
  • Disabling the Build system through RBAC or admission controllers. The "builder" service account is the only thing that RBAC and admission control cannot block without significant cluster impact.

Out of scope

<Defines what is not included in this story>

  • Disabling the Build API separately from the capabilities feature

Dependencies

< Link or at least explain any known dependencies. >

  • Build capability: WRKLDS-695
  • Separate controllers for default service accounts: API-1651

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

  • In OCP 4.14, "Build" was introduced as an optional installation capability. This means that the BuildConfig API and subsystems are not guaranteed to be present on new clusters.
  • The "builder" service account is granted permission to push images to the OpenShift internal registry via a controller. There is risk that the service account can be used as an attack vector to overwrite images in the internal registry.
  • OpenShift has an existing API to configure the build system. See OCP documentation on the current supported options. The current OCP build test suite includes checks for these global settings. Source code.
  • Customers with larger footprints typically separate "CI/CD clusters" from "application clusters" that run production workloads. This is because CI/CD workloads (and building container images in particular) can have "noisy" consumption of resources that risk destabilizing running applications.

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

  • Must work for new installations as well as upgraded clusters.

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

  • Update the "Build configurations" doc so admins can understand the new feature.
  • Potential updates to "Understanding BuildConfig" doc doc to include references to the serviceAccount option in the spec, as well as a section describing the permissions granted to the "builder" service account.

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

  • Disabling OCM Controllers (slides). Note that the controller names may be a bit out of date once API-1651 is done.
  • Install capabilities - OCP docs

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story (Required)

As a cluster admin trying to disable the Build, DeploymentConfig, and Image Registry capabilities I want the RBAC controllers for the builder and deployer service accounts and default image-registry rolebindings disabled when their respective capability is disabled.

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer's experience?>

Background (Required)

<Describes the context or background related to this story>

In WRKLDS-695, ocm-o was enhanced to disable the Build and DeploymentConfig controllers when the respective capability was disabled. This logic should be extended to include the controllers that set up the service accounts and role bindings for these respective features.

Out of scope

<Defines what is not included in this story>

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

    • Needs manual testing (OpenShift cluster deployed with all/some capabilities disabled). 

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

Acceptance Criteria (Mandatory)

  • Build and DeploymentConfig systems remain functional when the respective capability is enabled.
  • Build, DeploymentConfig, and Image-Puller RoleBinding controllers are not started when the respective capability is disabled.

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

  • Engineering: 5
  • QE: 2
  • Doc: 2

Legend

Unknown

Verified

Unsatisfied

Done Checklist

  • Code is completed, reviewed, documented and checked in
  • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
  • Continuous Delivery pipeline(s) is able to proceed with new code included
  • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
  • Acceptance criteria are met

Story (Required)

As an OpenShift engineer trying to use capabilities to enable/disable the Build and DeploymentConfig systems, I want to refactor the default rolebindings controller so that each respective capability runs a separate controller.

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer’s experience?>

Background (Required)

<Describes the context or background related to this story>

OpenShift has a controller that automatically creates role-bindings for service accounts in every namespace. Though only one controller operates, its logic contains forks that are specific to the Build and DepoymentConfig systems.

The goal is to refactor this into separate controllers so that individual ones can be disabled by the cluster-openshift-controller-manager-operator.

Out of scope

<Defines what is not included in this story>

  • Disabling the rolebindings controller via an operator.
  • Cleaning up rolebindings that are "orphaned" if the controller is disabled.

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

  • API-1651 - this was refactoring work taken on by the apiserver/auth team.

Acceptance Criteria (Mandatory)

<Describe edge cases to consider when implementing the story and defining tests>

<Provides a required and minimum list of acceptance tests for this story. More is expected as the engineer implements this story>

  • Separate rolebinding controllers exist for the builder and deployer service account rolebindings.
  • Build and DeploymentConfig systems remain functional when the respective capability is enabled.
  • The "image puller" role binding must continue to be created/reconciled.

INVEST Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

  • Eng: 3

Legend

Unknown
Verified
Unsatisfied

Done Checklist

  • Code is completed, reviewed, documented and checked in
  • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
  • Continuous Delivery pipeline(s) is able to proceed with new code included
  • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
  • Acceptance criteria are met

Problem statement

A 5G Core cluster communicates with many other clusters and remote hosts, aka remote sites. And these remote are added, and sometime removed temporarily or definitively from the telecom operator network. As of today, we need to add static routes on OCP nodes to be able to reach those remote sites, and updating static routes for each and every remote site change is unacceptable for Telco Operators, in particular as they do not have to configure static routes anywhere in their network as they rely on BGP to propage routes across their network. They do want OCP nodes to learn those BGP routes rather than to create a "script" that will translate BGP announces into an NMSTATE configuration to be pushed on OCP nodes.

 

More insights in (recording link on the title page) https://docs.google.com/presentation/d/1zW0-wmvtrU7dbApIaNWfvMbSZvLxoSMLa_gzRYn0Y3Y/edit#slide=id.gfb215b3717_0_3299

Vocabulary, definition

VRF: in this feature, a VRF is a Network VRF, not a Linux kernel VRF. A Network VRF is often implemented as a VLAN in a datacenter, but this is a pure logical entity on the routers/DCGW and are not visible on the OCP nodes. Please do not be confused with another Feature that relies on Linux VRFs to map the Network VRFs on OCP nodes (https://issues.redhat.com/browse/TELCOSTRAT-76).

More insights in (recording link on the title page) https://docs.google.com/presentation/d/1zW0-wmvtrU7dbApIaNWfvMbSZvLxoSMLa_gzRYn0Y3Y/edit#slide=id.gfb215b3717_0_3299

e description, scope

Any Kubernetes object/component learning/announcing routes, including pods, is not in this feature scope. This feature is about OCP nodes (==Linux host) to learn routes via BGP, and to eventually monitor their next hop with BFD (datacenter router, DCGW, fabric leaf, ...).

Routes next hop can be on any OCP node interface (any VLAN, bond, physical NIC): next hop are not necessarily reachable from the baremetal network. This means that we will have one BGP Peer per VRF.

We want to be able to learn routes with the same weight/local-preference, translating to ECMP routes on the OCP nodes, and also routes with different weight/local-preference, translating to active/backup routes on OCP nodes. In all cases, we want to be able to monitor the routes via BFD as BGP timeouts are too high for some customers.

Illustration for active/backup - note that there can be more than two routers

 

Illustration for active/active (ECMP) - note that there can be more than two routers

Routing protocol scope

This feature is limited to BGP and BFP, and must support IPv4, IPv6, and the commonly used BGP attributes, typically the ones supported by metalLB: https://docs.openshift.com/container-platform/4.12/networking/metallb/metallb-configure-bgp-peers.html#nw-metallb-bgppeer-cr_configure-metallb-bgp-peers

The OCP nodes will learn routes but will not announce routes, except the metalLB ones.

Scale requirements

Expectation is to have 3 VRFs and two routers per VRF. We should scale beyond, in particular for the number of VRFs, but the number of routers per VRF should be 4 at a maximum as best and sane practices are 2 and we do not want to encourage faulty/wrong network designs. Of course, we can amend this Feature scope for future evolution if best practices evolves.

Interoperability requirements

FRR is interoperable with all/most existing routers, and as Red Hat is upstream based, interoperability tests are not required but any interoperability issue will be fixed (upstream first, like for any code change at Red Hat).

Assumption

This feature is only relevant for local gateway mode (not shared gateway).

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Have FRR-K8s as another BGP backend for MetalLB
  • Have the MetalLB Operator deploy both MetalLB and the FRR-K8s daemon
  • Having the users configure the frr instance running on each node beyond the capabilities offered by metallb
  •  

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/CNF-8566

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Telecommunications providers look to displace Physical Network Functions (PNFs) with modern Virtual Network Functions (VNFs) at the Far Edge. Single Node OpenShift, as a the CaaS layer in the vRAN vDU architecture, must achieve a higher standard in regards to OpenShift upgrade speed and efficiency, as in comparison to PNFs.

Telecommunications providers currently deploy Firmware-based Physical Network Functions (PNFs) in their RAN solutions. These PNFs can be upgraded quickly due to their monolithic nature and  image-based download-and-reboot upgrades. Furthermore they often have the ability to retry upgrades and to rollback to the previous image if the new image fails. These Telcos are looking to displace PNFs with virtual solutions, but will not do so unless the virtual solutions have comparable operational KPIs to the PNFs.

Goals

Service Downtime

Service (vDU) Downtime is the time when the CNF is not operational and therefore no traffic is passing through the vDU. This has a significant impact as it degrades the customer’s service (5G->4G) or there’s an outright service outage. These disruptions are scheduled into Maintenance Windows (MW), but the Telecommunications Operators primary goal is to keep service running, so getting vRAN solutions with OpenShift to near PNF-like Service Downtime is and always will be a primary requirement.

 

Upgrade Duration

Upgrading OpenShift is only one of many operations that occur during a Maintenance Window. Reducing the CaaS upgrade duration is meaningful to many teams within a Telecommunications Operators organization as this duration fits into a larger set of activities that put pressure on the duration time for Red Hat software. OpenShift must reduce the upgrade duration time significantly to compete with existing PNF solutions.

 

Failure Detection and Remediation

As mentioned above, the Service Downtime disruption duration must be as small as possible, this includes when there are failures. Hardware failures fall into a category called Break+Fix and are covered by TELCOSTRAT-165. In the case of software failures must be detected and remediation must occur.

Detection includes monitoring the upgrade for stalls and failures and remediation would require the ability to rollback to the previously well-known-working version, prior to the failed upgrade.

 

Implicit Requirements

Upgrade To Any Release

The OpenShift product support terms are too short for Telco use cases, in particular vRAN deployments. The risk of Service Downtime drives Telecommunications Operators to a certify-deploy-and-then-don’t-touch model. One specific request from our largest Telco Edge customer is for 4 years of support.

These longer support needs drive a misalignment with the EUS->EUS upgrade path and drive the requirement that the Single Node OpenShift deployment can be upgraded from OCP X.y.z to any future [X+1].[y+1].[z+1] where X+1 and x+1 are decided by the Telecommunications Operator depending on timing and the desired feature-set and x+1 is determined through Red Hat, vDU vendor and custom maintenance and engineering validation.

 

Alignment with Related Break+Fix and Installation Requirements

Red Hat is challenged with improving multiple OpenShift Operational KPIs by our telecommunications partners and customers. Improved Break+Fix is tracked in TELCOSTRAT-165 and improved Installation is tracked in TELCOSTRAT-38.

 

Seamless Management within RHACM

Whatever methodology achieves the above requirements must ensure that the customer has a pleasant experience via RHACM and Red Hat GitOps. Red Hat’s current install and upgrade methodology is via RHACM and any new technologies used to improve Operational KPIs must retain the seamless experience from the cluster management solution. For example, after a cluster is upgraded it must look the same to a RHACM Operator.

 

Seamless Management when On-Node Troubleshooting

Whatever methodology achieves the above requirements must ensure that a technician troubleshooting a Single Node OpenShift deployment has a pleasant experience. All commands issued on the node must return output as it would before performing an upgrade.

 

Requirements

Y-Stream

  • CaaS Upgrade must complete in <2 hrs (single MW) (20 mins for PNF)
  • Minimize service disruption (customer impact) < 30 mins [cumulative] (5 mins for PNF)
  • Support In-Place Upgrade without vDU redeployment
  • Support backup/restore and rollback to known working state in case of unexpected upgrade failures

 

Z-Stream

  • CaaS Patch Release (z-release) Upgrade Requirements:
  • CaaS Upgrade must complete <15 mins
  • Minimize service disruption (customer impact) < 5 mins
  • Support In-Place Upgrade without vDU redeployment
  • Support backup/restore and rollback to known working state in case

 

Rollback

  • Restore and Rollback functionality must complete in 30 minutes or less.

 

Upgrade Path

  • Allow for upgrades to any future supported OCP release.

 

References

Feature goal (what are we trying to solve here?)

A systemd service that runs on a golden image first boot and configure the following:

 1. networking ( the internal IP address require special attention)

 2. Update the hostname (MGMT-15775)

 3. Execute recert (regenereate certs, Cluster name and base domain MGMT-15533)

 4. Start kubelet

 5. Apply the personalization info:

  1. Pull Secret
  2. Proxy
  3. ICSP
  4. DNS server
  5. SSH keys

DoD (Definition of Done)

  1. The API for configuring the networking, hostname, personalization manifests and executing recert is well defined
  2. The service configures the required attributes and start a functional OCP cluster
  3.  CI job that reconfigures a golden image and form a functional cluster 

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

    • Internal request

    • Catching up with OpenShift

      Reasoning (why it’s important?)

The following features depend on this functionality:

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

 In IBI and IBU flows we need a way to change nodeip-configuration hint file without reboot and before mco even starts. In order for MCO to be happy we need to remove this file from it's management to make it we will stop using machine config and move to ignition

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Run tuneD on a container on a one-shot mode and read the output kernel arguments to apply them using a MachineConfig (MC).
This would be run in the bootstrap procedure of the Openshift Installer, just before the MachineConfigOperator(MCO) procedure here
Initial considerations: https://docs.google.com/document/d/1zUpcpFUp4D5IM4GbM4uWbzbjr57h44dS0i4zP-hek2E/edit

 

Webhooks created during bootstrap-in-place will lead to failure in applying their admission subject resources. A permanent solution will be provided by resolving https://issues.redhat.com/browse/OCPBUGS-28550

We need a temporary workaround to unblock the development. This is easiest to do by changing the validating webhook failure policy to "Ignore"

 

Problem statement

DPDK applications require dedicated CPUs, and isolated any preemption (other processes, kernel threads, interrupts), and this can be achieved with the “static” policy of the CPU manager: the container resources need to include an integer number of CPUs of equal value in “limits” and “request”. For instance, to get six exclusive CPUs:

spec:

  containers:

  - name: CNF

    image: myCNF

    resources:

      limits:

        cpu: "6"

      requests:

        cpu: "6"

 

The six CPUs are dedicated to that container, however non trivial, meaning real DPDK applications do not use all of those CPUs as there is always at least one of the CPU running a slow-path, processing configuration, printing logs (among DPDK coding rules: no syscall in PMD threads, or you are in trouble). Even the DPDK PMD drivers and core libraries include pthreads which are intended to sleep, they are infrastructure pthreads processing link change interrupts for instance.

Can we envision going with two processes, one with isolated cores, one with the slow-path ones, so we can have two containers? Unfortunately no: going in a multi-process design, where only dedicated pthreads would run on a process is not an option as DPDK multi-process is going deprecated upstream and has never picked up as it never properly worked. Fixing it and changing DPDK architecture to systematically have two processes is absolutely not possible within a year, and would require all DPDK applications to be re-written. Knowing that the first and current multi-process implementation is a failure, nothing guarantees that a second one would be successful.

The slow-path CPUs are only consuming a fraction of a real CPU and can safely be run on the “shared” CPU pool of the CPU Manager, however containers specifications do not accept to request two kinds of CPUs, for instance:

 

spec:

  containers:

  - name: CNF

    image: myCNF

    resources:

      limits:

        cpu_dedicated: "4"

        cpu_shared: "20m"

      requests:

        cpu_dedicated: "4"

        cpu_shared: "20m"

Why do we care about allocating one extra CPU per container?

  • Allocating one extra CPU means allocating an additional physical core, as the CPUs running DPDK application should run on a dedicated physical core, in order to get maximum and deterministic performances, as caches and CPU units are shared between the two hyperthreads.
  • CNFs are built with a minimum of CPUs per container. This is still between 10 and 20, sometime more, today, but the intent is to decrease this number of CPU and increase the number of containers as this is the “cloud native” way to waste resources by having too large containers to schedule, like in the VNF days (tetris effect)

Let’s take a realistic example, based on a real RAN CNF: running 6 containers with dedicated CPUs on a worker node, with a slow Path requiring 0.1 CPUs means that we waste 5 CPUs, meaning 3 physical cores. With real life numbers:

  • For a single datacenter composed of 100 nodes, we waste 300 physical cores
  • For a single datacenter composed of 500 nodes, we waste 1500 physical cores
  • For a single node OpenShift deployed on 1 Millions of nodes, we waste 3 Millions of physical cores

Intel public CPU price per core is around 150 US$, not even taking into account the ecological aspect of the waste of (rare) materials and the electricity and cooling…

 

Goals

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Questions to answer…

  • Would an implementation based on annotations be possible rather than an implementation requiring a container (so pod) definition change, like the CPU pooler does?

Out of Scope

Background, and strategic fit

This issue has been addressed lately by OpenStack.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

  • The feature needs documentation on how to configure OCP, create pods, and troubleshoot

Epic Goal

  • provide standardize u/s solution to achieve similar functionality as mixed-cpu-node-plugin.
  • new API for requesting both shared and exclusive CPUs in the same container

Why is this important?

  • Upstream solutions is Red-Hat way
  • Decreasing overhead of maintaining internal/downstream solution
  • Gain community traction.

Scenarios

  1. see previous work

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • KEP describing the solution, new APIs, deep dive, etc.
  • Having the solution merge in K8S/K8S-sigs relevant repo

Dependencies (internal and external)

  1. see previous work

Previous Work (Optional):

  1. https://issues.redhat.com/browse/CNF-7603

Open questions::

  1. How to represent a pod which request for both shared and isolated cpus? (DRA?!?)
  2. From where the shared cpus should come from?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Test_Description :  Create a performance profile by enabling shared cpus. Create a pod with 1 gu pod , where it has a shared cpu device enabled. Verify the pod have the shared cpus exported, then disable the feature functionality by modifying the PP, then verify that the cpuset of the pod doesn’t include the shared cpus.

https://docs.google.com/document/d/1ci1pvLPrAaI5_K-I2IJf0MgYozUrhLM9ANOZkskMc_Y/edit#heading=h.iy76fjamvyjg

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Making the MixedCPUs feature GA during 4.16 release time frame
  • Having full functional comprehensive test-suite exercising the feature on both cgroups versions (v1 and v2)
  • Having official OCP docs explaining how to use this feature. 

Why is this important?

  • This would unblock lots of options including mixed cpu workloads where some CPUs could be shared among containers / pods CNF-3706
  • This would also allow further research on dynamic (simulated) hyper threading CNF-3743

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/CNF-7603
  2. https://issues.redhat.com/browse/CNF-9117 - Dev preveiw epic

Open questions:

 N/A

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

As an operation person, I would like to have the tooling to flush generic seed image to disk so it will be bootable, and run precaching as well. At scale.

Initial idea is to base the flow on anaconda: https://issues.redhat.com/browse/RHEL-2250)

 

https://redhat-internal.slack.com/archives/C05JHD9QYTC/p1702496826548079
 
 

When performing IBI, we're not previously wiping the installation disk

As our experience with assisted-installer shows, this can create problems

We should see how to reuse the disk wiping that we already do in assisted: https://github.com/openshift/assisted-installer/blob/f4b8cfd85dfe8194aac489bacfb93ef8501fd290/src/installer/installer.go#L780

Allow attaching an ISO image that will be used for data on an already provisioned system using a BMH.

Currently this can be achieved using the existing BMH.Spec.Image fields, but this attempts to change the boot order of the system and relies on the host to fallback to the installed system when booting the image fails.

Scope questions:

  • Changes in Ironic?
    • Can we introduce an actual Ironic API for this (instead of vendor passhtru)?
      • Dmitry probed the community, the initial reaction was positive.
    • The functionality is probably mostly there but needs to be exposed
    • We'll likely also need to add support for this into Sushy
    • We expect the machine will already be provisioned when the ISO is attached
  • Changes in Metal3?
    • Yes, we'll need new functionality in the BMH
  • Changes in OpenShift?
    • no install-config or installer change
  • Spec/Design/Enhancements?
    • Ironic: probably not, but an RFE for sure
    • Metal3: probably yes, an API addition
    • OpenShift: no
  • Dependencies on other teams?
    • Telco field folks to help us bring hardware we can validate the approach on?
      • We can test on our hardware after a normal deployment

Feature Overview

  • Telco customers delivering servers to far edge sites (D-RAN DU) need the ability to upgraded or downgraded the servers BIOS and firmware to to specific versions to ensure the server is configured as was in their validated pattern. After delivering the bare metal to the site this should be done prior to using ZTP to provision the node.

Goals

  • Allow the operator, via GitOps, to specify the BIOS image and any firmware images to be installed prior to ZTP
  • Seamless transition from pre-provisioning to existing ZTP solution
  • Integrate BIOS/Firmware upgrades/downgrades into TALM
  • (consider) integration with backup/restore recovery feature
  • Firmware could include: NICs, accelerators, GPUS/DPUS/IPUS

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • Can we provide the ability to fail back to a well known good firmware image if the upgrade/reboot fails?

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Currently the ROSA/ARO versions are not managed by OTA team.
This Feature covers the engineering effort to take the responsibility of management of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.

Here is the design document for the effort: https://docs.google.com/document/d/1hgMiDYN9W60BEIzYCSiu09uV4CrD_cCCZ8As2m7Br1s/edit?skip_itp2_check=true&pli=1

Here are some objectives :

  • Managed clusters would get update recommendations from Red Hat hosted OSUS directly without much middle layers like ClusterImageSet.
  • The new design should reduce the cost of maintaining versions for managed clusters including Hypershift hosted control planes.

Presentation from Jeremy Eder :

User Story

As a ROSA HyperShift customer I want to enforce that IMDSv2 is always the default, to ensure that I have the most secure setting by default.

Acceptance Criteria

  • IMDSv2 should be configured in HyperShift clusters by default

Default Done Criteria

  • All existing/affected SOPs have been updated.
  • New SOPs have been written.
  • Internal training has been developed and delivered.
  • The feature has both unit and end to end tests passing in all test
    pipelines and through upgrades.
  • If the feature requires QE involvement, QE has signed off.
  • The feature exposes metrics necessary to manage it (VALET/RED).
  • The feature has had a security review.* Contract impact assessment.
  • Service Definition is updated if needed.* Documentation is complete.
  • Product Manager signed off on staging/beta implementation.

Dates

Integration Testing:
Beta:
GA:

Current Status

GREEN | YELLOW | RED
GREEN = On track, minimal risk to target date.
YELLOW = Moderate risk to target date.
RED = High risk to target date, or blocked and need to highlight potential
risk to stakeholders.

References

Links to Gdocs, github, and any other relevant information about this epic.

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Epic Goal

The goal of this epic is to guarantee that all pods running within the ACM (Advanced Cluster Management) cluster adhere to Kubernetes Security Context Constraints (SCC). The implementation of a comprehensive SCC compliance checking system will proactively maintain a secure and compliant environment, mitigating security risks.

Why is this important?

Ensuring SCC compliance is critical for the security and stability of a Kubernetes cluster. 

Scenarios

A customer who is responsible for overseeing the operations of their cluster, faces the challenge of maintaining a secure and compliant Kubernetes environment. The organization relies on the ACM cluster to run a variety of critical workloads across multiple namespaces. Security and compliance are top priorities, especially considering the sensitive nature of the data and applications hosted in the cluster.

Deployments to Investigate

Only Annotation Needed:

  • [ ] operator (Hypershift)

Further Investigation Needed

  • [ ] hypershift-addon-agent (Hypershift)
  • [ ] hypershift-install-job (Hypershift)

Acceptance Criteria

  • [ ] Develop a script capable of automated checks for SCC compliance for all pods within the ACM cluster, spanning multiple namespaces.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Value Statement

As an ACM admin, I want to add Kubernetes Security Context Constraints (SCC) V2 options to the component's resource YAML configuration to ensure that the Pod runs with the 'readonlyrootfilesystem' and 'privileged' settings, in order to enhance the security and functionality of our application.

In the resource config YAML, we need to add the follow context:

securityContext:
  privileged: false
  readOnlyRootFilesystem: true

Affected resources:

  • [x] operator
  • [x] hypershift-addon-agent
  • [x] hypershift-install-job

Definition of Done for Engineering Story Owner (Checklist)

  • [x] Ensure that the Pod continues to function correctly with the new SCC V2 settings.
  • [x] Verify that the SCC V2 options are effective in limiting the Pod's privileges and restricting write access to the root filesystem.

Development Complete

  • The code is complete.
  • Functionality is working.
  • Any required downstream Docker file changes are made.

Tests Automated

  • [ ] Unit/function tests have been automated and incorporated into the
    build.
  • [ ] 100% automated unit/function test coverage for new or changed APIs.
  • Regression test is all we need for QE.

Secure Design

  • [ ] Security has been assessed and incorporated into your threat model.

Multidisciplinary Teams Readiness

Support Readiness

  • [ ] The must-gather script has been updated.

Epic Goal

This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.

Why is this important?

maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality

Scenarios

  1. N/A

Acceptance Criteria

  • depends on the specific card

Dependencies (internal and external)

  • depends on the specific card

Previous Work (Optional):

https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479 
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036

 Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Set of API updates, best practices adoption and cleanup to the e2e suite:

  • Replacing pointer package API
  • Passing contexts from the caller functions
  • Removing unused functions

Epic Goal

  • Reduce Installer Build Times

Why is this important?

Building CI Images has recently increased in duration, sometimes hitting 2 hours, which causes multiple problems:

  • Takes longer for devs to get feedback on PRs
  • Images can take so long to build, e2e tests hit timeout limits, causing them to fail and trigger retests, which wastes money

More importantly, the build times have gotten to a point where OSBS is failing to build the installer due to timeouts, which is making it impossible for ART to deliver the product or critical fixes.

Scenarios

  1. ...

Acceptance Criteria

  • Investigate what is causing spike in master build times
  • As much as possible, decrease total time jobs need to wait before e2e tests can begin

Out of Scope

  1. The scope of this epic is solely towards improving CI experience. This should focus on quick wins. The work of decreasing installer build times will go into another epic

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Create a new repo for the providers, build them into a container image and import the image in the installer container image.

Hopefully this will save resources and decrease build times for CI jobs in Installer PRs.

Epic Goal

  • Gather tech debt stories for OCP 4.15

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature goal (what are we trying to solve here?)

The external platform was created to allow cloud providers to supply their own integration components (cloud controller manager, etc.) without prior integration into openshift release artifacts. We need to support this new platform in assisted-installer in order to provide a user friendly way to enable such clusters, and to enable new-to-openshift cloud providers to quickly establish an installation process that is robust and will guide them toward success.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Feature goal (what are we trying to solve here?)

Allow using late binding via BMH, to support late binding via ZTP, part of RFE-4769

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

 In order to allow declarative association of host with a cluster in late binding flow, annotation support will be added to bmc.

The annotation is 
bmac.agent-install.openshift.io/cluster-reference.
The annotation value is a json encoded string containing dictionary with keys [name,namespace] that correspond to the name and namespace of the cluster deployment.

Manage the effort for adding jobs for release-ocm-2.10 on assisted installer

https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng

 

Merge order:

  1. Add temporary image streams for Assisted Installer migration - day before (make sure images were created)
  2. Add Assisted Installer fast forwards for ocm-2.x release <depends on #1> - need approval from test-platform team at https://coreos.slack.com/archives/CBN38N3MW 
  3. Branch-out assisted-installer components for ACM 2.(x-1) - <depends on #1, #2> - At the day of the FF
  4. Prevent merging into release-ocm-2.x - <depends on #3> - At the day of the FF
  5. Update BUNDLE_CHANNELS to ocm-2.x on master - <depends on #3> - At the day of the FF
  6. ClusterServiceVersion for release 2.(x-1) branch references "latest" tag <depends on #5> - After  #5
  7. Update external components to AI 2.x <depends on #3> - After a week, if there are no issues update external branches
  8. Remove unused jobs - after 2 weeks

 

1. Proposed title of this feature request

A) Getting VPAs metrics in Openshift's Prometheus/kube-state-metrics

2. Who is the customer behind the request?

Account: name: xxx

TAM customer: no

CSM customer: no

Strategic: no

3. What is the nature and description of the request?

A) Feature request so that the data is the VPA is available in Prometheus is available for Kube State Metrics

4. Why does the customer need this? (List the business requirements here)

A) To view the data related to VPA

7. Is there already an existing RFE upstream or in Red Hat Bugzilla?

A) No

8. Does the customer have any specific time-line dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?

A) No

9. Is the sales team involved in this request and do they have any additional input?

A) No

10. List any affected packages or components.

11. Would the customer be able to assist in testing this functionality if implemented?

A) Yes

Epic Goal

  • Scrape Profiles was introduced as Tech Preview in 4.13, goal it to now promote it to GA
  • Scrape Profiles Enhancement Proposal should be merged
  • OpenShift developers that want to adopt the feature should have the necessary tooling and documentation on how to do so
  • OpenShift CI should validate if possible changes in profiles that might break a profile or cluster functionality

This has no link to a planing session, as this predates our Epic workflow definition.

Why is this important?

  • Enables users to minimize the resrource overhead for Monitoring.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MON-2483

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Our documentation suggests creating an alert after configuring scrape sample limits.

That PrometheusRule object has two alerts configured within it [1]

`ApproachingEnforcedSamplesLimit` 

`TargetDown` 

The `Targetdown` alert is designed to fire after the `ApproachingEnforcedSamplesLimit` because the target is dropped once the enforced sample limit is reached

The TargetDown alert is creating false positives - its firing for reasons other than pods in the namespace have reached there enforced sample limit (e.g. the metrics endpoint may be down). 

User-defined monitoring should provide out-of-the-box metrics that will help with troubleshooting:

  • Update Prometheus user-workload to enable additional scrape metrics [2]
  • Rewrite the ApproachingEnforcedSamplesLimit alert expression in the OCP documentation like "(scrape_samples_post_metric_relabeling / (scrape_sample_limit > 0)) > 0.9" (which reads as "alert when the number of ingested samples reaches 90% of the configured limit).
  • Document how a user would know that a target has hit the limit (e.g. the Targets page should have the information).

[1] - https://docs.openshift.com/container-platform/4.12/monitoring/configuring-the-monitoring-stack.html#creating-scrape-sample-alerts_configuring-the-monitoring-stack 

[2] - https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics

Epic Goal

  • Make sure that some resources (statefulsets, especially those whose Pods take time to restart, for example Prometheus) are not unintentionally recreated to avoid downtime.

Why is this important?

  • Avoid downtime.
  • Avoid cases where Kubernetes cannot handle correctly the recreation of the resource. => All the resource's pods are stuck e.g.
  • prometheus-operator uses foreground deletion when an immutable field of a statefulset e.g. is modified, see https://issues.redhat.com/browse/OCPBUGS-17346 where matchLabels was modified.

Scenarios

Acceptance Criteria

  • We have a test (origin?) that makes sure of this during upgrades.
  • It should be easy to temporarily disable the test in case we cannot avoid a recreation during an upgrade.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. I thought about adding an origin test during upgrades, we had a discussion with Simon on https://issues.redhat.com/browse/OCPBUGS-17346?focusedId=22743037&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-22743037 (about the maintainability of such tests)

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Make sure the statefulsets are not recreated after upgrade.

 

Recreating a statefulset should be an exception.

Epic Goal

  • Replace the "oauth-proxy" currently in use by monitoring components (Prometheus, Alert Manager, Thanos) with "kube-rbac-proxy" to streamline authentication and authorization.

Why is this important?

  • kube-rbac-proxy offers a unified and fine-grain (different authorizaiton for different path) configurations of performing authentication and authorization on behalf of Kubernetes workloads, ensuring tight security measures around service endpoints. 
  • mTLS Implementation: Unlike oauth-proxy, kube-rbac-proxy is capable of implementing mutual TLS (mTLS), providing enhanced security through both client and server-side validation. 
  • Potential improvements in performance and resource consumption by skipping authentication request (TokenReview) or athorization request (SubjectAccessReview) in kubernetes.

Scenarios

  1. Prometheus endpoints are secured using kube-rbac-proxy without any loss of data or functionality.
  2. Alert Manager endpoints are secured using kube-rbac-proxy without any loss of data or functionality.
  3. Thanos endpoints are secured using kube-rbac-proxy without any loss of data or functionality.

 

Acceptance Criteria

  • All monitoring components interact successfully with kube-rbac-proxy{}.
  • CI - MUST be running successfully with tests automated.
  • No regressions in monitoring functionality post-migration.
  • Documentation is updated to reflect the changes in authentication and authorization mechanisms.

Dependencies (internal and external)

 

Previous Work (Optional):

https://github.com/rhobs/handbook/pull/59/files

https://github.com/openshift/cluster-monitoring-operator/pull/1631

https://github.com/openshift/origin/pull/27031

https://github.com/openshift/cluster-monitoring-operator/pull/1580

https://github.com/openshift/cluster-monitoring-operator/pull/1552

 

Related Tickets:

Require read-only access to Alertmanager in developer view. 

https://issues.redhat.com/browse/RFE-4125

Common user should not see alerts in UWM. 

https://issues.redhat.com/browse/OCPBUGS-17850

Related ServiceAccounts.

https://docs.google.com/spreadsheets/d/1CIgF9dN4ynu-E7FLq0uBcoZPqOxwas20/edit?usp=sharing&ouid=108370831581113620824&rtpof=true&sd=true

Interconnection diagram in monitoring stack.

https://docs.google.com/drawings/d/16TOFOZZLuawXMQkWl3T9uV2cDT6btqcaAwtp51dtS9A/edit?usp=sharing

 

Open questions:

None.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In CMO, ThanosRuler pods have an Oauth-proxy on port 9091 for web access on all paths.

We are going to replace it with kube-rbac-proxy and constraint the access to /api/v1 paths. 

The current behavior is to allow access to the ThanosRuler web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stub custom resource to authorize both "post" and "get" HTTP requests from certain users.

 

In CMO, Alertmanager pods in the openshift-monitoring namespace have an Oauth-proxy on port 9095 for web access on all paths.

We are going to replace it with kube-rbac-proxy and constraint the access to /api/v2 pathes. 

The current behavior is to allow access to the Alertmanager web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stubbed custom resource to authorize both "post" and "get" HTTP requests from certain users.

There is a request to allow read-only access to alerts in Developer view. kube-rbac-proxy can facilitate this functionality.

In CMO, Prometheus pods have an Oauth-proxy on port 9091 for web access on all paths.

We are going to replace it with kube-rbac-proxy and constraint the access to /api/v1 paths. 

The current behavior is to allow access to the Prometheus web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen. We may need use a stub custom resource to authorize both "post" and "get" HTTP requests from certain users.

 

The insight component is using this port, figure out how to keep its access after replacing the Oauth proxy https://github.com/openshift/insights-operator/blob/master/pkg/controller/const.go

Its service account "gather" and "operator" should use the prometheus endpoint. https://redhat-internal.slack.com/archives/CLABA9CHY/p1701345127689009

 

 

 

$ oc get clusterrole,role -n openshift-monitoring -o name | egrep -i -e 'monitoring{-}(alertmanager|rules)-(edit|view)' -e cluster-monitoring-view clusterrole.rbac.authorization.k8s.io/cluster-monitoring-view clusterrole.rbac.authorization.k8s.io/monitoring-rules-edit clusterrole.rbac.authorization.k8s.io/monitoring-rules-view clusterrole.rbac.authorization.k8s.io/openshift-cluster-monitoring-view role.rbac.authorization.k8s.io/monitoring-alertmanager-edit 

 
we have cluster roles for viewing metrics and editing alerts/silences but not a local role for viewing all.

I have a customer requesting read only access to alerts in the developer console.

In CMO we have static yaml assets, that are added to our CMO container image and applied by CMO. We read them once from the file system and after that they are chached in memory.

Currently we use a loose unmarshal, i.e. fields that are spuperflous (not part of the type unmarshaled into) are silently dropped.

We should use strict unmarshaling in order to catch config mistakes like  https://issues.redhat.com//browse/OCPBUGS-24630.

 

Additionally we could consider adding the static assets to our CMO binary via golangs embed.

Discussed, probably not worth it.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Remove dependency on github.com/pkg/errors

Why is this important?

NOTE:

  • github.com/pkg/errors.Wrap knows how to handle err=nil, you'd need to handle that case separately.
  • https://github.com/xdg-go/go-rewrap-errors may help with the migration but it isn't complete (it doesn't handle err=nil cases e.g.)

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Description of the problem:

On s390x and DHCP user should not set static IP to IPL nodes.
This would lead to dub for IP in case of coreos installer.

E.g.: {}installer-args '[\"{}append-karg\",\"ip=encbdd0:dhcp\",\"{}append-karg\",\"rd.neednet=1\",\"{}append-karg\",\"ip=10.14.6.3::10.14.6.1:255.255.255.0:master-0.boea3e06.lnxero1.boe:encbdd0:none\",\"{}append-karg\",\"nameserver=10.14.6.1\",\"{}append-karg\",\"ip=[fd00::3]::[fd00::1]:64::encbdd0:none\",\"{-}-append-karg\",\"nameserver=[fd00::1]\"

How reproducible:

Create cluster with DHCP and ipl node(s) using parm line containting static ip settings (ip4 and / or ipv6).

Steps to reproduce:

1.

2.

3.

Actual results:

karg setting for ip using the same device is twice (dhcp and static ip).

Expected results:

If DHCP is used than no static IP in parm line.

In case of LPAR installation it's possible that the cmd: systemd-detect-virt --vm might return an exitCode < 0 even if logged into the booted VM and executed from cmdline returns "none" without an errorcode.

This leads to an abort of further processing in system_vendor.go and Manufacturer will not be set to IBM/S390.

To fix this glitch, the code to determine the Manufacturer should be moved before the call for systemd-detect-virt --vm cmd.

Epic Goal

  • Improve IPI on Power VS in the 4.15 cycle
    • Changes to the installer to handle edge cases, fix bugs, and improve usability.
    • Switch to only support PER enabled zones.

Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

When the console dashboards plugin is present, the metrics tab does not respect a custom datasource.

Versions

console dashboards plugin: 0.1.0

Openshift: 4.16

Steps to reproduce

  1. A graph showing metrics from a custom datasource
  2. Click on "inspect" in a panel
  3. It shows data from a different instance from an entirely different datasource
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Based on the results of the spike as detailed in this document: https://docs.google.com/document/d/1EPEYd94NYS_LbFRFT4d1Mj9-ebibNYJYu_kHzwIwxaA/edit?usp=sharing we need to implement one of the suggested solutions for dashboards.


This story will focus on fixing timeout issues for label selectors and line charts. This will be done through breaking the long queries down into queries which are able to be responded to within the timeout period.

Suggested steps are:

  • Perform query chunking on the panels based on time ranges. This is needed for label and line charts
    • Rather than wait for a query to timeout, we should preemptively split the query into manageable chunks (1 day)
  • For the the line charts, rather than replacing all of the data for a single fetch, instead the data should replace its specific time period or all of the fetches should be aggregated before replacing all
    • For segment replacement, this should require retrieving all data for a panel, removing the new data’s time period, adding in the new data and then bounding the data for the full time range to prevent data sprawl if the webpage is left open for a long time.
    • Full replacement will work the same as the system currently works, but require waiting for the results of all the requests before combining them all and replacing the current data. This would overall be a simpler approach, but could leave potentially blank areas in a chart if one or more of the queries timeout where there was already data from a previous API call
  • For the label selectors, the labels for each of the fetches should be combined before replacing the existing data.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Any ERRORs produces by TuneD will result in Degraded Tuned Profiles.  Cleanup upstream and NTO/PPC-shipped TuneD profiles and add ways of limiting the ERROR message count.
  • Review the policy of restarting TuneD on errors every resync period.  See: OCPBUGS-11150

Why is this important?

  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/PSAP-908

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Remark that:

  • The metric should only be giving the EC2 instance type for now. Proposed name: hypershift_nodepools_info.
  • Especially we should not try to resolve the total number of cores for a given nodepool.
    • Knowing how many cores there is in a given EC2 instance is indeed not something you want to hard code in Hypershift or even get Hypershift deal with.
    • Resolving the total number of cores for a given EC2 instance type could be resolved through a recording rule built on top of this new metric.
  • Consider eventually reusing the hypershift_nodepools_available_replicas metric unless you think adding a label giving the EC2 instance type is blurring its semantic.

Slack thread:
https://redhat-internal.slack.com/archives/CCX9DB894/p1696515395395939

Acceptance criteria

  • A new metric is giving the EC2 instance type for a given nodepool

We need to Bump the version of K8 and to run a library sync for OCP4.13 .Two stories will be created for each activity

Owner: Architect:

Story (Required)

As a Sample Operator Developer, I will like to run the library sync process, so the new libraries can be pushed to OCP 4.16

Background (Required)

This is a runbook we need to execute on every release of OpenShift

Glossary

NA

Out of scope

NA

In Scope

NA

Approach(Required)

Follow instructions here: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities

Dependencies

Library Repo

Edge Case

Acceptance Criteria

 Library sync PR is merged in master

INVEST Checklist

 Dependencies identified
 Blockers noted and expected delivery timelines set
 Design is implementable
 Acceptance criteria agreed upon
 Story estimated

Legend

 Unknown
 Verified
 Unsatisfied

 

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Description
This epic tracks install server errors which require investigation into whether the RP or ARO-Installer can be more resilient or return a User Error instead of a Server Error.

This lives under shift improvement as it will reduce the number of incidents we get from customers due to `Internal Server Error` being returned.

How to Create Stories Under Epic
During the weekly SLO/SLA/SLI meeting, we will examine install failures on the cluster installation failure dashboard. We will aggregate the top occurring items which show up as Server Errors, and each story underneath will be an investigation required to figure out root cause and how we can either prevent it, be resilient to failures, or return a User Error.

Cluster installation Failures dashboard

AS AN ARO SRE
I WANT either{}
SO THAT

1. Decorate the error in the ARO installer to return a more informative message as to why the SKU was not found.
2. Ensure that the OpenShift installer and the RP are validating the SKU in the same manner
3. If the validation is the same between the Installer and ARO installer, we have the option to remove the ARO installer validation step

Acceptance Criteria

Given: The RP and the ARO Installer validates the SKU in the same manner
When: The RP validates
Then: The ARO Installer does not

Given: The ARO Installer validates additional or improved information validation than the RP
When: The ARO Installer validation fails due to missing SKU (failed validation)
Then: Enhance the log to include the SKU that was not found, providing us with more information to troubleshoot

Breadcrumbs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To have PAO working on Hypershift environment with feature parity with SNO ( as much as possible )
  • Not to be enabled prior to 4.16 (at minimum)
  • Add sanity test/s in hypershift CI
  • Note: eventually hypershift tests will run in hypershit CI lanes

Why is this important?

  • Hypershift is a very interesting platform for clients and PAO has a key role in node tuning so make it work in Hypershift is a good way to ease the migration to this new platform as clients will not lose their tuning capabilities.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No regressions are introduced

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

When PAO controller tries to create an event we get the following error:

E0402 09:41:10.578920       1 event.go:280] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"perfprofile-hostedcluster01.17c0ad00e8c2abc7", GenerateName:"", Namespace:"clusters-hostedcluster01", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"ConfigMap", Namespace:"clusters-hostedcluster01", Name:"perfprofile-hostedcluster01", UID:"e8d4883f-9944-4827-acba-21f757444a21", APIVersion:"v1", ResourceVersion:"21684945", FieldPath:""}, Reason:"Creation succeeded", Message:"[hypershift:perfprofile-hostedcluster01] Succeeded to create all components", Source:v1.EventSource{Component:"performance-profile-controller", Host:""}, FirstTimestamp:time.Date(2024, time.March, 27, 16, 47, 57, 817465799, time.Local), LastTimestamp:time.Date(2024, time.April, 2, 9, 41, 10, 577809390, time.Local), Count:14, Type:"Normal", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "perfprofile-hostedcluster01.17c0ad00e8c2abc7" is forbidden: User "system:serviceaccount:clusters-hostedcluster01:cluster-node-tuning-operator" cannot patch resource "events" in API group "" in the namespace "clusters-hostedcluster01"' (will not retry!)

This means we should add events under the `Role` of NTO operator

PerformanceProfile objects are handled in a different way on Hypershift so modifications in the Performance Profile controller are needed to handle this.

Basically Performance Profile controller have to reconcile ConfigMaps which will have PerformanceProfile objects embedded into them, create the different manifests as usual and then handle them to the hosted cluster using different methods.

More info in the enhancement proposal

Target: To have a feature equivalence in Hypershift and Standalone deployments

 

[UPDATE] This story is about setup the scaffolding for the actual hypershift implementation 

Some changes are needed in the NodePool controller to enable running the Performance Profile controller as part of Node Tuning Operator in the HyperShift hosted control planes to manage node tuning of hosted nodes.

PerformanceProfile objects will be created in the management cluster, embedded into a configmap, and referenced in a field of the NodePool API, then NodePool controller will handle this objects and create ConfigMap into the hosted cluster namespace for the Performance Profile controller to read them

More information in the enhancement proposal

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

A separate PR is requested from NTO team to put rationale on setting intel_pstate by default. The idea of this story is to set pstate mode depending on underlying hardwares generation. From recommendation of I///, the newer hardwares from IceLake+ generation comes with hwp enabled, therefore recommends to keep intel_pstate to active.

This PR should also modify all the e2e tests where rendered output is verified, because setting pstate to active actually needs modification on those files

This epic is to track any stories for 4.16 hypershift kubevirt development that do not fit cleanly within a larger effort.

Here are some examples of tasks that this "catch all" epic can capture

  • dependency update maintenance tasks
  • ci and testing changes/fixes
  • investigation spikes

This is a followup for CNV-29003 in which we've enabled the NodePoolUpgrade test, but just for replace strategy, in which a node pool release image update triggers a machine rollout resulting in creation of new nodes with the updated RHCOS/kubelet version.

 

With InPlace strategy, the machines are not getting recreated, but are getting upgraded only with soft-reboot. We've noticed it doesn't work for kubevirt as expected, as the updated node is getting stuck at SchedulingDisabled and the nodepool.status.version is not updated.

This task is for find the root cause, fix and make the InPlace nodepool test pass on the presubmit CI.

This epic is to track any stories for hypershift kubevirt development that do not fit cleanly within a larger effort.

Here are some examples of tasks that this "catch all" epic can capture

  • dependency update maintenance tasks
  • ci and testing changes/fixes
  • investigation spikes

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

AC:

  • Create integration test of ConsoleYAMLSample CRD, which will create a YAML sample for a specific resource and test if the sample if present for that resource.
  • Test functionality of "Try it" and "Download YAML" links.
  • Also test the display and list page.
  • After the tests is done, delete the created CR.

A hasty bug fix resulted in another bug. We should add integration tests that utilize cy.intercept in order to prevent such bugs from happening in the future.

AC:

Console frontend uses Yarn 1.22.15 from Sep 2021. We should update the version to v1 latest.

There is currently no automated way to check for consistency of PatternFly package version resolutions. We should add a script (executed as part of the build) to ensure that each PatternFly package has exactly one version resolution.

Convert legacy ListPage to dynamic-plugin-sdk ListPage- components in Console VolumeSnapshots Storage

The legacy ListPage components are located in /frontend/packages/console-app/src/components/

  • volume-snapshot/volume-snapshot.tsx
  • volume-snapshot-content.tsx
  • volume-snapshot-class.tsx

Justification: A recent replacement of the legacy ListPage to dynamic-plugin-sdk ListPage- components in VolumeSnapshotPVC tab component led to the duplication of the RowFilter logic in snapshotStatusFilters function due to incompatible type in RowFilter. Also, converting to dynamic-plugin-sdk ListPage- components would make the code more readable and simplify debugging of VolumeSnapshot components.

A.C.
  Find and replace legacy ListPage volume-snapshot pages.with dynamic-plugin-sdk ListPage<—> components

 

 

As part of the spike to determine outdated plugins, the file-loader dev dependency is out of date and needs to be updated.

Acceptance criteria:

  • update file-loader dependency
  • check for breaking changes, fix build issues
  • get a clean run of pre merge tests 

After upgrading cyress to v13 and switching from using Chrome to Electron browser, and we publish the deprecation of chrome browser, we should remove chrome from the builder image: 

https://github.com/openshift/console/blob/master/Dockerfile.builder#L59

Find instances of chrome: https://github.com/search?q=org%3Aopenshift+--browser+%24%7BBRIDGE_E2E_BROWSER_NAME%3A%3Dchrome%7D&type=code

AC:

  • Remove chrome from console builder image 
  • coordinate with other packages that chrome will be deprecated 

Recent regressions in the EventStream component have exposed some error-prone patterns. We should refactor the component to remove these patterns. This will harden the component and make it easier to maintain.

 

AC:

  • Refactor the EventStream component to address React anti-patterns

Once PR #12983 gets merged, Console application itself will use PatternFly v5 while also providing following PatternFly v4 packages as shared modules to existing dynamic plugins:

  • @patternfly/react-core
  • @patternfly/react-table
  • @patternfly/quickstarts

Above mentioned PR will allow dynamic plugins to bring in their own PatternFly code if the relevant Console provided shared modules (v4) are not sufficient.

Let's say we have a dynamic plugin that uses PatternFly v5 - since Console only provides v4 implementations of above shared modules, the plugin would bring in the whole v5 package(s) listed above.

There are two main issues here:

1. CSS consistency

  • Console application should be responsible for loading all PatternFly CSS, including older versions (like v4) for use with existing plugins.
  • Plugins should not bring in their own PatternFly CSS in order to avoid styling related bugs.

2. Optimal code sharing

  • Treating entire PatternFly packages as shared modules yields very big JS chunks, which means more data to load over the network and evaluate in the browser.
  • Instead of per-package, PatternFly v5 code should be federated per-component.

This story should address both issues mentioned above.

 


 

Acceptance criteria:

  • webpack generated Console plugin does not bundle PatternFly CSS
  • webpack generated Console plugin treats individual PatternFly v5 components as shared modules

This epic contains all the Dynamic Plugins related stories for OCP release-4.16 and implementing Core SDK utils.

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

#13679 - how to setup devel env. with Console and plugin servers running locally

#13521 - improve docs on shared modules (CONSOLE-3328)

#13586 - ensure that Console vs. SDK compat table is up-to-date

#13637 - how to migrate from PatternFly 4 to 5 (CONSOLE-3908)

#13637 - using correct MIME types when serving plugin assets - use "text/javascript" for all JS assets to ensure that "X-Content-Type-Options: nosniff" security header doesn't cause JS scripts to be blocked in the browser

#13637 - disable caching of plugin manifest JSON resource

Epic Goal

  • Add the ability to extend the Events Page. For each Event, the Lightspeed team would like to add an "Explain" Button. Please refer to screenshots

Why is this important?

  • This will enable Lightspeed to provide in-context information to users.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Lightspeed team wants to integrate with console via the events, Ben Parees is leading the effort. For that we need to update the Events page, so the lightspeed dynamic-plugin can create a button and pass a callback to it, which would explain the given event.

AC: 

  • Update the Events page to consume the existing extension which would be console.action/resource-provider extension

 

The observability team is doing a similar integration with Alerts:

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Goal of this epic is to capture the stories that will cover implementation of PatternFly's component-groups components, which are moved into PF from console-shared package

Why is this important?

  • Console needs to re-implement its moved components
  • We need to offload console repository of any components that might be reusable in other projects

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Done when:

We have an enhancement drafted and socialized that

  • describes the patterns of rotation mechanisms that the MCO owns, which other components can utilize
  • list certs that the MCO component is responsible for

Should be reviewed by/contain provisions for

  • WMCO
  • SNO
  • HyperShift

Description of problem

Spin off of OCPBUGS-30192

The daemon process can exit due to health check failures in 4.16+, after we added apiserver server CA rotation handling. The came with the side effect that if the MCD happens to exit in the middle of the update (e.g. image pull portion), the files/units would have been updated but the OS upgrade would not, blocking the upgrade indefinitely when the new container comes up.

Version-Release number of selected component

4.16

How reproducible

Only in BM CI so far, unsure if other issues contribute to this.

Steps to Reproduce

Get lucky and have api-int DNS break while the machine-config daemon is deploying updated files to disk. Unclear how to reliably trigger this, or distinguish from OCPBUGS-30192 and other failure modes.

Actual results

Expected results

Additional info

Description of problem:

Older clusters updating into or running 4.15.0-rc.0 (and possibly Engineering Candidates?) can have the Kube API server operator initiate certificate rollouts, including the api-int CA. Missing pieces in the pipeline to roll out the new CA to kubelets and other consumers lead the cluster to lock up when the Kubernetes API servers transition to using the new cert/CA pair when serving incoming requests. For example, nodes may go NotReady with kubelets unable to call in their status to an api-int signed by the new CA that they don't yet trust.

Version-Release number of selected component (if applicable):

Seen in two updates from 4.14.6 to 4.15.0-rc0. Unclear if Engineering Candidates were also exposed. 4.15.0-rc.1 and later will not be exposed because they have the fix for OCPBUGS-18761. They may still have broken logic for these CA rotations in place, but until the certs are 8y or more old, they will not trigger that broken logic.

How reproducible:

We're working on it. Maybe cluster-kube-apiserver-operator#1615.

Actual results:

Nodes go NotReady with kubelet failing to communicate with api-int because of tls: failed to verify certificate: x509: certificate signed by unknown authority.

Expected results:

Happy certificate rollout.

Additional info:

Rolling the api-int CA is complicated, and we seem to be missing a number of steps. It's probably worth working out details in a GDoc or something where we have a shared space to fill out the picture.

One piece is getting the api-int certificates out to the kubelet, where the flow seems to be:

  1. Kube API-server operator updates a Secret, like loadbalancer-serving-signer in openshift-kube-apiserver-operator (code).
  2. Kube API-server aggregates a number of certificates into the kube-apiserver-server-ca ConfigMap in the openshift-config-managed namespace (code).
  3. FIXME, possibly something in the Kube controller manager's ServiceAccount stack (and the serviceaccount-ca ConfigMap in openshift-kube-controller-manager) is handling getting the data from kube-apiserver-server-ca into node-bootstrapper-token?
  4. Machine-config operator consumes FIXME and writes a node-bootstrapper-token ServiceAccount Secret.
  5. Machine-config servers mount the node-bootstrapper-token Secret to /etc/mcs/bootstrap.
  6. Machine-config servers consume ca.crt from /etc/mcs/bootstrap-token and build a kubeconfig to serve in Ignition configs here as /etc/kubernetes/kubeconfig (code)
  7. Bootimage Ignition lays down the MCS-served content into the local /etc/kubernetes/kubeconfig, but only when the node is first born.
  8. FIXME propagates /etc/kubernetes/kubeconfig to /var/lib/kubelet/kubeconfig (FIXME:code Possibly the kubelet via --bootstrap-kubeconfig).
  9. The kubelet consumes /var/lib/kubelet/kubeconfig and uses its CA trust store when connecting to api-int (code).

That handles new-node creation, but not "Kube API-server operator rolled the CA, and now we need to update existing nodes, and systemctl status restart their kubelets. And any pods using ServiceAccount kubeconfigs? And...?". This bug is about filling in those missing pieces in the cert-rolling pipeline (including having the Kube API server not use the new CA until it has been sufficiently rolled out to api-int clients, possibly including every ServiceAccount-consuming pod on the cluster?), and anything else that seems broken with the early cert-rolls.

Somewhat relevant here is OCPBUGS-15367 currently managing /etc/kubernetes/kubeconfig permissions in the machine-config daemon to backstop for the file existing in the MCS-served Ignition config but not being a part of the rendered MachineConfig or the ControllerConfig stack.

After Layering and Hypershift GAs in 4.12, we need to remove the code and builds that are no longer associated with mainline OpenShift.

 

This describes non-customer facing.

Links to the effort:

The MCO and ART bits can be done ahead of time (except for the final config knob flip). Then we would merge the openshift/os, openshift/driver-toolkit PRs and do the ART config knob flip at roughly the same time.

Background

This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense. 

Goal

If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.

 

During my work on Project Labrador, I learned that there are advanced caching directives that one can add to their Containerfiles. These do things such as allowing the package manager cache be kept out of the image build, but to remain after the build so that subsequent builds don't have to download the packages. Golang has a great incremental build story as well, provided that one leaves the caches intact.

To begin with, my Red Hat-issued ThinkPad P16v takes approximately 2 minutes and 42 seconds to perform an MCO image build (assuming the builder and base images are already prefetched).

A preliminary test shows that by using advanced caching directives, incremental builds can be reduced to as little as 45 seconds. Additionally, by moving the nmstate binary installation into a separate build stage and limiting what files are copied into that stage, we can achieve a cache hit under most conditions. This cache hit has the additional advantage of that stage not requiring VPN in order to reach the appropriate RPM repository.

 

Done When:

  • The Dockerfile in the MCO repository root has been modified to take advantage of advanced caching directives.
  • The nmstate installation has been moved into a separate build stage within the MCO's Dockerfile to increase the likelihood of cache hits.
  • These speed improvements will not likely be observed within OpenShift CI because the build contexts and caches are thrown away after builds since these builds run in containers themselves.

As an OpenShift developer, I want to know that my code is as secure as possible by running static analysis on each PR.

 

Periodically, scans are performed on all OpenShift repositories and the container images produced by those repositories. These scans usually result in numerous OCP bugs being opened into our queue (see linked bugs as an example), putting us in a more reactive state. Instead, we can perform these scans on each PR by following these instructions https://docs.ci.openshift.org/docs/how-tos/add-security-scanning/ to add this to our OpenShift CI configurations.

 

Done When:

  • A PR to the openshift/release repository is merged which enables this configuration in a non-gating capacity.
  • All MCO team members are onboarded into the Snyk scan dashboard.
  • A preliminary scan report is produced which highlights areas for improvement.

Epic Goal

  • Reduce the resource footprint and simplify the metal3 pod.
  • Reduce the maintenance burden on the upstream ironic team.

Why is this important?

  • Inspector is a separate service for purely historical reasons. It integrates pretty tightly with Ironic, but has to maintain its own database, which is synchronized with the Ironic's one via API. This hurts performance and debugability.
  • One fewer service will have a positive effect on the resource footprint of Metal3.

Scenarios

This is not a user-visible change.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No ironic-inspector service is running in the Metal3 pod. In-band inspection functionality is retained.

Dependencies (internal and external)

METAL-119

Previous Work (Optional):

METAL-119 provides the upstream ironic functionality

having the cachito configuration in place we can now start converting the ironic packages and the dependencies to install from source

in CI and local builds if REMOTE_SOURCES and REMOTE_SOURCES_DIR are not defined they assume the value of . actually enabling the COPY . . command in the Dockerfile
to avoid potential issues we need to find an alternative and defaulting REMOTE_SOURCES to something more safe

to avoid mistakes like using hashes from wrong releases we need to have a way to test them
ideally this should also be automated in a CI job

since we're going toward a source-based model for the OCP ironic-image using downstream packages, we're starting to see more and more discrepancies with the OKD version based on CS9 and upstream packages, causing conflicts and issues due to missing or too old dependencies
for this reason we'd like to split the lists of installed packages between OCP and OKD as it was done for the ironic-agent-image

Definition of done

In SaaS, allow users of assisted-installer UI or API, to install any published OCP version out of a supported list of x.y options.

Feature Origin

Feature probably origins from our own team. This feature will enhance the current workflow we're following to allow users selectively install versions in assisted-installer SaaS.

Until now we had to be contacted by individual users to allow a specific version (usually, it was replaced by us with a newer version). In this case, we would add this version to the relevant configuration file.

Feature usage

It's not possible to quantify the relevant numbers here, because users might be missing certain versions in assisted and just give up the usage of it. In addition, it's not possible to know if users intended to use a certain "old" version, or if it's just an arbitrary decision.

Osher De Paz can we know how many requests we had for "out-of-supported-list"?

Feature availability

It's essential to include this feature in the UI. Otherwise, users will get very confused about the feature parity between API and UI.

Osher De Paz there will always be features that exist in the API and not in the UI. We usually show in the UI features that are more common and we know that users will be interacting with them.

Why is this important?

  • We need a generic way of using a specific OCP version on the cloud and also for other platforms by the user

Scenarios

  1. We will need to add validation that the images for this version exist before installation.
  2.  

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Not sure how we will handle this requirement in a disconnected environment
  2.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: Test Plan
  • QE - Manual execution of the feature - done
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:

In the current user interface, when you select a multiarch OpenShift version, you must specify a CPU architecture from those available in the manifest. After making this selection and clicking "next," the user interface attempts to register a cluster with the chosen version and CPU architecture. However, the assisted-service encounters an error and fails to find the requested release image.

How reproducible:

Always.

Steps to reproduce:

1. Choose multiarch OpenShift version

2. Press "next"

Actual results:

Expected results:

Successfully registering a cluster. ** 

Feature goal (what are we trying to solve here?)

  • When using ACM/MCE with infrastructure operator automatically import local cluster to enable adding nodes

DoD (Definition of Done)

When enabling infrastructure operator automatically import the cluster and enable users to add nodes to self cluster via Infrastructure operator

Does it need documentation support?

Yes, it's a new functionality that will need to be documented

Feature origin (who asked for this feature?)

Reasoning (why it’s important?)

  • Right now in order to enable this flow the user will need to install MCE and enable infrastructure operator and follow this guide in order to add nodes using the infrastructure operator, we would like to make this process easier for the users
  • it will automatically provide an easy start with CIM

Competitor analysis reference

  • Do our competitors have this feature? N/A

Feature usage (do we have numbers/data?)

  • We are not collecting MCE data yet
  • We were asked several times by customer support how to run this flow

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API - UI will benefit from it by having the cluster prepared for the user
  • If it's for a specific customer we should consider using AMS - Some users would like to manage the cluster locally, otherwise why did they install MCE?
  • Does this feature exist in the UI of other installers? - No

Open questions

  1. How to handle static network - infrastructure operator is not aware of how the local network was defined
  2. How the api should look like? should the user specify a target namespace or should it be automatic?
  3. How to get the local kubeconfig? can we use the one that the pod is using?

Description of the problem:

Imported local cluster doesn't inherit proxy settings from AI.

How reproducible:

100%

Steps to reproduce:

1. Install 4.15 hub cluster with proxy enabled (IPv6)

2. Install 2.10 ACM

Actual results:

No proxy settings in `local-cluster` ACI. 

Expected results:

ACI `local-cluster` inherited proxy settings from AI pod.

Feature goal (what are we trying to solve here?)

This is an existing feature that was added, but didn't go through any testing / documentation, it only works internally when the assisted installer uses it itself, but broken completely when users try to use it

See MGMT-12435 , MGMT-15999 , MGMT-16000 and MGMT-16002

Please describe what this feature is going to do.

This feature already exists, MGMT-16000 covers it and gives an example of what it looks like

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

  • Documentation
  • API support
  • QE

Does it need documentation support?

Yes

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
    • Users need to override installer manifests with surgical patches, we can't expect them to maintain a copy of an entire manifest that's every changing
      • Actually, it looks like since 4.12 it's impossible to override installer manifests at all, so this makes this feature even more urgent. Some things like RFE-3981 cannot be changed post-installation, it has to be done during installation
  • How does this feature help the product?
    • Users get better control about particular parameters

Competitor analysis reference

  • Do our competitors have this feature?N/A

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Description of the problem:

This pattern: https://github.com/openshift/assisted-service/blob/efc20a5ea46368da70143455dc0300aebd79ce18/swagger.yaml#L6687

is too strict, it does not allow users to create patch manifests

See related MGMT-15999 and MGMT-16000
 
How reproducible:

100% 

Steps to reproduce:

1. Try to create a patch manifest:

curl http://localhost:8080/api/assisted-install/v2/clusters/${CLUSTER_ID}/manifests \
--json '
{
  "folder": "openshift",
  "file_name": "cluster-network-02-config.yml.patch_custom_ovn_v4internalsubnet",
  "content": "---\n- op: add\npath: /spec/defaultNetwork\nvalue:\novnKubernetesConfig:\nv4InternalSubnet: 100.65.0.0/16"
}
'

Actual results:

 Error

{"code":605,"message":"file_name in body should match '^[^/]*\\.(yaml|yml|json)$'"}

Expected results:

No error

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it's important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn't exist anywhere
  • Related data - the feature doesn't exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

There were several issues found in customer sites concerning connectivity checks:

  • In none platform, there is connectivity to subset of the  network interfaces.
  • In none platform we need to sync the node-ip and the kubelet-ip.
  • We need to handle virtual interfaces better such as bridge.  This calls to discovery if IP is external or internal in order to decide if interface needs to be included for connectivity check.

 Currently when L3 connectivity groups, there must be symmetrical connectivity between all nodes.  This poses restriction that nodes cannot have networks that do not participate in cluster networking.

Instead, the L3 connectivity will change to connected addresses.  A connected address is in address that can be reached from all other hosts.  So if a host has a connected address, then it is regarded as host that belongs to majority group (the validation). 

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend. 

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Oracle
    • NetApp
    • Cisco

Reasoning (why it’s important?)

  • In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

Epic Goal

  • Set installation timeout to 24 hours (tokens expiration)
  • Warn user when installation part takes longer than reasonable amount of time
  • If installation is recoverable enable the user to fix the issue and complete the installation

Why is this important?

Scenarios

I created a SNO cluster through the SaaS. A minor issue prevented one of the ClusterOperators from reporting "Available = true", and after one hour, the install process hit a timeout and was marked as failed.

When I found the failed install, I was able to easily resolve the issue and get both the ClusterOperator and ClusterVersion to report "Available = true", but the SaaS no longer cared; as far as it was concerned, the installation failed and was permanently marked as such. It also would not give me the kubeadmin password, which is an important feature of the install experience and tricky to obtain otherwise.

A hard timeout can cause more harm than good, especially when applied to a system (openshift) that continuously tries to get to desired state without an absolute concept of an operation succeeding or failing; desired state just hasn't been achieved yet.

We should consider softening this timeout to be a warning that installation hasn't achieved completion as quickly as expected, without actively preventing a successful outcome.

Late binding scenario (kube-api):
User try to install a cluster with late binding featured enabled (deleting the cluster will return the hosts to InfraEnv), installation timeout and cluster goes into error state, user connect to the cluster and fix the issue.
AI will still think that there is an error in the cluster, If user will try to perform day2 operations on an in error cluster it will fail, the only option is to delete the cluster and create another one that is marked as installed but that will cause the host to boot from discovery ISO.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Add an annotation (with one sentence) that explains what the Kube resource is used for, sometimes we cannot/don't want to browse the code to know what the resource is used for, and mist of the times the name is not self-explanatory.

I don't think maintaining this will be a big deal as resources' purposes don't change a lot.

there is a standard annotation for this kubernetes.io/description https://kubernetes.io/docs/reference/labels-annotations-taints/#description

See https://issues.redhat.com/browse/MON-1634?focusedId=22170770&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22170770 for more context.

 

We don't clearly document what are the supported "3rd-party" monitoring APIs. We should compile the exhaustive list of API services that we officially support:

  • OpenShift routes to access the monitoring API endpoints from the outside.
  • Kubernetes services to access the monitoring API endpoints within the cluster.
  • For each API, the list of supported endpoints (e.g. /api/v1/query, /api/v1/query_range for Prometheus).
  • Which clusterrole/role bindings are needed to access each API service.

See RHDEVDOCS-4830 for the context.

Epic Goal

  • Graduate MetricsServer FeatureGate to GA

Why is this important?

  • For autoscaling OpenShift needs an resource metrics implementation. Currently this depends on CMO's Prometheus stack
  • Some users would like to opt-out of running a fully fledged Prometheus stack, see https://issues.redhat.com/browse/MON-3152

Scenarios

  1. A cluster admin decides to not deploy a full Monitoring stack, autoscaling must still work.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Follow FeatureGate Guidelines
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. TechPreview completed in MON-3153

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Graduate MetricsServer FeatureGate to GA

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Get rid of unnecessary UPDATE requests during syncs.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Migrate:

  • CreateOrUpdateSecret
  • CreateOrUpdateConfigMap: Require a small adjustment in library-go
  • CreateOrUpdatePodDisruptionBudget
  • CreateOrUpdateService
  • CreateOrUpdateRoleBinding
  • CreateOrUpdateRole
  • CreateOrUpdateClusterRole
  • CreateOrUpdateClusterRoleBinding

See https://docs.google.com/document/d/1fm2SZs8HroexPQnqI0Ua85Y31-lars8gCsdwSJC3ngo/edit?usp=sharing for more details.

This epic is to track stories that are not completed in MON-3378

There are a few places in CMO where we need to remove code after the release-4.16 branch is cut.

To find them, look for the "TODO" comments.

After we have replaced all oauth-proxy occurrences in the monitoring stack, we need to make sure that all references to oauth-proxy are removed from the cluster monitoring operator. Examples:

 

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Currently cloud connection is used to connect powervs dhcp private network and vpc network to create load balancer to in vpc for ingress controller.

Cloud connection has a limitation of only 2 per zone. 

Whereas in Transit gateway, 5 instances can be created in a zone also its lot faster than cloud connection since it uses PER.

https://cloud.ibm.com/docs/power-iaas?topic=power-iaas-per

https://cloud.ibm.com/docs/transit-gateway?topic=transit-gateway-getting-started&interface=ui

 

Currently various serviceIDs are created by create infra command to be used by various cluster operators, in which storage operator is exposed in guest cluster which should not happen. Need to reduce the scope of all the serviceIDs to specific to infra resources created for that cluster alone.

Epic Goal

  • Improve IPI on Power VS in the 4.16 cycle
    • Switch to CAPI provisioned bootstrap and control plane resources

Epic Goal
Through this epic, we will update our CI to use a have an available agent-based workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.

Why is this important?
There is an active initiative in openshift to remove terraform from the openshift installer.

Acceptance Criteria

  • All tasks within the epic are completed.

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

As a CI job author, I would like to be able to reference a yaml/json parsing tool that works across architectures and doesn't need to be downloaded for each unique step.

Rafael pointed out that Alessandro add multi-arch containers for yq for the upi installer:
https://github.com/openshift/release/pull/49036#discussion_r1499870554

yq should have the ability to parse json.

We should evaluate if this can be added to the libvirt-installer image as well, and then used by all of our libvirt CI steps.

Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314

Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.

Epic Goal

  • The goal of this epic is continue to work on CI job delivery, some bug fixes related to multi-path, and some updates to the the assisted-image service.
  • With 4.16 we want to support LPAR on s390x
  • LPAR comes in two flavors: LPAR classic (iPXE boot) and LPAR DPM (ISO and iPXE boot)

On s390x there are some network configurations where MAC addresses are not static and lead to issues using Assisted Installer, Agent based Installer and HCP (see attached net conf).

For AI and HCP there is a possibility to patch kernel arguments but using the UI there is a separate manual step needed using the API. This will be a bad user experience.
In addition patching the kernel arguments for ABI is not possible.

To solve this, a config override parameter need to be added to the parm file by the user and the ip settings will be automatically passed to the coreos installer regardless what the user configure (DHCP or static IP using nmstate).

New configuration for parm file in cases marked red in the network configuration matrix.

New parameter will be ai.ip_cfg_override and can have 0 or 1. In case of set to 1 the network configuration will be taken from the parm file regardless the user configure for example using the WebUI form Assisted installer. This issue is affecting Agent based installer and HCP.
Parameter for IP configuration will be "ip" and "nameserver". 
Example of the parm line looks like:
rd.neednet=1 ai.ip_cfg_override=1 console=ttysclp0 coreos.live.rootfs_url=http://172.23.236.156:8080/assisted-installer/rootfs.img ip=10.14.6.3::10.14.6.1:255.255.255.0:master-0.boea3e06.lnxero1.boe:encbdd0:none nameserver=10.14.6.1 ip=[fd00::3]::[fd00::1]:64::encbdd0:none nameserver=[fd00::1] zfcp.allow_lun_scan=0 rd.znet=qeth,0.0.bdd0,0.0.bdd1,0.0.bdd2,layer2=1 rd.zfcp=0.0.8002,0x500507630400d1e3,0x4000404600000000 random.trust_cpu=on rd.luks.options=discard ignition.firstboot ignition.platform.id=metal console=tty1 console=ttyS1,115200n8

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Bump OpenShift Router from Haproxy from 2.6 and 2.8. 

Why is this important?

As a cluster administrator, I want OpenShift to include a recent HAProxy version, so that I have the latest available performance and security fixes.  

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.
  • OpenShift Router is running a version of Haproxy 2.8
  • Perf & Scale analysis is complete
  • Review of critical/major bug fixes between Haproxy 2.6 and 2.8

...

Dependencies (internal and external)

1. Perf & Scale

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Bump the openshift-router container image to RHEL9 with help from the ART Team.

Why is this important?

OpenShift is transitioning to RHEL9 base images and we must move to RHEL9 by 4.16. RHEL9 introduces OpenSSL 3.0. OpenSSL 3.0 has shown 

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.
  • Openshift-Router image is RHEL9
  • Investigation to RHEL9 Performance degradation is complete

...

Dependencies (internal and external)

  1. Perf & Scale
  2. ART Team

...

Previous Work (Optional):

  1. ART Issue: https://issues.redhat.com/browse/ART-8359
  2. RHEL9 Router Smoke Test: https://github.com/openshift/router/pull/538 
  3. RHEL9 ART Slack Thread

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Work with ART team to have https://github.com/openshift-eng/ocp-build-data/pull/3895 merged, verified that everything is working appropriately, and ensure we keep our openshift-router CI jobs up-to-date by using RHEL9 since we disabled automatic base image definition from ART. 

The origin test "when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key" is failing due to a hardcoded certificate using SHA1 as the hash algorithm.

The certificate needs to be updated to use SHA256 as SHA1 isn't support anymore.

More details: https://github.com/openshift/router/pull/538#issuecomment-1831925290 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

  • In 1.29, Kube flipped the DisableKubeletCloudCredentialProviders to true by default, this broke our rebase tests as the kubelet could no longer pull images from GCR
  • To mitigate this, we flipped the flag back to false
  • We must revert the flip before the feature is GA'd upstream
  • The cloud provider authentication providers (eg on GCP) become dependencies for kubelet and must be configured via flags
  • As an example on GCP
    • We need to build the provider and ship it as an RPM (perhaps in the kubelet RPM? Can RPMs have dependency RPMs?)
    • The RPM should place the binary into a well known location on disk
    • We then need to create a configuration file and set the correct flags on Kubelet based on this configuration

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

Setting up the distgit (for production brew builds) depends on the ART team. This should be tackled early.

 

The PR to ocp-build-data should also be prioritised, as it blocks the PR to openshift/os. There is a separate CI Mirror used to run CI for openshift/os in order to merge, which can take a day to sync. 

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an openshift maintainer I want our build tooling to produce the gcr credential provider plugin so that it can be distributed in RHCOS to be used by kubelet.

Background

We need to ship the gcr credential provider via an rpm, so it is available to kubelet when it first starts.

To ship an rpm we must create a .spec file that provides information on how the package should be built

 

A working example for AWS is provided in this PR: https://github.com/openshift/cloud-provider-aws/pull/63

Steps

  • Copy the .spec file from the AWS PR
  • Set up rpmbuild and rpmlint locally (proably running in a conatiner). This is a good introduction to the process.
  • Get the .spec file building locally: likely by calling make <target> in the %build section (you'll need to get a tarball for the source, and move it into `SOURCES`)
  • PR .spec to the root of `cloud-provider-gcp`

Stakeholders

  • cluster-infra
  • workloads team

Definition of Done

  • .spec building the gcp gcr credential plugin locally
  • PR to `cloud-provider-gcp`
  • Docs
  • N.A
  • Testing
  • N.A

User Story

As a user I want kubelet to know how to authenticate with acr automatically so that I don't have to roll credentials every 12h

Background

This functionality is being removed in tree from the kubelet, so we now need to provide it via a credential provider plugin

Before this can be completed, we will need to create and ship an rpm within RHCOS to provide the binary kubelet will exec.

Steps

See https://github.com/openshift/machine-config-operator/pull/4103/files for an example PR

Stakeholders

  • cluster-infra team
  • workloads team

Definition of Done

  • MCO sets -image-credential-provider-config and -image-credential-provider-bin-dir for azure
  • credential provider config exists on azure master and worker nodes
  • Tests updated to reflect the above changes
  • Docs
  • Add release note notifying of the change from in tree kubelet to an external process
  • Testing
  • Set up private registry on ACR
  • Set up a new OCP cluster and check that it can pull from the registry

User Story

As a user I want kubelet to know how to authenticate with gcr automatically so that I don't have to roll credentials every 12h

Background

This functionality is being removed in tree from the kubelet, so we now need to provide it via a credential provider plugin

Before this can be completed, we will need to create and ship an rpm within RHCOS to provide the binary kubelet will exec.

Steps

See https://github.com/openshift/machine-config-operator/pull/4103/files for an example PR

Stakeholders

  • cluster-infra team
  • workloads team

Definition of Done

  • MCO sets {}image-credential-provider-config{-} and -image-credential-provider-bin-dir for gcp
  • credential provider config exists on gcp master and worker nodes
  • Tests updated to reflect the above changes
  • Docs
  • Add release note notifying of the change from in tree kubelet to an external process
  • Testing
  • Set up private registry on GCR
  • Set up a new OCP cluster and check that it can pull from the registry

Background

To enable the use of the new Azure and GCP image credential providers, we need to enable the DisableKubeletCloudCredentialProviders feature gate in all cluster profiles.

Steps

  • Enable the feature gate
  • Verify that it doesn't affect HyperShift

Stakeholders

  • Cluster Infra
  • Maciej - wants this done before 4.16 closes out
  • HyperShift

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

User Story

As an openshift maintainer I want our build tooling to produce the acr credential provider plugin so that it can be distributed in RHCOS to be used by kubelet.

Background

We need to ship the acr credential provider via an rpm, so it is available to kubelet when it first starts.

To ship an rpm we must create a .spec file that provides information on how the package should be built

 

A working example for AWS is provided in this PR: https://github.com/openshift/cloud-provider-aws/pull/63

Steps

  • Copy the .spec file from the AWS PR
  • Set up rpmbuild and rpmlint locally (proably running in a conatiner). This is a good introduction to the process.
  • Get the .spec file building locally: likely by calling make <target> in the %build section (you'll need to get a tarball for the source, and move it into `SOURCES`)
  • PR .spec to the root of `cloud-provider-azure`

Stakeholders

  • cluster-infra
  • workloads team

Definition of Done

  • .spec building the azure acr credential plugin locally
  • PR to `cloud-provider-azure`
  • Docs
  • N.A
  • Testing
  • N.A

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Stop setting `-cloud-provider` and `-cloud-config` arguments on KAS, KCM and MCO
  • Remove `CloudControllerOwner` condition from CCM and KCM ClusterOperators
  • Remove feature gating reliance in library-go IsCloudProviderExternal
  • Remove CloudProvider feature gates from openshift/api

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

As part of the migration to external cloud providers, the CCMO and KCMO used a CloudControllerOwner condition to show which controller owned the controllers.

This is no longer required and can be removed.

Steps

  • Remove code from CCMO that looks for and gates on the KCMO condition
  • Ensure CCMO clears the condition
  • Ensure KCMO clears the condition

Stakeholders

  • Cluster Infra
  • Workloads team

Definition of Done

  • Clusters upgraded to 4.16 do not have a CloudControllerOwner condition set on the KCMO or CCMO ClusterOperators
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

Code in library-go currently uses feature gates to determine if Azure and GCP clusters should be external or not. They have been promoted for at least one release and we do not see ourselves going back.

In 4.17 the code is expected to be deleted completely.

We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.

Steps

  • Update library go to remove reliance on feature gates
  • Update callers to no longer rely on feature gate accessor (KCMO, KASO, MCO, CCMO)
  • Remove feature gates from API repo

Stakeholders

  • Cluster Infra
  • MCO team
  • Workloads team
  • API server team

Definition of Done

  • Feature gates for external cloud providers are removed from the product
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

The kubelet no longer needs the cloud-config flag as it is no longer running in-tree code.

It is currently handled in the templates by this function which will need to be removed, along with any instances in the templates that call the function.

This should cause the flag to be omitted from future kubelet configuration.

Steps

  • Remove the cloud-config flag producing function from MCO
  • Handle any cleanup required (find kubelet files that reference the template function)

Stakeholders

  • Cluster Infra
  • MCO team

Definition of Done

  • Kubelet no longer has a specified value for `--cloud-config`
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Goal

  • The goal of this epic is to adjust HighOverallControlPlaneCPU alert thresholds when Workload Partitioning is enabled.

Why is this important?

  • On SNO clusters this might lead to false positives. Also, it makes sense to have such mechanism because right now it's using all available CPU for control plane alert, while user can allow less cores for it to be used

Scenarios

  1. As a user i want to enable workload partitioning and have my alert values adjusted accordingly

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Open questions:

  1. Do we want to bump alert threshold for SNO clusters because they are running workloads on master nodes, rather than worker nodes

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Create separate SNO alert with templating engine to adjust alerting rules based on workload partitioning mechanism

Here is our overall tech debt backlog: ODC-6711

See included tickets, we want to clean up in 4.16.

Problem:

Future PatternFly upgrades shouldn't break our integration tests. Based on the upgrade to PF 5 we should try to avoid classnames in our tests.

Goal:

Remove all the usage pf pf-* classname selectors in our test cases.

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/webterminal-plugin/integration-tests`
     

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/shipwright-plugin/integration-tests`
     

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/knative-plugin/integration-tests`
     

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/topology/integration-tests`
     

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/dev-console/integration-tests`
     

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/pipelines-plugin/integration-tests`
     

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/helm-plugin/integration-tests`
     

Problem:

Increase and improve the automation health of ODC

Goal:

Why is it important?

Improve automation coverage for ODC

Use cases:

  1. <case>

Acceptance criteria:

  1. Increase automation coverage for devconsole package
  2. Improve tests in devconsole, helm, knative, pipelines, topology, and shipwright packages

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

Automating the Recently Used Resources Section in the Search Resource page Dropdown
As a user,

Acceptance Criteria

  1. Automating the ACs for epic ODC-7480

    Additional Details:

Problem:

Increase CI coverage for different packages in ODC

Goal:

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. Increase tests in pre-merge CI for different packages of ODC
  2. Improve CI health

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

We released 4.15 with broken quick starts, see OCPBUGS-29992

We must ensure that our main features are covered by e2e tests.

Acceptance Criteria

  1. e2e that navigates to the QuickStart page from the main navigation
  2. Ensures that the quick start catalog is shown with at least X items
  3. e2e test that opens a quick start, navigates throw a quick start (pressing n times next) to see its working
  4. Test closing the sidebar, etc.

Problem:

Service Binding Operator will reach its end of life, and we want customers to be informed about that change. The first step for this is showing some deprecation messages if customers uses SBOs.

Acceptance criteria:

  1. Show a deprecation warnings to customers that uses SBO and service bindings, this could include this places:
    1. Add page / flow
    2. Creating a SB yaml
    3. SB list page
    4. Topology when creating a SB / bind a component
    5. Topology if we found any SB in the current namespace?
  2. Provide an alternative (TBD!)

Note:

This only affects Service Bindings in our UI and not the general Operator based bindings.

Google Doc - End of Life for Service Binding Operator

Description of problem:

Since the Service Binding Operator is deprecated and will be removed with the OpenShift Container Platform 4.16 release, users should be notified about this in the console in the below pages

1. Add page / flow
2. Creating a SB yaml
3. SB list page
4. Topology when creating a SB / bind a component
5. Topology if we found any SB in the current namespace?  

Note: 

Confirm the warning text with UX.

Additional info:

https://docs.openshift.com/container-platform/4.15/release_notes/ocp-4-15-release-notes.html#ocp-4-15-deprecation-sbo

https://docs.google.com/document/d/1_L05xy7ZSK2xCLiqrrJBPwDoahmi78Ox6mw-l-IIT9M/edit#heading=h.jcsa7gh4tupt    

Description

As an engineer, I want to know how OLS is being used, so that I can know what to focus on and improve it.

Acceptance Criteria

  • It should be separate from the feedback system
    • Or should we have a “thumbs up/down” metric to telemetry? Hard to make use of that value without context though.
  • It should enable cluster monitoring scraping of our namespace
    • Only valid for openshift-* NSes? Probably use openshift-lightspeed as our NS
  • It should pick+compute set of metrics
    • Response time
    • Token counts
    • Total requests/requests per minute/hour
    • Conversation length (how many follow up questions are asked)
  • It should expose+secure(tls) metrics endpoint
  • It should whitelist metrics to send to telemetry

Description

As a OLS product owner, I want metric data about OLS usage to be reported to RH's telemetry system so I can see how users are making use of OLS and potentially identify issues

How to add a metric to telemetry:
https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/

Acceptance Criteria

  • The subset of metrics OLS reports that we care about sending back to RH, are in the telemetry whitelist so that openshift forwards them from the cluster to RH
  • The metric data is visible in RH's telemetry systems (https://telemeter-lts.datahub.redhat.com/graph)

Notes:

  • OLM has these metrics exported, such as CSV related metrics
  • alert is sent to telemeter automatically
  • metrics generated from recording rules can be sent, too.
  • when creating PRs, we can send one metrics per PR for easy review from monitoring team. 

 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This is a response to https://issues.redhat.com/browse/OCPBUGS-12893 which is more of a feature request than a bug. The ask is that we put the ingress VIP in a fault state when there is no ingress controller present on the node so it won't take the VIP even if no other node takes it either.

I don't believe this is a bug because it's related to an unsupported configuration, but I still think it's worth doing because it will simplify our remote worker process. If we put the VIP in a fault state it won't be necessary to disable keepalived on remote workers, the ingress service just needs to be placed correctly and keepalived will do the right thing.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Background

Currently the alert details supports adding links through the plugin extension, we also need buttons to add actions that do not redirect but trigger code. For example triggering the troubleshooting panel

Outcomes

The monitoring plugin renders buttons and links provided by the actions plugin extension 

Description

Provide the ability to export data in a CSV format from the various Observability pages in the OpenShift console.

 

Initially this will include exporting data from any tables that we use.

Goals & Outcomes

Product Requirements:

A user will have the ability to click a button which will download the data in the current table in a CSV format. The user will then be able to take this downloaded file use it to import into their system.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Any ERRORs produces by TuneD will result in Degraded Tuned Profiles.  Cleanup upstream and NTO/PPC-shipped TuneD profiles and add ways of limiting the ERROR message count.
  • Review the policy of restarting TuneD on errors every resync period.  See: OCPBUGS-11150

Why is this important?

  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/PSAP-908

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:
As a software developer, I need to be better informed about the process of contributing to NTO, so that people do not mistakenly contribute to https://github.com/openshift/cluster-node-tuning-operator/tree/master/assets/tuned vs to
https://github.com/redhat-performance/tuned

I also want to be able to create custom NTO images just by following HACKING.md file, which needs to be updated.

Acceptance criteria:

Additional information
There will likely be changes necessary in the NTO Makefile too.

Implement shared selector-based functionality for port groups and address sets, to be re-used by network policy, ANP, egress firewall, multicast and possibly other features in the future.

This is intended to

  • update port groups on pod creation (network policy is the main use case, so the pod is isolated before it is started)
  • share port groups and address set when possible (for performance, it is already implemented for network policy address sets)
  • share code between features that update port groups and address sets based on selectors (mostly for pods and nodes, possibly services and other objects in the future)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description of problem:

Failed to install OCP on the below LZ/WLZ, the common point in the below regions is that all of them have only one type of zones: LZ or WLZ. e.g. in af-south-1, only LZ is available, no WL, in ap-northeast-2, only WL is available, no LZ.



Failed regions/zones:

af-south-1 ['af-south-1-los-1a']     
failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in af-south-1

ap-south-1 ['ap-south-1-ccu-1a', 'ap-south-1-del-1a']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-south-1

ap-southeast-1 ['ap-southeast-1-bkk-1a', 'ap-southeast-1-mnl-1a']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-1

me-south-1 ['me-south-1-mct-1a']     
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in me-south-1

ap-southeast-2 ['ap-southeast-2-akl-1a', 'ap-southeast-2-per-1a']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-2

eu-north-1 ['eu-north-1-cph-1a', 'eu-north-1-hel-1a']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in eu-north-1

ap-northeast-2 ['ap-northeast-2-wl1-cjj-wlz-1', 'ap-northeast-2-wl1-sel-wlz-1']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ap-northeast-2

ca-central-1 ['ca-central-1-wl1-yto-wlz-1']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ca-central-1

eu-west-2	['eu-west-2-wl1-lon-wlz-1', 'eu-west-2-wl1-man-wlz-1', 'eu-west-2-wl2-man-wlz-1']
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in eu-west-2

    

Version-Release number of selected component (if applicable):

4.15.0-rc.3-x86_64
    

How reproducible:

    

Steps to Reproduce:

1) install OCP on above regions/zones
    

Actual results:

See description. 
    

Expected results:

Don't check LZ's availability while installing OCP in WLZ
Don't check WLZ's availability while installing OCP in LZ
    

Additional info:

    

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The installer is using a very old reference to the machine-config project for apis.
Their apis have been moved to openshift/api. Update the imports and unit tests.

 

This will make a import error go away and make it easier to update in the future.

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

This includes ibm-vpc-node-label-updater!

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • csi-operator
  • aws-efs-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator

 

  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • aws-ebs-csi-driver-operator (now part of csi-operator)
  • azure-disk-csi-driver-operator (now part of csi-operator)
  •  

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

We get too many false positive bugs like https://issues.redhat.com/browse/OCPBUGS-25333 from SAST scans, especially from the vendor directory. Add a .snyk file like https://github.com/openshift/oc/blob/master/.snyk to each repo to ignore them.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Today we manually catch some regressions by eyeballing disruption graphs. 

There are two focuses for this Epic, first are updates to the existing disruption logic to fix and tune the existing logic and second is to consider new methods for collecting and analyzing disruption.

 

For the second part considerations are:

Design some automation to detect these kinds of regressions and alert TRT.

Would this be in sippy or something new? (too openshift specific?)

Bear in mind we'll soon need it for component memory and cpu usage as well.

Alerts should eventually be targeting the SLO we discussed in Infra arch call on Feb 7: https://docs.google.com/document/d/1QOXh7Me0w-4ad-c8HaTuQPvpG5cddUyA2b1j00H-MXQ/edit?usp=sharing

Make sure we gain testing over metal and vsphere which typically do not have the min 100 runs, how can we test these broader?

Today we fail the job if you're over the P99 for the last 3 weeks, as determined by a weekly pr to origin. The mechanism for creating that pr, reading it's data, and running these tests, is broken repeatedly without anyone realizing, and often doing things we don't expect.

Disruption ebbs and flows constantly especially at the 99th percentile, the test being run is not technically the same week to week.

We do still want to at least attempt to fail a job run if disruption was significant.

Examples we would not want to fail:

P99 2s, job got 4s. We don't care about a few seconds of disruption at this level. This happens all the time, and it's not a regression. Stop failing the test.

P99 60s, job got 65s. There's already huge disruption possible, a few seconds over is largely irrelevant week to week. Stop failing the test.

Layer 3 disruption monitoring is now our main focus point for catching more subtle regressions, this is just a first line of defence, best attempt at telling a PR author that your change may have caused a drastic disruption problem.

Changes proposed (see details in comments below for the data as to why):

  • Allow P99 + grace where grace is 5s or 10%.
  • Remove all fallbacks to historical data except based on release version. (or all of them, and shut down testing while data is accumulating for a new release?)
  • Disable all testing with empty Platform, whatever these are we should not be testing on it.
  • Fix the broken test names, disruptionlegacyapiservers monitortest.go testNames() is returning the same test name for new/reused connection testing and likely causing flakes when it should be causing failures.
  • Standardize all disruption test names to something we can easily search for, see this link.
  • Recheck data if we focus on jobs that ONLY failed on disruption, and see if a more lenient grace would achieve the effects we want there.

Certain repositories, ovn/sdn, router, sometimes MCO, are prone to cause disruption regressions. We need to give engineers in these teams better visibility into when they're about to merge something that might break payloads.

We could request /payload on all PRs but this is expensive and a manual task that could still easily be forgotten.

In our epic TRT-787 we will likely soon have a little data on disruption in sippy, enough to know what we expect is normal, and how we're doing the last few days.

This card proposes a similar approach to risk analysis, or actually plugging right into risk analysis. A spyglass panel should be visible with data on disruption:

  • backend name
  • phase 1 disruption observed (if abnormal)
  • phase 2 disruption observed (conformance, if applicable)
  • total disruption
  • normal disruption over last X period, or fixed limit we've determined
  • delta
  • risk level, or standard deviations etc

We could then enhance the pr commenter to drop this information right infront of the user if anything looks out of the ordinary.

The intervals charts displayed at the top of all prow job runs has become a critical tool for TRT and OpenShift engineering in general, allowing us to determine what happened when, and in relation to other events. The tooling however is falling short in a number of areas we'd like to improve.

Goals:

  • sharable links
  • improved filtering
  • a live service rather than chart html as an artifact that is difficult to improve on and regenerate for old jobs

Stretch goals:

  • searchable intervals across many / all jobs

Drop locator and message.

Move tempStructuredLocator and tempStructuredMessage to locator/message.

Ground work for this was already laid by removing all use of the legacy fields.

Epic Goal

  • Align to OKR 2024 for OCP, having a blocking job for MicroShift.
  • Ensure that issues that break MicroShift are reported against the OCP release as blocking by preventing the payload from being promoted.

Why is this important?

  • Ensures stability of MicroShift when there are OCP changes.
  • Ensures stability against MicroShift changes.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

1. Proposed title of this feature request

sosreport(sos rpm) command to be included in the tools imagestream.

2. What is the nature and description of the request?

There is imagestream called tools which are used for oc debug node/<node>. There are several tools to debug the node but sos is not one of them. sos report command is the most necessary command to get support from Red Hat and it should be included for debugging the node.

3. Why does the customer need this? (List the business requirements here)

Telco operators build their system in disconnected env. Therefore, it is hard to get additional rpms or images for their env. If the sos command is included in OCP platform with tools imagestream, it would be very useful for them. There is toolbox image for sosreport but it is not included in openshift release manifests. Some telco operator does not allow to bring additional packages or images other than OpenShift platform itself.

4. List any affected packages or components.

tools imagestream in Openshift release manifests

1. Proposed title of this feature request

sosreport(sos rpm) command to be included in the tools imagestream.

2. What is the nature and description of the request?

There is imagestream called tools which are used for oc debug node/<node>. There are several tools to debug the node but sos is not one of them. sos report command is the most necessary command to get support from Red Hat and it should be included for debugging the node.

3. Why does the customer need this? (List the business requirements here)

Telco operators build their system in disconnected env. Therefore, it is hard to get additional rpms or images for their env. If the sos command is included in OCP platform with tools imagestream, it would be very useful for them. There is toolbox image for sosreport but it is not included in openshift release manifests. Some telco operator does not allow to bring additional packages or images other than OpenShift platform itself.

4. List any affected packages or components.

tools imagestream in Openshift release manifests

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Due to FIPS compatibility reasons, oc compiled in specific RHEL version does not work on the other versions (i.e. oc compiled in RHEL8 does not work on RHEL9). Therefore, as customers are migrating to newer RHEL versions, oc's base image should be RHEL9.

This work covers changing the base image of tools, cli, deployer, cli-artifacts.

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

This improves the supportability and default oc will work on RHEL9 as default regardless of FIPS enabled or not on that host.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

ART has to perform some updates in ocp-build-data repository.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

1. from https://issues.redhat.com/browse/RFE-1576 add output compression
2. from earlier discussion about escalation add --since to limit the size of data (https://issues.redhat.com/browse/RFE-309)
3. from workloads architecture must-gather fallback to oc adm inspect clusteroperators when it can't run a pod (https://github.com/openshift/oc/pull/749)
4. ability to get kubelet logs, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1925328 (no longer relevant, feel free to reopen/re-report if needed)
5. fix error output from sync, see https://bugzilla.redhat.com/show_bug.cgi?id=1917850
6. Create link between nodes and pods from their respective nodes, so it's easier to navigate. (no longer applies)
7. Gather limited (x past hours/days) of logs see https://bugzilla.redhat.com/show_bug.cgi?id=1928468.
8. Use rsync for copying data off of pod see https://github.com/openshift/must-gather/pull/263#issuecomment-967024141 (moved under https://issues.redhat.com/browse/WRKLDS-1191)

[bug] Must-gather logs (not longed into the archive) - this also applies to inspect
[bug] timeouts don’t seem to happen at 10 min
[RFE] logging the collection of inspect in the timestamp file - the timestamp file should have more detailed information, probably similar to the first bug above (moved to https://issues.redhat.com/browse/WRKLDS-1190)

Pulling from https://issues.redhat.com/browse/WRKLDS-259. Currently, `oc adm must-gather` prints logs that are not timestamped when fallback is triggered (running `oc adm inspect` directly). E.g.:

$ ./oc adm must-gather
[must-gather      ] OUT Using must-gather plug-in image: registry.ci.openshift.org/ocp/4.16-2024-04-19-040249@sha256:addc9b013a4cbbe1067d652d032048e7b5f0c867174671d51cbf55f765ff35fc
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 1fa407e3-2ea7-4c34-8fba-8c4d5c3a0bda
ClientVersion: v4.2.0-alpha.0-2261-g17c015a
ClusterVersion: Stable at "4.16.0-0.ci-2024-04-19-040249"
ClusterOperators:
	All healthy and stable


[must-gather      ] OUT namespace/openshift-must-gather-9qrqc created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-8c5gz ...
Gathering data for ns/openshift-config...
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
Gathering data for ns/openshift-config-managed...
Gathering data for ns/openshift-authentication...
Gathering data for ns/openshift-authentication-operator...
Gathering data for ns/openshift-ingress...
Gathering data for ns/openshift-oauth-apiserver...
Gathering data for ns/openshift-machine-api...
Gathering data for ns/openshift-cloud-controller-manager-operator...
Gathering data for ns/openshift-cloud-controller-manager...
...

The missing timestamps disallow to measure how much time it takes for each "Gathering data" step to complete. The completion time can help to identify which steps take the most of the collection time.

*Testing part*: the only testing case here is to validate every "significant" log line has a timestamp. Significant = "Gathering data" line. In both cases. the normal one when a must-gather-??? pod/container is running. And when a must-gather image is not pulled (e.g. does not exist) and the must-gather falls back to running `oc adm inspect` code directly (takes significantly longer to run).

*Documenting part*: no-doc

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

External consumers of MachineSets(), such as hive, need to be able to customize the client that queries the OpenStack cloud for trunk support.

OSASINFRA-3420, eliminating what looked like tech debt, removed that enablement, which had been added via a revert of a previous similar removal.

Reinstate the customizability, and include a docstring explanation to hopefully prevent it being removed again.

Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/32

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:
Invalid volume size when restoring as new PVC from VolumeSnapshot. The size unit is undefined and the Size unit appears TiB. Please check the attachment.

  1. oc get vs -n syq-test
    NAME READYTOUSE SOURCEPVC RESTORESIZE SNAPSHOTCLASS
    isilon-data1-snapshot-1 true isilon-data1 2219907496 isilon-snapclass-dev01
    isilon-datas-snapshot true isilon-datas 2219907496 isilon-snapclass-dev01
    test1-syq-snapshot true test1-syq 1Ki isilon-snapclass-dev01
    test1-syq-snapshot-11 true test1-syq 1106 isilon-snapclass-dev01

[Env]
dell isilon CSI volumes

  • Additional info:
    Issue doesn't happen with ODF

Description of problem:

Oc-mirror create invalid format file of itms-oc-mirror.yaml when work with OCI image, when create itms from the file, hit error : 

oc create -f itms-oc-mirror.yaml 
The ImageTagMirrorSet "itms-operator-0" is invalid: spec.imageTagMirrors[0].source: Invalid value: "//app1/noo": spec.imageTagMirrors[0].source in body should match '^\*(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+$|^((?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])(?:(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+)?(?::[0-9]+)?)(?:(?:/[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?)+)?$'

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1)  Copy the operator as OCI format to localhost:
`skopeo copy --all docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 oci:///app1/noo/redhat-operator-index --remove-signatures`

2)  Use following imagesetconfigure for mirror: cat config-multi-op.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
    - catalog: oci:///app1/noo/redhat-operator-index
      packages:
        - name: odf-operator
`oc-mirror --config config-multi-op.yaml file://outmulitop   --v2`


3) Do diskTomirror :
`oc-mirror --config config-multi-op.yaml --from file://outmulitop  --v2 docker://ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi`

4) Create cluster resource with file: itms-oc-mirror.yaml
   `oc create -f itms-oc-mirror.yaml`

Actual results: 

4) failed to create ImageTagMirrorSet
oc create -f itms-oc-mirror.yaml 
The ImageTagMirrorSet "itms-operator-0" is invalid: spec.imageTagMirrors[0].source: Invalid value: "//app1/noo": spec.imageTagMirrors[0].source in body should match '^\*(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+$|^((?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])(?:(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+)?(?::[0-9]+)?)(?:(?:/[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?)+)?$'

cat itms-oc-mirror.yaml 
---
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata:
  creationTimestamp: null
  name: itms-operator-0
spec:
  imageTagMirrors:
  - mirrors:
    - ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi
    source: //app1/noo
status: {}

Expected results:

4) succeed to create the cluster resource

Description of problem:

    If the authentication.config/cluster Type=="" but the OAuth/User APIs are already missing, the console-operator won't update the authentication.config/cluster status with its own client as it's crashing on being unable to retrieve OAuthClients.

Version-Release number of selected component (if applicable):

    4.15.0

How reproducible:

    100%

Steps to Reproduce:

    1. scale oauth-apiserver to 0
    2. set featuregates to TechPreviewNotUpgradable
    3. watch the authentication.config/cluster .status.oidcClients

Actual results:

    The client for the console does not appear.

Expected results:

    The client for the console should appear.    

Additional info:

    

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Description of problem:

After the fix for OCPBUGSM-44759, we put timeouts on payload retrieval operations (verification and download); previously they were uncapped and under certain network circumstances could take hours to terminate. Testing the fix uncovered a problem where, after CVO passes the path with the timeouts, CVO starts logging errors for the core manifest reconciliation loop:

I0208 11:22:57.107819       1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834)
I0208 11:22:57.107887       1 task_graph.go:474] Canceled worker 1 while waiting for work
I0208 11:22:57.107900       1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834)
I0208 11:22:57.107911       1 task_graph.go:474] Canceled worker 0 while waiting for work                                                                                                                                                                                                                              
I0208 11:22:57.107918       1 task_graph.go:523] Workers finished
I0208 11:22:57.107925       1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)]
I0208 11:22:57.107938       1 sync_worker.go:1169] Summarizing 1 errors
I0208 11:22:57.107947       1 sync_worker.go:1173] Update error 648 of 834: UpdatePayloadFailed Could not update role "openshift-marketplace/marketplace-operator" (648 of 834) (context.deadlineExceededError: context deadline exceeded)
E0208 11:22:57.107966       1 sync_worker.go:654] unable to synchronize image (waiting 3m39.457405047s): Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)

This is caused by locks. The SyncWorker.Update method acquires its lock for its whole duration. The payloadRetriever.RetrievePayload method is called inside SyncWorker.Update, on the following call chain:

SyncWorker.Update ->
  SyncWorker.loadUpdatedPayload ->
    SyncWorker.syncPayload ->
      payloadRetriever.RetrievePayload

RetrievePayload can take 2 or 4 minutes before it timeouts, so CVO holds the lock for this whole wait.

The manifest reconciliation loop is implemented in the apply method. The whole apply method is bounded by a timeout context set to 2*minimum reconcile interval so it will be set to a value between 4 and 8 minutes. While in the reconciling mode, the manifest graph is split into multiple "tasks" where smaller sequences of these tasks are applied in parallel. Individual tasks in these series are iterated over and each iteration uses a consistentReporter to report status via its Update method, which also acquires the lock on the following call sequence:

SyncWorker.apply ->
  { for _, task := range tasks ... ->
    consistentReporter.Update ->
      statusWrapper.Report ->
        SyncWorker.updateApplyStatus ->

This leads to the following sequence:

1. apply is called with a timeout between 4 and 8 minutes
2. in parallel, SyncWorker.Update starts and acquires the lock
3. tasks under apply wait on the reporter to acquire lock
4. after 2 or 4 minutes RetrievePayload under SyncWorker.Update timeout and terminate, SyncWorker.Update terminates and releases the lock
5. tasks under apply report results after briefly acquiring the lock, start to do their thing
6. in parallel, SyncWorker.Update starts again and acquires the lock
7. further iterations over tasks under apply wait on the reporter to acquire lock
8. context passed to apply times out
9. Canceled worker 0 while waiting for work... errors

Version-Release number of selected component (if applicable):

4.13.0-0.ci.test-2023-02-06-062603 with https://github.com/openshift/cluster-version-operator/pull/896

How reproducible:

always in certain cluster configuration

Steps to Reproduce:

1. in a disconnected cluster, upgrade to an unrechachable payload image with --force
2. observe the CVO log

Actual results:

CVO starts to fail reconciling manifests

Expected results:

no failures, cluster continues to try retrieving the image but no interference with manifest reconciliation

Additional info:

This problem was discovered by Evgeni Vakhonin while testing fix for OCPBUGSM-44759: https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22

https://github.com/openshift/cluster-version-operator/pull/896 uncovers this issue but still gets CVO into a better shape - previously the RetrievePayload could be running for a much longer time (hours), preventing the CVO from working at all.

When the cluster gets into this buggy state, the solution is to abort the upgrade that fails to verify or download.

Description of problem:

   Due to RHEL9 incorporating OpenSSL 3.0, HaProxy will refuse to start if provided with a cert using SHA1-based signature algorithm. RHEL9 is being introduced in 4.16. This means customers updating from 4.15 to 4.16 with a SHA1 cert will find their router in a failure state.


My Notes from experimenting with various ways of using a cert in ingress:
- Routes with SHA1 spec.tls.certificate WILL prevent HaProxy from reloading/starting
- It is NOT limited to FIPs, I broke a non-FIPs cluster with this
- Routes with SHA1 spec.tls.caCertificate will NOT prevent HaProxy starting, but route is rejected, due to extended route validation failure:
    - lastTransitionTime: "2024-01-04T20:18:01Z"
      message: 'spec.tls.certificate: Invalid value: "redacted certificate data":
        error verifying certificate: x509: certificate signed by unknown authority
        (possibly because of "x509: cannot verify signature: insecure algorithm SHA1-RSA
        (temporarily override with GODEBUG=x509sha1=1)" while trying to verify candidate
        authority certificate "www.exampleca.com")'

- Routes with SHA1 spec.tls.destinationCACertificate will NOT prevent HaProxy from starting. It actually seems to work as expected
- IngressController with SHA1 spec.defaultCertificate WILL prevent HaProxy from starting.
- IngressController with SHA1 spec.clientTLS.clientCA will NOT prevent HaProxy from starting.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%    

Steps to Reproduce:

    1. Create a Ingress Controller with spec.defaultCertificate or a Route with spec.tls.certificate as a SHA1 cert
    2. Roll out the router   

Actual results:

    Router fails to start

Expected results:

    Router should start

Additional info:

    We've previously documented via story in RHEL9 epic: https://issues.redhat.com/browse/NE-1449

The initial fix for this issue was merged as [https://github.com/openshift/router/pull/555].  This issue is currently causing some issues, notably causing the openshift/cluster-ingress-operator repository's {{TestRouteAdmissionPolicy}} E2E test to fail intermittently, which causes the e2e-azure, e2e-gcp-operator, and e2e-aws-operator CI jobs to fail intermittently.

Note: In the solution, we only intend to reject **routes** with SHA1 cert on spec.tls.certificate. Ingress Controller with SHA1 cert on spec.defaultCertificate will NOT be rejected.

Description of problem:

    Remove react-helmet from the list of shared modules in console/frontend/packages/console-dynamic-plugin-sdk/src/shared-modules.ts (noted as deprecated from last release)

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/145

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Check scripts for the on-premise keepalived static pods only check the haproxy, which only directs to kube-apiserver pod. They do not take into consideration whether the control plane node has a healthy machine-config-server.

This may be a problem because, in a failure scenario, it may be required to rebuild nodes and machine-config-server is required for that (so that ignitions are provided).

One example is the etcd restore procedure (https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html). In our case, the following happened (I'd suggest reading the recovery procedure before this sequence of events):
- Machine config server was healthy in the recovery control plane node but not in the other hosts. 
- At this point, we can only guarantee the health of the recovery control plane node because the non-recovery ones are to be replaced and must be removed first from the cluster (node objects deleted) so that OVN-Kubernetes control plane can work properly.
- The keepalived check scripts were succeeding in the non-recovery control plane nodes because their haproxy pods were up and running. That is fine from kube-apiserver point of view, actually, but does not take machine config server into consideration.
- As the machine-config-server was not reachable, provision of the new masters required by the procedure was impossible.

In parallel to this bug, I'll be raising another bug to improve the restore procedure. Basically, asking to stop the keepalived static pods on the non-recovery control plane nodes. This would prevent the exact situation above.

However, there are other situations where machine-config-server pods may be unhealthy and we should not just be manually stopping keepalived. In such cases, keepalived should take machine-config-server into consideration.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Under some failure scenarios, where machine-config-server is not healthy in one control plane node.

Steps to Reproduce:

1. Try to provision new machine for recovery.
2.
3.

Actual results:

Machine-config-server not serving because keepalived assigned the VIP to one node that doesn't have a working machine-config-server pod.

Expected results:

Keepalived to take machine-config-server health into consideration while doing failover.

Additional info:

Possible ideas to fix:
- Create a check script for the machine-config-server check. It may have less weight than the kube-apiserver ones.
- Include machine-config-server endpoint in the haproxy of the kube-apiservers.

Please review the following PR: https://github.com/openshift/installer/pull/7819

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

OCP 4.14.5
multicluster-engine.v2.4.1
advanced-cluster-management.v2.9.0

Attempt to run the create a spoke cluster:

apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
  creationTimestamp: "2023-12-08T16:59:25Z"
  finalizers:
  - agentclusterinstall.agent-install.openshift.io/ai-deprovision
  generation: 1
  name: infraenv-spoke
  namespace: infraenv-spoke
  ownerReferences:
  - apiVersion: hive.openshift.io/v1
    kind: ClusterDeployment
    name: infraenv-spoke
    uid: 34f1fe43-2af2-4880-b4ca-fb9ab8df13df
  resourceVersion: "3468594"
  uid: 79a42bdf-db1f-4500-b689-8b3813bd27a6
spec:
  clusterDeploymentRef:
    name: infraenv-spoke
  imageSetRef:
    name: 4.14-test
  networking:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    serviceNetwork:
    - 172.30.0.0/16
    userManagedNetworking: true
  provisionRequirements:
    controlPlaneAgents: 3
    workerAgents: 2
status:
  conditions:
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: SyncOK
    reason: SyncOK
    status: "True"
    type: SpecSynced
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The cluster is not ready to begin the installation
    reason: ClusterNotReady
    status: "False"
    type: RequirementsMet
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: 'The cluster''s validations are failing: '
    reason: ValidationsFailing
    status: "False"
    type: Validated
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The installation has not yet started
    reason: InstallationNotStarted
    status: "False"
    type: Completed
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The installation has not failed
    reason: InstallationNotFailed
    status: "False"
    type: Failed
  - lastProbeTime: "2023-12-08T16:59:30Z"
    lastTransitionTime: "2023-12-08T16:59:30Z"
    message: The installation is waiting to start or in progress
    reason: InstallationNotStopped
    status: "False"
    type: Stopped
  debugInfo:
    eventsURL: https://assisted-service-rhacm.apps.sno-0.qe.lab.redhat.com/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiZjU4MjVmMTctNTg0OS00OTljLWE1NDctNjJmMDc4ZDU3MDJiIn0.qpSeZuqLwZ3cr3qn6AZo665o1ANp45YVE6IWUv7Gdn1RmapG4HZaxsUUY4iswkRMiqIfka_pLHFnBeVzXSTbrg&cluster_id=f5825f17-5849-499c-a547-62f078d5702b
    logsURL: ""
    state: insufficient
    stateInfo: Cluster is not ready for install
  platformType: None
  progress:
    totalPercentage: 0
  userManagedNetworking: true
apiVersion: agent-install.openshift.io/v1beta1
kind: InfraEnv
metadata:
  creationTimestamp: "2023-12-08T16:59:26Z"
  finalizers:
  - infraenv.agent-install.openshift.io/ai-deprovision
  generation: 1
  name: infraenv-spoke
  namespace: infraenv-spoke
  resourceVersion: "3468794"
  uid: 6254bbb3-5531-4665-bb78-f073b439b023
spec:
  clusterRef:
    name: infraenv-spoke
    namespace: infraenv-spoke
  cpuArchitecture: s390x
  ipxeScriptType: ""
  nmStateConfigLabelSelector: {}
  pullSecretRef:
    name: infraenv-spoke-pull-secret
status:
  agentLabelSelector:
    matchLabels:
      infraenvs.agent-install.openshift.io: infraenv-spoke
  bootArtifacts:
    initrd: ""
    ipxeScript: ""
    kernel: ""
    rootfs: ""
  conditions:
  - lastTransitionTime: "2023-12-08T16:59:51Z"
    message: 'Failed to create image: cannot use Minimal ISO because it''s not compatible
      with the s390x architecture on version 4.14.6 of OpenShift'
    reason: ImageCreationError
    status: "False"
    type: ImageCreated
  debugInfo:
    eventsURL: ""



 oc get clusterimagesets.hive.openshift.io 4.14-test -o yaml
apiVersion: hive.openshift.io/v1
kind: ClusterImageSet
metadata:
  creationTimestamp: "2023-12-08T18:11:29Z"
  generation: 1
  name: 4.14-test
  resourceVersion: "3514589"
  uid: 32e2ba8d-6bb7-4e4b-b3a5-63fa8224d144
spec:
  releaseImage: registry.ci.openshift.org/ocp-s390x/release-s390x@sha256:f024a617c059bf2cbf4a669c2a19ab4129e78a007c6863b64dd73a413c0bdf46


oc get agentserviceconfigs.agent-install.openshift.io agent -o yaml
apiVersion: agent-install.openshift.io/v1beta1
kind: AgentServiceConfig
metadata:
  creationTimestamp: "2023-12-08T18:10:42Z"
  finalizers:
  - agentserviceconfig.agent-install.openshift.io/ai-deprovision
  generation: 1
  name: agent
  resourceVersion: "3514534"
  uid: ef204896-25f1-4ff3-ae60-c80c2f45cd30
spec:
  databaseStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
  filesystemStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 20Gi
  imageStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
  mirrorRegistryRef:
    name: mirror-registry-ca
  osImages:
  - cpuArchitecture: x86_64
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/amd64/dependencies/rhcos/4.14/latest/rhcos-live.x86_64.iso
    version: "4.14"
  - cpuArchitecture: arm64
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/aarch64/dependencies/rhcos/4.14/latest/rhcos-live.aarch64.iso
    version: "4.14"
  - cpuArchitecture: ppc64le
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/ppc64le/dependencies/rhcos/4.14/latest/rhcos-live.ppc64le.iso
    version: "4.14"
  - cpuArchitecture: s390x
    openshiftVersion: "4.14"
    rootFSUrl: ""
    url: https://mirror.openshift.com/pub/openshift-v4/s390x/dependencies/rhcos/4.14/latest/rhcos-live.s390x.iso
    version: "4.14"
status:
  conditions:
  - lastTransitionTime: "2023-12-08T18:10:42Z"
    message: AgentServiceConfig reconcile completed without error.
    reason: ReconcileSucceeded
    status: "True"
    type: ReconcileCompleted
  - lastTransitionTime: "2023-12-08T18:11:23Z"
    message: All the deployments managed by Infrastructure-operator are healthy.
    reason: DeploymentSucceeded
    status: "True"
    type: DeploymentsHealthy



Description of problem:

Looking at the code snippet at line 198, the wg.Add(1) should be moved closer to the function it is waiting for (line 226).

Having another function in between that could exit could leave the controller in a state where it will be waiting for a defer that can never occur, meaning that the controller will never terminate. 

Version-Release number of selected component (if applicable):

Found on the master branch while cross-referencing errors/logs for a cluster.

How reproducible:

Not reproducible.

Additional info:

Not required: resolution has already found

  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
    • Done

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/112

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Kube apiserver pod keeps crashing when tested against v1.29 rebase

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Run hypershift e2e agains v1.29 rebase
    2.
    3.
    

Actual results:

    Fails https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1810-periodics-e2e-aws-ovn/1732734494688940032

Expected results:

    Succeeds

Additional info:

    Kube apiserver pod is crashlooping with:
E1208 21:17:06.619997 1 run.go:74] "command failed" err="group version flowcontrol.apiserver.k8s.io/v1alpha1 that has not been registered"

Description of problem:

1. [sig-network][Feature:EgressFirewall] egressFirewall should have no impact outside its namespace [Suite:openshift/conformance/parallel] 
2. [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]

The issue arises during the execution of the above tests and appears to be related to the image in use, specifically, the image located at https://quay.io/repository/redhat-developer/nfs-server?tab=tags&tag=1.1 (quay.io/redhat-developer/nfs-server:1.1). 
This image does not include the 'ping' executable for the s390x architecture, leading to the following error in the prow job logs:
...
msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6mg9v --kubeconfig=/tmp/configfile3768380277 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-11T19:04:52Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n"
...

Our suggest fix: build new s390x image that contains ping binary.

Version-Release number of selected component (if applicable):

 

How reproducible:

The issue is reproducible when the test container (quay.io/redhat-developer/nfs-server:1.1) is scheduled on an s390x node, leading to test failures.

Steps to Reproduce:

1.Have a multi-arch cluster (x86 + s390x day2 worker node attached)
2.Execute the two tests
3.Try few times to make the pod assigned to s390x node

Actual results from prow job:

Run #0: Failed expand_less30s{  fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:70]: Unexpected error:
    <*fmt.wrapError | 0xc005924300>: 
    Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8:
    StdOut>
    time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH"
    command terminated with exit code 255
    StdErr>
    time="2023-10-12T07:17:02Z" level=error msg="exec failed: unable to start container process: exec: \"ping\": executable file not found in $PATH"
    command terminated with exit code 255
    exit status 255
    
    {
        msg: "Error running /usr/bin/oc --namespace=e2e-test-no-egress-firewall-e2e-6r9zh --kubeconfig=/tmp/configfile3961753222 exec dummy -- ping -c 1 8.8.8.8:\nStdOut>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nStdErr>\ntime=\"2023-10-12T07:17:02Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"ping\\\": executable file not found in $PATH\"\ncommand terminated with exit code 255\nexit status 255\n",
        err: <*exec.ExitError | 0xc0059242e0>{
            ProcessState: {
                pid: 78611,
                status: 65280,
                rusage: {
                    Utime: {Sec: 0, Usec: 168910},
                    Stime: {Sec: 0, Usec: 60897},
                    Maxrss: 206428,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 4199,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 753,
                    Nivcsw: 149,
                },
            },
            Stderr: nil,
        },
    }
occurred
Ginkgo exit error 1: exit with code 1}

Expected results:

Passed

Additional info:

This issue pertains to a specific bug on the s390x architecture and additionally impacts the libvirt-s390x prow job.

Description of problem:

When trying to delete a ClusterResourceQuota resource, using foreground deletion cascading strategy it's stuck in that state and failing to complete the removal.

Once background deletion cascading strategy is used it's immediately removed.

Now, given that OpenShift GitOps is using foreground deletion cascading strategy by default, this does expose some challenges when managing ClusterResourceQuota resources using OpenShift GitOps.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-23-223425 but also previous version of OpenShift Container Platform 4 are affected

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4
2. Create the ClusterResourceQuota as shown below

$ bat -p /tmp/crq.yaml
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
  creationTimestamp: null
  name: blue
spec:
  quota:
    hard:
      pods: "10"
      secrets: "20"
  selector:
    annotations: null
    labels:
      matchLabels:
        color: nocolor

3. Delete the ClusterResourceQuota using "oc delete --cascade=foreground clusterresourcequota blue"

Actual results:

$ oc delete --cascade=foreground clusterresourcequota blue
clusterresourcequota.quota.openshift.io "blue" deleted

Is stuck and won't finish, the resource looks as shown below.

$ oc get clusterresourcequota blue -o yaml
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"quota.openshift.io/v1","kind":"ClusterResourceQuota","metadata":{"annotations":{},"creationTimestamp":null,"name":"blue"},"spec":{"quota":{"hard":{"pods":"10","secrets":"20"}},"selector":{"annotations":null,"labels":{"matchLabels":{"color":"nocolor"}}}}}
  creationTimestamp: "2023-10-24T07:37:48Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-10-24T07:59:47Z"
  finalizers:
  - foregroundDeletion
  generation: 2
  name: blue
  resourceVersion: "60554"
  uid: c18dd92c-afeb-47f4-a944-8b55be4037d7
spec:
  quota:
    hard:
      pods: "10"
      secrets: "20"
  selector:
    annotations: null
    labels:
      matchLabels:
        color: nocolor

Expected results:

The ClusterResourceQuota to be deleted using foreground deletion cascading strategy without being stuck as there does not appear to be any OwnerReference that is still around and blocking removal

Additional info:


Please review the following PR: https://github.com/openshift/prometheus-operator/pull/270

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/97

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/29

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

oauthclients degraded condition that never gets removed, meaning once its set due to an issue on a cluster, it wont be unset

Version-Release number of selected component (if applicable):

    

How reproducible:

Sporadically, when the AuthStatusHandlerFailedApply condition is set on the console operator status conditions.

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

In upstream we started using ValidatingAdmissionPolicy API to enforce package uniqueness for ClusterExtension.

These API are still not enabled by default in OCP 4.16 (K8s 1.29), but should be enabled with TechPreviewNoUpgrade. For some reason our E2E CI job fails despite the fact that we are running it with tech preview.

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-22293. The following is the description of the original issue:

Description of problem:

Upgrading from 4.13.5 to 4.13.17 fails at network operator upgrade

Version-Release number of selected component (if applicable):

 

How reproducible:

Not sure since we only had one cluster on 4.13.5.

Steps to Reproduce:

1. Have a cluster on version 4.13.5 witn ovn kubernetes
2. Set desired update image to quay.io/openshift-release-dev/ocp-release@sha256:c1f2fa2170c02869484a4e049132128e216a363634d38abf292eef181e93b692
3. Wait until it reaches network operator

Actual results:

Error message: Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: failed to apply / update (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: DaemonSet.apps "ovnkube-master" is invalid: [spec.template.spec.containers[1].lifecycle.preStop: Required value: must specify a handler type, spec.template.spec.containers[3].lifecycle.preStop: Required value: must specify a handler type]

Expected results:

Network operator upgrades successfully

Additional info:

Since I'm not able to attach files please gather all required debug data from https://access.redhat.com/support/cases/#/case/03645170

 

Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/94

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Trying to run without --node-upgrade-type param fails for "spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""


although in --help it is documented to have a default value of 'InPlace'

Version-Release number of selected component (if applicable):

 [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp -v
hcp version openshift/hypershift: af9c0b3ce9c612ec738762a8df893c7598cbf157. Latest supported OCP: 4.15.0
[   

How reproducible:

  happens all the time   

Steps to Reproduce:

    1.on an hosted cluster setup run :
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help
Creates basic functional NodePool resources for Agent platformUsage:
  hcp create nodepool agent [flags]Flags:
  -h, --help   help for agentGlobal Flags:
      --cluster-name string             The name of the HostedCluster nodes in this pool will join. (default "example")
      --name string                     The name of the NodePool.
      --namespace string                The namespace in which to create the NodePool. (default "clusters")
      --node-count int32                The number of nodes to create in the NodePool. (default 2)
      --node-upgrade-type UpgradeType   The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default )
      --release-image string            The release image for nodes; if this is empty, defaults to the same release image as the HostedCluster.
      --render                          Render output as YAML to stdout instead of applying.
     

2.try to run with default value of --node-upgrade-type:
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2


Actual results:

[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2
2024-02-06T19:57:03+02:00       ERROR   Failed to create nodepool       {"error": "NodePool.hypershift.openshift.io \"nodepool-of-extra1\" is invalid: spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""}
github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1
        /home/kni/hypershift_working/hypershift/cmd/nodepool/core/create.go:39
github.com/spf13/cobra.(*Command).execute
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
        /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1032
main.main
        /home/kni/hypershift_working/hypershift/product-cli/main.go:60
runtime.main
        /home/kni/hypershift_working/go/src/runtime/proc.go:250
Error: NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"
NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"
    

Expected results:

   should pass as if your adding the param :
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type InPlace
NodePool nodepool-of-extra1 created
[kni@ocp-edge119 ~]$ 

Additional info:

A related issue is that we have a difference if the --help is used with other parameters or not :

[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help > long.help.out
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --help > short.help.out
[kni@ocp-edge119 ~]$ diff long.help.out short.help.out 
14c14
<       --node-upgrade-type UpgradeType   The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default )
---
>       --node-upgrade-type UpgradeType   The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace
[kni@ocp-edge119 ~]$ 


 

allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving. 

to prevent possible issues similar to https://issues.redhat.com//browse/OCPBUGS-23796

Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/37

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The kubelet is running with `unconfined_service_t`. It should run as `kubelet_exec_t`. This is causing all our plugins to fail because of Selinux denial.

sh-5.1# ps -AZ | grep kubelet
system_u:system_r:unconfined_service_t:s0 8719 ? 00:24:50 kubelet

This issue was previously observed and resolved in 4.14.10. 

Version-Release number of selected component (if applicable):

OCP 4.15

How reproducible:

Run ps -AZ | grep kubelet to see kubelet running with wrong label

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Kubelet is running as unconfined_service_t

Expected results:

    Kubelet should run as kubelet_exec_t

Additional info:

    

Two quality-of-life improvements for e2e charts:

  • wrap tooltips in e2e chart to make sure that tooltips don't overflow X when a long label is used.
  • copy segment label contents to clipboard when its clicked

Description of problem:
gatewayConfig.ipForwarding allows any invalid value but it should enforce "", "Restricted" or "Global"
 
You can currently even do really funky stuff with that:

oc edit network.operator/cluster
(...)
 15 spec:                                                                                                                   
 16   clusterNetwork:                                                                                                       
 17   - cidr: 10.128.0.0/14                                                                                                 
 18     hostPrefix: 23                                                                                                      
 19   - cidr: fd01::/48                                                                                                     
 20     hostPrefix: 64                                                                                                      
 21   defaultNetwork:                                                                                                       
 22     ovnKubernetesConfig:                                                                                                
 23       egressIPConfig: {}                                                                                                
 24       gatewayConfig:                                                                                                    
 25         ipForwarding: $(echo 'Im injected'; lscpu)
$ oc get pods -n openshift-ovn-kubernetes ovnkube-node-24628 -o yaml | grep sysctl -C5
      fi

      # If IP Forwarding mode is global set it in the host here.
      ip_forwarding_flag=
      if [ "$(echo 'Im injected'; lscpu)" == "Global" ]; then
        sysctl -w net.ipv4.ip_forward=1
        sysctl -w net.ipv6.conf.all.forwarding=1
      else
        ip_forwarding_flag="--disable-forwarding"
      fi

      NETWORK_NODE_IDENTITY_ENABLE=
$ oc logs -n openshift-ovn-kubernetes ovnkube-node-24628 -c ovnkube-controller | grep inje -A5
++ echo 'Im injected'
++ lscpu
+ '[' 'Im injected
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             112

I wouldn't consider this a security issue, because I have to be the admin to do that, and as the admin I can also simply modify the pod, but it's not very elegant to allow for some sort of code injection, even by the admin

Description of problem:

Hosted control plane kube scheduler pods crashloop on clusters created with Kube 1.29 rebase     

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Create a hosted cluster using 4.16 kube rebase code base
    2. Wait for the cluster to come up
    

Actual results:

    Cluster never comes up because kube scheduler pod crashloops

Expected results:

    Cluster comes up

Additional info:

    The kube scheduler configuration generated by the control plane operator is using the v1beta3 version of the configuration. That version is no longer included in Kubernetes v1.29

PR https://github.com/openshift/monitoring-plugin/pull/83 was intended to just modify the images built for local testing, but accidentally changed the deafult Dockerfile leading to a mismatch between the nginx config and Dockerfile used in CI. The causes the monitoring-plugin to fail to load in CI builds.

Description of problem:

    capi-based installer failing with missing openshift-cluster-api namespace

Version-Release number of selected component (if applicable):

    

How reproducible:

Always in CustomNoUpgrade

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Install failure

Expected results:

    namespace create, install succeeds or does not error on missing namespace

Additional info:

    

As a maintainer of the HyperShift repo, I would like to remove unused functions from the code base to reduce the code footprint of the repo.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/90

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.15
Start Time: 2024-02-08T00:00:00Z
End Time: 2024-02-14T23:59:59Z
Success Rate: 91.30%
Successes: 63
Failures: 6
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 735
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20upgrade-micro%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=azure&platform=azure&sampleEndTime=2024-02-14%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-02-08%2000%3A00%3A00&testId=openshift-tests-upgrade%3A37f1600d4f8d75c47fc5f575025068d2&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers&upgrade=upgrade-micro&upgrade=upgrade-micro&variant=standard&variant=standard

Note: When you look at the link above you will notice some of the failures mention the bare metal operator.  That's being investigated as part of https://issues.redhat.com/browse/OCPBUGS-27760.  There have been 3 cases in the last week where the console was in a fail loop.  Here's an example:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1757637415561859072

 

We need help understanding why this is happening and what needs to be done to avoid it.

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/130

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CCO reports credsremoved mode in metrics when the cluster is actually in the default mode. 
See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/47349/rehearse-47349-pull-ci-openshift-cloud-credential-operator-release-4.16-e2e-aws-qe/1744240905512030208 (OCP-31768). 

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always. 

Steps to Reproduce:

1. Creates an AWS cluster with CCO in the default mode (ends up in mint)
2. Get the value of the cco_credentials_mode metric
    

Actual results:

credsremoved    

Expected results:

mint    

Root cause:

The controller-runtime client used in metrics calculator (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L77) is unable to GET the root credentials Secret (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L184) since it is backed by a cache which only contains target Secrets requested by other operators (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/cmd/operator/cmd.go#L164-L168).

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/94

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/160

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Enable installer AWS SDK install, and create a C2S cluster, will hit following fatal error:

level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create bootstrap resources: failed to create bootstrap instance profile: failed to create role (yunjiang-14c2a-t4wp7-bootstrap-role): RequestCanceled: request context canceled


    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-03-140457
4.16.0-0.nightly-2024-01-03-193825
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Enable AWS SDK install and create a C2S cluster
    2.
    3.
    

Actual results:

failed to create bootstrap instance profile: failed to create role (yunjiang-14c2a-t4wp7-bootstrap-role), bootstrap process failed
    

Expected results:

bootstrap process can be finished successfully.
    

Additional info:

No issue on terraform way.
    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

New deployment of BM IPI using provisioning network with IPV6 is showing:

http://XXXX:XXXX:XXXX:XXXX::X:6180/images/ironic-python-agernt.kernel....
connection timed out (http://ipxe.org/4c0a6092)" error

Version-Release number of selected component (if applicable):

Openshift 4.12.32
Also seen in Openshift 4.14.0-rc.5 when adding new nodes

How reproducible:

Very frequent

Steps to Reproduce:

1. Deploy cluster using BM with provided config
2.
3.

Actual results:

Consistent failures depending of the version of OCP used to deploy

Expected results:

No error, successful deployment

Additional info:

Things checked while the bootstrap host is active and the installation information is still valid (and failing):
- tried downloading the "ironic-python-agent.kernel" file from different places (bootstrap, bastion hosts, another provisioned host) and in all cases it worked:
[core@control-1-ru2 ~]$ curl -6 -v -o ironic-python-agent.kernel http://[XXXX:XXXX:XXXX:XXXX::X]:80/images/ironic-python-agent.kernel
\*   Trying XXXX:XXXX:XXXX:XXXX::X...
\* TCP_NODELAY set
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to XXXX:XXXX:XXXX:XXXX::X (xxxx:xxxx:xxxx:xxxx::x) port 80   #0)
> GET /images/ironic-python-agent.kernel HTTP/1.1
> Host: [xxxx:xxxx:xxxx:xxxx::x]
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Fri, 27 Oct 2023 08:28:09 GMT
< Server: Apache
< Last-Modified: Thu, 26 Oct 2023 08:42:16 GMT
< ETag: "a29d70-6089a8c91c494"
< Accept-Ranges: bytes
< Content-Length: 10657136
<
{ [14084 bytes data]
100 10.1M  100 10.1M    0     0   597M      0 --:--:-- --:--:-- --:--:--  597M
\* Connection #0 to host xxxx:xxxx:xxxx:xxxx::x left intact

This verifies some of the components like the network setup and the httpd service running on ironic pods.

- Also gathered listing of the contents of the ironic pod running in podman, specially in the shared directory. The contents of /shared/html/inspector.ipxe seems correct compared to a working installation, also all files look in place.

- Logs from the ironic container shows the errors coming from the node being deployed, we also show here the curl log to compare:

xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:19:55 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)"
xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:19:55 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)"
xxxx:xxxx:xxxx:xxxx::x - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 200 10657136 "-" "curl/7.61.1"
cxxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)"
xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)"

Seems like an issue with iPXE and IPV6

 

 

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/102

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Cluster install failed on ibm cloud and machine-api-controllers stucks in CrashLoopBackOff 

Version-Release number of selected component (if applicable):

from 4.16.0-0.nightly-2024-02-02-224339

How reproducible:

Always

Steps to Reproduce:

    1. Install cluster on IBMCloud
    2.
    3.
    

Actual results:

Cluster install failed
$ oc get node                       
NAME                     STATUS   ROLES                  AGE     VERSION
maxu-16-gp2vp-master-0   Ready    control-plane,master   7h11m   v1.29.1+2f773e8
maxu-16-gp2vp-master-1   Ready    control-plane,master   7h11m   v1.29.1+2f773e8
maxu-16-gp2vp-master-2   Ready    control-plane,master   7h11m   v1.29.1+2f773e8

$ oc get machine -n openshift-machine-api           
NAME                           PHASE   TYPE   REGION   ZONE   AGE
maxu-16-gp2vp-master-0                                        7h15m
maxu-16-gp2vp-master-1                                        7h15m
maxu-16-gp2vp-master-2                                        7h15m
maxu-16-gp2vp-worker-1-xfvqq                                  7h5m
maxu-16-gp2vp-worker-2-5hn7c                                  7h5m
maxu-16-gp2vp-worker-3-z74z2                                  7h5m

openshift-machine-api                              machine-api-controllers-6cb7fcdcdb-k6sv2                     6/7     CrashLoopBackOff   92 (31s ago)     7h1m

$ oc logs -n openshift-machine-api  -c  machine-controller  machine-api-controllers-6cb7fcdcdb-k6sv2                          
I0204 10:53:34.336338       1 main.go:120] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation.panic: runtime error: invalid memory address or nil pointer dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x285fe72]
goroutine 25 [running]:k8s.io/klog/v2/textlogger.(*tlogger).Enabled(0x0?, 0x0?)	/go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/k8s.io/klog/v2/textlogger/textlogger.go:81 +0x12sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc000438100, 0x0?)	/go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:114 +0x92github.com/go-logr/logr.Logger.Info({{0x3232210?, 0xc000438100?}, 0x0?}, {0x2ec78f3, 0x17}, {0x0, 0x0, 0x0})	/go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/github.com/go-logr/logr/logr.go:276 +0x72sigs.k8s.io/controller-runtime/pkg/metrics/server.(*defaultServer).Start(0xc0003bd2c0, {0x322e350?, 0xc00058a140})	/go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/metrics/server/server.go:185 +0x75sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002c4540)	/go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223 +0xc8created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 24	/go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:207 +0x19d

Expected results:

 Cluster install succeed

Additional info:

may relate to this pr https://github.com/openshift/machine-api-provider-ibmcloud/pull/34

Description of problem:

    The code requires the `s3:HeadObject` permission (https://github.com/openshift/cloud-credential-operator/blob/master/pkg/aws/utils.go#L57) but it doesn't exist. The AWS docs say the permission needed is `s3:ListBucket`: https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadBucket.html

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Try to install cluster with minimal permissions without s3:HeadBucket
    2.
    3.
    

Actual results:

level=warning msg=Action not allowed with tested creds action=iam:DeleteUserPolicy
level=warning msg=Tested creds not able to perform all requested actions
level=warning msg=Action not allowed with tested creds action=s3:HeadBucket
level=warning msg=Tested creds not able to perform all requested actions
level=fatal msg=failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: AWS credentials cannot be used to either create new creds or use as-is
Installer exit with code 1

Expected results:

    Only `s3:ListBucket` should be checked.

Additional info:

    

Description of problem:

The go docs in the install-config's platform.aws.lbType is misleading as well as on the ingress object (oc explain ingresses.config.openshift.io.spec.loadBalancer.platform.aws.type).

Both say:
"When this field is specified, the default ingresscontroller will be created using the specified load-balancer type."

That is true, but what is missing is that ALL ingresscontrollers will be created using the specified load-balancer type by default (not just the default ingresscontroller).

This missing information can be confusing to users.

Version-Release number of selected component (if applicable):

    4.12+

How reproducible:

    100%

Steps to Reproduce:

openshift-install explain installconfig.platform.aws.lbType
- or -
oc explain ingresses.config.openshift.io.spec.loadBalancer.platform.aws.type   

Actual results:

./openshift-install explain installconfig.platform.aws.lbType
KIND:     InstallConfig
VERSION:  v1RESOURCE: <string>

LBType is an optional field to specify a load balancer type. When this field is specified, the default ingresscontroller will be created using the specified load-balancer type. 

...
[same with ingress.spec.loadBalancer.platform.aws.type]

Expected results:

My suggestion:

./openshift-install explain installconfig.platform.aws.lbType
KIND:     
InstallConfig 
VERSION:  v1RESOURCE: <string>  

LBType is an optional field to specify a load balancer type. When this field is specified, all ingresscontrollers (including the default ingresscontroller) will be created using the specified load-balancer type by default.

...
[same with ingress.spec.loadBalancer.platform.aws.type]

Additional info:

    Since the change should be the same thing for both the installconfig and ingress object, this bug would handle both.

Description of problem:

    Since OCP 4.15 we see issue with OLM deployed operator unable to operate in watched namespaces (multiple). It works fine with single watched namespace (subscription). Also, same test passes if we don't deploy operator using OLM, but using files.
It seems like it is permission issue based on operator log. Same test works fine on any other previous OCP 4.14 and older.

Version-Release number of selected component (if applicable):

Server Version: 4.15.0-ec.3
Kubernetes Version: v1.28.3+20a5764

How reproducible:

Always    

Steps to Reproduce:

    0. oc login OCP4.15
    1. git clone https://gitlab.cee.redhat.com/amq-broker/claire
    2. make -f Makefile.downstream build ARTEMIS_VERSION=7.11.4 RELEASE_TYPE=released
    3. make -f Makefile.downstream operator_test OLM_IIB=registry-proxy.engineering.redhat.com/rh-osbs/iib:636350 OLM_CHANNEL=7.11.x  TESTS=ClusteredOperatorSmokeTests TEST_LOG_LEVEL=debug DISABLE_RANDOM_NAMESPACES=true

Actual results:

    Can't deploy artemis broker custom resource in given namespace (permission issue - see details below) 

Expected results:

    Successfully deployed broker on watched namespaces

Additional info:

Log from AMQ Broker operator - seems like some permission issues since 4.15

    E0103 10:04:54.425202       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemis: failed to list *v1beta1.ActiveMQArtemis: activemqartemises.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemises" in API group "broker.amq.io" in the namespace "cluster-testsa"
E0103 10:04:54.425207       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemisSecurity: failed to list *v1beta1.ActiveMQArtemisSecurity: activemqartemissecurities.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemissecurities" in API group "broker.amq.io" in the namespace "cluster-testsa"
E0103 10:04:54.425221       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "pods" in API group "" in the namespace "cluster-testsa"
W0103 10:04:54.425296       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1beta1.ActiveMQArtemisScaledown: activemqartemisscaledowns.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemisscaledowns" in API group "broker.amq.io" in the namespace "cluster-testsa"

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1980

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/689

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

console-operator is updating the OIDC status without checking the feature gate    

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Setup a OCP cluster without external OIDC provider, using default OAuth.

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

the OIDC related conditions are is being surfaced in the console-operator's config conditions.

Expected results:

    the OIDC related conditions should not be surfaced in the console-operator's config conditions.

Additional info:

    

Description of problem:

Openshift Installer supports HTTP Proxy configuration in a restricted environment. However, it seems the bootstrap node doesn't use the given proxy when it grabs ignition assets.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-04-27-113605

How reproducible:

Always

Steps to Reproduce:

1. try IPI installation in a restricted/disconnected network with "publish: Internal", and without using Google Private Access 

Actual results:

The installation failed, because bootstrap node failed to fetch its ignition config.

Expected results:

The installation should succeed.

Additional info:

We'd ever fixed similar issue on AWS (and Alibabacloud) by https://bugzilla.redhat.com/show_bug.cgi?id=2090836.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    Support of apiVersion v1alpha1 has been removed. So, it is better to upgrade the apiVersion to v1beta1.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Unable to view the alerts, metrics page, getting a blank page.

Version-Release number of selected component (if applicable):

4.15.0-nightly

How reproducible:

Always

Steps to Reproduce:

Click on any alert under "Notification Panel" to view more, and you will be redirected to the alert page.
    

Actual results:

User is unable to view any alerts, metrics.

Expected results:

User should be able to view all/individual alerts, metrics.    

Additional info:

N.A

Description of problem:

CAPI E2Es failing to start in some CAPI provider's release branches.
Failing with the following error:

`go: errors parsing go.mod:94/tmp/tmp.ssf1LXKrim/go.mod:5: unknown directive: toolchain`

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api/199/pull-ci-openshift-cluster-api-master-e2e-aws-capi-techpreview/1765512397532958720#1:build-log.txt%3A91-95

Version-Release number of selected component (if applicable):

4.15    

How reproducible:

Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

  This is because the script launching the e2e is launching it from the `main` branch of the cluster-capi-operator (which has some backward incompabible go toolchain changes), rather than the correctly matching release branch.

Impact assessment of OCPBUGS-24009

Which 4.y.z to 4.y'.z' updates increase vulnerability?

Any upgrade up to 4.15.{current-z}

Which types of clusters?

Any non-Microshift cluster with an operator installed via OLM before upgrade to 4.15. After upgrading to 4.15, re-installing a previously uninstalled operator may also cause this issue. 

What is the impact? Is it serious enough to warrant removing update recommendations?

OLM Operators can't be upgraded and may incorrectly report failed status.

How involved is remediation?

Delete the resources associated with the OLM installation related to the failure message in the olm-operator.

A failure message similar to this may appear on the CSV:

InstallComponentFailed install strategy failed: rolebindings.rbac.authorization.k8s.io "openshift-gitops-operator-controller-manager-service-auth-reader" already exists

The following resource types have been observed to encounter this issue and should be safe to delete:

  • ClusterRoleBinding suffixed with "-system:auth-delegator"
  • Service
  • RoleBinding suffixed with "-auth-reader"

Under no circumstances should a user delete a CustomResourceDefinition (CRD) if the same error occurs and names such a resource as data loss may occur. Note that we have not seen this type of resource named in the error from any of our users so far.

Labeling the problematic resources with olm.managed: "true" then restarting the olm-operator pod in the openshift-operator-lifecycle-manager namespace may also resolve the issue if the resource appears risky to delete.

Is this a regression?

Yes, functionality which worked in 4.14 may break after upgrading to 4.15.Not a regression, this is a new issue related to performance improvements added to OLM in 4.15

https://issues.redhat.com/browse/OCPBUGS-24009

https://issues.redhat.com/browse/OCPBUGS-31080

https://issues.redhat.com/browse/OCPBUGS-28845

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/67

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Recently a user was attempting to change the Virtual Machine Folder for a cluster installed on vSphere.  The user used the configuration panel "vSphere Connection Configuration" to complete this process.  Upon updating the path and clicking "Save Configuration" cluster wide issues emerged including nodes not coming back online after a reboot. 

OpenShift nodes eventually crashed with an error resultant of an incorrectly parsed folder due to the string literal " " characters missing.

While this was exhibited on OCP 4.13, other versions may be affected.

Description of problem:

  Invalid memory address or nil pointer dereference in Cloud Network Config Controller  

Version-Release number of selected component (if applicable):

    4.12

How reproducible:

    sometimes

Steps to Reproduce:

    1. Happens by itself sometimes
    2.
    3.
    

Actual results:

    Panic and pod restarts

Expected results:

    Panics due to Invalid memory address or nil pointer dereference  should not occur

Additional info:

    E0118 07:54:18.703891 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 93 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x203c8c0?, 0x3a27b20})
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0003bd090?})
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x203c8c0, 0x3a27b20})
/usr/lib/golang/src/runtime/panic.go:884 +0x212
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).AssignPrivateIP(0xc0001ce700, {0xc000696540, 0x10, 0x10}, 0xc000818ec0)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:146 +0xcf0
github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig.(*CloudPrivateIPConfigController).SyncHandler(0xc000986000, {0xc000896a90, 0xe})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go:327 +0x1013
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc000720d80, {0x1e640c0?, 0xc0003bd090?})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x11c
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc000720d80)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(0xc000504ea0?)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27b3220, 0xc000894480}, 0x1, 0xc0000aa540)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x3aa
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1a40b30]
goroutine 93 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0003bd090?})
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x203c8c0, 0x3a27b20})
/usr/lib/golang/src/runtime/panic.go:884 +0x212
github.com/openshift/cloud-network-config-controller/pkg/cloudprovider.(*Azure).AssignPrivateIP(0xc0001ce700, {0xc000696540, 0x10, 0x10}, 0xc000818ec0)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/cloudprovider/azure.go:146 +0xcf0
github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig.(*CloudPrivateIPConfigController).SyncHandler(0xc000986000, {0xc000896a90, 0xe})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/cloudprivateipconfig/cloudprivateipconfig_controller.go:327 +0x1013
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem.func1(0xc000720d80, {0x1e640c0?, 0xc0003bd090?})
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:152 +0x11c
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).processNextWorkItem(0xc000720d80)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:162 +0x46
github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).runWorker(0xc000504ea0?)
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:113 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x27b3220, 0xc000894480}, 0x1, 0xc0000aa540)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 +0x25
created by github.com/openshift/cloud-network-config-controller/pkg/controller.(*CloudNetworkConfigController).Run
/go/src/github.com/openshift/cloud-network-config-controller/pkg/controller/controller.go:99 +0x3aa

User Story:

As a developer, I want to be able to:

  • Use the latest library version(v3) of `google.golang.org/api/cloudresourcemanager` which has been extended with APIs required for resource manager tags.

so that I can achieve

  • Tags related implementation could be kept inline with existing framework made available for using GCP clients.

Acceptance Criteria:

Description of criteria:

  • UTs are updated, if any changes are required with library update.

Description of problem:

Job link: powervs » zstreams » zstream-ocp4x-powervs-london06-p9-current-upgrade #282 Console [Jenkins] (ibm.com)

Must-gather link

long snippet from e2e log

external internet 09/01/23 07:26:09.624
Sep  1 07:26:09.624: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 http://www.google.com:80'
STEP: creating an egressfirewall object 09/01/23 07:26:09.903
STEP: calling oc create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml 09/01/23 07:26:09.903
Sep  1 07:26:09.904: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml'
egressfirewall.k8s.ovn.org/default createdSTEP: sending traffic to control plane nodes should work 09/01/23 07:26:22.122
Sep  1 07:26:22.130: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443'
Sep  1 07:26:23.358: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:
StdOut>
command terminated with exit code 28
StdErr>
command terminated with exit code 28[AfterEach] [sig-network][Feature:EgressFirewall]
  github.com/openshift/origin/test/extended/util/client.go:180
STEP: Collecting events from namespace "e2e-test-egress-firewall-e2e-2vvzx". 09/01/23 07:26:23.358
STEP: Found 4 events. 09/01/23 07:26:23.361
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {multus } AddedInterface: Add eth0 [10.131.0.89/23] from ovn-kubernetes
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-quay-io-redhat-developer-nfs-server-1-1-dlXGfzrk5aNo8EjC" already present on machine
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Created: Created container egressfirewall-container
Sep  1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Started: Started container egressfirewall-container
Sep  1 07:26:23.363: INFO: POD             NODE                                           PHASE    GRACE  CONDITIONS
Sep  1 07:26:23.363: INFO: egressfirewall  lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT  }]
Sep  1 07:26:23.363: INFO: 
Sep  1 07:26:23.367: INFO: skipping dumping cluster info - cluster too large
Sep  1 07:26:23.383: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-egress-firewall-e2e-2vvzx-user}, err: <nil>
Sep  1 07:26:23.398: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-egress-firewall-e2e-2vvzx}, err: <nil>
Sep  1 07:26:23.414: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~X_2HPGEj3O9hpd-3XKTckrp9bO23s_7zlJ3Tkn7ncBE}, err: <nil>
[AfterEach] [sig-network][Feature:EgressFirewall]
  github.com/openshift/origin/test/extended/util/client.go:180
STEP: Collecting events from namespace "e2e-test-no-egress-firewall-e2e-84f48". 09/01/23 07:26:23.414
STEP: Found 0 events. 09/01/23 07:26:23.416
Sep  1 07:26:23.417: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Sep  1 07:26:23.417: INFO: 
Sep  1 07:26:23.421: INFO: skipping dumping cluster info - cluster too large
Sep  1 07:26:23.446: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-no-egress-firewall-e2e-84f48-user}, err: <nil>
Sep  1 07:26:23.451: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-no-egress-firewall-e2e-84f48}, err: <nil>
Sep  1 07:26:23.457: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~2Lk8-jWfwpdyo59E9YF7kQFKH2LBUSvnbJdKj7rOzn4}, err: <nil>
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  dump namespaces | framework.go:196
STEP: dump namespace information after failure 09/01/23 07:26:23.457
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  tear down framework | framework.go:193
STEP: Destroying namespace "e2e-test-no-egress-firewall-e2e-84f48" for this suite. 09/01/23 07:26:23.457
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  dump namespaces | framework.go:196
STEP: dump namespace information after failure 09/01/23 07:26:23.462
[DeferCleanup (Each)] [sig-network][Feature:EgressFirewall]
  tear down framework | framework.go:193
STEP: Destroying namespace "e2e-test-egress-firewall-e2e-2vvzx" for this suite. 09/01/23 07:26:23.463
fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:155]: Unexpected error:
    <*fmt.wrapError | 0xc001dd50a0>: {
        msg: "Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:\nStdOut>\ncommand terminated with exit code 28\nStdErr>\ncommand terminated with exit code 28\nexit status 28\n",
        err: <*exec.ExitError | 0xc001dd5080>{
            ProcessState: {
                pid: 140483,
                status: 7168,
                rusage: {
                    Utime: {Sec: 0, Usec: 149480},
                    Stime: {Sec: 0, Usec: 19930},
                    Maxrss: 222592,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 1536,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 596,
                    Nivcsw: 173,
                },
            },
            Stderr: nil,
        },
    }
    Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:
    StdOut>
    command terminated with exit code 28
    StdErr>
    command terminated with exit code 28
    exit status 28
    
occurred
Ginkgo exit error 1: exit with code 1failed: (18.7s) 2023-09-01T11:26:23 "[sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created  [Suite:openshift/conformance/parallel]"

Version-Release number of selected component (if applicable):

4.13.11

How reproducible:

This e2e failure is not consistently reproduceable.

Steps to Reproduce:

1.Start a Z stream Job via Jenkins
2.monitor e2e

Actual results:

e2e is getting failed

Expected results:

e2e should pass

Additional info:

 

Description of problem:

Installation some operators. After some time the ResolutionFailed showing up:

 

$ kubectl get subscription.operators -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,ResolutionFailed:.status.conditions[?(@.type=="ResolutionFailed")].status,MSG:.status.conditions[?(@.type=="ResolutionFailed")].message'
NAMESPACE                   NAME                                                                         ResolutionFailed   MSG
infra-sso                   rhbk-operator                                                                True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
metallb-system              metallb-operator-sub                                                         True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
multicluster-engine         multicluster-engine                                                          True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
open-cluster-management     acm-operator-subscription                                                    True               [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"]
openshift-cnv               kubevirt-hyperconverged                                                      True               [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"]
openshift-gitops-operator   openshift-gitops-operator                                                    True               [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"]
openshift-local-storage     local-storage-operator                                                       True               [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"]
openshift-nmstate           kubernetes-nmstate-operator                                                  <none>             <none>
openshift-operators         devworkspace-operator-fast-redhat-operators-openshift-marketplace            <none>             <none>
openshift-operators         external-secrets-operator                                                    <none>             <none>
openshift-operators         web-terminal                                                                 <none>             <none>
openshift-storage           lvms                                                                         <none>             <none>
openshift-storage           mcg-operator-stable-4.14-redhat-operators-openshift-marketplace              <none>             <none>
openshift-storage           ocs-operator-stable-4.14-redhat-operators-openshift-marketplace              <none>             <none>
openshift-storage           odf-csi-addons-operator-stable-4.14-redhat-operators-openshift-marketplace   <none>             <none>
openshift-storage           odf-operator                                                                 <none>             <none> 

 

At the package server logs you can see one time the catalog source is not available, after a while the catalog source is available but the error doesn't disappear from the subscription.

Package server logs: 

time="2023-12-05T14:27:09Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.37.69:50051: connect: connection refused\"" source="{redhat-operators openshift-marketplace}"
time="2023-12-05T14:27:09Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace
time="2023-12-05T14:28:26Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace
time="2023-12-05T14:30:23Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace
time="2023-12-05T14:35:56Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace
time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace
time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace
time="2023-12-05T14:39:40Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace
time="2023-12-05T14:46:07Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace
time="2023-12-05T14:47:37Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace
time="2023-12-05T14:48:21Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace
time="2023-12-05T14:49:53Z" level=info msg="updating  

 

Version-Release number of selected component (if applicable):

4.14.3    

How reproducible:

 

Steps to Reproduce:

    1. Install an operator for example metallb
    2. Wait until the catalog pod is not available for on time.
    3. ResolutionFailed doesn't disappear anymore     

Actual results:

ResolutionFailed doesn't disappear anymore from subscription.

Expected results:

ResolutionFailed disappear from subscription.

Additional info:

sig-network][endpoints] admission [apigroup:config.openshift.io] [It] blocks manual creation of EndpointSlices pointing to the cluster or service network

[Suite:openshift/conformance/parallel]
github.com/openshift/origin/test/extended/networking/endpoint_admission.go:81

[FAILED] error getting endpoint controller service account 

 

We want to use the latest version of CAPO in MAPO. We need to revendor CAPO to version 0.9 before the 4.16 FF.

There are several API changes that might require Matt's help.

Description of problem:

NHC failed to watch Metal3 remediation template    

Version-Release number of selected component (if applicable):

OCP4.13 and higher

How reproducible:

    100%

Steps to Reproduce:

    1. Create Metal3RemediationTemplate
    2. Install NHCv.0.7.0
    3. Create NHC with Metal3RemediationTemplate     

Actual results:

E0131 14:07:51.603803 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource "metal3remediationtemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope

E0131 14:07:59.912283 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3Remediation: unknown

W0131 14:08:24.831958 1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource

Expected results:

    No errors

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2308

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-router-551-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1752661032133726208

The funky job renaming done in these is breaking risk analysis. The disruption checking actually ran if you look in the logs, but we don't get far enough to generate the html template, I suspect because of the error returned by risk analysis.

Would be nice if both would work but I'd be happy just to get the disruption portion populated for now. I don't think the overall risk analysis will be easy.

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1979

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

in UPI cluster, there is no MachineSets and Machines resource, when user visits Machines and MachineSets list page, we will see simple text 'Not found'

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-16-113018    

How reproducible:

Always    

Steps to Reproduce:

1. setup UPI cluster
2. goes to MachineSets and Machines list page, check the empty state message 

Actual results:

2. we just simply show 'Not found' text    

Expected results:

2. for other resources, we show richer text 'No <resourcekind> found', so we should also show 'No Machines found' and 'No MachineSets found' for these pages  

Additional info:

    

Description of problem:

Currently MachineConfiguration is only effective with name cluster, we can create multiple MachineConfigurations with other names like

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: ndp-file-action-none
  namespace: openshift-machine-config-operator
spec:
  nodeDisruptionPolicy:
    files:
      - path: /etc/test
        actions:
           - type: None

But only 'cluster' can take action, it will make the user confused

Version-Release number of selected component (if applicable):

    

How reproducible:

always

Steps to Reproduce:

Create MachineConfiguration with any name expect 'cluster'     

Actual results:

new MachineConfiguration won't be effective

Expected results:

if the function only works with 'cluster' object, we should reject the CR creation with other names

Additional info:

    

Description of problem:


The customer has a custom apiserver certificate.

This error can be found while trying to uninstall any operator by console:

openshift-console/pods/console-56494b7977-d7r76/console/console/logs/current.log:

2023-10-24T14:13:21.797447921+07:00 E1024 07:13:21.797400 1 operands_handler.go:67] Failed to get new client for listing operands: Get "https://api.<cluster>.<domain>:6443/api?timeout=32s": x509: certificate signed by unknown authority

when trying the same request from the console pod we can see no issue.

We see the root ca that signs apiserver certificate and this CA is trusted in the pod.

It seems the code that provokes this issue is:

https://github.com/openshift/console/blob/master/pkg/server/operands_handler.go#L62-L70

Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/40

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Version-Release number of selected component (if applicable):

 4.14

How reproducible:

1. oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'

2. curl <svc>:<port>

3. oc scale --replicas=3 deploy/<deploy>

4. oc scale --replicas=0 deploy/<deploy>

5. oc scale --replicas=3 deploy/<deploy>

Actual results:

 Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720

Expected results:

 Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520

Additional info:

See the hostname in the server log output for each command.

$ oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'

Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520

$ oc scale --replicas=1 deploy/<deploy>

Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47082
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47088
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54832
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54848

$ oc scale --replicas=3 deploy/<deploy>

Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720

Description of problem:

The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure    

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-multi-2023-12-06-195439 

How reproducible:

   Always 

Steps to Reproduce:

    1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config     
    2.Create cluster 
    3.Check installation     

Actual results:

Azure will precheck if architecture is consistent with instance type when creating manifests, like:
12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj"
12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64

But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster    

Expected results:

The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/80

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/10

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Hypershift Operator is scheduing control plane on Deleting nodes

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

https://web-rca.devshift.net/incident/ITN-2024-00068
    1. HO was trying to create an HCP on a node being deleted
    2. HO couldn't find the paired node 'cause it was already deleted 
  
 Forcing the removal of pending node (blocked by PDB) solves the issue  

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/101

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Creating a Serverless Deployment with "Scaling" "Min Pods"/"Max Pods" options set, uses deprecated knative annotations "autoscaling.knative.dev/minScale" / "maxScale",

the correct current ones are "autoscaling.knative.dev/min-scale" / "max-scale"

The same problem with "autoscaling.knative.dev/targetUtilizationPercentage" , which should be "autoscaling.knative.dev/target-utilization-percentage"

Prerequisites (if any, like setup, operators/versions):

Serverless operator

Steps to Reproduce

  1. Install serverless operator
  2. Create KnativeServing in knative-serving namespace
  3. create a test "foobar" namespace
  4. Go to <console>/deploy-image/ns/foobar
  5. Use gcr.io/knative-samples/helloworld-go  as "Image name from external registry" (or any webserver image listening on :8080)
  6. Choose "Serverless Deployment" for the "resource type"
  7. Click on "Scaling" in "Click on the names to access advanced options for ..."
  8. Set "2" for "Min Pods" and "3" for "Max Pods"
  9. Create

Actual results:

The created ksvc resource has

 spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: "3"
        autoscaling.knative.dev/minScale: "2"
        autoscaling.knative.dev/targetUtilizationPercentage: "70"

Expected results:

The created ksvc should have

spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/max-scale: "3"
        autoscaling.knative.dev/min-scale: "2"
        autoscaling.knative.dev/target-utilization-percentage: "70"

Reproducibility (Always/Intermittent/Only Once): Always

Build Details:

4.14.8

Workaround:

none required ATM, current serverless still supports the deprecated "minScale"/"maxScale" annotations.

Additional info:

https://docs.openshift.com/serverless/1.31/knative-serving/autoscaling/serverless-autoscaling-developer-scale-bounds.html

https://issues.redhat.com/browse/SRVKS-910

 

Description of problem:

    Tekton Results API endpoint failed to fetch data on airgapped cluster.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/installer/pull/7816

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

If the user specifies baselineCapabilitySet: None in the install-config and does not specifically enable the capability baremetal, yet still uses platform: baremetal then the install will reliably fail.

This failure takes the form of a timeout with the bootkube logs (not easily accessible to the user) full of errors like:

bootkube.sh[46065]: "99_baremetal-provisioning-config.yaml": unable to get REST mapping for "99_baremetal-provisioning-config.yaml": no matches for kind "Provisioning" in version "metal3.io/v1alpha1"
bootkube.sh[46065]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"

Since the installer can tell when processing the install-config if the baremetal capability is missing, we should detect this and error out immediately to save the user an hour of their life and us a support case.

Although this was found on an agent install, I believe the same will apply to a baremetal IPI install.

Description of problem:

documentationBaseURL still points to 4.14

Version-Release number of selected component (if applicable):

4.16 

How reproducible:

Always

Steps to Reproduce:

1.Check documentationBaseURL on 4.16 cluster: 
# oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/

2.
3.

Actual results:

1.documentationBaseURL is still pointing to 4.14

Expected results:

1.documentationBaseURL should point to 4.16

Additional info:

 

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/66

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-os-images/pull/34

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Incorrect help info for loglevel when using --v2 flag 

 

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1) Check `oc-mirror –v2 -h` help info
2) When use command `oc-mirror --config config.yaml --from file://out docker://xxxxxx.com:5000/test  --v2 --max-nested-paths 2 --loglevel trace`

Actual results: 

oc-mirror --config config.yaml --from file://out docker://ec2-18-188-118-33.us-east-2.compute.amazonaws.com:5000/test  --v2 --max-nested-paths 2 --loglevel trace
--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 
2024/03/20 07:18:30  [INFO]   : mode diskToMirror 
2024/03/20 07:18:30  [TRACE]  : creating signatures directory out/working-dir/signatures 
2024/03/20 07:18:30  [TRACE]  : creating release images directory out/working-dir/release-images 
2024/03/20 07:18:30  [TRACE]  : creating release cache directory out/working-dir/hold-release 
2024/03/20 07:18:30  [TRACE]  : creating operator cache directory out/working-dir/hold-operator 
2024/03/20 07:18:30  [ERROR]  :  error parsing local storage configuration : invalid loglevel trace Must be one of [error, warn, info, debug]

Expected results:

Show correct help information

 

Additional info:

same for `--strict-archive archiveSize`, the information is not clear how to use it.

Description of problem:

[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS arm instance types "c7gd.2xlarge , m7gd.xlarge"

Version-Release number of selected component (if applicable):

    4.15.3

How reproducible:

    Always

Steps to Reproduce:

    1. Create an Openshift cluster on AWS with intance types "c7gd.2xlarge , m7gd.xlarge"
    2. Check the csinode allocatable volumes count 
    3. Create statefulset with 1 pvc mounted and max allocatable volumes count replicas with nodeAffinity 
    apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: $VOL_COUNT_LIMIT
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - $NODE_NAME
      containers:
      - name: openshifttest
        image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: doc gata
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: gp3-csi
      resources:
        requests:
          storage: 1Gi
    4. The statefulset all replicas should all become ready.

Actual results:

In step 4, the statefulset 26th replica(pod) stuck at ContainerCreating caused by the volume couldn't be attached to the node(the csinode allocatable volumes count incorrect) 
$ oc get no/ip-10-0-22-114.ec2.internal -oyaml|grep 'instance'
    beta.kubernetes.io/instance-type: m7gd.xlarge
    node.kubernetes.io/instance-type: m7gd.xlarge
 $ oc get csinode/ip-10-0-22-114.ec2.internal -oyaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2024-03-20T02:16:34Z"
  name: ip-10-0-22-114.ec2.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-10-0-22-114.ec2.internal
    uid: acb9a153-bb9b-4c4a-90c1-f3e095173ce2
  resourceVersion: "19281"
  uid: 12507a73-898d-441a-a844-41c7de290b5b
spec:
  drivers:
  - allocatable:
      count: 26
    name: ebs.csi.aws.com
    nodeID: i-00ec014c5676a99d2
    topologyKeys:
    - topology.ebs.csi.aws.com/zone
$ export VOL_COUNT_LIMIT="26"
$ export NODE_NAME="ip-10-0-22-114.ec2.internal"
$ envsubst < sts-vol-limit.yaml| oc apply -f -
statefulset.apps/statefulset-vol-limit created
$ oc get sts
NAME                    READY   AGE
statefulset-vol-limit   25/26   169m

$ oc describe po/statefulset-vol-limit-25
Name:             statefulset-vol-limit-25
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-22-114.ec2.internal/10.0.22.114
Start Time:       Wed, 20 Mar 2024 18:56:08 +0800
Labels:           app=my-svc
                  apps.kubernetes.io/pod-index=25
                  controller-revision-hash=statefulset-vol-limit-7db55989f7
                  statefulset.kubernetes.io/pod-name=statefulset-vol-limit-25
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.128.2.53/23"],"mac_address":"0a:58:0a:80:02:35","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0...
Status:           Pending
IP:
IPs:              <none>
Controlled By:    StatefulSet/statefulset-vol-limit
Containers:
  openshifttest:
    Container ID:
    Image:          quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt/storage from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zkwqx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-statefulset-vol-limit-25
    ReadOnly:   false
  kube-api-access-zkwqx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age                  From                     Message
  ----     ------              ----                 ----                     -------
  Normal   Scheduled           167m                 default-scheduler        Successfully assigned default/statefulset-vol-limit-25 to ip-10-0-22-114.ec2.internal
  Warning  FailedAttachVolume  166m (x2 over 166m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": context deadline exceeded
  Warning  FailedAttachVolume  30s (x87 over 166m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-b43ec1d0-4fa3-4e87-a80b-6ad912160273" : rpc error: code = Internal desc = Could not attach volume "vol-0a7cb8c5859cf3f96" to node "i-00ec014c5676a99d2": attachment of disk "vol-0a7cb8c5859cf3f96" failed, expected device to be attached but was attaching

Expected results:

    In step4 The statefulset all replicas should all become ready.

Additional info:

    The AWS arm instance types "c7gd.2xlarge , m7gd.xlarge" all should be "25" not "26"

Description of problem:

Customer is trying to create RHOCP cluster with Domain name 123mydomain.com

In RedHat Hybrid Cloud Console customer is getting below error : 
~~~
Failed to update the cluster
DNS format mismatch: 123mydomain.com domain name is not valid
~~~

*** As per regex check as described in KCS - https://access.redhat.com/solutions/5517531
The domain name starting with numeric character is valid e.g. 123mydomain.com

Below is the RegeX check to find the domain name validity :
 [a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]

*** From the validations the assisted installer does as per : https://github.com/openshift/assisted-service/blob/master/pkg/validations/validations.go 

The below regexps are applied:
baseDomainRegex          = `^[a-z]([\-]*[a-z\d]+)+$`
dnsNameRegex             = `^([a-z]([\-]*[a-z\d]+)*\.)+[a-z\d]+([\-]*[a-z\d]+)+$`
wildCardDomainRegex      = `^(validateNoWildcardDNS\.).+\.?$`
hostnameRegex            = `^[a-z0-9][a-z0-9\-\.]{0,61}[a-z0-9]$`
installerArgsValuesRegex = `^[A-Za-z0-9@!#$%*()_+-=//.,";':{}\[\]]+$`
  
This means the domain name must start with a letter [a-z].

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Open RedHat Hybrid Cloud Console
2. Go to Clusters
3. Create Cluster
4. Go to Datacenter 
5. Under Assisted Installer -> Create Cluster
6. Enter Cluster Name mytestcluster and enter Domain Name 123mydomain.com
7. Click on Next

Actual results:

Domain name with numeric character first and then letters e.g. 123mydomain.com showing invalid in RedHat Hybrid Cloud Console Assisted Installer , throwing error :-
Failed to create new cluster
DNS format mismatch: 123mydomain.com domain name is not valid

Expected results:

Domain name with numeric character first and then letters e.g. 123mydomain.com must be valid in OpenShift RedHat Hybrid Cloud Console Assisted Installer

Description of problem:

It would help making debugging easier if we included the namespace in the message for these alerts: https://github.com/openshift/cluster-ingress-operator/blob/master/manifests/0000_90_ingress-operator_03_prometheusrules.yaml#L69

Version-Release number of selected component (if applicable):

4.12.x

How reproducible:

Always

Steps to Reproduce:

1. 
2.
3.

Actual results:

No namespace in the alert message

Expected results:

 

Additional info:

 

Description of problem:

OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. 

The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery).

Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately. 

Version-Release number of selected component (if applicable):

Versions where OCPBUGS-15583 was backported. This includes 4.16, 4.15.0, 4.14.8, 4.13.33, and the next 4.12.z likely 4.12.51.

How reproducible:

always    

Steps to Reproduce:

1. create a cluster that contains a fix for OCPBUGS-15583
2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET
    alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m])) 

     

Actual results:

the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver

Expected results:

the node status should not update that frequently, meaning the control plane CPU usage should go down again 

Additional info:

slack thread with the node team:
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849
    

In TRT-1476, we created a VM that served as an endpoint where we can test connectivity in gcp.
We want one for Azure.

In TRT-1477, we created some code in origin to send HTTP GETs to that endpoint as a test to ensure connectivity remains working. Do this also for Azure.

Description of the problem:

When trying to create cluster with s390x architecture, an error occurs that stops cluster creation.  The error is "cannot use Skip MCO reboot because it's not compatible with the s390x architecture on version 4.15.0-ec.3 of OpenShift"

How reproducible:

Always

Steps to reproduce:

Create cluster with architecture s390x

Actual results:

Create failed

Expected results:

Create should succeed

Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/24

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

In https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-06-031624, I see several PR's involving moving crio metrics.  This payload is being rejected on TargetAlerts alerts on AWS minor upgrades.
 
Example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]e-from-stable-4.15-e2e-aws-ovn-upgrade/1754707749708500992
 
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system expand_less 0s { TargetDown was at or above info for at least 1m58s on platformidentification.JobType

{Release:"4.16", FromRelease:"4.15", Platform:"aws", Architecture:"amd64", Network:"ovn", Topology:"ha"}

(maxAllowed=1s): pending for 15m0s, firing for 1m58s: Feb 06 04:48:42.698 - 118s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS{alertname="TargetDown", alertstate="firing", job="crio", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}}
 

Description of the problem:

BMH and Machine resources not created for ZTP day-2 control-plane nodes
 

How reproducible:

 100%

Steps to reproduce:

1. Use ZTP to add control-plane nodes to an existing baremetal spoke cluster that was installed using ZTP

Actual results:

 CSRs are not being approved automatically because Machine and BMH resources are not being created due to this condition which excludes control plane nodes. This condition seems to be old and no longer relevant, as it was written before adding day-2 control plane nodes was supported

Expected results:

Machine and BMH resources are being created and as a result CSRs are being approved automatically

Description of problem:

    Logs for PipelineRuns fetched from the Tekton Results API is not loading

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Navigate to the Log tab of PipelineRun fetched from the Tekton Results
    2.
    3.
    

Actual results:

    Logs window is empty with a loading indicator

Expected results:

Logs should be shown

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Try to deploy in mad02 or mad04 with powervs
    2. Cannot import boot image
    3. fail
    

Actual results:

Fail    

Expected results:

Cluster comes up    

Additional info:

    

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/222

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ovirt-csi-driver-operator/pull/129

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

ibm-vpc-block-csi-driver deployment is missing sidecar metrics and the kube-rbac-proxy sidecar

Version-Release number of selected component (if applicable):

4.10+

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/206

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/633

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable"

There's an issue when the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. 

Scenarios
- If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok 
- On a reboot if tuned recovers after the guaranteed pod has been launched - broken
- If tuned restarts at runtime for any reason - broken

Version-Release number of selected component (if applicable):

   4.14 and likely earlier

How reproducible:

    See description

Steps to Reproduce:

    1.See description 
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 

Description of problem:

Since the singular variant of APIVIP/IngressVIP has been removed as part of https://github.com/openshift/installer/pull/7574, the appliance disk image e2e job is now failing: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static

The job fails since th appliance support only 4.14, which still requires the singular variant of the VIP properties.   

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

Always    

Steps to Reproduce:

1. Invoke appliance e2e job on master: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static
    

Actual results:

Job fails with the following validation error:
"the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs"
Due to missing apiVIP and ingressVIP in AgentClusterInstall.

Expected results:

AgentClusterInstall should include also the singular 'apiVIP' and 'ingressVIP', and the e2e job should successfully complete

Additional info:

    

Description of problem:

    The design doc for ImageDigestMirrorSet states:
"ImageContentSourcePolicy CRD will be marked as deprecated and will be supported during all of 4.x. Update and coexistence of ImageDigestMirrorSet/ ImageTagMirrorSet and ImageContentSourcePolicy is supported. We encourage users to move to IDMS while supporting both in the cluster, but will not remove ICSP in OCP 4.x.".
see: https://github.com/openshift/machine-config-operator/blob/master/docs/ImageMirrorSetDesign.md#goals

see also:
https://github.com/openshift/enhancements/blob/master/enhancements/api-review/add-new-CRD-ImageDigestMirrorSet-and-ImageTagMirrorSet-to-config.openshift.io.md#update-the-implementation-for-migration-path
for the rationale behind it.

but the hypershift-operator is reading ImageContentSourcePolicy only if no ImageDigestMirrorSet exists on the cluster, see:
https://github.com/openshift/hypershift/blob/main/support/globalconfig/imagecontentsource.go#L101-L102

Version-Release number of selected component (if applicable):

    4.14, 4.15, 4.16

How reproducible:

    100%

Steps to Reproduce:

    1. Set both an ImageContentSourcePolicy and ImageDigestMirrorSet with different content on the management cluster
    2.
    3.    

Actual results:

the hypershift-operator consumes only the ImageDigestMirrorSet content ignoring the ImageContentSourcePolicy one.    

Expected results:

since both ImageDigestMirrorSet and ImageContentSourcePolicy (although deprecated) are still supported on the management cluster, the hypershift-operator should align.    

Additional info:

currently oc-mirror (v1) is only generating imageContentSourcePolicy.yaml without any imageDigestMirrorSet.yaml equivalent breaking the hypershift disconnected scenario on clusters where an IDMS is already there for other reasons.    

Description of problem:

Mirror OCI catalog with v2, after create catalog source , the pod is not present, check the catalog see error info :

oc describe catalogsource cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
Name:         cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
Namespace:    openshift-marketplace
Labels:       <none>
Annotations:  <none>
API Version:  operators.coreos.com/v1alpha1
Kind:         CatalogSource
Metadata:
  Creation Timestamp:  2024-03-29T02:49:47Z
  Generation:          1
  Resource Version:    53264
  UID:                 69a39693-b29b-4fa4-a6da-de31dc3d521c
Spec:
  Image:        ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi/redhat-operator-index:8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
  Source Type:  grpc
Status:
  Message:  couldn't ensure registry server - error ensuring pod: : error creating new pod: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c-: Pod "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7da785sd" is invalid: metadata.labels: Invalid value: "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c": must be no more than 63 characters
  Reason:   RegistryServerError
Events:     <none>

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1)  Copy the operator as OCI format to localhost:
`skopeo copy --all docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 oci:///app1/noo/redhat-operator-index --remove-signatures`

2)  Use following imagesetconfigure for mirror: cat config-multi-op.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
    - catalog: oci:///app1/noo/redhat-operator-index
      packages:
        - name: odf-operator
`oc-mirror --config config-multi-op.yaml file://outmulitop   --v2`


3) Do diskTomirror :
`oc-mirror --config config-multi-op.yaml --from file://outmulitop  --v2 docker://ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi`

4) Create cluster resource with file: itms-oc-mirror.yaml
   `oc create -f cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c.yaml`

Actual results: 

4) The pod for catalogsource not present 

oc describe catalogsource cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
Name:         cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
Namespace:    openshift-marketplace
Labels:       <none>
Annotations:  <none>
API Version:  operators.coreos.com/v1alpha1
Kind:         CatalogSource
Metadata:
  Creation Timestamp:  2024-03-29T02:49:47Z
  Generation:          1
  Resource Version:    53264
  UID:                 69a39693-b29b-4fa4-a6da-de31dc3d521c
Spec:
  Image:        ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi/redhat-operator-index:8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
  Source Type:  grpc
Status:
  Message:  couldn't ensure registry server - error ensuring pod: : error creating new pod: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c-: Pod "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7da785sd" is invalid: metadata.labels: Invalid value: "cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c": must be no more than 63 characters
  Reason:   RegistryServerError
Events:     <none>

cat cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: null
  name: cs-redhat-operator-index-8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
  namespace: openshift-marketplace
spec:
  image: ec2-3-139-239-15.us-east-2.compute.amazonaws.com:5000/multi/redhat-operator-index:8bfb449c24d03d6ddbd05d3de9fe7a7dae4a2ecdb8f84487f28d24d6ca2d175c
  sourceType: grpc
status: {} 

Expected results:

4) catalog source pod running well.

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

RHOCP installation on RHOSP fails with an error

~~~

$ ansible-playbook -i inventory.yaml security-groups.yaml
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.36.5 is smaller than minimum version 1.0."}

~~~

Packages Installed :

ansible-2.9.27-1.el8ae.noarch Fri Oct 13 06:56:05 2023
python3-netaddr-0.7.19-8.el8.noarch Fri Oct 13 06:55:44 2023
python3-openstackclient-4.0.2-2.20230404115110.54bf2c0.el8ost.noarch Tue Nov 21 01:38:32 2023
python3-openstacksdk-0.36.5-2.20220111021051.feda828.el8ost.noarch Fri Oct 13 06:55:52 2023

Document followed :
https://docs.openshift.com/container-platform/4.13/installing/installing_openstack/installing-openstack-user.html#installation-osp-downloading-modules_installing-openstack-user

Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/293

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If these pods are evicted, they loose all knowlage of exsisting dhcp leases, and any pods using dhcp ipam will fail to renew the dhcp lease. even after the pod is re-created.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. use a NAD with ipam: dhcp.
    2. delete the dhcp deamon pod on the smae node as your workload.
    3. observe the lease expire on dhcp server / get reissued to a different pod causing network outage from duplicate addresses.
    

Actual results:

dhcp-daemon 

Expected results:

dhcp-daemon pod does not get evicted before workloads. because of system-node-critical

Additional info:

All other multus components system-node-critical 
  priority: 2000001000
  priorityClassName: system-node-critical

Please review the following PR: https://github.com/openshift/telemeter/pull/522

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/136

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When upgrading cluster from 4.13.23 to 4.14.3, machine-config CO gets stuck due to a content mismatch error on all nodes.

Node node-xxx-xxx is reporting: "unexpected on-disk state
      validating against rendered-master-734521b50f69a1602a3a657419ed4971: content
      mismatch for file \"/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt\""

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. perform a upgrade from 4.13.x to 4.14.x
    2. 
    3.
    

Actual results:

    machine-config stalls during upgrade

Expected results:

    the "content mismatch" shouldn't happen anymore according to the MCO engineering team

Additional info:

    

Since approximately 12 April, all FIPS CI is broken, with the authentication operator failing to come up.

Sippy

The oauth-openshift containers are failing with the message:

Copying system trust bundle
FIPS mode is enabled, but the required OpenSSL backend is unavailable

This is due to https://github.com/openshift/oauth-server/commit/8a6f3a11a4b25e3e22152252720490b9f355ce53 changing the base image to RHEL 9 while leaving the builder image as RHEL 8. When the binary starts, it can not find the RHEL 8 OpenSSL it was linked against.

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/router/pull/550

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/658

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Conditional risks have looser naming restrictions:

	// +kubebuilder:validation:Required
	// +kubebuilder:validation:MinLength=1
	// +required
	Name string `json:"name"`

...than condition Reason field:

	// +required
	// +kubebuilder:validation:Required
	// +kubebuilder:validation:MaxLength=1024
	// +kubebuilder:validation:MinLength=1
	// +kubebuilder:validation:Pattern=`^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$`
	Reason string `json:"reason" protobuf:"bytes,5,opt,name=reason"`

CVO can use a name risk as a reason so when a name of the applying risk is invalid, CVO will start trying to update the ClusterVersion resource with an invalid one.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

always

Steps to Reproduce:

Make the cluster consume update graph data containing a conditional edge with a risk with a name that does not follow the Condition.Reason restriction, e.g. uses a - character. The risk needs to apply on the cluster. For example:

{
  "nodes": [
    {"version": "CLUSTER-BOT-VERSION", "payload": "CLUSTER-BOT-PAYLOAD"},
    {"version": "4.12.22", "payload": "quay.io/openshift-release-dev/ocp-release@sha256:1111111111111111111111111111111111111111111111111111111111111111"}
  ],
  "conditionalEdges": [
    {
      "edges": [{"from": "CLUSTER-BOT-VERSION", "to": "4.12.22"}],
      "risks": [
        {
          "url": "https://github.com/openshift/api/blob/8891815aa476232109dccf6c11b8611d209445d9/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1515C4-L1520",
          "name": "OCPBUGS-9050",
          "message": "THere is no validation that risk name is a valid Condition.Reason so let's just use a - character in it.",
          "matchingRules": [{"type": "PromQL", "promql": { "promql": "group by (type) (cluster_infrastructure_provider)"}}]
        }
      ]
    }
  ]
}

Then, observe the ClusterVersion status field after the cluster has a chance to evaluate the risk:

$ oc get clusterversion version -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Progressing" or .type=="Available" or .type=="Failing")'

Actual results:

{
  "lastTransitionTime": "2023-09-01T13:21:49Z",
  "status": "False",
  "type": "Available"
}
{
  "lastTransitionTime": "2023-09-01T13:21:49Z",
  "message": "ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'",
  "status": "True",
  "type": "Failing"
}
{
  "lastTransitionTime": "2023-09-01T13:14:34Z",
  "message": "Error ensuring the cluster version is up to date: ClusterVersion.config.openshift.io \"version\" is invalid: status.conditionalUpdates[0].conditions[0].reason: Invalid value: \"OCPBUGS-9050\": status.conditionalUpdates[0].conditions[0].reason in body should match '^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$'",
  "status": "False",
  "type": "Progressing"
}

Expected results:

No errors, CVO continues to work and either uses some sanitized version of the name as the reason, or maybe uses something generic, like RiskApplies.

CVO does not get stuck after consuming data from external source

Additional info:

1. We should CI PRs to o/cincinnati-graph-data so we never create invalid data
2. We should sanitize the field in CVO code so that CVO never attempts to submit an invalid ClusterVersion.status.conditionalUpdates.condition.reason
3. We should further restrict the conditional update risk name in the CRD so it is guaranteed compatible with Condition.Reason
4. We should sanitize the field in CVO code after being read from OSUS so that CVO never attempts to submit an invalid (after we do 3) ClusterVersion.conditinalUpdates.name

Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/59

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When a customer certificate and sre certificate are configured and approved, revocation of customer certificate causes access to the cluster using kubeconfig with sre cert to be denied

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Create a cluster
    2. Configure a customer cert and a sre cert, they are approved
    3. Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
    

Actual results:

   Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied

Expected results:

   Revoke a customer cert, access to the cluster using kubeconfig with sre cert succeeds

Additional info:

    

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/327

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/prometheus/pull/195

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Clear Button in Upload Jar Form is not working, user need to close the form in-order to remove the previous selected JAR file.   

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Open Upload Jar File form from Add Page
    2. Upload a JAR file
    3. Remove the JAR the file by using clear button
    

Actual results:

    The selected JAR file is not removed even after using "Clear" button

Expected results:

    The "Clear" button should remove the selected file from the form.

Additional info:

    

Please review the following PR: https://github.com/openshift/oauth-server/pull/140

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/183

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:[link Worker CloudFormation Template|[Installing a cluster on AWS using CloudFormation templates - Installing on AWS | Installing | OpenShift Container Platform 4.13|https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-user-infra.html#installation-cloudformation-worker_installing-aws-user-infra]]

    In OpenShift Documentation under Manual AWS Cloudformation Templates. Within the cloudformation template for Worker Nodes. The description for Subnet and WorkerSecurityGroupId refer to the Master Nodes. Based on the variable names the descriptions should refer to Worker Nodes instead.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When creating an HostedCluster with 'NodePort' service publishing strategy, the VMs (guest nodes) are trying to contact HCP services, such as ignition and oauth. If these services are colocated on the same infra node, they can't be reached via NodePort because the 'virt-launcher' NetworkPolicy is blocking it.
Need to explicitly add access to oauth and ignition-server-proxy pods so they can be reached from the virtual machines on the same node.

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Always, if conditions are met

Steps to Reproduce:

    1. As described above
    2.
    3.
    

Actual results:

VMs are not joining the cluster as nodes if the ignition server is running on the same infra node as the VM.

Expected results:

All VMs are joining the cluster as nodes, and the HostedCluster is eventually Completed and Available

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/153

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

OCPBUGS-11856 added a patch to change termination log permissions manually. Since 4.15.z this is not longer necessary as its fixed by lumberjack dep bump.

This bug tracks carry revert

As part of our investigation into GCP disruption we want an endpoint separate from the cluster under test but inside GCP to monitor for connectivity.

One approach is to use a GCP Cloud Function with a HTTP Trigger

Another alternative it to standup our own server and collect logging

 

We need to consider cost of implementation, cost of maintaining and how well the implementation lines up with our overall test scenario (we are wanting to use this as a control to compare with reaching a pod within a cluster under test)

 

We may want to also consider standing up similar endpoints in AWS and Azure in the future.

 

A separate story will cover monitoring the endpoint from within Origin

  • We want to capture the audit id and log when we receive an incoming request
  • audit id could include the build id or another field could be used to correlate back to the job instance

Please review the following PR: https://github.com/openshift/route-override-cni/pull/51

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Currently when creating an Azure cluster, only the first node of the nodePool will be ready and join the cluster, all other azure machines are stuck in the `Creating` state.

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/113

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/sdn/pull/599

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/86

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/coredns/pull/107

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/98

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

In order to use hostPath volumes, containers in kubernetes must be started with the privileged flag set. This is because this flag toggles an SELinux boolean that cannot be toggled by enabling any particular capability. (Empirical testing shows the same restriction does not apply to emptyDir volumes.)

Since the baremetal components rely on a hostPath volumes for an number of purposes, this prevents many of them from running unprivileged.

However, there are a number of containers that do not use any hostPath volumes and need only an added capability, if anything. These should be specified explicitly instead of just setting privileged mode to enable everything.

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/59

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/160

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/242

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Began permafailing somewhere in https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-14-214308

Example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1768395075936587776

{ fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:199]: cluster is still being upgraded: registry.build02.ci.openshift.org/ci-op-588yb1d9/release@sha256:98f570afbb8492d9b393eecc929266e987ba75088af72b234b81d2702d63f75e Ginkgo exit error 1: exit with code 1}

{Cluster did not complete upgrade: timed out waiting for the condition: Could not update customresourcedefinition "infrastructures.config.openshift.io" (48 of 887): the object is invalid, possibly due to local cluster configuration }

We suspect the latter message implicates https://github.com/openshift/api/pull/1802 and a revert is open now.

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1710501463301079

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/488

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.20 True False 43h Cluster version is 4.11.20

$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2023-01-11T13:16:47Z"
  name: system:openshift:kube-controller-manager:gce-cloud-provider
  resourceVersion: "6079"
  uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:kube-controller-manager:gce-cloud-provider
subjects:
- kind: ServiceAccount
  name: cloud-provider
  namespace: kube-system

$ oc get sa cloud-provider -n kube-system
Error from server (NotFound): serviceaccounts "cloud-provider" not found

The serviceAccount cloud-provider does not exist. Neither in kube-system nor in any other namespace.

It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount.

From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4 (all version from what we have found)

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4
2. Run oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml

Actual results:

$ oc get clusterrolebinding system:openshift:kube-controller-manager:gce-cloud-provider -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: "2023-01-11T13:16:47Z"
  name: system:openshift:kube-controller-manager:gce-cloud-provider
  resourceVersion: "6079"
  uid: 82a81635-4535-4a51-ab83-d2a1a5b9a473
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:kube-controller-manager:gce-cloud-provider
subjects:
- kind: ServiceAccount
  name: cloud-provider
  namespace: kube-system

$ oc get sa cloud-provider -n kube-system
Error from server (NotFound): serviceaccounts "cloud-provider" not found

Expected results:

The serviceAccount called cloud-provider to exist or otherwise the ClusterRoleBinding to be removed.

Additional info:

Finding related to a Security review done on the OpenShift Container Platform 4 - Platform

[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-kni-infra

Test is now passing near 0% on metal-ipi.

Started around Feb 10th.

event [namespace/openshift-kni-infra node/master-1.ostest.test.metalkube.org pod/haproxy-master-1.ostest.test.metalkube.org hmsg/64785a22cf - Back-off restarting failed container haproxy in pod haproxy-master-1.ostest.test.metalkube.org_openshift-kni-infra(336080d8c1b455c151170524132c026d)] happened 295 times

Possible relation to haproxy 2.8 merge?

Logs indicate the error is:

/bin/bash: line 47: socat: command not found

Description of the problem:
Environment: ACM running on a SNO. ACM is dual stack, but spoke cluster is IPv6 only

Attempting to deploy the spoke cluster via ZTP fails with proxy errors appearing in the ironic-python-agent logs:

Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 CRITICAL ironic-python-agent [-] Unhandled error: requests.exceptions.ProxyError: HTTPSConnectionPool(host='10.240.92.11', port=5050): Max retries exceeded with url: /v1/continue (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent Traceback (most recent call last):
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 696, in urlopen
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent self._prepare_proxy(conn)
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 964, in _prepare_proxy
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent conn.connect()
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connection.py", line 366, in connect
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent self._tunnel()
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib64/python3.9/http/client.py", line 930, in _tunnel
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent raise OSError(f"Tunnel connection failed:")
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent OSError: Tunnel connection failed: 403 Forbidden
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent During handling of the above exception, another exception occurred:
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent Traceback (most recent call last):
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent resp = conn.urlopen(
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent retries = retries.increment(
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent raise MaxRetryError(_pool, url, error or ResponseError(cause))
Mar 26 19:11:53 sancrvdu3.cran-openshift.bete.ericy.com podman[2825]: 2024-03-26 19:11:53.361 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.240.92.11', port=5050): Max retries exceeded with url: /v1/continue (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))

How reproducible:
Always
 

Steps to reproduce:

1. Deploy ACM on a dual-stack SNO

2. Configure ACM for ZTP/GitOps

3. Use ZTP to deploy a spoke IPv6 only SNO

Actual results:
Proxy errors when the ironic agent attempts to communicate with ACM. The ironic-python-agent.conf incorrectly specifies the IPv4 endpoint:

 

$ cat /etc/ironic-python-agent.conf
[DEFAULT]
api_url = https://192.168.92.11:6385
inspection_callback_url = https://192.168.92.11:5050/v1/continue
insecure = True
enable_vlan_interfaces = all

 

 
Expected results:
Spoke cluster is successfully deployed via ZTP.

Code to decide whether to update the Pull Secret replicates most of the functionality of the ApplySecret() func in library-go, which it then calls anyway.

This is hard to read, and misleading for anybody wanting to add similar functionality.

Description of problem:

Currently in oc-mirror v2 user have no way of determining if the error occurs during a mirror is an actual mirror or a flake. We need to provide a way for the user to determine such errors easily which will make the product/user experience better.
    

Version-Release number of selected component (if applicable):

    [knarra@knarra-thinkpadx1carbon7th openshift-tests-private]$ oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404221110.p0.g0e2235f.assembly.stream.el9-0e2235f", GitCommit:"0e2235f4a51ce0a2d51cfc87227b1c76bc7220ea", GitTreeState:"clean", BuildDate:"2024-04-22T16:05:56Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

     Always
    

Steps to Reproduce:

    1.  Install latest oc-mirror v2
    2.  Run mirror2disk via command `oc-mirror -c <config.yaml> file://out --v2`
    3.  Now run disk2mirror via command `oc-mirror -c <config.yaml> --from file://out docker:<localhost:5000>/mirror
    

Actual results:

    sometimes mirror fails with error 
 2024/04/24 13:15:38  [ERROR]  : [Worker] err: copying image 3/3 from manifest list: reading blob sha256:418a8fe842682e4eadab6f16a6ac8d30550665a2510090aa9a29c607d5063e67: fetching blob: unauthorized: Access to the requested resource is not authorized
    

Expected results:

     There should be a way for user to determine whether the error happened was an actual error or kind of a race. Above error is not  intutive
    

Additional info:

    Discussed here https://redhat-internal.slack.com/archives/C050P27C71S/p1714025295014549
    

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-28-192827

 

 [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] 

{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:360]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"cluster-node-tuning-operator\" produces more watch requests than expected: watchrequestcount=209, upperbound=184, ratio=1.14",
    ]
to be empty
Ginkgo exit error 1: exit with code 1} 

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/136

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Enable French and Spanish in the OCP Console

A.C.

  • Add French and Spanish to User Preferences dropdown
  • Download French and Spanish translations from Memsource portal when it is ready
  • Review and open a pull request

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/39

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/console/pull/13434

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/oc/pull/1627

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Operation cannot be fulfilled on networks.operator.openshift.io during OVN live migration
    

Version-Release number of selected component (if applicable):

    

How reproducible:

Not always

Steps to Reproduce:

1. Enable features of egressfirewall, externalIP,multicast, multus, network-policy, service-idle.
2. Start migrate SDN to OVN cluster
     

Actual results:

[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation
validatingwebhookconfiguration.admissionregistration.k8s.io "sre-techpreviewnoupgrade-validation" deleted
[weliang@weliang ~]$ oc edit featuregate cluster
featuregate.config.openshift.io/cluster edited
[weliang@weliang ~]$ oc get node
NAME                          STATUS   ROLES                  AGE   VERSION
ip-10-0-20-154.ec2.internal   Ready    control-plane,master   86m   v1.28.5+9605db4
ip-10-0-45-93.ec2.internal    Ready    worker                 80m   v1.28.5+9605db4
ip-10-0-49-245.ec2.internal   Ready    worker                 74m   v1.28.5+9605db4
ip-10-0-57-37.ec2.internal    Ready    infra,worker           60m   v1.28.5+9605db4
ip-10-0-60-0.ec2.internal     Ready    infra,worker           60m   v1.28.5+9605db4
ip-10-0-62-121.ec2.internal   Ready    control-plane,master   86m   v1.28.5+9605db4
ip-10-0-62-56.ec2.internal    Ready    control-plane,master   86m   v1.28.5+9605db4
[weliang@weliang ~]$ for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" --  chroot /host cat /etc/kubernetes/kubelet.conf | grep NetworkLiveMigration ; done
Starting pod/ip-10-0-20-154ec2internal-debug-9wvd8 ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-45-93ec2internal-debug-rwvls ...
To use host binaries, run `chroot /host`
    "NetworkLiveMigration": true,Removing debug pod ...
Starting pod/ip-10-0-49-245ec2internal-debug-rp9dt ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-57-37ec2internal-debug-q5thk ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-60-0ec2internal-debug-zp78h ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-62-121ec2internal-debug-42k2g ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
Starting pod/ip-10-0-62-56ec2internal-debug-s99ls ...
To use host binaries, run `chroot /host`Removing debug pod ...
    "NetworkLiveMigration": true,
[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/live-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
network.config.openshift.io/cluster patched
[weliang@weliang ~]$ 
[weliang@weliang ~]$ oc get co network
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.15.0-0.nightly-2024-01-06-062415   True        False         True       4h1m    Internal error while updating operator configuration: could not apply (/, Kind=) /cluster, err: failed to apply / update (operator.openshift.io/v1, Kind=Network) /cluster: Operation cannot be fulfilled on networks.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
[weliang@weliang ~]$ oc get node
NAME                          STATUS   ROLES                  AGE     VERSION
ip-10-0-2-52.ec2.internal     Ready    worker                 3h54m   v1.28.5+9605db4
ip-10-0-26-16.ec2.internal    Ready    control-plane,master   4h2m    v1.28.5+9605db4
ip-10-0-32-116.ec2.internal   Ready    worker                 3h54m   v1.28.5+9605db4
ip-10-0-32-67.ec2.internal    Ready    infra,worker           3h38m   v1.28.5+9605db4
ip-10-0-35-11.ec2.internal    Ready    infra,worker           3h39m   v1.28.5+9605db4
ip-10-0-39-125.ec2.internal   Ready    control-plane,master   4h2m    v1.28.5+9605db4
ip-10-0-6-117.ec2.internal    Ready    control-plane,master   4h2m    v1.28.5+9605db4
[weliang@weliang ~]$ oc get Network.operator.openshift.io/cluster -o json
{
    "apiVersion": "operator.openshift.io/v1",
    "kind": "Network",
    "metadata": {
        "creationTimestamp": "2024-01-08T13:28:07Z",
        "generation": 417,
        "name": "cluster",
        "resourceVersion": "236888",
        "uid": "37fb36f0-c13c-476d-aea1-6ebc1c87abe8"
    },
    "spec": {
        "clusterNetwork": [
            {
                "cidr": "10.128.0.0/14",
                "hostPrefix": 23
            }
        ],
        "defaultNetwork": {
            "openshiftSDNConfig": {
                "enableUnidling": true,
                "mode": "NetworkPolicy",
                "mtu": 8951,
                "vxlanPort": 4789
            },
            "ovnKubernetesConfig": {
                "egressIPConfig": {},
                "gatewayConfig": {
                    "ipv4": {},
                    "ipv6": {},
                    "routingViaHost": false
                },
                "genevePort": 6081,
                "mtu": 8901,
                "policyAuditConfig": {
                    "destination": "null",
                    "maxFileSize": 50,
                    "maxLogFiles": 5,
                    "rateLimit": 20,
                    "syslogFacility": "local0"
                }
            },
            "type": "OVNKubernetes"
        },
        "deployKubeProxy": false,
        "disableMultiNetwork": false,
        "disableNetworkDiagnostics": false,
        "kubeProxyConfig": {
            "bindAddress": "0.0.0.0"
        },
        "logLevel": "Normal",
        "managementState": "Managed",
        "migration": {
            "mode": "Live",
            "networkType": "OVNKubernetes"
        },
        "observedConfig": null,
        "operatorLogLevel": "Normal",
        "serviceNetwork": [
            "172.30.0.0/16"
        ],
        "unsupportedConfigOverrides": null,
        "useMultiNetworkPolicy": false
    },
    "status": {
        "conditions": [
            {
                "lastTransitionTime": "2024-01-08T13:28:07Z",
                "status": "False",
                "type": "ManagementStateDegraded"
            },
            {
                "lastTransitionTime": "2024-01-08T17:29:52Z",
                "status": "False",
                "type": "Degraded"
            },
            {
                "lastTransitionTime": "2024-01-08T13:28:07Z",
                "status": "True",
                "type": "Upgradeable"
            },
            {
                "lastTransitionTime": "2024-01-08T17:26:38Z",
                "status": "False",
                "type": "Progressing"
            },
            {
                "lastTransitionTime": "2024-01-08T13:28:20Z",
                "status": "True",
                "type": "Available"
            }
        ],
        "readyReplicas": 0,
        "version": "4.15.0-0.nightly-2024-01-06-062415"
    }
}
[weliang@weliang ~]$ 

Expected results:

OVN live migration pass

Additional info:

must-gather: https://people.redhat.com/~weliang/must-gather1.tar.gz

User Story

As a developer I want to remove the NoUpgrade annotation from the CAPI IPAM CRDs so that I can promote them to General Availability

Background

The SPLAT team is planning to have the CAPI IPAM CRDs promoted to GA because they need them in a component they are promoting to GA.

Steps

  • remove the NoUpgrade annotation from the CAPI IPAM CRDs

Stakeholders

  • SPLAT
  • Cluster Infra Team

Definition of Done

  • manifests-generator PR merged
  • cluster-api repo PR merged

Description of problem:

OTA team wants to rename the `supported but not recommended` update edges to `known issues`

Version-Release number of selected component (if applicable):

    openshift-4.16

Expected results:

`supported but not recommended` edges are renamed to `known issues`

Additional info:

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L191

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L216

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L219

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219/frontend/public/components/modals/cluster-update-modal.tsx#L234

https://github.com/openshift/console/blob/962ab7b443f87e45486741444a32053a807d3219[…]rontend/public/components/cluster-settings/cluster-settings.tsx    

Please review the following PR: https://github.com/openshift/router/pull/547

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CredentialsRequest for Azure AD workload identity contains unnecessary permissions under `virtualMachines/extensions`.   Specifically write and delete.  
    

Version-Release number of selected component (if applicable):

4.14.0+
    

How reproducible:

Every time
    

Steps to Reproduce:

    1. Create a cluster without the CredentialsRequest permissions mentioned
    2. Scale machineset
    3. See no permission errors
    

Actual results:

We have unnecessary permissions, but still no errors
    

Expected results:

Still no permission errors after removal.
    

Additional info:

RHCOS doesn't leverage virtual machine extensions.  It appears as though the code path is dead.  
    

Description of problem:

when extracting `oc` and `openshift-install` from release payload  below warnings are shown which might be confusing for the user, to make this clear please help update the warning to add image names into the kubectl version mismatch message  in addition to the version list
    Version-Release number of selected component (if applicable):{code:none}

always
    How reproducible:{code:none}
Always
    

Steps to Reproduce:

    1. Run command to extract oc & openshift-install using `oc adm extract`
    2. Run oc adm release info --commits <payload>
    3.
    

Actual results:

    $ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-03-05-032119
warning: multiple versions reported for the kubectl: 1.29.1,1.28.2,1.29.0

    

Expected results:

     show image names which needs kubernetes bump along with kubectl version
    

Additional info:

   Thread here:  https://redhat-internal.slack.com/archives/GK58XC2G2/p1709565188855519
    

Description of problem:

In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:

$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message'
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Version-Release number of selected component (if applicable)

4.15.3, so we have 4.15.2's OCPBUGS-30304 but not 4.15.5's OCPBUGS-30237.

How reproducible

Seen in two clusters after updating from 4.14 to 4.15.3.

Steps to Reproduce

Unclear.

Actual results

Sad multus pods.

Expected results

Happy cluster.

Additional info

$ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null
...
Certificate chain
 0 s:CN = api-int.REDACTED
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT
...
 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT
...

So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:

$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2
2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255       1 cmd.go:241] Using service-serving-cert provided certificates
2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.

We were able to recover individual nodes via:

  1. oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig  from any machine with an admin kubeconfig
  2. copy to all nodes as /etc/kubernetes/kubeconfig
  3. on each node rm /var/lib/kubelet/kubeconfig
  4. restart each node
  5. approve each kubelet CSR
  6. delete the node's multus-* pod.

Please review the following PR: https://github.com/openshift/etcd/pull/236

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/36

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If OLMPlacement is set to management,  the cluster is up with disableAllDefaultSources set to true, remove it in the HostedCluster CR, in the guest cluster disableAllDefaultSources isn't removed and still set to true

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

When running a build (e.g. oc start-build) the build fails with reason CannotRetrieveServiceAccount and message Unable to look up the service account secrets for this build.

When platform specific passwords are included in the install-config.yaml they are stored in the generated agent-cluster-install.yaml, which is included in the output of the agent-gather command. These passwords should be redacted.

Steps to Reproduce:

1. Install a cluster using Azure Workload Identity
2. Check the value of the cco_credentials_mode metric

Actual results:

mode = manual    

Expected results:

mode = manualpodidentity

Additional info:

The cco_credentials_mode metric reports manualpodidentity mode for an AWS STS cluster. 

Description of problem:

 Facing error while creating manifests:

./openshift-install create manifests --dir openshift-config
FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: unexpected end of JSON input

Using below document :

https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-vpc.html#installation-gcp-config-yaml_installing-gcp-vpc

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

In the "request-serving" deployment model for HCPs, the request-serving nodes are being memory starved by the k8s API server. This has an observed impact on limiting the number of nodes a guest HCP cluster can provision, especially during upgrade events.

This is a spike card to investing setting the API Priority and Fairness [1] configuration, and exactly what configuration would be necessary to set.

[1] https://kubernetes.io/docs/concepts/cluster-administration/flow-control/

Description of problem:

OVN br-int ovs flows do not get updated on other nodes when a nodes bond MACADDR is changed to other slave interface after reboot. This causes network traffic coming for the sdn of one node to get dropped when it hits the node that changed mac addresses on its bond interface.  

Version-Release number of selected component (if applicable): 4.12+

How reproducible: 100% of the time after rebooting if the mac changes. mac does not always change.

Steps to Reproduce:

1. Capture bond0 mac before reboot
2. Reboot host
3. Confirm mac change
4.  oc run --rm -it test-pod-sdn --image=registry.redhat.io/openshift4/network-tools-rhel8  --overrides='{"spec": {"tolerations": [{"operator": "Exists"}],"nodeSelector":{"kubernetes.io/hostname":"nodeb-not-rebooted"}}}' /bin/bash
5. Ping rebooted node

Actual results:

ping hits rebooted node but is dropped because the MAC address is of other slave interface and not the one bond is using. 

Expected results:

OVS flows to update on all nodesafter reboot if MAC changes 

Additional info:

  If we restart NetworkManager a couple times this triggers OVS flows to get updated, not sure why. 

Possible workarounds 
 -  https://access.redhat.com/solutions/6972925
 - Statically set the mac of bond0 to one of the slave interfaces. 

Description of problem:

During the control plane upgrade e2e test, it seems that the openshift apiserver becomes unavailable during the upgrade process. The test is run on an HA control plane, and this should not happen.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Often

Steps to Reproduce:

1. Create a hosted cluster with HA control plane and wait for it to become available
2. Upgrade the hosted cluster to a newer release
3. While upgrading, monitor whether the openshift apiserver is available by either querying APIService resources or resources served by the openshift apiserver.

Actual results:

The openshift apiserver is unavailable at some point during the upgrade

Expected results:

The openshift apiserver is available throughout the upgrade

Additional info:

 

Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/103

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   Few of the CI tests are continuously failing for Jenkins Pipelines Build Strategy. As this strategy has been deprecated since 4.10, we should skip these to unblock the PRs.

Version-Release number of selected component (if applicable):

4.16.0
    

How reproducible:

    Almost Everytime

Actual results:

    Tests failing

Expected results:

    All the tests should pass.

Additional info:

Observed in: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/343#issuecomment-2057851847

Currently importing the HyperShift API packages in other projects brings also the dependencies and conflicts of the rest of the non-API packages. This is a request to create  Go submodule containing only the API packages.

Once we cut beta and the API is stable we should move it into its own repo

Description of the problem:

In PSI, BE master ~ 2.30 - Massive amount of  the following message "Cluster was updated with api-vip <IP ADDRESS>, ingress-vip <IP ADDRESS>" in cluster events.

This message is repeating itself every minute X 5  (I guess related to number of hosts ?)

the installation was started, but aborted due to network connection issues.

I've tried to reproduce in staging, but couldn't.

How reproducible:

Still checking

Steps to reproduce:

1. 

2.

3.

Actual results:

 

Expected results:
This message should not be shown more than once

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Reproducer:
1. On a GCP cluster, create an ingress controller with internal load balancer scope, like this:

apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: foo
  namespace: openshift-ingress-operator
spec:
  domain: foo.<cluster-domain>
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: Internal

2. Wait for load balancer service to complete rollout

$ oc -n openshift-ingress get service router-foo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-foo LoadBalancer 172.30.101.233 10.0.128.5 80:32019/TCP,443:32729/TCP 81s

3. Edit ingress controller to set spec.endpointPublishingStrategy.loadBalancer.scope to External

the load balancer service (router-foo in this case) should get an external IP address, but currently it keeps the 10.x.x.x address that was already assigned.

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4070

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

The SAST scans keep coming up with bogus positive results from test and vendor files. This bug is just a placeholder to allow us to backport the change to ignore those files.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Running the command coreos-installer iso kargs show no longer works with the 4.13 Agent ISO. Instead we get this error:

$ coreos-installer iso kargs show agent.x86_64.iso
Writing manifest to image destination
Storing signatures
Error: No karg embed areas found; old or corrupted CoreOS ISO image.

This is almost certainly due to the way we repack the ISO as part of embedding the agent-tui binary in it.

It worked fine in 4.12. I have tested both with every version of coreos-installer from 0.14 to 0.17

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/101

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/37

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

All the apiservers:

  • kube-apiserver
  • openshift-apiserver
  • oauth-apiserver

Expose both `apiserver_request_slo_duration_seconds` and `apiserver_request_sli_duration_seconds`. The SLI metric was introduced in Kubernetes 1.26 as a replacement of `apiserver_request_slo_duration_seconds` which was deprecated in Kubernetes 1.27. This change is only a renaming so both metrics expose the same data. To avoid storing duplicated data in Prometheus, we need to drop the `apiserver_request_slo_duration_seconds` in favor of `apiserver_request_sli_duration_seconds`.

Since https://github.com/openshift/installer/pull/8093 merged, CI jobs for the agent appliance have been broken. It appears that the agent-register-cluster.service is no longer getting enabled.

These tests look to have been mistakenly deleted in 4.14 during the big monitortest refactor. We could use them right now to identify gcp jobs with spatter disruption.

[sig-network] there should be nearly zero single second disruptions for _
[sig-network] there should be reasonably few single second disruptions for _

Find out what happened and get them restored. Code is there but it looks like there are assumptions about extracting the backend name that may have been broken somewhere.

Description of problem:

When creating an application based on devfile "Import from Git" in Developer console using only GitLab repo, the following error block to create it.
It only happened when using GitLab, not Github. And CLI operation based on "oc new-app" could work well. In other words, the issue is only for Dev console.  

  Could not fetch kubernetes resource "/deploy.yaml" for component "kubernetes-deploy" from Git repository https://{gitlaburl}.

Version-Release number of selected component (if applicable):

4.15.z

How reproducible:

Always

Steps to Reproduce:

You can always reproduce according to the following procedures.
a. Switch "Developer" mode at your web console.
b. Move "+Add", then click "Import from Git" in "Git Repository" section at the page.
c. Input "https://<GITLAB HOSTNAME>/XXXX/devfile-sample-go-basic.git" to the "Git Repo URL" text box.
d. Select "GitLab" at "Git type" drop box.
e. You can see the below error messages.

Actual results:

The "/deploy.yaml" file path evaluated as invalid one with 400 response status during the process as follows.
Look at the URL, "/%2Fdeploy.yaml" shows us leading slash was duplicated there.

  Request URL:
    https://<GITLAB HOSTNAME>/api/v4/projects/yyyy/repository/files/%2Fdeploy.yaml/raw?ref=main
  Response:
    {"error":"file_path should be a valid file path"}

Expected results:

 The request URL for handling "deploy.yaml" file should be removed the duplicated leading slash and provide correct file path.

 Request URL:
   https://<GITLAB HOSTNAME>/api/v4/projects/yyyy/repository/files/deploy.yaml/raw?ref=main
 Response:
   "deploy.yaml" contents.

Additional info:

I submitted a pull request to fix this here: https://github.com/openshift/console/pull/13812 
A hostedcluster/hostedcontrolplane were stuck uninstalling. Inspecting the CPO logs, it showed that

"error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"

Unfortunately, I do not have enough access to the AWS account to inspect this security group, though I know it is the default worker security group because it's recorded in the hostedcluster .status.platform.aws.defaultWorkerSecurityGroupID

Version-Release number of selected component (if applicable):

4.14.1

How reproducible:

I haven't tried to reproduce it yet, but can do so and update this ticket when I do. My theory is:

Steps to Reproduce:

1. Create an AWS HostedCluster, wait for it to create/populate defaultWorkerSecurityGroupID
2. Attach the defaultWorkerSecurityGroupID to anything else in the AWS account unrelated to the HCP cluster
3. Attempt to delete the HostedCluster

Actual results:

CPO logs:
"error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"
HostedCluster Status Condition
  - lastTransitionTime: "2023-11-09T22:18:09Z"
    message: ""
    observedGeneration: 3
    reason: StatusUnknown
    status: Unknown
    type: CloudResourcesDestroyed

Expected results:

I would expect that the CloudResourcesDestroyed status condition on the hostedcluster would reflect this security group as holding up the deletion instead of having to parse through logs.

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-api-provider-openstack/pull/300

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Priority Class override for ignition-server deployment was accidentally ripped out when a new reconcileProxyDeployment() func was introduced. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1.Create a cluster with priority class override opted in
    2.Override priority class in HC
    3.Check ignition server deployment priority class     

Actual results:

doesn't override priority class    

Expected results:

overridden priority class

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/94

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  when user click ‘Cancel’ on any Secret creation page, it doesn’t return to Secrets list page

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-01-06-062415

How reproducible:

    Always

Steps to Reproduce:

    1. Go to Create Key/value secret|Image pull secret|Source secret|Webhook secret|FromYaml page
       eg:/k8s/ns/default/secrets/~new/generic
    2. Click Cancel button
    3.
    

Actual results:

    The page does not go back to Secrets list page
    eg: /k8s/ns/default/core~v1~Secret

Expected results:

    The page should go back to the Secrets list page

Additional info:

    

Description of problem:

Node Overview Pane not displaying    

Version-Release number of selected component (if applicable):

    

How reproducible:

    In the openshift console, under Compute > Node > Node Details >
the Overview tab does not display

Steps to Reproduce:

 In the openshift console, under Compute > Node > Node Details >
the Overview tab does not display     

Actual results:

    Overview tab does not display   

Expected results:

    Overview tab should display   

Additional info:


{code

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Go to one pvc "VolumeSnapshots" tab, it shows error "Oh no! Something went wrong."
    

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-01-03-140457
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Create a pvc in project. Go to the pvc's "VolumeSnapshots" tab.
    2.
    3.
    

Actual results:

1. The error "Oh no! Something went wrong." shows up on the page.
    

Expected results:

1. Should show volumesnapshot related to the pvc without error.
    

Additional info:

screenshot: https://drive.google.com/file/d/1l0i0DCFh_q9mvFHxnftVJL0AM1LaKFOO/view?usp=sharing 
    

Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-monitoring.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.15
Start Time: 2024-01-04T00:00:00Z
End Time: 2024-01-10T23:59:59Z
Success Rate: 42.31%
Successes: 11
Failures: 15
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 151
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Monitoring&confidence=95&environment=sdn%20no-upgrade%20amd64%20gcp%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=sdn&network=sdn&pity=5&platform=gcp&platform=gcp&sampleEndTime=2024-01-10%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-01-04%2000%3A00%3A00&testId=openshift-tests%3A567152bb097fa9ce13dd2fb6885e094a&testName=%5Bsig-arch%5D%20events%20should%20not%20repeat%20pathologically%20for%20ns%2Fopenshift-monitoring&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

Description of problem:

    If a cluster is installed using proxy and the username used for connecting to the proxy contains the characters "%40" for encoding a "@" in case of providing a doamin, the instalation fails. The failure is because the proxy variables implemented in the file "/etc/systemd/system.conf.d/10-default-env.conf" in the bootstrap node are ignored by systemd. This issue seems was already fixed in MCO (BZ 1882674 - fixed in RHOCP 4.7), but looks like is affecting the bootstrap process in 4.13 and 4.14, causing the installation to not start at all.

Version-Release number of selected component (if applicable):

    4.14, 4.13

How reproducible:

    100% always

Steps to Reproduce:

    1. create a install-config.yaml file with "%40" in the middle of the username used for proxy.
    2. start cluster installation.
    3. bootstrap will fail for not using proxy variables.
    

Actual results:

Installation fails because systemd fails to load the proxy varaibles if "%" is present in the username.

Expected results:

    Installation to succeed using a username with "%40" for the proxy. 

Additional info:

File "/etc/systemd/system.conf.d/10-default-env.conf" for the bootstrap should be generated in a way accepted by systemd.    

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-aws-ovn/1763616334094012416/artifacts/e2e-aws-ovn/run-e2e/artifacts/TestUpgradeControlPlane/namespaces/e2e-clusters-5tmhk-example-j4msj/core/pods/logs/cloud-network-config-controller-9689f46c8-w8h64-controller-previous.log

fatal error: concurrent map read and map write

goroutine 31 [running]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0002e81c0, {{0x0, 0x0}, {0xc000bd17c6, 0x2}, {0xc000bd17c0, 0x6}})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x67

goroutine 82 [runnable]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName(0xc0002e81c0, {{0x2ede1b9, 0x1a}, {0x2eb1bb8, 0x2}, {0x2ebb205, 0xa}}, {0x3388ab8?, 0xc00062e540})
	/go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:176 +0x2b2

Description of problem:

When using OpenShift 4.15 on ROSA Hosted Control Planes, after disabling the ImageRegistry, the default secrets and service accounts are still being created.

This functionality should not be occurring once the registry is removed:

https://docs.openshift.com/rosa/nodes/pods/nodes-pods-secrets.html#auto-generated-sa-token-secrets_nodes-pods-secrets   

Version-Release number of selected component (if applicable):

4.15   

How reproducible:

Always

Steps to Reproduce:

    1. Deploy ROSA 4.15 HCP Cluster
    2. Set spec.managementState = "Removed" on the cluster.config.imageregistry.operator.openshift.io. The image registry will be removed
    3. Create a new OpenShift Project
    4. Observe the builder, default and deployer ServiceAccounts and their associated Secrets are still created
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Based on Azure doc [1], NCv2 series Azure virtual machines (VMs) are retired on September 6, 2023. VM could not be provisioned on those instance types.

So remove standardNCSv2Family from azure doc tested_instance_types_x86_64 on 4.13+.

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/ncv2-series    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. cluster is installed failed on NCv2 series instance type 
    2.
    3.
    

Actual results:

 

Expected results:

    

Additional info:

    

Description of problem:

Usernames can contain all kinds of characters that are not allowed in resource names. Hash the name instead and use hex representation of the result to get a usable identifier.    

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. log in to the web console configured with a login to a 3rd party OIDC provider
    2. go to the User Preferences page / check the logs in the javascript console
    

Actual results:

The User Preferences page shows empty values instead of defaults.
The javascript console reports things like
```
consoleFetch failed for url /api/kubernetes/api/v1/namespaces/openshift-console-user-settings/configmaps/user-settings-kubeadmin r: configmaps "user-settings-kubeadmin" not found
```

Expected results:

   I am able to persist my user preferences. 

Additional info:

    

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/247

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Client side throttling observed when running the metrics controller. 

Steps to Reproduce:

1. Install an AWS cluster in mint mode
2. Enable debug log by editing cloudcredential/cluster
3. Wait for the metrics loop to run for a few times
4. Check CCO logs

Actual results:

// 7s consumed by metrics loop which is caused by client-side throttling 
time="2024-01-20T19:43:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics
I0120 19:43:56.251278       1 request.go:629] Waited for 176.161298ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
I0120 19:43:56.451311       1 request.go:629] Waited for 197.182213ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
I0120 19:43:56.651313       1 request.go:629] Waited for 197.171082ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
I0120 19:43:56.850631       1 request.go:629] Waited for 195.251487ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials
...
time="2024-01-20T19:44:03Z" level=info msg="reconcile complete" controller=metrics elapsed=7.231061324s

Expected results:

No client-side throttling when running the metrics controller. 

Description of problem:

    all images have been removed from quay.io/centos7 and oc newapp unit tests are heavily relying on these images and started failing. See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc/1716/pull-ci-openshift-oc-master-unit/1773203483667730432

Version-Release number of selected component (if applicable):

    probably all

How reproducible:

    Open a PR and see that pre-submit unit test fails

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Customer pentest shows that the Server header is returned by admin console when browsing

https://console-openshift-console$domain/locales/resource.json?lng=en&ns=plugin__odf-console

This could lead to information about CVE for a potential attacker.

Response header:

Server: nginx/1.20.1

Description of problem:

   As a logged in user Im unable to logout from cluster with external OIDC provider.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Login into cluster with external OIDC setup
    2.
    3.
    

Actual results:

    Unable to logout

Expected results:

    Logout successfully 

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/218

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4152

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-provider-kubevirt/pull/30

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

W0109 17:47:02.340203       1 builder.go:109] graceful termination failed, controllers failed with error: failed to get infrastructure name: infrastructureName not set in infrastructure 'cluster'

Description of problem:

There are multiple dashboards, so this page title should be "Dashboards" rather than "Dashboard". The admin console version of this page is already titled "Dashboards".    

Version-Release number of selected component (if applicable):

4.16

Steps to Reproduce:

    1. Open "Developer" View > Observe

Actual results:

    See the tab title is "Dashboard"

Expected results:

    The tab title should be "Dashboards"

Description of problem:

    Package that we use for Power VS has recently been revealed to be unmaintained. We should remove it in favor of maintained solutions.

Version-Release number of selected component (if applicable):

    4.13.0 onward

How reproducible:

It's always used    

Steps to Reproduce:

    1. Deploy with IPI on Power VS
    2. Use bluemix-go
    3.
    

Actual results:

    bluemix-go is used

Expected results:

bluemix-go should be avoided    

Additional info:

    

Description of problem:

The hypershift operator introduced Azure customer-managed keys etcd encryption in https://github.com/openshift/hypershift/pull/3183.  The implementation will not work in any non-Azure Public Cloud as the keyvault URL is hardcoded: https://github.com/openshift/hypershift/blob/cd4d4c69a64d8983da04d7bb26ea39a72109e135/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L4871 to vault.azure.net, which is only the public cloud keyvault domain suffix.  The cloud-specific keyvault domain suffixes are here: https://learn.microsoft.com/en-us/azure/key-vault/general/about-keys-secrets-certificates#dns-suffixes-for-object-identifiers
    

Version-Release number of selected component (if applicable):

    Since https://github.com/openshift/hypershift/pull/3183 was merged

How reproducible:

    Every time

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    The keyvault domain is hardcoded to work specifically for public cloud, but will not for azure gov cloud when using etcd encryption with customer-managed keys

Expected results:

    The keyvault domain to fetch from will use the correct cloud's domain suffix as outlined here: https://learn.microsoft.com/en-us/azure/key-vault/general/about-keys-secrets-certificates#dns-suffixes-for-object-identifiers

Additional info:

    

A runbook for the alert TargetDown will useful for Openshift users.
This runbook should explain:
1. how to identify which targets are down
2. how to investigate the reason why the target goes offline
3. resolution of common causes bringing down the target

Related cases:

Description of problem:

We want to update trigger from auto to manual or vice versa. We can do it with CLI 'oc set triggers deployment/<name> --manual'. It normally changes to deployment annotation metadata.annotations.image.openshift.io/triggers to "paused: true" or "paused: false" when set to auto. But when we enable or disable auto trigger by editing deployment from web console, it overrides annotation to "pause: false" or "pause: true" without 'd'.

Version-Release number of selected component (if applicable):

    

How reproducible:

Create simple httpd application. Follow [1] to  set trigger using CLI. Steps to set trigger from console:

Web console->deployment-> Edit deployment > Form view-> Images section -> Enable Deploy image from an image stream tag -> Enable Auto deploy when new Image is available an save the changes -> check annotations

[1] https://docs.openshift.com/container-platform/4.12/openshift_images/triggering-updates-on-imagestream-changes.html

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

 

Expected results:

    

Additional info:

code: https://github.com/openshift/console/blob/master/frontend/packages/dev-console/src/utils/resource-label-utils.ts#L78

Description of problem:

Altinfra build jobs are failing

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

1.Build master installer and use latest nightly 4.16 release image
2.Run CAPI enabled installer with FeatureSet CustomNoUpgrade and featureGates: ["ClusterAPIInstall=true"]

    

Actual results:

Cluster fails to complete boostrap

Expected results:

Cluster is able to install completely

Additional info:

This bug is to track investigation into why altinfra e2e jobs were failing for:
https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-installer-master-altinfra-e2e-vsphere-capi-ovn
Upon looking into it, etcd operator was not being created.  We saw the following:

CVO:

402 17:18:59.959209       1 task.go:124] error running apply for etcd "cluster" (108 of 937): failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1"
E0402 17:19:03.862993       1 task.go:124] error running apply for etcd "cluster" (108 of 937): failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1"
E0402 17:19:09.157126       1 task.go:124] error running apply for etcd "cluster" (108 of 937): failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1"
I0402 17:19:20.234944       1 task_graph.go:550] Result of work: [Could not update etcd "cluster" (108 of 937): the server does not recognize this resource, check extension API servers Cluster operator kube-apiserver is not available Cluster operator machine-api is not available Cluster operator authentication is not available Cluster operator image-registry is not available Cluster operator ingress is not available Cluster operator monitoring is not available Cluster operator openshift-apiserver is not available Could not update rolebinding "openshift/cluster-samples-operator-openshift-edit" (536 of 937): resource may have been deleted Could not update oauthclient "console" (597 of 937): the server does not recognize this resource, check extension API servers Could not update imagestream "openshift/driver-toolkit" (659 of 937): resource may have been deleted Could not update role "openshift/copied-csv-viewer" (727 of 937): resource may have been deleted Could not update role "openshift-console-operator/prometheus-k8s" (855 of 937): resource may have been deleted Could not update role "openshift-console/prometheus-k8s" (859 of 937): resource may have been deleted]
I0402 17:19:20.235037       1 sync_worker.go:1166] Update error 108 of 937: UpdatePayloadResourceTypeMissing Could not update etcd "cluster" (108 of 937): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "Etcd" in version "operator.openshift.io/v1")
* Could not update etcd "cluster" (108 of 937): the server does not recognize this resource, check extension API servers 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After running ./openshift-install destroy cluster, TagCategory still exist

# ./openshift-install destroy cluster --dir cluster --log-level debug
DEBUG OpenShift Installer 4.15.0-0.nightly-2023-12-18-220750
DEBUG Built from commit 2b894776f1653ab818e368fa625019a6de82a8c7
DEBUG Power Off Virtual Machines
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-master-2
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-master-1
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-master-0
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46
DEBUG Powered off                                   VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn
DEBUG Delete Virtual Machines
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-rhcos-generated-region-generated-zone
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-master-2
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-master-1
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-master-0
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46
INFO Destroyed                                     VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn
DEBUG Delete Folder
INFO Destroyed                                     Folder=sgao-devqe-spn2w
DEBUG Delete                                        StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w
INFO Destroyed                                     StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w
DEBUG Delete                                        Tag=sgao-devqe-spn2w
INFO Deleted                                       Tag=sgao-devqe-spn2w
DEBUG Delete                                        TagCategory=openshift-sgao-devqe-spn2w
INFO Deleted                                       TagCategory=openshift-sgao-devqe-spn2w
DEBUG Purging asset "Metadata" from disk
DEBUG Purging asset "Master Ignition Customization Check" from disk
DEBUG Purging asset "Worker Ignition Customization Check" from disk
DEBUG Purging asset "Terraform Variables" from disk
DEBUG Purging asset "Kubeconfig Admin Client" from disk
DEBUG Purging asset "Kubeadmin Password" from disk
DEBUG Purging asset "Certificate (journal-gatewayd)" from disk
DEBUG Purging asset "Cluster" from disk
INFO Time elapsed: 29s
INFO Uninstallation complete!

# govc tags.category.ls | grep sgao
openshift-sgao-devqe-spn2w

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-18-220750

How reproducible:

    always

Steps to Reproduce:

    1. IPI install OCP on vSphere
    2. Destroy cluster installed, check TagCategory

Actual results:

    TagCategory still exist

Expected results:

    TagCategory should be deleted

Additional info:

    Also reproduced in openshift-install-linux-4.14.0-0.nightly-2023-12-20-184526,4.13.0-0.nightly-2023-12-21-194724, while 4.12.0-0.nightly-2023-12-21-162946 have not this issue

Description of problem:

Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).

 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1720630313664647168

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1719915053026643968

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721475601161785344

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-serial/1724202075631390720

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-upi-serial/1721927613917696000

These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.

Since this particular platform's business significance is high, I'm setting this as "Critical" severity.

Please get in touch with me or Dean West if more teams need to be pulled into investigation and mitigation.

 

Version-Release number of selected component (if applicable):

4.15 / master

How reproducible:

Component Readiness Board

Actual results:

The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload. 

Expected results:

1. We NEED to understand what is causing this problem.
2. If we can mitigate this, we should.
3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem.
4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.

Additional info:

 

Please review the following PR: https://github.com/openshift/images/pull/155

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We merged this ART PR which bumps base images. And then bumper [reverted the changes here|https://github.com/openshift/operator-framework-operator-controller/pull/88/files].

I still see the ART bump commit in main, but there is "Add OpenShift specific files" commit on top of it with older images. Actually now we have two "Add OpenShift specific files" commits in main:

And every UPSTREAM: <carry>-prefixed commit seems to be duplicated on top of synced changes.

Expected result:

  • Bumper doesn't override/revert UPSTREAM: <carry>-prefixed commit contributed directly into the downstream repos. Order of UPSTREAM: <carry>-prefixed commits should be respected.

Description of problem:

We upgraded our OpenShift Cluster from 4.4.16 to 4.15.3 and multiple operators are now in "Failed" status with the following CSV conditions such as:
- NeedsReinstall installing: deployment changed old hash=5f6b8fc6f7, new hash=5hFv6Gemy1Zri3J9ulXfjG9qOzoFL8FMsLNcLR
- InstallComponentFailed install strategy failed: rolebindings.rbac.authorization.k8s.io "openshift-gitops-operator-controller-manager-service-auth-reader" already exists

All other failures refer to a similar "auth-reader" rolebinding that already exist.
 
    

Version-Release number of selected component (if applicable):

OpenShift 4.15.3
    

How reproducible:

Happened on several installed operators but on the only cluster we upgraded (our staging cluster)
    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:

All operators should be up-to-date
    

Additional info:


This may be related to https://github.com/operator-framework/operator-lifecycle-manager/pull/3159 
    

Description of problem:

https://issues.redhat.com/browse/OCPBUGS-22710?focusedId=23594559&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-23594559

 
seems the issue still here, test on 4.15.0-0.nightly-2023-12-04-223539, there is status.message for each zone, but there is no summarized status, so move to Assigned.
apbexternalroute yaml file is:

apiVersion: k8s.ovn.org/v1
kind: AdminPolicyBasedExternalRoute
metadata:
  name: default-route-policy
spec:
  from:
    namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: test
  nextHops:
    static:
    - ip: "172.18.0.8"
    - ip: "172.18.0.9" 

and Status section as below:

% oc get apbexternalroute
NAME                   LAST UPDATE   STATUS
default-route-policy   12s <--- still empty
% oc describe apbexternalroute default-route-policy | tail -n 10
Status:
Last Transition Time: 2023-12-06T02:12:11Z
Messages:
qiowang-120620-gtt85-master-2.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-master-0.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-worker-a-55fzx.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-master-1.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-worker-b-m98ms.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
qiowang-120620-gtt85-worker-c-vtl8q.c.openshift-qe.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
Events: <none> 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

This update is compatible with recent kube-openapi changes so that we can drop our replace k8s.io/kube-openapi => k8s.io/kube-openapi v0.0.0-20230928195430-ce36a0c3bb67 introduced in 53b387f4f54c8426526478afd0fd3e2b4e7aec66.

Please review the following PR: https://github.com/openshift/csi-operator/pull/78

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/54

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-operator/pull/81

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/491

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component Readiness has found a potential regression in [sig-arch][Early] CRDs for openshift.io should have subresource.status [Suite:openshift/conformance/parallel].

Probability of significant regression: 98.48%

Sample (being evaluated) Release: 4.16
Start Time: 2024-03-21T00:00:00Z
End Time: 2024-03-27T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 99.28%
Successes: 138
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20no-upgrade%20amd64%20metal-ipi%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-03-27%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-03-21%2000%3A00%3A00&testId=openshift-tests%3Ab3e170673c14c432c14836e9f41e7285&testName=%5Bsig-arch%5D%5BEarly%5D%20CRDs%20for%20openshift.io%20should%20have%20subresource.status%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

In examining these test failures we found what is actually a pretty random grouping of tests failing and likely as a group these are a fairly significant part of why component readiness is reporting so much red on metal right now.

In March the metal team modified some configuration such that a portion of metal jobs can now land in a couple new environments, one of them ibmcloud.

This linked test above helped find the pattern whereby we can open the spyglass chart in prow and see a clear pattern that we then found in many other failed metal jobs:

  • pod-logs section full of etcd logging problems where reads and writes took too long
  • a vertical line of disruption across multiple backends
  • an abnormal vertical line of etcd leader elections jumping around
  • a vertical line of failed e2e tests

All of these line up within the same vertical space indicating the problem was at the same time, and the pod-logs section is as full as ever.

Derek Higgins has pulled ibmcloud out of rotation until they can attempt some SSD for etcd.

This bug is for introduction of a test that will make this symptom of etcd being very unhealthy visible as a test failure, both to communicate to engineers who look at the runs and help them understand this critical failure, and to help us locate runs affected because no single existing test can really do this today.

Azure and GCP jobs can normally log these etcd warnings 3-5k times in a CI run. These ibmcloud runs were showing 30-70k. A limit of 10k was chosen based on examining the data in bigquery, only 50 jobs have exceeded that this month, all metal and agent jobs.

Description of problem:

The cluster-network-operator in hypershift when templating in cluster resources does not use the node local address of the client side haproxy load balancer that runs on all nodes. This bypasses a level of health checks for the backend redundant apiserver addresses that is performed by the local kube-apiserver-proxy pods that run on every node in a hypershift environment. In environments where the backend api servers are not fronted through an additional cloud load balancer: this leads to a percentage of request failures from the in cluster components occuring when a control plane endpoint goes down even if other endpoints are available. 

Version-Release number of selected component (if applicable):

  4.16 4.15 4.14

How reproducible:

    100%

Steps to Reproduce:

    1. Setup a hypershift cluster in a baremetal/non cloud environment where there are redundant API servers behind a DNS that point directly to the node IPs.
    2. Power down one of the control plane nodes
    3. Schedule workload into cluster that depends on kube-proxy and/or multus to setup networking configuration
    4. You will see errors like the following 
```
add): Multus: [openshiftai/moe-8b-cmisale-master-0/9c1fd369-94f5-481c-a0de-ba81a3ee3583]: error getting pod: Get "https://[p9d81ad32fcdb92dbb598-6b64a6ccc9c596bf59a86625d8fa2202-c000.us-east.satellite.appdomain.cloud]:30026/api/v1/namespaces/openshiftai/pods/moe-8b-cmisale-master-0?timeout=1m0s": dial tcp 192.168.98.203:30026: connect: timeout
```
    

Actual results:

    When a control plane node fails intermittent timeouts occur when kube-proxy/multus resolve the dns and a failed control plane node ip is returned

Expected results:

    No requests fail (which will occur if all traffic is routed through the node local load balancer instance

Additional info:

    Additionally: control plane components in the management cluster that live next to the apiserver are adding uneeded dependencies by using an external DNS entry to talk to the kube-apiserver when it can use the local kube-apiserver address to have it all go over cluster local networking

This is a regression due to the fix for https://issues.redhat.com/browse/OCPBUGS-23069.

When using dual-stack networks with networks other than OVN or SDN a validation failure results. For example when using this networking config:

networking:
  clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 25
    - cidr: fd01::/48
      hostPrefix: 64
  networkType: Calico

The following error will be returned:

{
  "id": "network-prefix-valid",
  "status": "failure",
  "message": "Unexpected status ValidationError"
},

When the clusterNetwork prefixes are removed the following error will result:

{
  "id": "network-prefix-valid",
  "status": "failure",
  "message": "Invalid Cluster Network prefix: Host prefix, now 0, must be a positive integer."
},

Description of problem:

It is noticed that ovs-monitor-ipsec fails to import cert into nss db with following error.

2024-04-17T19:57:21.140989157Z 2024-04-17T19:57:21Z |  6  | reconnect | INFO | unix:/var/run/openvswitch/db.sock: connecting...
2024-04-17T19:57:21.142234972Z 2024-04-17T19:57:21Z |  9  | reconnect | INFO | unix:/var/run/openvswitch/db.sock: connected
2024-04-17T19:57:21.170709468Z 2024-04-17T19:57:21Z |  14 | ovs-monitor-ipsec | INFO | Tunnel ovn-69b991-0 appeared in OVSDB
2024-04-17T19:57:21.171379359Z 2024-04-17T19:57:21Z |  16 | ovs-monitor-ipsec | INFO | Tunnel ovn-52bc87-0 appeared in OVSDB
2024-04-17T19:57:21.171826906Z 2024-04-17T19:57:21Z |  18 | ovs-monitor-ipsec | INFO | Tunnel ovn-3e78bb-0 appeared in OVSDB
2024-04-17T19:57:21.172300675Z 2024-04-17T19:57:21Z |  20 | ovs-monitor-ipsec | INFO | Tunnel ovn-12fb32-0 appeared in OVSDB
2024-04-17T19:57:21.172726970Z 2024-04-17T19:57:21Z |  22 | ovs-monitor-ipsec | INFO | Tunnel ovn-8a4d01-0 appeared in OVSDB
2024-04-17T19:57:21.178644919Z 2024-04-17T19:57:21Z |  24 | ovs-monitor-ipsec | ERR | Import cert and key failed.
2024-04-17T19:57:21.178644919Z b"No cert in -in file '/etc/openvswitch/keys/ipsec-cert.pem' matches private key\n80FBF36CDE7F0000:error:05800074:x509 certificate routines:X509_check_private_key:key values mismatch:crypto/x509/x509_cmp.c:405:\n"
2024-04-17T19:57:21.179581526Z 2024-04-17T19:57:21Z |  25 | ovs-monitor-ipsec | ERR | traceback
2024-04-17T19:57:21.179581526Z Traceback (most recent call last):
2024-04-17T19:57:21.179581526Z   File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 1382, in <module>
2024-04-17T19:57:21.179581526Z     main()
2024-04-17T19:57:21.179581526Z   File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 1369, in main
2024-04-17T19:57:21.179581526Z     monitor.run()
2024-04-17T19:57:21.179581526Z   File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 1176, in run
2024-04-17T19:57:21.179581526Z     if self.ike_helper.config_global(self):
2024-04-17T19:57:21.179581526Z   File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 521, in config_global
2024-04-17T19:57:21.179581526Z     self._nss_import_cert_and_key(cert, key, name)
2024-04-17T19:57:21.179581526Z   File "/usr/share/openvswitch/scripts/ovs-monitor-ipsec", line 809, in _nss_import_cert_and_key
2024-04-17T19:57:21.179581526Z     os.remove(path)
2024-04-17T19:57:21.179581526Z FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ovs_certkey_ef9cf1a5-bfb2-4876-8fb3-69c6b22561a2.p12'

Version-Release number of selected component (if applicable):

 4.16.0   

How reproducible:

Hit on the CI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50690/rehearse-50690-pull-ci-openshift-cluster-network-operator-master-e2e-ovn-ipsec-step-registry/1780660589492703232

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

openshift-install failed with error:

time="2024-04-17T19:34:47Z" level=error msg="Cluster initialization failed because one or more operators are not functioning properly.\nThe cluster should be accessible for troubleshooting as detailed in the documentation linked below,\nhttps://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html\nThe 'wait-for install-complete' subcommand can then be used to continue the installation"
time="2024-04-17T19:34:47Z" level=error msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Cluster operator authentication is degraded\n* Cluster operators monitoring, openshift-apiserver are not available"

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/50690/rehearse-50690-pull-ci-openshift-cluster-network-operator-master-e2e-ovn-ipsec-step-registry/1780660589492703232/artifacts/e2e-ovn-ipsec-step-registry/ipi-install-install/artifacts/.openshift_install-1713382487.log

Expected results:

Cluster must come up COs running with IPsec enabled for EW traffic.    

Additional info:

It seems like ovn-ipsec-host pod's ovn-keys init container write empty content into /etc/openvswitch/keys/ipsec-cert.pem though corresponding csr request containing certificate in its status.    

unknown machine config node can be listed, the name is not in current cluster, in my cluster, there are 6 nodes, but I can see 10 machine config nodes

// current node
$ oc get node
NAME                                        STATUS   ROLES                  AGE     VERSION
ip-10-0-12-209.us-east-2.compute.internal   Ready    worker                 3h48m   v1.28.3+59b90bd
ip-10-0-23-177.us-east-2.compute.internal   Ready    control-plane,master   3h54m   v1.28.3+59b90bd
ip-10-0-32-216.us-east-2.compute.internal   Ready    control-plane,master   3h54m   v1.28.3+59b90bd
ip-10-0-42-207.us-east-2.compute.internal   Ready    worker                 53m     v1.28.3+59b90bd
ip-10-0-71-71.us-east-2.compute.internal    Ready    worker                 3h46m   v1.28.3+59b90bd
ip-10-0-81-190.us-east-2.compute.internal   Ready    control-plane,master   3h54m   v1.28.3+59b90bd

// current mcn
$ oc get machineconfignode
NAME                                        UPDATED   UPDATEPREPARED   UPDATEEXECUTED   UPDATEPOSTACTIONCOMPLETE   UPDATECOMPLETE   RESUMED
ip-10-0-12-209.us-east-2.compute.internal   True      False            False            False                      False            False
ip-10-0-23-177.us-east-2.compute.internal   True      False            False            False                      False            False
ip-10-0-32-216.us-east-2.compute.internal   True      False            False            False                      False            False
ip-10-0-42-207.us-east-2.compute.internal   True      False            False            False                      False            False
ip-10-0-53-5.us-east-2.compute.internal     True      False            False            False                      False            False
ip-10-0-56-84.us-east-2.compute.internal    True      False            False            False                      False            False
ip-10-0-58-210.us-east-2.compute.internal   True      False            False            False                      False            False
ip-10-0-58-99.us-east-2.compute.internal    False     True             True             Unknown                    False            False
ip-10-0-71-71.us-east-2.compute.internal    True      False            False            False                      False            False
ip-10-0-81-190.us-east-2.compute.internal   True      False            False            False                      False            False

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-04-162702

How reproducible:

Consistently

Steps to Reproduce:

1. setup cluster with 4.15.0-0.nightly-2023-12-04-162702 on aws
2. enable featureSet: TechPreviewNoUpgrade
3. apply file based mc few times.
4. check node list
5. check machine config node list
     
    

Actual results:

there are some unknown machine config nodes found

Expected results:

machine config node number should be same as cluster node number

Additional info:

must-gather: https://drive.google.com/file/d/1-VTismwXXZ9sYMHi8hDL7vhwzjuMn92n/view?usp=drive_link    

Please review the following PR: https://github.com/openshift/node_exporter/pull/140

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We have detected several bugs in Console dynamic plugin SDK v1 as part of Kubevirt plugin PR #1804

These bugs affect dynamic plugins which target Console 4.15+

1. Build errors related to Hot Module Replacement chunk files

ERROR in [entry] [initial] kubevirt-plugin.494371abc020603eb01f.hot-update.js
Missing call to loadPluginEntry

2. Build warnings issued by dynamic-module-import-loader

LOG from @openshift-console/dynamic-plugin-sdk-webpack/lib/webpack/loaders/dynamic-module-import-loader ../node_modules/ts-loader/index.js??ruleSet[1].rules[0].use[0]!./utils/hooks/useKubevirtWatchResource.ts
<w> Detected parse errors in /home/vszocs/work/kubevirt-plugin/src/utils/hooks/useKubevirtWatchResource.ts

3. Build warnings related to PatternFly shared modules

WARNING in shared module @patternfly/react-core
No required version specified and unable to automatically determine one. Unable to find required version for "@patternfly/react-core" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/@openshift-console/dynamic-plugin-sdk/package.json). It need to be in dependencies, devDependencies or peerDependencies.

How to reproduce

1. git clone Kubevirt plugin repo
2. switch to commit containing changes from PR #1804
3. yarn install && yarn dev to update dependencies and start local dev server

Description of problem:

gstreamer1 package (and its plugins) include certain video/audio codecs, which create licensing concerns for our Partners, who embed our solutions (OCP) and deliver it to their end customers. 

ose-network-tools container image (seems applicable for all OCP releases) includes dependency to gstreamer1 rpm (and its plugin rpms, like gstreamer1-plugins-bad-free). The request is re-consider this dependency and if possible totally remove it. It is a blocking issue which prevents our partners to deliver their solution on the field.

It is an indirect dependency. ose-network-tools includes wireshark, wireshark has dependency to qt5-multimedia, which in turn includes dependency to gstreamer1-plugins-bad-free. 

First question: is wireshark really needed for network-tools? Wireshirk is a GUI tool, so dependency is not clear. 
Second question: would wireshark-cli be sufficient for needed purposes instead? Because CLI version does not contain dependency to qt5 and so on.

Version-Release number of selected component (if applicable):

    Seems applicable to all active OCP releases.

How reproducible:

    Steps to Reproduce:
    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/58

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The agent-based installer does not support the TechPreviewNoUpgrade featureSet, and by extension nor does it support any of the features gated by it. Because of this, there is no warning about one of these features being specified - we expect the TechPreviewNoUpgrade feature gate to error out when any of them are used.

However, we don't warn about TechPreviewNoUpgrade itself being ignored, so if the user does specify it then they can use some of these non-supported features without being warned that their configuration is ignored.

We should fail with an error when TechPreviewNoUpgrade is specified, until such time as AGENT-554 is implemented.

Description of problem:

    The node-network-identity deployment should conform to hypershift control plane expectations that all applicable containers should have a liveness probe, and a readiness probe if it is an endpoint for a service.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    No liveness or readiness probes

Expected results:

 

Additional info:

    

Description of problem:

    Environment file /etc/kubernetes/node.env is overwritten after node restart. 

There is a type in https://github.com/openshift/machine-config-operator/blob/master/templates/common/aws/files/usr-local-bin-aws-kubelet-nodename.yaml where variable should be changed to NODEENV wherever NODENV is found.

Version-Release number of selected component (if applicable):

    

How reproducible:

  Easy

Steps to Reproduce:

    1. Change contents of /etc/kubernetes/node.env
    2. Restart node
    3. Notice changes are lost
    

Actual results:

  

Expected results:

     /etc/kubernetes/node.env should not be changed after restart of a node

Additional info:

    

Description of problem:

 "create serverless function" functionality in the Openshift UI. When you add a (random) repository it shows a warning saying "func.yaml is not present and builder strategy is not s2i" but without any further link or information. That's not a very good UX imo.  Could we add a link to explain to the user what that entails?

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    https://redhat-internal.slack.com/archives/CJYKV1YAH/p1706639383940559

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/59

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We frequently receive inquiries regarding the versions of monitoring components (such as Prometheus, Alertmanager, etc.) that are used in a giving OCP version.
Currently, obtaining this information requires several manual steps on our part, e.g.:

  • Identify the relevant GitHub repository.
  • Check out the appropriate branch.
  • Locate the file that contains the version.

What if we automate this?

How about a view that displays the versions of all components for all recent OCP versions.

Description of problem:

The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure    

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-multi-2023-12-06-195439 

How reproducible:

   Always 

Steps to Reproduce:

    1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config     
    2.Create cluster 
    3.Check installation     

Actual results:

Azure will precheck if architecture is consistent with instance type when creating manifests, like:
12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj"
12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64

But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster    

Expected results:

The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)    

Additional info:

    

Description of problem:

Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup.
Platform: IBM Cloud Bare Metal cluster.

Steps done:

Step 1.

$ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge
clusterversion.config.openshift.io/version patched

Step 2:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837

The cluster was not upgraded successfully.

 
$ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837   True        False         False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.0-0.nightly-2024-01-18-050837   True        False         True       111s    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
console                                    4.15.0-0.nightly-2024-01-18-050837   False       False         False      111s    RouteHealthAvailable: console route is not admitted
dns                                        4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6."
etcd                                       4.15.0-0.nightly-2024-01-18-050837   True        False         True       12d     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379"  Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379"  Healthy:true Took:9.090133ms Error:<nil>}]...
image-registry                             4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     Progressing: All registry resources are removed...
machine-config                             4.14.7                               True        True          True       7d22h   Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]]
monitoring                                 4.15.0-0.nightly-2024-01-18-050837   False       True          True       7m54s   UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded
network                                    4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)...
node-tuning                                4.15.0-0.nightly-2024-01-18-050837   True        True          False      98m     Working towards "4.15.0-0.nightly-2024-01-18-050837"


$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9b7e02d956d965d0906def1426cb03b5   False     True       True       3              0                   0                     1                      12d
worker   rendered-worker-4f54b43e9f934f0659761929f55201a1   False     True       True       3              1                   1                     1                      12d


$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.7    True        True          120m    Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors


$ oc get nodes
NAME                          STATUS                     ROLES                         AGE   VERSION
baremetal2-01.qe.rh-ocs.com   Ready                      worker                        12d   v1.27.8+4fab27b
baremetal2-02.qe.rh-ocs.com   Ready                      worker                        12d   v1.27.8+4fab27b
baremetal2-03.qe.rh-ocs.com   Ready                      control-plane,master,worker   12d   v1.27.8+4fab27b
baremetal2-04.qe.rh-ocs.com   Ready                      control-plane,master,worker   12d   v1.27.8+4fab27b
baremetal2-05.qe.rh-ocs.com   Ready                      worker                        12d   v1.28.5+c84a6b8
baremetal2-06.qe.rh-ocs.com   Ready,SchedulingDisabled   control-plane,master,worker   12d   v1.27.8+4fab27b
----------------------------------------------------

During the efforts to bring the cluster back to a good state, these steps were done:
The node baremetal2-06.qe.rh-ocs.com was uncordoned.

Tried to upgrade to using the command

$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading:  Reason: ClusterOperatorsDegraded
  Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500


Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful.
Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works.

Some clusteroperators stayed on the previous version. Some moved to Degraded state. 

$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9b7e02d956d965d0906def1426cb03b5   False     True       False      3              1                   1                     0                      13d
worker   rendered-worker-4f54b43e9f934f0659761929f55201a1   False     True       True       3              1                   1                     1                      13d


$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     11d
rook-ceph-mon-pdb                                 N/A             1                 1                     11d
rook-ceph-osd                                     N/A             1                 1                     3h17m


$ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME                             STATUS  REWEIGHT  PRI-AFF
-1         5.23672  root default                                                   
-5         1.74557      host baremetal2-01-qe-rh-ocs-com                           
 1    ssd  0.87279          osd.1                             up   1.00000  1.00000
 4    ssd  0.87279          osd.4                             up   1.00000  1.00000
-7         1.74557      host baremetal2-02-qe-rh-ocs-com                           
 3    ssd  0.87279          osd.3                             up   1.00000  1.00000
 5    ssd  0.87279          osd.5                             up   1.00000  1.00000
-3         1.74557      host baremetal2-05-qe-rh-ocs-com                           
 0    ssd  0.87279          osd.0                             up   1.00000  1.00000
 2    ssd  0.87279          osd.2                             up   1.00000  1.00000


OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/

 

Version-Release number of selected component (if applicable):

Initial version:
OCP 4.14.7
ODF 4.14.4-5.fusion-hci
OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380
Local Storage: local-storage-operator.v4.14.0-202312132033
OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci

How reproducible:

Reporting the first occurance of the isue.

Steps to Reproduce:

    1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP
    

Actual results:

    OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.

Expected results:

OCP upgrade to nightly build from 4.14.7 should be success.    

Additional info:

    There are 3 hosted clients present

Description of problem:

apbexternalroute and egressfirewall status shows empty on hypershift hosted cluster

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-17-173511 

How reproducible:

always

Steps to Reproduce:

1. setup hypershift, login hosted cluster
% oc get node
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-128-55.us-east-2.compute.internal    Ready    worker   125m   v1.28.4+7aa0a74
ip-10-0-129-197.us-east-2.compute.internal   Ready    worker   125m   v1.28.4+7aa0a74
ip-10-0-135-106.us-east-2.compute.internal   Ready    worker   125m   v1.28.4+7aa0a74
ip-10-0-140-89.us-east-2.compute.internal    Ready    worker   125m   v1.28.4+7aa0a74


2. create new project test
% oc new-project test


3. create apbexternalroute and egressfirewall on hosted cluster
apbexternalroute yaml file:
---
apiVersion: k8s.ovn.org/v1
kind: AdminPolicyBasedExternalRoute
metadata:
  name: apbex-route-policy
spec:
  from:
    namespaceSelector:
      matchLabels:
        kubernetes.io/metadata.name: test
  nextHops:
    static:
    - ip: "172.18.0.8"
    - ip: "172.18.0.9"
% oc apply -f apbexroute.yaml 
adminpolicybasedexternalroute.k8s.ovn.org/apbex-route-policy created

egressfirewall yaml file:
---
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress:
  - type: Allow
    to: 
      cidrSelector: 0.0.0.0/0
% oc apply -f egressfw.yaml 
egressfirewall.k8s.ovn.org/default created


3. oc get apbexternalroute and oc get egressfirewall

Actual results:

The status show empty:
% oc get apbexternalroute
NAME                 LAST UPDATE   STATUS
apbex-route-policy   49s                     <--- status is empty
% oc describe apbexternalroute apbex-route-policy | tail -n 8
Status:
  Last Transition Time:  2023-12-19T06:54:17Z
  Messages:
    ip-10-0-135-106.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
    ip-10-0-129-197.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
    ip-10-0-128-55.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
    ip-10-0-140-89.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9
Events:  <none>

% oc get egressfirewall
NAME      EGRESSFIREWALL STATUS
default                           <--- status is empty 
% oc describe egressfirewall default | tail -n 8
    Type:             Allow
Status:
  Messages:
    ip-10-0-129-197.us-east-2.compute.internal: EgressFirewall Rules applied
    ip-10-0-128-55.us-east-2.compute.internal: EgressFirewall Rules applied
    ip-10-0-140-89.us-east-2.compute.internal: EgressFirewall Rules applied
    ip-10-0-135-106.us-east-2.compute.internal: EgressFirewall Rules applied
Events:  <none>

Expected results:

the status can be shown correctly

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
  • Don't presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
  • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

Description of problem:

console-config sets telemeterClientDisabled: true even telemeter client is NOT disabled

Version-Release number of selected component (if applicable):

a cluster launched by image built with cluster-bot: build 4.16-ci,openshift/console#13677,openshift/console-operator#877    

How reproducible:

Always    

Steps to Reproduce:

1. Check if telemeter client is enabled
$ oc -n openshift-monitoring get pod | grep telemeter-clienttelemeter-client-7cc8bf56db-7wcs5                       3/3     Running   0          83m 
$ oc get cm cluster-monitoring-config -n openshift-monitoring
Error from server (NotFound): configmaps "cluster-monitoring-config" not found

2. Check console-config settings
$ oc get cm console-config -n openshift-console -o yaml
apiVersion: v1
data:
  console-config.yaml: |
    apiVersion: console.openshift.io/v1
    auth:
      authType: openshift
      clientID: console
      clientSecretFile: /var/oauth-config/clientSecret
      oauthEndpointCAFile: /var/oauth-serving-cert/ca-bundle.crt
    clusterInfo:
      consoleBaseAddress: https://xxxxx
      controlPlaneTopology: HighlyAvailable
      masterPublicURL: https://xxxxx:6443
      nodeArchitectures:
      - amd64
      nodeOperatingSystems:
      - linux
      releaseVersion: 4.16.0-0.test-2024-03-18-024238-ci-ln-0q7bq2t-latest
    customization:
      branding: ocp
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.16/
    kind: ConsoleConfig
    monitoringInfo:
      alertmanagerTenancyHost: alertmanager-main.openshift-monitoring.svc:9092
      alertmanagerUserWorkloadHost: alertmanager-main.openshift-monitoring.svc:9094
    plugins:
      monitoring-plugin: https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/
    providers: {}
    servingInfo:
      bindAddress: https://[::]:8443
      certFile: /var/serving-cert/tls.crt
      keyFile: /var/serving-cert/tls.key
    session: {}
    telemetry:
      telemeterClientDisabled: "true"
kind: ConfigMap
metadata:
  creationTimestamp: "2024-03-19T01:20:23Z"
  labels:
    app: console
  name: console-config
  namespace: openshift-console
  resourceVersion: "27723"
  uid: 2f9282c3-1c4a-4400-9908-4e70025afc33    

 

Actual results:

in cm/console-config, telemeterClientDisabled is set with 'true'

Expected results:

telemeterClientDisabled property should reveal the real status of telemeter client

telemeter client is not disabled because 
1. telemeter client pod is running
2. user didn't disable telemeter client manually because 'cluster-monitoring-config' configmap doesn't exist

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/154

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We should add a Dockerfile that is optimized for running the tests locally. (The current Dockerfile assumes it is running with the CI setup.)

PR introduced a change causing failures in all SDN jobs.

We need to revert the change and then update the test to allow this state for SDN configured clusters.

Once the test is fixed we can reintroduce the PR and validate with payload-blocking test

Not sure which component this bug should be associated with.

I am not even sure if importing respects ImageTagMirrorSet.

We could not figure out in the slack conversion.

https://redhat-internal.slack.com/archives/C013VBYBJQH/p1709583648013199

 

Description of problem:

The expecting behaviour of ImageTagMirrorSet of redirecting the pulling of a proxy to quay.io did not work out.

Version-Release number of selected component (if applicable):

oc --context build02 get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-ec.3   True        False         7d4h    

Steps to Reproduce:

oc --context build02 get ImageTagMirrorSet quay-proxy -o yaml
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"config.openshift.io/v1","kind":"ImageTagMirrorSet","metadata":{"annotations":{},"name":"quay-proxy"},"spec":{"imageTagMirrors":[{"mirrors":["quay.io/openshift/ci"],"source":"quay-proxy.ci.openshift.org/openshift/ci"}]}}
  creationTimestamp: "2024-03-05T03:49:59Z"
  generation: 1
  name: quay-proxy
  resourceVersion: "4895378740"
  uid: 69fb479e-85bd-4a16-a38f-29b08f2636c3
spec:
  imageTagMirrors:
  - mirrors:
    - quay.io/openshift/ci
    source: quay-proxy.ci.openshift.org/openshift/ci


oc --context build02 tag --source docker quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest hongkliu-test/proxy-test-2:011 --as system:admin
Tag proxy-test-2:011 set to quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest.

oc --context build02 get is proxy-test-2 -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  annotations:
    openshift.io/image.dockerRepositoryCheck: "2024-03-05T20:03:02Z"
  creationTimestamp: "2024-03-05T20:03:02Z"
  generation: 2
  name: proxy-test-2
  namespace: hongkliu-test
  resourceVersion: "4898915153"
  uid: f60b3142-1f5f-42ae-a936-a9595e794c05
spec:
  lookupPolicy:
    local: false
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest
    generation: 2
    importPolicy:
      importMode: Legacy
    name: "011"
    referencePolicy:
      type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/hongkliu-test/proxy-test-2
  publicDockerImageRepository: registry.build02.ci.openshift.org/hongkliu-test/proxy-test-2
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2024-03-05T20:03:02Z"
      message: 'Internal error occurred: quay-proxy.ci.openshift.org/openshift/ci:ci_ci-operator_latest:
        Get "https://quay-proxy.ci.openshift.org/v2/": EOF'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: "011"

Actual results:

The status of the stream shows that it still tries to connect to quay-proxy.

Expected results:

The request goes to quay.io directly.

Additional info:

The proxy has been shut down completely just to simplify the case. If it was on, there are Access logs showing the proxy get the requests for the image.
oc scale deployment qci-appci -n ci --replicas 0
deployment.apps/qci-appci scaled

I also checked the pull secret in the namespace and it has correct pull credentials to both proxy and quay.io.

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/79

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

When clicking on the output image link on a Shipwright BuildRun details page, the link leads to the imagestream details page but shows 404 error.

The image link is:

https://console-openshift-console.apps...openshiftapps.com/k8s/ns/buildah-example/imagestreams/sample-kotlin-spring%3A1.0-shipwright

The BuildRun spec

apiVersion: shipwright.io/v1beta1
kind: BuildRun
metadata: 
  generateName: sample-spring-kotlin-build-
  name: sample-spring-kotlin-build-xh2dq
  namespace: buildah-example
  labels: 
    build.shipwright.io/generation: '2'
    build.shipwright.io/name: sample-spring-kotlin-build
spec: 
  build: 
    name: sample-spring-kotlin-build
status: 
  buildSpec: 
    output: 
      image: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright'
    paramValues: 
      - name: run-image
        value: 'paketocommunity/run-ubi-base:latest'
      - name: cnb-builder-image
        value: 'paketobuildpacks/builder-jammy-tiny:0.0.176'
      - name: app-image
        value: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright'
    source: 
      git: 
        url: 'https://github.com/piomin/sample-spring-kotlin-microservice.git'
      type: Git
    strategy: 
      kind: ClusterBuildStrategy
      name: buildpacks
  completionTime: '2024-02-12T12:15:03Z'
  conditions: 
    - lastTransitionTime: '2024-02-12T12:15:03Z'
      message: All Steps have completed executing
      reason: Succeeded
      status: 'True'
      type: Succeeded
  output: 
    digest: 'sha256:dc3d44bd4d43445099ab92bbfafc43d37e19cfaf1cac48ae91dca2f4ec37534e'
  source: 
    git: 
      branchName: master
      commitAuthor: Piotr Mińkowski
      commitSha: aeb03d60a104161d6fd080267bf25c89c7067f61
  startTime: '2024-02-12T12:13:21Z'
  taskRunName: sample-spring-kotlin-build-xh2dq-j47ql

Looking at recent CI metal-ipi CI jobs

Some of the boot strap failures seem to be because of master nodes failing to come up

Search https://search.dptools.openshift.org/?search=Got+0+worker+nodes%2C+%5B12%5D+master+nodes%2C&maxAge=336h&context=-1&type=build-log&name=metal-ipi&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none
43 results over the last 14 days 

e.g.
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-upgrade-ovn-ipv6/1779842483996332032

level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 1 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).

Please review the following PR: https://github.com/openshift/image-registry/pull/390

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    All e2e-ibmcloud-ovn testing is failing due to repeated events of liveness or readiness probes failing during MonitorTests.

Version-Release number of selected component (if applicable):

    4.16.0-0.ci.test-2024-02-20-184205-ci-op-lghcpt9x-latest

How reproducible:

    Appears to be 100%

Steps to Reproduce:

    1. Setup IPI cluster on IBM Cloud
    2. Run OCP Conformance w/ MonitorTests (CI does this on IBM Cloud related PR's)
    

Actual results:

    Failed OCP Conformance tests, due to MonitorTests failure:

: [sig-arch] events should not repeat pathologically for ns/openshift-cloud-controller-manager expand_less0s{  2 events happened too frequently

event happened 43 times, something is wrong: namespace/openshift-cloud-controller-manager node/ci-op-lghcpt9x-52953-tk4vl-master-2 pod/ibm-cloud-controller-manager-6c5f8594c5-bpnm8 hmsg/d91441a732 - reason/ProbeError Liveness probe error: Get "https://10.241.129.4:10258/healthz": dial tcp 10.241.129.4:10258: connect: connection refused result=reject 
body: 
 From: 20:25:44Z To: 20:25:45Z
event happened 43 times, something is wrong: namespace/openshift-cloud-controller-manager node/ci-op-lghcpt9x-52953-tk4vl-master-1 pod/ibm-cloud-controller-manager-6c5f8594c5-wn4fq hmsg/fda26f2bbf - reason/ProbeError Liveness probe error: Get "https://10.241.64.6:10258/healthz": dial tcp 10.241.64.6:10258: connect: connection refused result=reject 
body: 
 From: 20:25:54Z To: 20:25:55Z}


: [sig-arch] events should not repeat pathologically for ns/openshift-oauth-apiserver expand_less0s{  1 events happened too frequently

event happened 25 times, something is wrong: namespace/openshift-oauth-apiserver node/ci-op-lghcpt9x-52953-tk4vl-master-1 pod/apiserver-c5ff4776b-kqg7c hmsg/c9e932e38d - reason/ProbeError Readiness probe error: HTTP probe failed with statuscode: 500 result=reject 
body: [+]ping ok
[+]log ok
[+]etcd ok
[-]etcd-readiness failed: reason withheld
[+]informer-sync ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/openshift.io-StartUserInformer ok
[+]poststarthook/openshift.io-StartOAuthInformer ok
[+]poststarthook/openshift.io-StartTokenTimeoutUpdater ok
[+]shutdown ok
readyz check failed

 From: 20:25:04Z To: 20:25:05Z}

Expected results:

    Passing OCP Conformance (w/ MonitorTests) test

Additional info:

    The frequent (perhaps only) failures appear to occur via:

[sig-arch] events should not repeat pathologically for ns/openshift-cloud-controller-manager

[sig-arch] events should not repeat pathologically for ns/openshift-oauth-apiserver

I am unsure on the cause of the liveness/readiness probe failures as of yet, unsure if the underlying Infrastructure is the cause (and if so, what resource).

Description of problem:

    Standalone OCP encrypts various resources at rest in etcd:
https://docs.openshift.com/container-platform/4.14/security/encrypting-etcd.html
HyperShift control planes are only encrypting secrets. We should have parity with standalone.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    Always

Steps to Reproduce:

    1. Create HyperShift standalone control plane
    2. Check that configmaps, routes, oauth access tokens or oauth authorize tokens are encrypted
    

Actual results:

    Those resources are not encrypted

Expected results:

    Those resources are encrypted

Additional info:

Resources to be encrypted are configured here:
https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/kas/kms/aws.go#L121-L126    

Description of problem:

After build02 is upgraded to 4.16.0-ec.4 from 4.16.0-ec.3, the CSRs are not auto-approved. As a result, provisioned machines cannot become nodes of the cluster.

Version-Release number of selected component (if applicable):

oc --context build02 get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-ec.4   True        False         4h28m

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

Michael McCune feels the group "system:serviceaccounts" was missing in the CSR.
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1710875084740869?thread_ts=1710861842.471739&cid=CBZHF4DHC

An inspection of the namespace openshift-cluster-machine-approver:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1710863462860809?thread_ts=1710861842.471739&cid=CBZHF4DHC

 

A workaround to approve the CSRs manually on b02:

https://github.com/openshift/release/pull/50016

 

[sig-api-machinery] ValidatingAdmissionPolicy [Privileged:ClusterAdmin] [FeatureGate:ValidatingAdmissionPolicy] [Beta] should type check a CRD [Suite:openshift/conformance/parallel] [Suite:k8s]

This test appears to fail a little too often. It seems to only run on techpreview clusters (presumably the Beta tag in the name), but I was worried it's an indication something isn't ready to graduate from techpreview, so figured this is worth a bug.

Even so 93% pass rate is a little too low, would like someone to investigate and get this test rate up. When it fails it's typically the only thing killing the job run. Output is always:

{  fail [k8s.io/kubernetes@v1.29.0/test/e2e/apimachinery/validatingadmissionpolicy.go:349]: Expected
    <[]v1beta1.ExpressionWarning | len:0, cap:0>: nil
to have length 2
Ginkgo exit error 1: exit with code 1}

View this link for sample job runs, I would focus on those with 2 failures indicating this was the only failing test in the job.

Description of problem:

 setup cluster cluster on vsphere by usermanaged ELB, with install-config.yaml

    apiVIPs:
      - 10.38.153.2
    ingressVIPs:
      - 10.38.153.3
    loadBalancer:
      type: UserManaged
networking:
  machineNetwork:
    - cidr: "10.38.153.0/25"
featureSet: TechPreviewNoUpgrade

after cluster is started, Found the keeplaived still running on worker nodes.

omc get pod -n openshift-vsphere-infra
NAME                                                   READY   STATUS    RESTARTS   AGE
coredns-ci-op-2kch7ldp-72b07-7l4vs-master-0            2/2     Running   0          1h
coredns-ci-op-2kch7ldp-72b07-7l4vs-master-1            2/2     Running   0          59m
coredns-ci-op-2kch7ldp-72b07-7l4vs-master-2            2/2     Running   0          59m
coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74      2/2     Running   0          39m
coredns-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k      2/2     Running   0          37m
keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-0-tqc74   2/2     Running   0          39m
keepalived-ci-op-2kch7ldp-72b07-7l4vs-worker-1-s654k   2/2     Running   0          37m

 
 

Version-Release number of selected component (if applicable):

4.15

How reproducible:

always

Steps to Reproduce:

    1. setup vsphere on multi-subnet network with ELB, job
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328     2.
    3.
    

Actual results:

    

Expected results:

    keepalived should not be running on worker node.

Additional info:

    must-gather logs:  https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/46458/rehearse-46458-periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/1732560150155235328/artifacts/vsphere-ipi-zones-multisubnets-external-lb-usermanaged-f28/gather-must-gather/artifacts/must-gather.tar
Version-Release number of selected component (if applicable):
{code:none}
    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Add the possibility to have other search filters in the resource list toolbar

Why is this important?

  • This will let plugins define additional search filters on top of Name and Label in a resource list page

Scenarios

  1. In the nmstate-console-plugin I would like to define other filters like IP search in the NMState list. This search will let the user filter the nmstate resources that are inside a subnet or by specific ips.

For now, using the props and some hacks we were able to change the Name search into an IP search but we would like to have both.

https://issues.redhat.com/browse/CNV-36247

Description of problem:

    Observer - Alerting, Metrics, and Targets page does not load as expected, blank page would be shown

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-07-041003

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Observer -> Alerting, Metrics, and Targets page directly
    2.
    3.
    

Actual results:

    Blank page, no data be loaded

Expected results:

    Work as normal

Additional info:

 Failed to load resource: the server responded with a status of 404 (Not Found)
/api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%2715ace915-53d3-4455-b7e3-b7a5a4796b5c%27:1

Failed to load resource: the server responded with a status of 403 (Forbidden)
main-chunk-bb9ed989a7f7c65da39a.min.js:1 API call to get support level has failed r: Access denied due to cluster policy.
    at https://console-openshift-console.apps.ci-ln-9fl1l5t-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-bb9ed989a7f7c65da39a.min.js:1:95279
(anonymous) @ main-chunk-bb9ed989a7f7c65da39a.min.js:1
/api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/#ALL_NS#/clusterserviceversions?:1
        
        
       Failed to load resource: the server responded with a status of 404 (Not Found)
vendor-patternfly-5~main-chunk-95cb256d9fa7738d2c46.min.js:1 Modal: When using hasNoBodyWrapper or setting a custom header, ensure you assign an accessible name to the the modal container with aria-label or aria-labelledby.

Description of problem:

   Invalid CN is not bubbled up in the CR 

Version-Release number of selected component (if applicable):

    4.15.0-rc7

How reproducible:

    always

Steps to Reproduce:

# generate a key with invalid CN
openssl genrsa -out myuser4.key 2048
openssl req -new -key myuser4.key -out myuser4.csr -subj "/CN=baduser/O=system:masters"
# get cert in the CSR
# apply the CSR
# Status remains in Accepted, but it is not Issued
% oc get csr | grep 29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr
29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr   4m29s   hypershift.openshift.io/ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1.customer-break-glass   system:admin                                                                60m                 Approved
# No status in the CSR status:
  conditions:
  - lastTransitionTime: "2024-02-16T14:06:41Z"
    lastUpdateTime: "2024-02-16T14:06:41Z"
    message: The requisite approval resource exists.
    reason: ApprovalPresent
    status: "True"
    type: Approved
# pki controller shows the error
 oc logs control-plane-pki-operator-bf6d75d5f-h95rf -n ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1 | grep "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr"
I0216 14:06:41.842414       1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1", Name:"control-plane-pki-operator", UID:"b63dbaa9-18f7-4ee6-8473-8a38bdb6f2df", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateSigningRequestApproved' "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" in is approved
I0216 14:06:41.848623       1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"ocm-integration-29ecg6n5bkugrh6io4his24ser3bt16n-ad-int1", Name:"control-plane-pki-operator", UID:"b63dbaa9-18f7-4ee6-8473-8a38bdb6f2df", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CertificateSigningRequestInvalid' "29ecg6n5bkugrh6io4his24ser3bt16n-5-customer-break-glass-csr" is invalid: invalid certificate request: subject CommonName must begin with "system:customer-break-glass:"     

Actual results:

    

Expected results:

    status in the CR show failed and the error 

Additional info:

    

When upgrading a HC from 4.13 to 4.14, after admin-acking the API deprecation check, the upgrade is still blocked by the ClusterVersionUpgradeble condition on the HC being Unknown. This is because the CVO in the guest cluster does not have an Upgradeable condition anymore.

hypershift#1614 gave us the router Deployment (descended from the private-router Deployment), but it lacks PDB coverage. For example:

$ git --no-pager log -1 --oneline origin/main
f3f421bc7 (origin/release-4.16, origin/release-4.15, origin/main, origin/HEAD) Merge pull request #3183 from muraee/azure-kms
$ git --no-pager grep 'func [^(]*\(Deployment\|PodDisruptionBudget\)' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas}
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/ingress/router.go:func ReconcileRouterDeployment(deployment *appsv1.Deployment, ownerRef config.OwnerRef, deploymentConfig config.DeploymentConfig, image string, config *corev1.ConfigMap) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/deployment.go:func ReconcileKubeAPIServerDeployment(deployment *appsv1.Deployment,
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/pdb.go:func ReconcilePodDisruptionBudget(pdb *policyv1.PodDisruptionBudget, p *KubeAPIServerParams) error {

Both the ingress and kas packages have Reconcile*Deployment methods. Only kas has a ReconcilePodDisruptionBudget method.

This bug is asking for router to get a covering PDB too, because being able to simultaneously evict all router-* pods simultaneously (for the cluster flavors that have replicas > 1 on that Deployment) can make the incoming traffic unreachable. And some of that Route traffic looks like stuff that folks would want to be reliably reachable:

$ git --no-pager grep 'func Reconcile[^(]*Route(' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas}
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPublicRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPrivateRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileInternalRoute(route *routev1.Route, owner *metav1.OwnerReference) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityExternalRoute(route *routev1.Route, ownerRef config.OwnerRef, hostname string, defaultIngressDomain string) error {
f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityInternalRoute(route *routev1.Route, ownerRef config.OwnerRef) error {

Test plan:

1. Install a hosted cluster.
2. Log into the managment cluster, and find the namespace of the hosted cluster $NAMESPACE.
3. Evict both router pods (using a raw create, because there isn't more convenient syntax yet):

oc -n "${NAMESPACE}" get -l app=private-router -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | while read NAME
do
  oc create -f - <<EOF --raw "/api/v1/namespaces/${NAMESPACE}/pods/${NAME}/eviction"
{"apiVersion": "policy/v1", "kind": "Eviction", "metadata": {"name": "${NAME}"}}
EOF
done

If that clears out both router pods right after the other, ingress will probably hiccup. And with the PDB in place, I'd expect the second eviction to fail.

Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/291

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The Helm Plugin's index view is parsing a given chart entry's into multiple tiles if the individual entry names vary.

This is inconsistent with the Helm CLI experience, which treats all items in an index entry (i.e. all versions of a given chart) to be a part of the same chart.

Version-Release number of selected component (if applicable):

All

How reproducible:

100%

Steps to Reproduce:

    1. Open the Developer Console, Helm Plugin
    2. Select a namespace and Click to create a helm release
    3. Search for the developer-hub chart in the catalog (this is an example demonstrating the problem)
    

Actual results:

There are two tiles for Developer Hub, but only one index entry in the corresponding index (https://charts.openshift.io)

Expected results:

A single tile should exist for this single index entry.

Additional info:

The cause of this is an expected indexing inconsistency, but the experience should align with the Helm CLI's behavior, and should still represent a single catalog tile per index entry.

Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/300

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/100

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/432

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/642

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/208

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The hypershift operator ignores RegistryOverrides (from ICSP/IDMS) inspecting the control-plane-operator-image so on disconnected cluster the user should explicitly set hypershift.openshift.io/control-plane-operator-image annotation pointing to the mirrored image on the internal registry.

Example:
the correct match is in the IDMS:
# oc get imagedigestmirrorset -oyaml | grep -B2 registry.ci.openshift.org/ocp/4.14-2024-02-14-135111
 ...
    - mirrors:
      - virthost.ostest.test.metalkube.org:5000/localimages/local-release-image
      source: registry.ci.openshift.org/ocp/4.14-2024-02-14-135111

Creating an hosted cluster with:
hcp create cluster kubevirt --image-content-sources /home/mgmt_iscp.yaml  --additional-trust-bundle /etc/pki/ca-trust/source/anchors/registry.2.crt --name simone3 --node-pool-replicas 2 --memory 16Gi --cores 4 --root-volume-size 64 --namespace local-cluster --release-image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:66c6a46013cda0ad4e2291be3da432fdd03b4a47bf13067e0c7b91fb79eb4539 --pull-secret /tmp/.dockerconfigjson --generate-ssh

on the hostedCluster object we see:
status:
  conditions:
  - lastTransitionTime: "2024-02-14T22:01:30Z"
    message: 'failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6:
      failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6:
      unauthorized: authentication required'
    observedGeneration: 3
    reason: ReconciliationError
    status: "False"
    type: ReconciliationSucceeded


and in the logs of the hypershift operator:
{"level":"info","ts":"2024-02-14T22:18:11Z","msg":"registry override coincidence not found","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"simone3","namespace":"local-cluster"},"namespace":"local-cluster","name":"simone3","reconcileID":"6d6a2479-3d54-42e3-9204-8d0ab1013745","image":"4.14-2024-02-14-135111"}
{"level":"error","ts":"2024-02-14T22:18:12Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"simone3","namespace":"local-cluster"},"namespace":"local-cluster","name":"simone3","reconcileID":"6d6a2479-3d54-42e3-9204-8d0ab1013745","error":"failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6: unauthorized: authentication required","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}


so the hypershift-operator is not using the RegistryOverrides mechanism to inspect the image from the internal registry (virthost.ostest.test.metalkube.org:5000/localimages/local-release-image in this example).

Explicitly setting annotation:
hypershift.openshift.io/control-plane-operator-image: virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6
on the hosted-cluster directly pointing to the mirrored control-plane-operator image is required to proceed on disconnected environments.

Version-Release number of selected component (if applicable):

    4.14, 4.15, 4.16

How reproducible:

    100%

Steps to Reproduce:

    1. try to deploy an hostedCluster on a disconnected environment without explicitly set hypershift.openshift.io/control-plane-operator-image annotation.
    2.
    3.
    

Actual results:

    A reconciliation error reported on the hostedCluster object:
status:
  conditions:
  - lastTransitionTime: "2024-02-14T22:01:30Z"
    message: 'failed to look up image metadata for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6:
      failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-02-14-135111@sha256:84c74cc05250d0e51fe115274cc67ffcf0a4ac86c831b7fea97e484e646072a6:
      unauthorized: authentication required'
    observedGeneration: 3
    reason: ReconciliationError
    status: "False"
    type: ReconciliationSucceeded

The hostedCluster is not spawn.

Expected results:

    The hypershift operator uses the RegistryOverrides mechanism also for the control-plane-operator image.
    Explicitly setting hypershift.openshift.io/control-plane-operator-image annotation is not required.

Additional info:

    - Maybe related to OCPBUGS-29110
    - Explicitly setting hypershift.openshift.io/control-plane-operator-image annotation pointing to the mirrored image on the internal registry is a valid workaround.

Description of problem:

    Repositories list page breaks with a TypeError 
cannot read properties of undefined (reading `pipelinesascode.tekton.dev/repository`)

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://drive.google.com/file/d/1TpH_PTyBxNX0b9SPZ2yS8b-q-tbvp6Ok/view?usp=sharing

 

on clusters with a large number of services with externalIPs or services from type loadBalancer the ovnkube-node initialization can take up to 50 min

The problem is after a node reboot done by MCO the unschedule taint is removed from the node so the api allocates pods to that node that get stuck on ContrainerCreating and other nodes continue to go down for reboot making the workloads unavailable. (if no PDB exists for the workload to protect it)

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/60

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/26

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/84

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Router are restarting due to memory issues

Version-Release number of selected component (if applicable):

    OCP 4.12.45

How reproducible:

    not easy
Router restart due to memory issues:
~~~
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: Get "http://localhost:1936/healthz/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Liveness probe error: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Liveness probe failed: Get "http://localhost:1936/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
3h40m       Normal    Killing      pod/router-default-56c9f67f66-j8xwn                        Container router failed liveness probe, will be restarted
3h40m       Warning   ProbeError   pod/router-default-56c9f67f66-j8xwn                        Readiness probe error: HTTP probe failed with statuscode: 500...
3h40m       Warning   Unhealthy    pod/router-default-56c9f67f66-j8xwn                        Readiness probe failed: HTTP probe failed with statuscode: 500
~~~

The node only host the router replica, and from prometheus it can be verified that routers are consumming all the memory in a short period of time ~20G with an hour.

At some point, the number of haproxy are increasing and ending consuming all memory resources leading in a service disruption in a productive environment.

As console is one of the service with highest activity as per router stats, so far customer is deleting the console pod and process decreasing from 45 to 12. 

Customer is willing to have a guidance about how to identify the process that is consuming the memory, haproxy monitoring is enabled but no dashboard available. 

Router stats from when the router has 8g-6g-3g of memory available has been requested. 

Additional info:

 Customer is claiming that this is a happening only in OCP 4.12.45, as other active cluster is still in version 4.10.39 and this is not happening. Upgrade is blocked because of this .

Requested action:
* hard-stop-after might be an option but customer expect information about side effects of this configuration.
* How to reset console connection from haproxy?
* Is there any documentation about haproxy prometheus queries?  

Description of problem:

control-plane-machine-set operator pod stuck into crashloopbackoff state with panic: runtime error: invalid memory address or nil pointer dereference while extracting the failureDomain from the controlplanemachineset. Below is the error trace for reference.
~~~
2024-04-04T09:32:23.594257072Z I0404 09:32:23.594176       1 controller.go:146]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="c282f3e3-9f9d-40df-a24e-417ba2ea4106"
2024-04-04T09:32:23.594257072Z I0404 09:32:23.594221       1 controller.go:125]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="7f03c05f-2717-49e0-95f8-3e8b2ce2fc55"
2024-04-04T09:32:23.594274974Z I0404 09:32:23.594257       1 controller.go:146]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="7f03c05f-2717-49e0-95f8-3e8b2ce2fc55"
2024-04-04T09:32:23.597509741Z I0404 09:32:23.597426       1 watch_filters.go:179] reconcile triggered by infrastructure change
2024-04-04T09:32:23.606311553Z I0404 09:32:23.606243       1 controller.go:220]  "msg"="Starting workers" "controller"="controlplanemachineset" "worker count"=1
2024-04-04T09:32:23.606360950Z I0404 09:32:23.606340       1 controller.go:169]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
2024-04-04T09:32:23.609322467Z I0404 09:32:23.609217       1 panic.go:884]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
2024-04-04T09:32:23.609322467Z I0404 09:32:23.609271       1 controller.go:115]  "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference [recovered]
2024-04-04T09:32:23.612540681Z     panic: runtime error: invalid memory address or nil pointer dereference
2024-04-04T09:32:23.612540681Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c]
2024-04-04T09:32:23.612540681Z 
2024-04-04T09:32:23.612540681Z goroutine 255 [running]:
2024-04-04T09:32:23.612540681Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
2024-04-04T09:32:23.612571624Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa
2024-04-04T09:32:23.612571624Z panic({0x1c8ac60, 0x31c6ea0})
2024-04-04T09:32:23.612571624Z     /usr/lib/golang/src/runtime/panic.go:884 +0x213
2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.VSphereProviderConfig.ExtractFailureDomain(...)
2024-04-04T09:32:23.612571624Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/vsphere.go:120
2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.providerConfig.ExtractFailureDomain({{0x1f2a71a, 0x7}, {{{{...}, {...}}, {{...}, {...}, {...}, {...}, {...}, {...}, ...}, ...}}, ...})
2024-04-04T09:32:23.612588145Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/providerconfig.go:212 +0x23c
~~~
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

control-plane-machine-set operator stuck into crashloopback off state while cluster upgrade.
    

Expected results:

control-plane-machine-set operator should be upgraded without any errors.
    

Additional info:

This is happening during the cluster upgrade of Vsphere IPI cluster from OCP version 4.14.z to 4.15.6 and may impact other z stream releases. 
from the official docs[1]  I see providing the failure domain for the Vsphere platform is tech preview feature.
[1] https://docs.openshift.com/container-platform/4.15/machine_management/control_plane_machine_management/cpmso-configuration.html#cpmso-yaml-failure-domain-vsphere_cpmso-configuration
    

Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/62

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

4.16.0-0.nightly-2024-04-07-182401, Prometheus Operator 0.73.0, too many warnings for "'bearerTokenFile' is deprecated, use 'authorization' instead.", see below

$ oc -n openshift-monitoring logs -c prometheus-operator deploy/prometheus-operator 
level=info ts=2024-04-08T07:06:17.191301889Z caller=main.go:186 msg="Starting Prometheus Operator" version="(version=0.73.0, branch=rhaos-4.16-rhel-9, revision=3541f90)"
level=info ts=2024-04-08T07:06:17.195797026Z caller=main.go:187 build_context="(go=go1.21.7 (Red Hat 1.21.7-1.el9) X:loopvar,strictfipsruntime, platform=linux/amd64, user=root, date=20240405-12:29:19, tags=strictfipsruntime)"
level=info ts=2024-04-08T07:06:17.195888428Z caller=main.go:198 msg="namespaces filtering configuration " config="{allow_list=\"\",deny_list=\"\",prometheus_allow_list=\"openshift-monitoring\",alertmanager_allow_list=\"openshift-monitoring\",alertmanagerconfig_allow_list=\"\",thanosruler_allow_list=\"openshift-monitoring\"}"
level=info ts=2024-04-08T07:06:17.212735844Z caller=main.go:227 msg="connection established" cluster-version=v1.29.3+e994e5d
level=warn ts=2024-04-08T07:06:17.228748881Z caller=main.go:75 msg="resource \"scrapeconfigs\" (group: \"monitoring.coreos.com/v1alpha1\") not installed in the cluster"
level=info ts=2024-04-08T07:06:17.25637504Z caller=operator.go:335 component=prometheus-controller msg="Kubernetes API capabilities" endpointslices=true
level=warn ts=2024-04-08T07:06:17.258012256Z caller=main.go:75 msg="resource \"prometheusagents\" (group: \"monitoring.coreos.com/v1alpha1\") not installed in the cluster"
level=info ts=2024-04-08T07:06:17.360652572Z caller=server.go:298 msg="starting insecure server" address=127.0.0.1:8080
level=info ts=2024-04-08T07:06:17.602723953Z caller=operator.go:283 component=thanos-controller msg="successfully synced all caches"
level=info ts=2024-04-08T07:06:17.686834878Z caller=operator.go:313 component=alertmanager-controller msg="successfully synced all caches"
level=info ts=2024-04-08T07:06:17.687014402Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
level=info ts=2024-04-08T07:06:17.696906656Z caller=operator.go:392 component=prometheus-controller msg="successfully synced all caches"
level=info ts=2024-04-08T07:06:17.698997412Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
level=info ts=2024-04-08T07:06:17.904295505Z caller=operator.go:572 component=alertmanager-controller key=openshift-monitoring/main msg="sync alertmanager"
level=warn ts=2024-04-08T07:06:18.111274725Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver-operator/openshift-apiserver-operator
level=warn ts=2024-04-08T07:06:18.111387227Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver
level=warn ts=2024-04-08T07:06:18.111430218Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver-operator-check-endpoints
level=warn ts=2024-04-08T07:06:18.11149249Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication-operator/authentication-operator
level=warn ts=2024-04-08T07:06:18.111554601Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication/oauth-openshift
level=warn ts=2024-04-08T07:06:18.111637633Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cloud-credential-operator/cloud-credential-operator
level=warn ts=2024-04-08T07:06:18.111697614Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:18.111733495Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:18.111784766Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:18.111819506Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:18.111895078Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:18.111944309Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor
level=warn ts=2024-04-08T07:06:18.11197813Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor
level=warn ts=2024-04-08T07:06:18.112071132Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor
level=warn ts=2024-04-08T07:06:18.112151634Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-machine-approver/cluster-machine-approver
level=warn ts=2024-04-08T07:06:18.112226245Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-version/cluster-version-operator
level=warn ts=2024-04-08T07:06:18.112256916Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-config-operator/config-operator
level=warn ts=2024-04-08T07:06:18.112284327Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console-operator/console-operator
level=warn ts=2024-04-08T07:06:18.112310487Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console/console
level=warn ts=2024-04-08T07:06:18.112339628Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager-operator/openshift-controller-manager-operator
level=warn ts=2024-04-08T07:06:18.112370889Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager/openshift-controller-manager
level=warn ts=2024-04-08T07:06:18.112397339Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns-operator/dns-operator
level=warn ts=2024-04-08T07:06:18.11243773Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns/dns-default
level=warn ts=2024-04-08T07:06:18.112484231Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-etcd-operator/etcd-operator
level=warn ts=2024-04-08T07:06:18.112532742Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-image-registry/image-registry
level=warn ts=2024-04-08T07:06:18.112575493Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress-operator/ingress-operator
level=warn ts=2024-04-08T07:06:18.112648155Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress/router-default
level=warn ts=2024-04-08T07:06:18.112684775Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-insights/insights-operator
level=warn ts=2024-04-08T07:06:18.112738886Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver-operator/kube-apiserver-operator
level=warn ts=2024-04-08T07:06:18.112771917Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver/kube-apiserver
level=warn ts=2024-04-08T07:06:18.112834288Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager-operator/kube-controller-manager-operator
level=warn ts=2024-04-08T07:06:18.11288797Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager/kube-controller-manager
level=warn ts=2024-04-08T07:06:18.112923101Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler-operator/kube-scheduler-operator
level=warn ts=2024-04-08T07:06:18.112974211Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler
level=warn ts=2024-04-08T07:06:18.113004992Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler
level=warn ts=2024-04-08T07:06:18.113031193Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/cluster-autoscaler-operator
level=warn ts=2024-04-08T07:06:18.113082674Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers
level=warn ts=2024-04-08T07:06:18.113111174Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers
level=warn ts=2024-04-08T07:06:18.113137205Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers
level=warn ts=2024-04-08T07:06:18.113180076Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-operator
level=warn ts=2024-04-08T07:06:18.113207577Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-controller
level=warn ts=2024-04-08T07:06:18.113243277Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-daemon
level=warn ts=2024-04-08T07:06:18.113268968Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-operator
level=warn ts=2024-04-08T07:06:18.113303009Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-marketplace/marketplace-operator
level=warn ts=2024-04-08T07:06:18.113566255Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-monitoring/promtail-monitor
level=warn ts=2024-04-08T07:06:18.113659677Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-multus-admission-controller
level=warn ts=2024-04-08T07:06:18.113690037Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-network
level=warn ts=2024-04-08T07:06:18.113716478Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-diagnostics/network-check-source
level=warn ts=2024-04-08T07:06:18.113760539Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-operator/network-operator
level=warn ts=2024-04-08T07:06:18.113789389Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-oauth-apiserver/openshift-oauth-apiserver
level=warn ts=2024-04-08T07:06:18.11382366Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/catalog-operator
level=warn ts=2024-04-08T07:06:18.113849491Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/olm-operator
level=warn ts=2024-04-08T07:06:18.113882881Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/package-server-manager-metrics
level=warn ts=2024-04-08T07:06:18.113910142Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-control-plane-metrics
level=warn ts=2024-04-08T07:06:18.113939212Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node
level=warn ts=2024-04-08T07:06:18.113965423Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node
level=warn ts=2024-04-08T07:06:18.114005374Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-route-controller-manager/openshift-route-controller-manager
level=warn ts=2024-04-08T07:06:18.114032265Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-service-ca-operator/service-ca-operator
level=warn ts=2024-04-08T07:06:18.114075275Z caller=promcfg.go:1806 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1
level=info ts=2024-04-08T07:06:18.372521592Z caller=operator.go:766 component=prometheus-controller key=openshift-monitoring/k8s msg="sync prometheus"
level=warn ts=2024-04-08T07:06:19.52908448Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver-operator/openshift-apiserver-operator
level=warn ts=2024-04-08T07:06:19.529206143Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver
level=warn ts=2024-04-08T07:06:19.529264914Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-apiserver/openshift-apiserver-operator-check-endpoints
level=warn ts=2024-04-08T07:06:19.529314545Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication-operator/authentication-operator
level=warn ts=2024-04-08T07:06:19.529363736Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-authentication/oauth-openshift
level=warn ts=2024-04-08T07:06:19.529496399Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cloud-credential-operator/cloud-credential-operator
level=warn ts=2024-04-08T07:06:19.52954309Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:19.529610031Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:19.529675583Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:19.529722024Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:19.529773425Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-monitor
level=warn ts=2024-04-08T07:06:19.529840396Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor
level=warn ts=2024-04-08T07:06:19.529940188Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor
level=warn ts=2024-04-08T07:06:19.530042201Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-csi-drivers/shared-resource-csi-driver-node-monitor
level=warn ts=2024-04-08T07:06:19.530145063Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-machine-approver/cluster-machine-approver
level=warn ts=2024-04-08T07:06:19.530242295Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-cluster-version/cluster-version-operator
level=warn ts=2024-04-08T07:06:19.530318036Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-config-operator/config-operator
level=warn ts=2024-04-08T07:06:19.530379448Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console-operator/console-operator
level=warn ts=2024-04-08T07:06:19.530423309Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-console/console
level=warn ts=2024-04-08T07:06:19.53046613Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager-operator/openshift-controller-manager-operator
level=warn ts=2024-04-08T07:06:19.530515121Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-controller-manager/openshift-controller-manager
level=warn ts=2024-04-08T07:06:19.530600663Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns-operator/dns-operator
level=warn ts=2024-04-08T07:06:19.530658014Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-dns/dns-default
level=warn ts=2024-04-08T07:06:19.530718695Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-etcd-operator/etcd-operator
level=warn ts=2024-04-08T07:06:19.530768006Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-image-registry/image-registry
level=warn ts=2024-04-08T07:06:19.530829528Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress-operator/ingress-operator
level=warn ts=2024-04-08T07:06:19.530882449Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ingress/router-default
level=warn ts=2024-04-08T07:06:19.53093667Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-insights/insights-operator
level=warn ts=2024-04-08T07:06:19.530991941Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver-operator/kube-apiserver-operator
level=warn ts=2024-04-08T07:06:19.531039122Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-apiserver/kube-apiserver
level=warn ts=2024-04-08T07:06:19.531094903Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager-operator/kube-controller-manager-operator
level=warn ts=2024-04-08T07:06:19.531137024Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-controller-manager/kube-controller-manager
level=warn ts=2024-04-08T07:06:19.531180345Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler-operator/kube-scheduler-operator
level=warn ts=2024-04-08T07:06:19.531224986Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler
level=warn ts=2024-04-08T07:06:19.531270967Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-kube-scheduler/kube-scheduler
level=warn ts=2024-04-08T07:06:19.531334098Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/cluster-autoscaler-operator
level=warn ts=2024-04-08T07:06:19.53138266Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers
level=warn ts=2024-04-08T07:06:19.5314245Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers
level=warn ts=2024-04-08T07:06:19.531463661Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-controllers
level=warn ts=2024-04-08T07:06:19.531513562Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-api/machine-api-operator
level=warn ts=2024-04-08T07:06:19.531555783Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-controller
level=warn ts=2024-04-08T07:06:19.531626765Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-daemon
level=warn ts=2024-04-08T07:06:19.531689586Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-machine-config-operator/machine-config-operator
level=warn ts=2024-04-08T07:06:19.531733467Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-marketplace/marketplace-operator
level=warn ts=2024-04-08T07:06:19.532134636Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-monitoring/promtail-monitor
level=warn ts=2024-04-08T07:06:19.532233158Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-multus-admission-controller
level=warn ts=2024-04-08T07:06:19.532507644Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-multus/monitor-network
level=warn ts=2024-04-08T07:06:19.532567965Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-diagnostics/network-check-source
level=warn ts=2024-04-08T07:06:19.532635257Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-network-operator/network-operator
level=warn ts=2024-04-08T07:06:19.532683058Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-oauth-apiserver/openshift-oauth-apiserver
level=warn ts=2024-04-08T07:06:19.532728279Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/catalog-operator
level=warn ts=2024-04-08T07:06:19.53277187Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/olm-operator
level=warn ts=2024-04-08T07:06:19.532821821Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-operator-lifecycle-manager/package-server-manager-metrics
level=warn ts=2024-04-08T07:06:19.532863662Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-control-plane-metrics
level=warn ts=2024-04-08T07:06:19.532904153Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node
level=warn ts=2024-04-08T07:06:19.532944204Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-ovn-kubernetes/monitor-ovn-node
level=warn ts=2024-04-08T07:06:19.532990574Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-route-controller-manager/openshift-route-controller-manager
level=warn ts=2024-04-08T07:06:19.533037166Z caller=promcfg.go:1374 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1 service_monitor=openshift-service-ca-operator/service-ca-operator
level=warn ts=2024-04-08T07:06:19.533089337Z caller=promcfg.go:1806 component=prometheus-controller msg="'bearerTokenFile' is deprecated, use 'authorization' instead." version=2.50.1

example servicemonitor with bearerTokenFile that causes warining in prometheus operator

$ oc -n openshift-apiserver-operator get servicemonitor openshift-apiserver-operator -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    metricRelabelings:
    - action: drop
      regex: etcd_(debugging|disk|request|server).*
      sourceLabels:
 ...
$ oc explain servicemonitor.spec.endpoints.bearerTokenFile
GROUP:      monitoring.coreos.com
KIND:       ServiceMonitor
VERSION:    v1FIELD: bearerTokenFile <string>DESCRIPTION:
    File to read bearer token for scraping the target. 
     Deprecated: use `authorization` instead.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-04-07-182401   True        False         52m     Cluster version is 4.16.0-0.nightly-2024-04-07-182401

How reproducible:

with Prometheus Operator 0.73.0

Steps to Reproduce:

1. check prometheus-operator logs
    

Actual results:

too many warnings for "'bearerTokenFile' is deprecated, use 'authorization' instead."

Expected results:

no warnings

Description of problem:

"Oh no! Something went wrong." will shown on Pending pod details page

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-14-063437

How reproducible:

always

Steps to Reproduce:

1. Create a dummy pod with pending status
eg:
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd
    
OR 

apiVersion: v1
kind: Pod
metadata:  
  name: dummy-pod
spec:  
  containers:    
    - name: dummy-pod      
    image: ubuntu  
  restartPolicy: Always  
  nodeSelector:    
    testtype: pending


2. Navigate to Pod Details page
3.
    

Actual results:

Oh no! Something went wrong. will shown

TypeError
Description:Cannot read properties of undefined (reading 'restartCount')

Component trace:Copy to clipboardat fe (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:562500)
    at div
    at div
    at ve (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:563346)
    at div
    at ke (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:571308)
    at i (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:329180)
    at _ (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-4dc722526d0f0470939e.min.js:31:4920)
    at ne (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-4dc722526d0f0470939e.min.js:31:10364)
    at Suspense
    at div
    at k (https://console-openshift-console.apps.qe-daily-416-0415.qe.azure.devcluster.openshift.com/static/main-chunk-7643d3f1edb399bb7d65.min.js:1:118938)

Expected results:

    no issue was found

Additional info:

Enable pod securtiy labels when create the pod via UI:
$ oc label namespace <ns> security.openshift.io/scc.podSecurityLabelSync=false --overwrite
$ oc label namespace <ns> pod-security.kubernetes.io/enforce=privileged --overwrite
$ oc label namespace <ns> pod-security.kubernetes.io/audit=privileged --overwrite
$ oc label namespace <ns> fix

Description of problem:

when I used the file to create CatalogSource, the creation failed and hit error:
[root@preserve-fedora36 cluster-resources]# oc create -f cs-redhat-operator-index-v4-15.yaml 
The CatalogSource "cs-redhat-operator-index-v4-15" is invalid: 
* spec.icon.base64data: Required value
* spec.icon.mediatype: Required value
[root@preserve-fedora36 cluster-resources]# cat cs-redhat-operator-index-v4-15.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: null
  name: cs-redhat-operator-index-v4-15
  namespace: openshift-marketplace
spec:
  icon: {}
  image: ec2-3-144-93-237.us-east-2.compute.amazonaws.com:5000/redhat/redhat-operator-index:v4.15
  sourceType: grpc
status: {}

 

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1. Use following imagesetconfigure to mirror to localhost:
cat config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
#archiveSize: 8
storageConfig:
  local:
    path: /app1/ocmirror/offline
mirror:
  platform:
    channels:
    - name: stable-4.12                                             
      type: ocp
      minVersion: '4.12.46'
      maxVersion: '4.12.46'
      shortestPath: true
    graph: true
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: advanced-cluster-management                                  
      channels:
      - name: release-2.9             
    - name: compliance-operator
      channels:
      - name: stable
    - name: multicluster-engine
      channels:
      - name: stable-2.4
      - name: stable-2.5
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest                        
  - name: registry.redhat.io/rhel8/support-tools:latest
  - name: registry.access.redhat.com/ubi8/nginx-120:latest
  - name: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
  - name: registry.k8s.io/sig-storage/csi-resizer:v1.8.0
`oc-mirror --config config.yaml  file://operatortest --v2`
2. mirror to registry :
`oc-mirror  --config config.yaml --from file://operatortest   docker://ec2-3-144-93-237.us-east-2.compute.amazonaws.com:5000  --v2`

3. Create catalogsource with the created file:
cat cs-redhat-operator-index-v4-15.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: null
  name: cs-redhat-operator-index-v4-15
  namespace: openshift-marketplace
spec:
  icon: {}
  image: ec2-3-144-93-237.us-east-2.compute.amazonaws.com:5000/redhat/redhat-operator-index:v4.15
  sourceType: grpc
status: {}

oc create -f cs-redhat-operator-index-v4-15.yaml 
The CatalogSource "cs-redhat-operator-index-v4-15" is invalid: 
* spec.icon.base64data: Required value
* spec.icon.mediatype: Required value

Actual results: 

Failed to create catalogsource by the created file.

Expected results:

No error.

Please review the following PR: https://github.com/openshift/cluster-api/pull/190

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The installation of OpenShift Container Platform 4.13.4 is failing fairly frequent compare to previous version, when installing with proxy configured.

The error reported by the MachineConfigPool is as shown below.

  - lastTransitionTime: "2023-07-04T10:36:44Z"
    message: 'Node master0.example.com is reporting: "machineconfig.machineconfiguration.openshift.io
      \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master1.example.com
      is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\"
      not found", Node master2.example.com is reporting:
      "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\"
      not found"'

According to https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx this seems to be a known condition but it's not clear how to prevent that from happening and therefore ensure installation are working as expected.

The major difference found between /etc/mcs-machine-config-content.json on the OpenShift Container Platform 4 - Control-Plane Node and the rendered-master-${hash} are within the following files.

 - /etc/mco/proxy.env
 - /etc/kubernetes/kubelet-ca.crt

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13.4

How reproducible:

Random

Steps to Reproduce:

1. Install OpenShift Container Platform 4.13.4 on AWS with platform:none, proxy defined and both machineCIDR and machineNetwork.cidr set.

Actual results:

Installation is stuck and will eventually fail as the MachineConfigPool is failing to rollout required MachineConfig for master MachineConfigPool

  - lastTransitionTime: "2023-07-04T10:36:44Z"
    message: 'Node master0.example.com is reporting: "machineconfig.machineconfiguration.openshift.io
      \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master1.example.com
      is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\"
      not found", Node master2.example.com is reporting:
      "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\"
      not found"'

Expected results:

Installation to work or else provide meaningful error messaging 

Additional info:

https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx checked and then talked to Red Hat Engineering as it was not clear how to proceed

Description of problem:

If a ROSA HCP customer uses the default worker security group that the CPO creates for some other purpose (i.e. creates their own VPC Endpoint or EC2 instance using this security group) and then starts an uninstallation - the uninstallation will hang indefinitely because the CPO is unable to delete the security group.

https://github.com/openshift/hypershift/blob/9e6255e5e44c8464da0850f8c19dc085bdbaf8cb/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L317-L331    

Version-Release number of selected component (if applicable):

4.14.8    

How reproducible:

100%    

Steps to Reproduce:

    1. Create a ROSA HCP cluster
    2. Attach the default worker security group to some other object unrelated to the cluster, like an EC2 instance or VPC Endpoint
    3. Uninstall the ROSA HCP cluster

Actual results:

The uninstall hangs without much feedback to the customer    

Expected results:

Either that the uninstall gives up and moves on eventually, or that clear feedback is provided to the customer, so that they know that the uninstall is held up because of an inability to delete a specific security group id. If this feedback mechanism is already in place, but not wired through to OCM, this may not be an OCPBUGS and could just be an OCM bug instead!    

Additional info:

    

Current description of HighOverallControlPlaneCPU is wrong for SNO cases and can mislead users. We need to add information regarding SNO clusters to the description of the alert

Description of problem:

Internal registry Pods will panic while deploying OCP on `ca-west-1` AWS Region

Version-Release number of selected component (if applicable):

4.14.2    

How reproducible:

Every time    

Steps to Reproduce:

    1. Deploy OCP on `ca-west-1` AWS Region

Actual results:

$ oc logs image-registry-85b69cd9fc-b78sb -n openshift-image-registry
time="2024-02-08T11:43:09.287006584Z" level=info msg="start registry" distribution_version=v3.0.0+unknown go.version="go1.20.10 X:strictfipsruntime" openshift_version=4.14.0-202311021650.p0.g5e7788a.assembly.stream-5e7788a
time="2024-02-08T11:43:09.287365337Z" level=info msg="caching project quota objects with TTL 1m0s" go.version="go1.20.10 X:strictfipsruntime"
panic: invalid region provided: ca-west-1goroutine 1 [running]:
github.com/distribution/distribution/v3/registry/handlers.NewApp({0x2873f40?, 0xc00005c088?}, 0xc000581800)
    /go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:130 +0x2bf1
github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp({0x2873f40, 0xc00005c088}, 0x0?, {0x2876820?, 0xc000676cf0})
    /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:96 +0xb9
github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp({0x2873f40?, 0xc00005c088}, {0x285ffd0?, 0xc000916070}, 0xc000581800, 0xc00095c000, {0x0?, 0x0})
    /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:138 +0x485
github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer({0x2873f40, 0xc00005c088}, 0xc000581800, 0xc00095c000)
    /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:212 +0x38a
github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute({0x2858b60, 0xc000916000})
    /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:166 +0x86b
main.main()
    /go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:93 +0x496
    

Expected results:

The internal registry is deployed with no issues    

Additional info:

This is a new AWS Region we are adding support to. The support will be backported to 4.14.z    

Description of problem

The Route API documentation states that the default value for the spec.tls.insecureEdgeTerminationPolicy field is "Allow". However, the observable default behavior is that of "None".

Version-Release number of selected component (if applicable)

OpenShift 3.11 and earlier and OpenShift 4.1 through 4.16.

How reproducible

100%.

Steps to Reproduce

1. Check the documentation: oc explain routes.spec.tls.insecureEdgeTerminationPolicy
2. Create an example application and edge-terminated route without specifying insecureEdgeTerminationPolicy, and try to connect to the route using HTTP:

oc adm new-project hello-openshift
oc -n hello-openshift create -f https://raw.githubusercontent.com/openshift/origin/56867df5e362aab0d2d8fa8c225e6761c7469781/examples/hello-openshift/hello-pod.json
oc -n hello-openshift expose pod hello-openshift
oc -n hello-openshift create route edge --service=hello-openshift
curl -k https://hello-openshift-hello-openshift.apps.<cluster domain>
curl -I http://hello-openshift-hello-openshift.apps.<cluster domain>

Actual results

The documentation states that "Allow" is the default:

% oc explain routes.spec.tls.insecureEdgeTerminationPolicy                        
KIND:     Route
VERSION:  route.openshift.io/v1

FIELD:    insecureEdgeTerminationPolicy <string>

DESCRIPTION:
     insecureEdgeTerminationPolicy indicates the desired behavior for insecure
     connections to a route. While each router may make its own decisions on
     which ports to expose, this is normally port 80.

     * Allow - traffic is sent to the server on the insecure port
     (edge/reencrypt terminations only) (default). * None - no traffic is
     allowed on the insecure port. * Redirect - clients are redirected to the
     secure port.

However, in practice, the default seems to be "None":

% oc adm new-project hello-openshift
Created project hello-openshift
% oc -n hello-openshift create -f https://raw.githubusercontent.com/openshift/origin/56867df5e362aab0d2d8fa8c225e6761c7469781/examples/hello-openshift/hello-pod.json
pod/hello-openshift created
% oc -n hello-openshift expose pod hello-openshift
service/hello-openshift exposed
% oc -n hello-openshift create route edge --service=hello-openshift
route.route.openshift.io/hello-openshift created
% oc -n hello-openshift get routes/hello-openshift -o yaml
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  annotations:
    openshift.io/host.generated: "true"
  creationTimestamp: "2024-04-02T22:59:32Z"
  labels:
    name: hello-openshift
  name: hello-openshift
  namespace: hello-openshift
  resourceVersion: "27147"
  uid: 50029f66-a089-4ec0-be04-91f176883e2b
spec:
  host: hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org
  tls:
    termination: edge
  to:
    kind: Service
    name: hello-openshift
    weight: 100
  wildcardPolicy: None
status:
  ingress:
  - conditions:
    - lastTransitionTime: "2024-04-02T22:59:32Z"
      status: "True"
      type: Admitted
    host: hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org
    routerCanonicalHostname: router-default.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org
    routerName: default
    wildcardPolicy: None
  - conditions:
    - lastTransitionTime: "2024-04-02T22:59:32Z"
      status: "True"
      type: Admitted
    host: hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org
    routerCanonicalHostname: router-custom.custom.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org
    routerName: custom
    wildcardPolicy: None
% curl -k https://hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org
Hello OpenShift!
% curl -I http://hello-openshift-hello-openshift.apps.8fbd3fa1605eb7f8632a.hypershift.aws-2.ci.openshift.org 
HTTP/1.0 503 Service Unavailable
pragma: no-cache
cache-control: private, max-age=0, no-cache, no-store
content-type: text/html

Expected results

Given the API documentation, I would maybe expect to see insecureEdgeTerminationPolicy: Allow in the route definition, and I would definitely expect the curl http:// command to succeed.

Alternatively, I would expect the API documentation to state that the default for insecureEdgeTerminationPolicy is "None", based on the observed behavior.

Additional info

The current "(default)" text was added in https://github.com/openshift/origin/pull/10983/commits/dc1aecd4bcdae7525536180bab2a0a0083aaa0f4.

Description of problem:

    During automation test execution for dev-console package it is observed that cypress intensely fails the ongoing test due to "uncaught:exception : ResizeObserver limit exceed", but there is no visible failure from UI.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info: Screenshot

Description of problem:

Install cluster with azure workload identity against 4.16 nightly build, failed as some co are degraded.
$ oc get co | grep -v "True        False         False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.16.0-0.nightly-2024-02-07-200316   False       False         True       153m    OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jima416a1.qe.azure.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
console                                    4.16.0-0.nightly-2024-02-07-200316   False       True          True       141m    DeploymentAvailable: 0 replicas available for console deployment...
ingress                                                                         False       True          True       137m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)

Ingress LB public IP is pending to be created
$ oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.199.169   <pending>     80:32007/TCP,443:30229/TCP   154m
router-internal-default   ClusterIP      172.30.112.167   <none>        80/TCP,443/TCP,1936/TCP      154m


Detected that CCM pod is CrashLoopBackOff with error
$ oc get pod -n openshift-cloud-controller-manager
NAME                                              READY   STATUS             RESTARTS         AGE
azure-cloud-controller-manager-555cf5579f-hz6gl   0/1     CrashLoopBackOff   21 (2m55s ago)   160m
azure-cloud-controller-manager-555cf5579f-xv2rn   0/1     CrashLoopBackOff   21 (15s ago)     160m

error in ccm pod:
I0208 04:40:57.141145       1 azure.go:931] Azure cloudprovider using try backoff: retries=6, exponent=1.500000, duration=6, jitter=1.000000
I0208 04:40:57.141193       1 azure_auth.go:86] azure: using workload identity extension to retrieve access token
I0208 04:40:57.141290       1 azure_diskclient.go:68] Azure DisksClient using API version: 2022-07-02
I0208 04:40:57.141380       1 azure_blobclient.go:73] Azure BlobClient using API version: 2021-09-01
F0208 04:40:57.141471       1 controllermanager.go:314] Cloud provider azure could not be initialized: could not init cloud provider azure: no token file specified. Check pod configuration or set TokenFilePath in the options

Version-Release number of selected component (if applicable):

4.16 nightly build    

How reproducible:

Always    

Steps to Reproduce:

    1. Install cluster with azure workload identity
    2.
    3.
    

Actual results:

    Installation failed due to some operators are degraded

Expected results:

    Installation is successful.

Additional info:

 

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/8

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

ccoctl consumes CredentialsRequests extracted from OpenShift releases and manages secrets associated with those requests for the cluster. Over time, ccoctl has grown a number of CredentialRequest filters, including deletion annotations in CCO-175 and tech-preview annotations in cco#444.

But with OTA-559, 4.14 and later oc adm release extract ... learned about an --included parameter, which allows oc to perform that "will the cluster need this credential?" filtering, and there is no longer a need for ccoctl to perform that filtering, or for ccoctl callers to have to think through "do I need to enable tech-preview CRs for this cluster or not?".

Version-Release number of selected component (if applicable):

4.14 and later.

How reproducible:

100%.

Steps to Reproduce:

$ cat <<EOF >install-config.yaml 
> apiVersion: v1
> platform:
>   gcp:
>     dummy: data
> featureSet: TechPreviewNoUpgrade
> EOF
$ oc adm release extract --included --credentials-requests --install-config install-config.yaml --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64
$ ccoctl gcp create-all --dry-run --name=test --region=test --project=test --credentials-requests-dir=credentials-requests

Actual results:

ccoctl doesn't dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it {--enable-tech-preview}}.

Expected results:

ccoctl does dry-run create the TechPreviewNoUpgrade openshift-cluster-api-gcp CredentialsRequest unless you pass it --enable-tech-preview=false.

Additional info:

Longer-term, we likely want to go through some phases of deprecating and maybe eventually removing --enable-tech-preview and the ccoctl-side filtering. But for now, I think we want to pivot to defaulting to true, so that anyone with existing flows that do not include the new --included extraction has an easy way to keep their workflow going (they can set --enable-tech-preview=false). And I think we should backport that to 4.14's ccoctl to simplify OSDOCS-4158's docs#62148. But we're close enough to 4.14's expected GA, that it's worth some consensus-building and alternative consideration, before trying to rush changes back to 4.14 branches.

Description of problem

I had a version of MTC installed on my cluster when it was running a prior version. I had deleted it some time ago, long before upgrading to 4.15. I upgraded it to 4.15 and needed to reinstall to take a look at something, but found the operator would not install.

I originally tried with 4.15.0, but on failure upgraded to 4.15.3 to see if it would resolve the issue, but it did no.

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.15.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.15.3
Kubernetes Version: v1.28.7+6e2789b

How reproducible:

Always as far as I can tell. I have at least two clusters where I was able to reproduce it.

Steps to Reproduce:

    1. Install Migration Toolkit for Containers on OpenShift 4.14
    2. Uninstall it
    3. Upgrade to 4.15
    4. Try to install it again

Actual results:

The operator never installs. UI just shows "Upgrade status: Unkown Failure"

Observe the catalog operator logs and note errors like:
E0319 21:35:57.350591       1 queueinformer_operator.go:319] sync {"update" "openshift-migration"} failed: bundle unpacking failed with an error: [roles.rbac.authorization.k8s.io "c1572438804f004fb90b6768c203caad96c47331f7ecc4f68c3cf6b43b0acfd" already exists, roles.rbac.authorization.k8s.io "724788f6766aa5ba19b24ef4619b6a8e8e856b8b5fb96e1380f0d3f5b9dcb7a" already exists]

If you delete the roles, you'll get the same for rolebindings, then the same for jobs.batch, and then configmaps.

Expected results:

Operator just installs

Additional info:

If you clean up all these resources the operator will install successfully.    

Description of problem

The cluster-ingress-operator repository vendors controller-runtime v0.16.3, which uses Kubernetes 1.28 packages. OpenShift 4.16 is based on Kubernetes 1.29.

Version-Release number of selected component (if applicable)

4.16.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.16/go.mod.

Actual results

The sigs.k8s.io/controller-runtime package is at v0.16.3.

Expected results

The sigs.k8s.io/controller-runtime package is at v0.17.0 or newer.

Additional info

https://github.com/openshift/cluster-ingress-operator/pull/1016 already bumped the k8s.io/* packages to v0.29.0, but ideally the controller-runtime package should be bumped too. The controller-runtime v0.17 release includes some breaking changes, such as the removal of apiutil.NewDiscoveryRESTMapper; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.17.0.

Description of problem:

When console with custom route is disabled before cluster upgrade, and re-enabled after cluster upgrade, console could not be accessed successfully.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-15-111607

How reproducible:

Always

Steps to Reproduce:

1. Launch a cluster with available update.
2. Create custom route for console in ingress configuration:
# oc edit ingresses.config.openshift.io cluster
spec:
  componentRoutes:
  - hostname: console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com
    name: console
    namespace: openshift-console
  - hostname: openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com
    name: downloads
    namespace: openshift-console
  domain: apps.qe-413-0216.qe.devcluster.openshift.com
3. After custom route is created, access console with custom route.
4. Remove console by setting managementState as Removed in console operator:
# oc edit consoles.operator.openshift.io cluster
spec:
  logLevel: Normal
  managementState: Removed
  operatorLogLevel: Normal
5. Upgrade cluster to a target version.
6. Enable console by setting managementState as Managed in console operator:
# oc edit consoles.operator.openshift.io cluster
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
7. After console resources are created, access console url.

Actual results:

3. Console could be accessed through custom route.
4. Console resources are removed. And all cluster operators are in normal status
# oc get all -n openshift-console
No resources found in openshift-console namespace.

5. Upgrade succeeds, all cluster operators are in normal status
6. Console resources are created:

  1. oc get all -n openshift-console
    NAME READY STATUS RESTARTS AGE
    pod/console-69d88985b-bvh46 1/1 Running 0 3m41s
    pod/console-69d88985b-fwhjf 1/1 Running 0 3m41s
    pod/downloads-6b6b555d8d-kn822 1/1 Running 0 3m49s
    pod/downloads-6b6b555d8d-wp6zc 1/1 Running 0 3m49s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/console ClusterIP 172.30.226.112 <none> 443/TCP 3m50s
service/console-redirect ClusterIP 172.30.147.151 <none> 8444/TCP 3m50s
service/downloads ClusterIP 172.30.251.248 <none> 80/TCP 3m50s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/console 2/2 2 2 3m47s
deployment.apps/downloads 2/2 2 2 3m50s

NAME DESIRED CURRENT READY AGE
replicaset.apps/console-69d88985b 2 2 2 3m42s
replicaset.apps/console-6dbdd487d 0 0 0 3m47s
replicaset.apps/downloads-6b6b555d8d 2 2 2 3m50s

NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
route.route.openshift.io/console console-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com console-redirect custom-route-redirect edge/Redirect None
route.route.openshift.io/console-custom console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com console https reencrypt/Redirect None
route.route.openshift.io/downloads downloads-openshift-console.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None
route.route.openshift.io/downloads-custom openshift-downloads-custom.apps.qe-413-0216.qe.devcluster.openshift.com downloads http edge/Redirect None

7. Could not open console url successfully. There is error info for console operator:

  1. oc get co | grep console
    console 4.13.0-0.nightly-2023-02-15-202607 False False False 42s RouteHealthAvailable: route not yet available, https://console-openshift-custom.apps.qe-413-0216.qe.devcluster.openshift.com returns '503 Service Unavailable'
  2. oc get clusterversions.config.openshift.io
    NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
    version 4.13.0-0.nightly-2023-02-15-202607 True False 4h48m Error while reconciling 4.13.0-0.nightly-2023-02-15-202607: the cluster operator console is not available

Expected results:

7. Should be able to access console successfully.

Additional info:


Please review the following PR: https://github.com/openshift/network-tools/pull/105

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When creating hypershift cluster in disconnected env, the worker node cannot pass the validation of the assisted service due to an ignition error.

    

Version-Release number of selected component (if applicable):

4.14.z    

How reproducible:

  100 %  

Steps to Reproduce:

    1. Steps to install HCP cluster is mentioned in documentation: https://hypershift-docs.netlify.app/labs/dual/mce/agentserviceconfig/#assisted-service-customization
    2.
    3.
    

Actual results:

Node addition fails    

Expected results:

Node should get added to the cluster    

Additional info:

    

Description of problem:

When the IPI installer creates a service instance for the user, PowerVS will now have the type as composite_instance rather than service_instance. Fixup delete cluster to account for this change.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create cluster
    2. Destroy cluster
    3.
    

Actual results:

The newly created service instance does not delete.
    

Expected results:


    

Additional info:


    

Description of problem:

    Installer now errors when attempting to use networkType: OpenShiftSDN; but the message still says "deprecated".

Version-Release number of selected component (if applicable):

4.15+    

How reproducible:

100%

Steps to Reproduce:

    1. Attempt to install 4.15+ with networkType: OpenShiftSDN
Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"    

Actual results:

Observe error in logs:

time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"    

Expected results:

A message more like:

Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is not supported, please use OVNKubernetes"

Additional info:
See thread

Description of problem:

In this PR, we started using watcher channels to wait for the job finished event from the periodic and on-demand data gathering jobs from IO.
However, as stated in this comment, part of maintaining a watcher is to re-establish it at the last received resource version whenever this channel closes.

This issue is currently causing flakiness in our test suite as the on-demand data gathering job is created, when the job is about to finish, the watcher channel closes, which is causing the datagather instance associated with the job to never have the insightsReport updated. Therefore the tests fail.

 

Version-Release number of selected component (if applicable):

    

How reproducible:

Sometimes. Very hard to reproduce as it might have to do with the API resyncing the watcher's cache .

Steps to Reproduce:

    1.Create a data gathering job
    2.You may see a log saying "watcher channel was closed unexpectedly"

    

Actual results:

The DataGather instance will not be updated with the insightsReport    

Expected results:

When the job finishes, the archive is uploaded to ingress and the report is downloaded from the external data pipeline. This report should appear in the DataGather instance.

Additional info:

It's possible but flaky to reproduce with on-demand data gathering jobs but I've seen it happen with periodic ones as well.

Description of problem:

When there is new update for cluster, try to click "Select a version" from cluster settings page, there is no reaction.
    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-19-033450
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Prepare a cluster with available update.
    2.Go to Cluster Settings page, choose a version by clicking on "Select a version" button.
    3.
    

Actual results:

2. There is no response when click on the button, user could not select a version from the page.
    

Expected results:

2. A modal should show up for user to select version after clicking on "Select a version" button
    

Additional info:

screenshot: https://drive.google.com/file/d/1Kpyu0kUKFEQczc5NVEcQFbf_uly_S60Y/view?usp=sharing
    

Description of problem:

There is a problem with the logic change in https://github.com/openshift/machine-config-operator/pull/4196 that is causing Kubelet to fail to start after a reboot on OpenShiftSDN deployments. This is currently breaking all of the v4 metal jobs.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Deploy baremetal cluster with OpenShiftSDN
    2.
    3.
    

Actual results:

    Nodes fail to join cluster

Expected results:

    Successful cluster deployment

Additional info:

    

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/96

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    HyperShift control plane pods that support auditing (i.e. Kubernetes API server, OpenShift API server, and OpenShift oauth API server) maintain auditing log files that may consume many GB of container ephemeral storage in short period of time.

We need to reduce the size of logs in these containers by modifying audit-log-maxbackup and audit-log-maxsize. This should not change the functionality of the audit logs since all we do is output to stdout in the containerd logs.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/99

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/100

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

After upgrade from 4.13.x to 4.14.10, the workload images that the customer stored inside the internal registry are lost, resulting the applications pods into error 
"Back-off pulling image".

Even when manually pulling with podman, it fails then with "manifest unknown" because the image cannot be found in the registry anymore.


- This behavior was found and reproduced 100% on ARO clusters, where the internal registry is by default backed up by the Storage Account created by the ARO RP service principal, which is the Containers blob service.

- I do not know if in non-managed Azure clusters or any other architecture the same behavior is found.

Version-Release number of selected component (if applicable):

4.14.10

How reproducible:

100% with an ARO cluster (Managed cluster)

Steps to Reproduce:  Attached.

The workaround found so far is to rebuild the apps or re-import the images. But those tasks are lengthy and costly specially if it is a production cluster.

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/59

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

In a deployment with bonding and vlan during the booting of the provision image, the system loose the connectivity (even the ping) as soon as two new network interfaces appear on the node, created by the ironic-python-agent, those interfaces are the slaves interfaces with the vlan added.{code}
Version-Release number of selected component (if applicable):

4.12.48\{code}
How reproducible:
{code:none}
Always\{code}
Steps to Reproduce:
{code:none}
1. Deploy a cluster with bonding + vlan
2.
3.
Actual results:
{code:none}
After investigation from Openstack team, it looks like having this option "enable_vlan_interfaces = all" enabled in "/etc/ironic-python-agent.conf" is what trigger the creation of the vlan interfaces. This new interfaces is what cuts the communication.

Expected results:

No extra vlan interfaces created, communication is not lost and installation succeeds.

Additional info:

How customer crafted the test:

As soon as the node start pinging he connected with ssh and set a password to core user
one communication is lost (~1 min after started pinging) it connects through the KVM interface and cor password.
If we disable the ironic-python-agent and manually remove the vlan interfaces created the communication is restored.  Installation works if LLDP is turned off at teh switch.

This issue was supposed to be fixed in these versions, according to the original JIRA which I have linked here.

Team lead from that JIRA suggested the issue has to be fixed by re-vendoring ICC in the assisted-service, hence this JIRA creation.

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/42

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When deploying to Power VS with endpoint overrides set in the provider status, the operator will ignore the overrides.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Easily

Steps to Reproduce:

    1. Set overrides in platform status
    2. Deploy cluster-image-registry-operator
    3. Endpoints are ignored
    

Actual results:

    Specified endpoints are ignored

Expected results:

    Specified endpoints are used

Additional info:

    

Description of problem:

  "[sig-apps][Feature:DeploymentConfig] deploymentconfigs when tagging images should successfully tag the deployed image [apigroup:apps.openshift.io][apigroup:authorization.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" has the following warning: warnings.go:70] apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-03-07-234116

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2027

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/99

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/70

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/69

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/48

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

we need to update packages_ironic.yml to be closer to current opendev master upper constraints
after the new packages are created we'll have to tag them and update the ironic-image configuration

Description of problem:

Automate E2E tests of Dynamic OVS Pinning. This bug is created for merging 

https://github.com/openshift/cluster-node-tuning-operator/pull/746

Version-Release number of selected component (if applicable):

4.15.0

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When using canary rollout, paused MCPs begin updating when the user triggers the cluster update.

Version-Release number of selected component (if applicable):

 

How reproducible:

Approximately 3/10 times that I have witnessed.

Steps to Reproduce:

1. Install cluster
2. Follow canary rollout strategy: https://docs.openshift.com/container-platform/4.11/updating/update-using-custom-machine-config-pools.html 
3. Start cluster update

Actual results:

Worker nodes in paused MCPs begin update

Expected results:

Worker nodes in paused MCPs will not begin update until cluster admin unpauses the MCPs

Additional info:

This has occurred with my customer in their Azure self-managed cluster and their on-prem cluster in vSphere, as well as my lab cluster in vSphere.

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/61

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

Up to latest decision RH won't going to support installation OCP cluster on Nutanix with

nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"

Slack discussion thread

https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159

Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS

 

GDescription of problem:

NodeLogQuery e2e tests are failing with Kubernetes 1.28 bump. Example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1646/pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-ovn/1683472309211369472

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/60

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/364

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

This was discovered during Contrail testing when a large number of additional manifests specific to contrail were added to the openshift/ dir. The additional manifests are here - https://github.com/Juniper/contrail-networking/tree/main/releases/23.1/ocp.

When creating the agent image the following error occurred:
failed to fetch Agent Installer ISO: failed to generate asset \"Agent Installer ISO\": failed to create overwrite reader for ignition: content length (802204) exceeds embed area size (262144)"]

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The YAML sidebar is occupying too much space on some pages   

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-03-140457    

How reproducible:

Always    

Steps to Reproduce:

1. Go to Deployment/DeploymentConfig creation page
2. Choose 'YAML view'
3. (for comparison) Go to other resources YAML page, open the sidebar    

Actual results:

We can see the sidebar is occupying too much screen compared with other resources YAML page

Expected results:

We should reduce the space sidebar occupies

Additional info:

    

Description of problem:

The two tasks will always "UPDATE" the underlying "APIService" resource even when no changes are to be made.
This behavior significantly elevates the likelihood of encountering conflicts, especially during upgrades, with other controllers that are concurrently monitoring the same resources (CA controller e.g.). Moreover, this consumes resources for unnecessary work.

The tasks rely on CreateOrUpdateAPIService: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745 which always "UPDATE" the APIService resource.

Version-Release number of selected component (if applicable):

    

How reproducible:

keep a cluster running then take a look at audit logs concerning the APIService v1beta1.metrics.k8s.io, you could use: oc adm must-gather -- /usr/bin/gather_audit_logs

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

You would see that every "get" is followed by an "update"

You can also take a look at the code taking care of that: https://github.com/openshift/cluster-monitoring-operator/blob/5e394dd9de305cb6927a23c31b3f9651aa806fb8/pkg/client/client.go#L1725-L1745

Expected results:

"Updates" should be avoided if no changes are to be made.

Additional info:

it'd be even better if we could avoid the "get"s, but that would be another subject to discuss.

Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/81

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We've moved to using bigquery for this, stop pushing and free up some loki cycles.

This is done with a command in origin: https://github.com/openshift/origin/blob/24e011ba3adf2767b88619351895bb878de3d62a/pkg/cmd/openshift-tests/dev/dev.go#L211

So all this code and probably some libraries could be removed.

But first, remove the invocation of this command in the release repo. (upload-intervals)

To accommodate upgrades of 4.12 to 4.13 on a fips cluster, a rhel8 binary needed to be included in the rhel9 based 4.13 ovn-kubernetes container image. See https://issues.redhat.com/browse/OCPBUGS-15962 for details.

This workaround is not needed on 4.14+ clusters, as minor upgrades from 4.12 will always land on 4.13.

A fix in the ovn-kubernetes repo needs to be accompanied by a config change in ocp-build-data, please coordinate with ART.

Description of the problem:

Impossible to create an extra partition on the main disk at installation time with OCP 4.15. It works perfectly with 4.14 and under

I supply a custom machineconfig manifest to do so, and the behavior is that during installation, after reboot, screen is blank, and host has no networking (no route to host)

A slack thread explaining the issue with further debugging can be consulted in https://redhat-internal.slack.com/archives/C999USB0D/p1707991107757299

The bug seems to be introduced in https://github.com/openshift/assisted-installer/pull/713 , which allows for one less reboot on installation time, and to do that, it implements part of the post-reboot code. This code runs BEFORE the extra partition is created, and this creates a problem

How reproducible:

Always

Steps to reproduce:

1. Create a 4.15 cluster with an extra manifest that creates a extra partition at the end of the main disk

Example of machineconfig (change device to match installation disk): 

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 98-extra-partition
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      disks:
        - device: /dev/vda
          partitions:
            - label: javi
              startMiB: 110000 # space left for the CoreOS partition.
              sizeMiB: 0 # Use all available space 

 

2. Proceed with the installation

Actual results:

After reboot, node never comes back up

Expected results:

Cluster installs without problem

Description of problem:

While deploying a cluster with OVNKubnernetes or applying a cloud-provider-config change, all OCP nodes got a failing unit on them:

$  oc debug -q node/ostest-h9vbm-master-0 -- chroot  /host sudo systemctl list-units --failed
  UNIT                       LOAD   ACTIVE SUB    DESCRIPTION
● afterburn-hostname.service loaded failed failed Afterburn HostnameLOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.

$ oc debug -q node/ostest-h9vbm-master-0 -- chroot  /host sudo systemctl status afterburn-hostname
× afterburn-hostname.service - Afterburn Hostname
     Loaded: loaded (/etc/systemd/system/afterburn-hostname.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Tue 2023-04-18 11:48:35 UTC; 2h 26min ago
   Main PID: 1309 (code=exited, status=123)
        CPU: 148msApr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]:     1: maximum number of retries (10) reached
Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]:     2: failed to fetch
Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]:     3: error sending request for url (http://169.254.169.254/latest/meta-data/hostname): error trying to connect: tcp connect error: Network is unreachable (os error 101)
Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]:     4: error trying to connect: tcp connect error: Network is unreachable (os error 101)
Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]:     5: tcp connect error: Network is unreachable (os error 101)
Apr 18 11:48:35 ostest-h9vbm-master-0 openstack-afterburn-hostname[1314]:     6: Network is unreachable (os error 101)
Apr 18 11:48:35 ostest-h9vbm-master-0 hostnamectl[2494]: Too few arguments.
Apr 18 11:48:35 ostest-h9vbm-master-0 systemd[1]: afterburn-hostname.service: Main process exited, code=exited, status=123/n/a
Apr 18 11:48:35 ostest-h9vbm-master-0 systemd[1]: afterburn-hostname.service: Failed with result 'exit-code'.
Apr 18 11:48:35 ostest-h9vbm-master-0 systemd[1]: Failed to start Afterburn Hostname.


$ oc debug -q node/ostest-h9vbm-worker-0-fkxdr -- chroot  /host sudo systemctl list-units --failed
  UNIT                       LOAD   ACTIVE SUB    DESCRIPTION
● afterburn-hostname.service loaded failed failed Afterburn HostnameLOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.

Once the installation of the config change is done, restarting the service resolves the issue:

$ oc debug -q node/ostest-h9vbm-worker-0-fkxdr -- chroot  /host sudo systemctl restart afterburn-hostname

$ oc debug -q node/ostest-h9vbm-worker-0-fkxdr -- chroot  /host sudo systemctl status afterburn-hostname
○ afterburn-hostname.service - Afterburn Hostname
     Loaded: loaded (/etc/systemd/system/afterburn-hostname.service; enabled; preset: disabled)
     Active: inactive (dead) since Tue 2023-04-18 14:14:40 UTC; 9s ago
    Process: 171875 ExecStart=/usr/local/bin/openstack-afterburn-hostname (code=exited, status=0/SUCCESS)
   Main PID: 171875 (code=exited, status=0/SUCCESS)
        CPU: 119msApr 18 14:14:32 ostest-h9vbm-worker-0-fkxdr systemd[1]: Starting Afterburn Hostname...
Apr 18 14:14:39 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:39.521 WARN failed to locate config-drive, using the metadata service API instead
Apr 18 14:14:39 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:39.583 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt #1
Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:40.237 INFO Fetch successful
Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr openstack-afterburn-hostname[171876]: Apr 18 14:14:40.237 INFO wrote hostname ostest-h9vbm-worker-0-fkxdr to /dev/stdout
Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr systemd[1]: afterburn-hostname.service: Deactivated successfully.
Apr 18 14:14:40 ostest-h9vbm-worker-0-fkxdr systemd[1]: Finished Afterburn Hostname.
error: non-zero exit code from debug container
[stack@undercloud-0 ~]$ oc debug -q node/ostest-h9vbm-master-0 -- chroot  /host sudo systemctl status afterburn-hostname
× afterburn-hostname.service - Afterburn Hostname
     Loaded: loaded (/etc/systemd/system/afterburn-hostname.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Tue 2023-04-18 11:48:35 UTC; 2h 26min ago
   Main PID: 1309 (code=exited, status=123)
        CPU: 148ms

Version-Release number of selected component (if applicable):

Observed on 4.13.0-0.nightly-2023-04-13-171034 and 4.12.13

How reproducible:

Always

Additional info:

More retries or expanding them in time can help resolve this. It seems that in OVN-K the network is taking time to get ready and therefore the retries are timed out with the current configuration before the network is ready.

Must-gather link provided on private comment.

Description of problem:

 CMPS was supported in 4.15 on vsphere platform when enable TechPreviewNoUpgrade. but after I build the cluster with no failure domains/single failure domain setting in install-config. there were three duplicated failure domains.

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-11-033133

How reproducible:

    install a cluster with TP enabled and don't set failure domain (or set single failure doamin) in install-config.

Steps to Reproduce:

    1. do not config failure domain in install-config (or set single failure doamin).
    2. install a cluster with TP enabled
    3. check CPMS with command:   
       oc get controlplanemachineset -oyaml    

Actual results:

duplicated failure domains.
        failureDomains:
     platform: VSphere
     vsphere:
     - name: generated-failure-domain
     - name: generated-failure-domain
     - name: generated-failure-domain
    metadata:
     labels:

Expected results:

 failure domain should not duplicated when setting single failure domain in install-config.
 failure domain should not exists when not setting failure domain in install-config.

Additional info:

    

Description of problem:

For certain operations the CEO will check the etcd member health by creating a client directly and waiting for its status report.

Under a situation of any member not being reachable for a longer period, we found the CEO was constantly getting stuck / deadlocked and couldn't move certain controllers forward. 

In OCPBUGS-12475 we introduced a health-check that would dump stack and automatically restart with the operator deployment health probe.

In a more recent upgrade run we could find the culprit [1] to be a missing context during client initialization to etcd, making it stuck infinitely:


W0229 02:55:46.820529       1 aliveness_checker.go:33] Controller [EtcdEndpointsController] didn't sync for a long time, declaring unhealthy and dumping stack

goroutine 1426 [select]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth({0x3272768?, 0xc002090310}, {0xc0000a6880, 0x3, 0xc001c98360?})
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:64 +0x330
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.(*etcdClientGetter).MemberHealth(0xc000c24540, {0x3272688, 0x4c20080})
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:412 +0x18c
github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.CheckSafeToScaleCluster({0x324ccd0?, 0xc000b6d5f0?}, {0x3284250?, 0xc0008dda10?}, {0x324e6c0, 0xc000ed4fb0}, {0x3250560, 0xc000ed4fd0}, {0x32908d0, 0xc000c24540})
	github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/bootstrap.go:149 +0x28e
github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers.(*QuorumCheck).IsSafeToUpdateRevision(0x2893020?)
	github.com/openshift/cluster-etcd-operator/pkg/operator/ceohelpers/qourum_check.go:37 +0x46
github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).syncConfigMap(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x32801b0, 0xc001198540})
	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:146 +0x5d8
github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller.(*EtcdEndpointsController).sync(0xc0002e28c0, {0x32726f8, 0xc0008e60a0}, {0x325d240, 0xc003569e90})
	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go:66 +0x71
github.com/openshift/cluster-etcd-operator/pkg/operator/health.(*CheckingSyncWrapper).Sync(0xc000f21bc0, {0x32726f8?, 0xc0008e60a0?}, {0x325d240?, 0xc003569e90?})
	github.com/openshift/cluster-etcd-operator/pkg/operator/health/checking_sync_wrapper.go:22 +0x43
github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc00113cd80, {0x32726f8, 0xc0008e60a0}, {0x325d240?, 0xc003569e90?})
	github.com/openshift/library-go@v0.0.0-20240124134907-4dfbf6bc7b11/pkg/controller/factory/base_controller.go:201 +0x43



goroutine 11640 [select]:
google.golang.org/grpc.(*ClientConn).WaitForStateChange(0xc003707000, {0x3272768, 0xc002091260}, 0x3)
	google.golang.org/grpc@v1.58.3/clientconn.go:724 +0xb1
google.golang.org/grpc.DialContext({0x3272768, 0xc002091260}, {0xc003753740, 0x3c}, {0xc00355a880, 0x7, 0xc0023aa360?})
	google.golang.org/grpc@v1.58.3/clientconn.go:295 +0x128e
go.etcd.io/etcd/client/v3.(*Client).dial(0xc000895180, {0x32754a0?, 0xc001785670?}, {0xc0017856b0?, 0x28f6a80?, 0x28?})
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:303 +0x407
go.etcd.io/etcd/client/v3.(*Client).dialWithBalancer(0xc000895180, {0x0, 0x0, 0x0})
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:281 +0x1a9
go.etcd.io/etcd/client/v3.newClient(0xc002484e70?)
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:414 +0x91c
go.etcd.io/etcd/client/v3.New(...)
	go.etcd.io/etcd/client/v3@v3.5.10/client.go:81
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.newEtcdClientWithClientOpts({0xc0017853d0, 0x1, 0x1}, 0x0, {0x0, 0x0, 0x0?})
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/etcdcli.go:127 +0x77d
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x32726f8, 0xc00318ac30}, 0xc002090460)
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:103 +0xc5
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x6c
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth in goroutine 1426
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5

  

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

any currently supported OCP version    

How reproducible:

Always    

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack (similar to the one above)
    

Actual results:

CEO controllers are getting deadlocked, but the operator will restart eventually after some time due to health probes failing    

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

clientv3.New doesn't take any timeout context, but tries to establish a connection forever

https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/etcdcli/etcdcli.go#L127-L130

There's a way to pass the "default context" via the client config, which is slightly misleading.

Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/68

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/779

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The following test started to fail freequently in the periodic tests:

External Storage [Driver: pd.csi.storage.gke.io] [Testpattern: Dynamic PV
 (block volmode)] provisioning should provision storage with pvc data 
source in parallel 

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Sometimes, but way too often in the CI

Steps to Reproduce:

    1. Run the periodic-ci-openshift-release-master-nightly-X.X-e2e-gcp-ovn-csi test
    

Actual results:

    Provisioning of some volumes fails with

time="2024-01-05T02:30:07Z" level=info msg="resulting interval message" message="{ProvisioningFailed  failed to provision volume with StorageClass \"e2e-provisioning-9385-e2e-scw2z8q\": rpc error: code = Internal desc = CreateVolume failed to create single zonal disk pvc-35b558d6-60f0-40b1-9cb7-c6bdfa9f28e7: failed to insert zonal disk: unknown Insert disk operation error: rpc error: code = Internal desc = operation operation-1704421794626-60e299f9dba08-89033abf-3046917a failed (RESOURCE_OPERATION_RATE_EXCEEDED): Operation rate exceeded for resource 'projects/XXXXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-a/disks/pvc-501347a5-7d6f-4a32-b0e0-cf7a896f316d'. Too frequent operations from the source resource. map[reason:ProvisioningFailed]}"

Expected results:

    Test passes

Additional info:

    Looks like we're hitting the API quota limits with the test

Failed test run example:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-ovn-csi/1743082616304701440

Link to Sippy:

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Dynamic%20PV%20%28block%20volmode%29&component=Storage%20%2F%20Kubernetes%20External%20Components&confidence=95&environment=ovn%20no-upgrade%20amd64%20gcp%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=gcp&platform=gcp&sampleEndTime=2024-01-08%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-01-02%2000%3A00%3A00&testId=openshift-tests%3A7845229f6a2c8faee6573878f566d2f3&testName=External%20Storage%20%5BDriver%3A%20pd.csi.storage.gke.io%5D%20%5BTestpattern%3A%20Dynamic%20PV%20%28block%20volmode%29%5D%20provisioning%20should%20provision%20storage%20with%20pvc%20data%20source%20in%20parallel%20%5BSlow%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=standard&variant=standard

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/415

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Cluster with user provisioned image registry storage accounts fails to upgrade to 4.14.20 due to image-registry-operator being degraded.

message: "Progressing: The registry is ready\nNodeCADaemonProgressing: The daemon set node-ca is deployed\nAzurePathFixProgressing: Migration failed: panic: AZURE_CLIENT_ID is required for authentication\nAzurePathFixProgressing: \nAzurePathFixProgressing: goroutine 1 [running]:\nAzurePathFixProgressing: main.main()\nAzurePathFixProgressing: \t/go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:25 +0x15c\nAzurePathFixProgressing: "

cmd/move-blobs was introduced due to https://issues.redhat.com/browse/OCPBUGS-29003.  

 

Version-Release number of selected component (if applicable):

4.14.15+

How reproducible:

I have not reproduced myself but I imagine you would hit this every time when upgrading from 4.13->4.14.15+ with Azure UPI image registry

 

Steps to Reproduce:

    1.Starting on version 4.13, Configuring the registry for Azure user-provisioned infrastructure - https://docs.openshift.com/container-platform/4.14/registry/configuring_registry_storage/configuring-registry-storage-azure-user-infrastructure.html.

    2.  Upgrade to 4.14.15+
    3.
    

Actual results:

    Upgrade does not complete succesfully 
$ oc get co
....
image-registry                             4.14.20        True        False         True       617d     AzurePathFixControllerDegraded: Migration failed: panic: AZURE_CLIENT_ID is required for authentication...

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.38   True        True          7h41m   Unable to apply 4.14.20: wait has exceeded 40 minutes for these operators: image-registry

 

Expected results:

Upgrade to complete successfully

Additional info:

    

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/98

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The Port_Security has been override although it has set to false in Worker machineset configuration

Version-Release number of selected component (if applicable):

    OCP=4.14.14
    RHOSP=17.1

How reproducible:

    NFV Perf lab 
    ShiftonStack Deployment mode = IPI

Steps to Reproduce:

    1.Network configuration resources for Worker node
$ oc get machinesets.machine.openshift.io -n openshift-machine-api | grep worker
5kqfbl3y0rhocpnfv-wj2jj-worker-0   1         1         1       1           5d23h
$ oc describe machinesets.machine.openshift.io -n openshift-machine-api 5kqfbl3y0rhocpnfv-wj2jj-worker-0
Name:         5kqfbl3y0rhocpnfv-wj2jj-worker-0
Namespace:    openshift-machine-api
Labels:       machine.openshift.io/cluster-api-cluster=5kqfbl3y0rhocpnfv-wj2jj
              machine.openshift.io/cluster-api-machine-role=worker
              machine.openshift.io/cluster-api-machine-type=worker
Annotations:  machine.openshift.io/memoryMb: 47104
              machine.openshift.io/vCPU: 26
API Version:  machine.openshift.io/v1beta1
Kind:         MachineSet
Metadata:
  Creation Timestamp:  2024-03-07T05:24:07Z
  Generation:          3
  Resource Version:    226098
  UID:                 8cb06872-9b62-4c2c-b66b-bf91a03efa2d
Spec:
  Replicas:  1
  Selector:
    Match Labels:
      machine.openshift.io/cluster-api-cluster:     5kqfbl3y0rhocpnfv-wj2jj
      machine.openshift.io/cluster-api-machineset:  5kqfbl3y0rhocpnfv-wj2jj-worker-0
  Template:
    Metadata:
      Labels:
        machine.openshift.io/cluster-api-cluster:       5kqfbl3y0rhocpnfv-wj2jj
        machine.openshift.io/cluster-api-machine-role:  worker
        machine.openshift.io/cluster-api-machine-type:  worker
        machine.openshift.io/cluster-api-machineset:    5kqfbl3y0rhocpnfv-wj2jj-worker-0
    Spec:
      Lifecycle Hooks:
      Metadata:
      Provider Spec:
        Value:
          API Version:        machine.openshift.io/v1alpha1
          Availability Zone:  worker
          Cloud Name:         openstack
          Clouds Secret:
            Name:        openstack-cloud-credentials
            Namespace:   openshift-machine-api
          Config Drive:  true
          Flavor:        sos-worker
          Image:         5kqfbl3y0rhocpnfv-wj2jj-rhcos
          Kind:          OpenstackProviderSpec
          Metadata:
          Networks:
            Filter:
            Subnets:
              Filter:
                Id:  7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34
          Ports:
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p1
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p1
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p2
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p2
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p3
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p3
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    1a892dcf-bf93-46ef-bf37-bda6cf923471
            Name Suffix:    provider3p4
            Network ID:     50a557b5-34c2-4c47-b539-963688f7167c
            Port Security:  false
            Tags:
              sriov
            Trunk:      false
            Vnic Type:  direct
            Fixed I Ps:
              Subnet ID:    76430b9e-302f-428d-916a-77482d9cfb19
            Name Suffix:    provider4p4
            Network ID:     e2106b16-8f83-4e2e-bdbd-20e2c12ec279
            Port Security:  false
            Tags:
              sriov
            Trunk:         false
            Vnic Type:     direct
          Primary Subnet:  7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34
          Security Groups:
            Filter:
            Name:             5kqfbl3y0rhocpnfv-wj2jj-worker
          Server Group Name:  5kqfbl3y0rhocpnfv-wj2jj-worker-worker
          Server Metadata:
            Name:                  5kqfbl3y0rhocpnfv-wj2jj-worker
            Openshift Cluster ID:  5kqfbl3y0rhocpnfv-wj2jj
          Tags:
            openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj
          Trunk:  true
          User Data Secret:
            Name:  worker-user-data
Status:
  Available Replicas:      1
  Fully Labeled Replicas:  1
  Observed Generation:     3
  Ready Replicas:          1
  Replicas:                1
Events:                    <none>
$ oc get nodes
NAME                                     STATUS   ROLES                  AGE     VERSION
5kqfbl3y0rhocpnfv-wj2jj-master-0         Ready    control-plane,master   5d23h   v1.27.10+28ed2d7
5kqfbl3y0rhocpnfv-wj2jj-master-1         Ready    control-plane,master   5d23h   v1.27.10+28ed2d7
5kqfbl3y0rhocpnfv-wj2jj-master-2         Ready    control-plane,master   5d23h   v1.27.10+28ed2d7
5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr   Ready    worker                 5d22h   v1.27.10+28ed2d7
$ oc describe nodes 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
Name:               5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=sos-worker
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=worker
                    feature.node.kubernetes.io/network-sriov.capable=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=sos-worker
                    node.openshift.io/os_id=rhcos
                    topology.cinder.csi.openstack.org/zone=worker
                    topology.kubernetes.io/region=regionOne
                    topology.kubernetes.io/zone=worker
Annotations:        alpha.kubernetes.io/provided-node-ip: 192.168.0.91
                    csi.volume.kubernetes.io/nodeid: {"cinder.csi.openstack.org":"aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879"}
                    machine.openshift.io/machine: openshift-machine-api/5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-8c613531f97974a9561f8b0ada0c2cd0
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 505735
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
                    tuned.openshift.io/bootcmdline:
                      skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on rcu_nocbs=10-25 tuned.non_isolcpus=000003ff systemd.cpu_affinity=0,1,2,3...
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 07 Mar 2024 06:09:31 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
  AcquireTime:     <unset>
  RenewTime:       Wed, 13 Mar 2024 04:55:28 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:00 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:00 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:00 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 13 Mar 2024 04:55:33 +0000   Thu, 07 Mar 2024 15:18:05 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.0.91
  Hostname:    5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr
Capacity:
  cpu:                          26
  ephemeral-storage:            104266732Ki
  hugepages-1Gi:                20Gi
  hugepages-2Mi:                0
  memory:                       47264764Ki
  openshift.io/intl_provider3:  4
  openshift.io/intl_provider4:  4
  pods:                         250
Allocatable:
  cpu:                          16
  ephemeral-storage:            95018478229
  hugepages-1Gi:                20Gi
  hugepages-2Mi:                0
  memory:                       25166844Ki
  openshift.io/intl_provider3:  4
  openshift.io/intl_provider4:  4
  pods:                         250
System Info:
  Machine ID:                             aa5cfdcbeb4646d88ac25bb6f0c0d879
  System UUID:                            aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879
  Boot ID:                                77573755-0d27-4717-80fe-4579692d9c2c
  Kernel Version:                         5.14.0-284.54.1.el9_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 414.92.202402201520-0 (Plow)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.27.3-6.rhaos4.14.git7eb2281.el9
  Kubelet Version:                        v1.27.10+28ed2d7
  Kube-Proxy Version:                     v1.27.10+28ed2d7
ProviderID:                               openstack:///aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879
Non-terminated Pods:                      (19 in total)
  Namespace                               Name                                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                               ----                                                 ------------  ----------  ---------------  -------------  ---
  crucible-rickshaw                       testpmd-host-device-e810-sriov                       10 (62%)      10 (62%)    10000Mi (40%)    10000Mi (40%)  3d13h
  openshift-cluster-csi-drivers           openstack-cinder-csi-driver-node-hnv49               30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         5d22h
  openshift-cluster-node-tuning-operator  tuned-fcjfp                                          10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         5d22h
  openshift-dns                           dns-default-v7s59                                    60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         5d22h
  openshift-dns                           node-resolver-gkz8b                                  5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         5d22h
  openshift-image-registry                node-ca-p5dn5                                        10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         5d22h
  openshift-ingress-canary                ingress-canary-fk59t                                 10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         5d22h
  openshift-machine-config-operator       machine-config-daemon-9qw8z                          40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         5d22h
  openshift-monitoring                    node-exporter-czcmj                                  9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         5d22h
  openshift-monitoring                    prometheus-adapter-7696787779-vj5wk                  1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         5d4h
  openshift-multus                        multus-additional-cni-plugins-l7rpv                  10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         5d22h
  openshift-multus                        multus-nxr6k                                         10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         5d22h
  openshift-multus                        network-metrics-daemon-tb7sq                         20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         5d22h
  openshift-network-diagnostics           network-check-target-pqtp9                           10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         5d22h
  openshift-openstack-infra               coredns-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr       200m (1%)     0 (0%)      400Mi (1%)       0 (0%)         5d22h
  openshift-openstack-infra               keepalived-5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr    200m (1%)     0 (0%)      400Mi (1%)       0 (0%)         5d22h
  openshift-sdn                           sdn-9mdnb                                            110m (0%)     0 (0%)      220Mi (0%)       0 (0%)         5d22h
  openshift-sriov-network-operator        sriov-device-plugin-tr68w                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         5d13h
  openshift-sriov-network-operator        sriov-network-config-daemon-dtf95                    100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         5d22h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                     Requests       Limits
  --------                     --------       ------
  cpu                          10845m (67%)   10 (62%)
  memory                       11928Mi (48%)  10000Mi (40%)
  ephemeral-storage            0 (0%)         0 (0%)
  hugepages-1Gi                8Gi (40%)      8Gi (40%)
  hugepages-2Mi                0 (0%)         0 (0%)
  openshift.io/intl_provider3  4              4
  openshift.io/intl_provider4  4              4
Events:                        <none>

    2. OpenStack Network resource for Worker node
$ openstack server list --all --fit-width
+--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+
| ID                                   | Name                                   | Status | Networks                                                                                                             | Image                         | Flavor     |
+--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+
| aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr | ACTIVE | management=192.168.0.91; provider-3=192.168.177.197, 192.168.177.59, 192.168.177.66, 192.168.177.83;                 | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-worker |
|                                      |                                        |        | provider-4=192.168.178.108, 192.168.178.121, 192.168.178.144, 192.168.178.18                                         |                               |            |
| 1a24baf3-acde-49a0-ab8e-4f4afcc9d3cc | 5kqfbl3y0rhocpnfv-wj2jj-master-2       | ACTIVE | management=192.168.0.62                                                                                              | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master |
| 3e545ab5-6e28-4189-8d94-9272dfa1cd05 | 5kqfbl3y0rhocpnfv-wj2jj-master-1       | ACTIVE | management=192.168.0.78                                                                                              | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master |
| 97e5c382-0fb0-4a70-b58e-0469d3869a4e | 5kqfbl3y0rhocpnfv-wj2jj-master-0       | ACTIVE | management=192.168.0.93                                                                                              | 5kqfbl3y0rhocpnfv-wj2jj-rhcos | sos-master |
+--------------------------------------+----------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------+-------------------------------+------------+$ openstack port list --server aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879
+--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+
| ID                                   | Name                                               | MAC Address       | Fixed IP Addresses                                                             | Status |
+--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+
| 0a562c29-4ddc-41c4-82e8-13934d3ee273 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-0           | fa:16:3e:16:9a:c3 | ip_address='192.168.0.91', subnet_id='7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34'    | ACTIVE |
| 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4 | fa:16:3e:15:88:d7 | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE |
| 1778cb62-5fbf-42be-8847-53a7b092bdf5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2 | fa:16:3e:2a:64:e4 | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471' | ACTIVE |
| 557f205b-2674-4f6e-91a2-643fe1702be2 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p1 | fa:16:3e:56:a3:48 | ip_address='192.168.177.83', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'  | ACTIVE |
| 721b5f15-2dc9-4509-a4ba-09f364ae8771 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p3 | fa:16:3e:dd:c3:28 | ip_address='192.168.177.59', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'  | ACTIVE |
| 9da4b1be-27d7-4428-a194-9eb4b02f6ac5 | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p3 | fa:16:3e:fb:06:1b | ip_address='192.168.178.144', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE |
| a72fcbd2-83d3-4fa9-be3d-e9fbde27d4bf | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p4 | fa:16:3e:a9:28:0e | ip_address='192.168.177.66', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'  | ACTIVE |
| ba5cd10f-c6bc-4bed-b978-3b8a3560ad5c | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p1 | fa:16:3e:33:e4:c4 | ip_address='192.168.178.18', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19'  | ACTIVE |
| bf2ce123-76fc-4e5c-9e4f-0473febbdeac | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p2 | fa:16:3e:ce:91:10 | ip_address='192.168.178.121', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19' | ACTIVE |
+--------------------------------------+----------------------------------------------------+-------------------+--------------------------------------------------------------------------------+--------+
$ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                     |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | UP                                                                                                                        |
| allowed_address_pairs   |                                                                                                                           |
| binding_host_id         | nfv-intel-11.perflab.com                                                                                                  |
| binding_profile         | pci_slot='0000:b1:11.2', pci_vendor_info='8086:1889', physical_network='provider4'                                        |
| binding_vif_details     | connectivity='l2', port_filter='False', vlan='178'                                                                        |
| binding_vif_type        | hw_veb                                                                                                                    |
| binding_vnic_type       | direct                                                                                                                    |
| created_at              | 2024-03-07T06:03:43Z                                                                                                      |
| data_plane_status       | None                                                                                                                      |
| description             | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj                           |
| device_id               | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879                                                                                      |
| device_owner            | compute:worker                                                                                                            |
| device_profile          | None                                                                                                                      |
| dns_assignment          | fqdn='host-192-168-178-108.openstacklocal.', hostname='host-192-168-178-108', ip_address='192.168.178.108'                |
| dns_domain              |                                                                                                                           |
| dns_name                |                                                                                                                           |
| extra_dhcp_opts         |                                                                                                                           |
| fixed_ips               | ip_address='192.168.178.108', subnet_id='76430b9e-302f-428d-916a-77482d9cfb19'                                            |
| id                      | 0c1814db-cd4f-4f6a-a0c6-4f8e569b6767                                                                                      |
| ip_allocation           | None                                                                                                                      |
| mac_address             | fa:16:3e:15:88:d7                                                                                                         |
| name                    | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider4p4                                                                        |
| network_id              | e2106b16-8f83-4e2e-bdbd-20e2c12ec279                                                                                      |
| numa_affinity_policy    | None                                                                                                                      |
| port_security_enabled   | True                                                                                                                      |
| project_id              | 927450d0f06647a99d86214acd822679                                                                                          |
| propagate_uplink_status | None                                                                                                                      |
| qos_network_policy_id   | None                                                                                                                      |
| qos_policy_id           | None                                                                                                                      |
| resource_request        | None                                                                                                                      |
| revision_number         | 6                                                                                                                         |
| security_group_ids      | f0df9265-c7fd-4f47-875f-d346e5cb5074                                                                                      |
| status                  | ACTIVE                                                                                                                    |
| tags                    | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj |
| trunk_details           | None                                                                                                                      |
| updated_at              | 2024-03-07T06:04:10Z                                                                                                      |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+$ openstack port show --fit-width 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                     |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | UP                                                                                                                        |
| allowed_address_pairs   |                                                                                                                           |
| binding_host_id         | nfv-intel-11.perflab.com                                                                                                  |
| binding_profile         | pci_slot='0000:b1:01.1', pci_vendor_info='8086:1889', physical_network='provider3'                                        |
| binding_vif_details     | connectivity='l2', port_filter='False', vlan='177'                                                                        |
| binding_vif_type        | hw_veb                                                                                                                    |
| binding_vnic_type       | direct                                                                                                                    |
| created_at              | 2024-03-07T06:03:41Z                                                                                                      |
| data_plane_status       | None                                                                                                                      |
| description             | Created by cluster-api-provider-openstack cluster openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj                           |
| device_id               | aa5cfdcb-eb46-46d8-8ac2-5bb6f0c0d879                                                                                      |
| device_owner            | compute:worker                                                                                                            |
| device_profile          | None                                                                                                                      |
| dns_assignment          | fqdn='host-192-168-177-197.openstacklocal.', hostname='host-192-168-177-197', ip_address='192.168.177.197'                |
| dns_domain              |                                                                                                                           |
| dns_name                |                                                                                                                           |
| extra_dhcp_opts         |                                                                                                                           |
| fixed_ips               | ip_address='192.168.177.197', subnet_id='1a892dcf-bf93-46ef-bf37-bda6cf923471'                                            |
| id                      | 1778cb62-5fbf-42be-8847-53a7b092bdf5                                                                                      |
| ip_allocation           | None                                                                                                                      |
| mac_address             | fa:16:3e:2a:64:e4                                                                                                         |
| name                    | 5kqfbl3y0rhocpnfv-wj2jj-worker-0-n6crr-provider3p2                                                                        |
| network_id              | 50a557b5-34c2-4c47-b539-963688f7167c                                                                                      |
| numa_affinity_policy    | None                                                                                                                      |
| port_security_enabled   | True                                                                                                                      |
| project_id              | 927450d0f06647a99d86214acd822679                                                                                          |
| propagate_uplink_status | None                                                                                                                      |
| qos_network_policy_id   | None                                                                                                                      |
| qos_policy_id           | None                                                                                                                      |
| resource_request        | None                                                                                                                      |
| revision_number         | 9                                                                                                                         |
| security_group_ids      | f0df9265-c7fd-4f47-875f-d346e5cb5074                                                                                      |
| status                  | ACTIVE                                                                                                                    |
| tags                    | cluster-api-provider-openstack, openshift-machine-api-5kqfbl3y0rhocpnfv-wj2jj, openshiftClusterID=5kqfbl3y0rhocpnfv-wj2jj |
| trunk_details           | None                                                                                                                      |
| updated_at              | 2024-03-07T06:10:42Z                                                                                                      |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
$ openstack network list
+--------------------------------------+-------------+--------------------------------------+
| ID                                   | Name        | Subnets                              |
+--------------------------------------+-------------+--------------------------------------+
| 50a557b5-34c2-4c47-b539-963688f7167c | provider-3  | 1a892dcf-bf93-46ef-bf37-bda6cf923471 |
| e2106b16-8f83-4e2e-bdbd-20e2c12ec279 | provider-4  | 76430b9e-302f-428d-916a-77482d9cfb19 |
| 5fdddf1c-3a71-4752-94bd-bdb5b9674500 | management  | 7fb7d2d6-325d-49e1-b3f8-b4dbb1197e34 |
+--------------------------------------+-------------+--------------------------------------+$ openstack network show provider-3
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2024-03-01T16:45:48Z                 |
| description               |                                      |
| dns_domain                |                                      |
| id                        | 50a557b5-34c2-4c47-b539-963688f7167c |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | None                                 |
| is_vlan_transparent       | None                                 |
| mtu                       | 9216                                 |
| name                      | provider-3                           |
| port_security_enabled     | True                                 |
| project_id                | ad4b9a972ac64bd9916ad7ee80288353     |
| provider:network_type     | vlan                                 |
| provider:physical_network | provider3                            |
| provider:segmentation_id  | 177                                  |
| qos_policy_id             | None                                 |
| revision_number           | 2                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | True                                 |
| status                    | ACTIVE                               |
| subnets                   | 1a892dcf-bf93-46ef-bf37-bda6cf923471 |
| tags                      |                                      |
| updated_at                | 2024-03-01T16:45:52Z                 |
+---------------------------+--------------------------------------+$ openstack network show provider-4
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2024-03-01T16:45:57Z                 |
| description               |                                      |
| dns_domain                |                                      |
| id                        | e2106b16-8f83-4e2e-bdbd-20e2c12ec279 |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | None                                 |
| is_vlan_transparent       | None                                 |
| mtu                       | 9216                                 |
| name                      | provider-4                           |
| port_security_enabled     | True                                 |
| project_id                | ad4b9a972ac64bd9916ad7ee80288353     |
| provider:network_type     | vlan                                 |
| provider:physical_network | provider4                            |
| provider:segmentation_id  | 178                                  |
| qos_policy_id             | None                                 |
| revision_number           | 2                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | True                                 |
| status                    | ACTIVE                               |
| subnets                   | 76430b9e-302f-428d-916a-77482d9cfb19 |
| tags                      |                                      |
| updated_at                | 2024-03-01T16:46:01Z                 |
+---------------------------+--------------------------------------+
     3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/images/pull/154

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using
oc delete secret/signing-key -n openshift-service-ca
operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.
console                                    4.14.0-0.nightly-2023-06-30-131338   False       False         True       159m    RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable' 
monitoring                                 4.14.0-0.nightly-2023-06-30-131338   False       True          True       161m    reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

How reproducible:

100%

Steps to Reproduce:

1.oc delete secret/signing-key -n openshift-service-ca
2. wait at least 30+ minutes
3. observe oc get co

Actual results:

console and monitoring degraded and not recovering

Expected results:

able to recover eventually as in previous versions

Additional info:

using manual deletion of all pods it is possible to recover the cluster from this state as follows:
for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \
      do oc delete pods --all -n $I; \
      sleep 1; \
      done

 

must-gather:
https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing

 

On ipv6-primary dualstack, it is observed that the test:

"[sig-installer][Suite:openshift/openstack][lb][Serial] The Openstack platform should re-use an existing UDP Amphora LoadBalancer when new svc is created on Openshift with the proper annotation"

fails, because CCM is considering it as "internal":

I0216 10:13:07.053922       1 loadbalancer.go:2113] "EnsureLoadBalancer" cluster="kubernetes" service="e2e-test-openstack-sprfn/udp-lb-shared2-svc"
E0216 10:13:07.124915       1 controller.go:298] error processing service e2e-test-openstack-sprfn/udp-lb-shared2-svc (retrying with exponential backoff): failed to ensure load balancer: internal Service cannot share a load balancer
I0216 10:13:07.125445       1 event.go:307] "Event occurred" object="e2e-test-openstack-sprfn/udp-lb-shared2-svc" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: internal Service cannot share a load balancer"

However, both LBs do not have the below annotation:

"service.beta.kubernetes.io/openstack-internal-load-balancer": "true" 

Versions:
4.15.0-0.nightly-2024-02-14-052317
RHOS-16.2-RHEL-8-20230510.n.1

After performing an Agent Based Installation on Baremetal, the master node which was initially the rendezvous host is not joining to the cluster.

Checking podman containers on this node we see that 'assisted-installer' pod appears with 143 exit code after the second master is detected as ready:

2024-04-01T15:21:14.677437000Z time="2024-04-01T15:21:14Z" level=info msg="Found 1 ready master nodes"
2024-04-01T15:21:19.684831000Z time="2024-04-01T15:21:19Z" level=info msg="Found a new ready master node <second-master> with id <master-id>" 

podman pods status:

$ podman ps -a
CONTAINER ID  IMAGE                                                                                                                   COMMAND               CREATED         STATUS                     PORTS       NAMES
20b338ab8906  localhost/podman-pause:4.4.1-1707368644                                                                                                       16 hours ago    Up 16 hours                            d2b97e733b33-infra
0876c611f655  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27c5328e1d9a0d7db874c6e52efae631ab3c29a3d4da50c50b2e783dcb784128  /bin/bash start_d...  16 hours ago    Up 16 hours                            assisted-db
a9a116bed3a7  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27c5328e1d9a0d7db874c6e52efae631ab3c29a3d4da50c50b2e783dcb784128  /assisted-service     16 hours ago    Up 16 hours                            service
0afbe44c2cf2  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27c5328e1d9a0d7db874c6e52efae631ab3c29a3d4da50c50b2e783dcb784128  /usr/local/bin/ag...  16 hours ago    Exited (0) 16 hours ago                apply-host-config
45da1bdf2440  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b3daca74ad515845d5f8dcf384f0e51d58751a2785414edc3f20969a6fc0403  next_step_runner ...  16 hours ago    Up 16 hours                            next-step-runner
8d1306b0ea3a  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79e97d8cbd27e2c7402f7e016de97ca2b1f4be27bd52a981a27e7a2132be1ef4  --role bootstrap ...  16 hours ago    Exited (143) 15 hours ago              assisted-installer
8b0cc08890b4  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f44844c4024dfa35688eac52e5e3d1540311771c4a24fef1ba4a6dccecc0e55  start --node-name...  16 hours ago    Exited (0) 16 hours ago                hungry_varahamihira
4916c14b9f7e  registry.redhat.io/rhel9/support-tools:latest                                                                           /usr/bin/bash         34 seconds ago  Up 34 seconds                          toolbox-core

 

crio pods status:

CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                 ATTEMPT             POD ID              POD
03b89032db0bc       98fc664e8c2aa859c10ec8ea740b083c7c85925d75506bcb85c6c9c640945c36                                                         13 seconds ago      Exited              etcd                 182                 5d42cdad70890       etcd-bootstrap-member-<failed-master-name>.local
01008c6e32e5a       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b38d75b297fa52d1ba29af0715cec2430cd5fda1a608ed0841a09c55c292fb3   16 hours ago        Running             coredns              0                   5f8736b856a0c       coredns-<failed-master-name> 5e00e89ebef34       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e119d0d9f8470dd634a62329d2670602c5f169d0d9bbe5ad25cee07e716c94b   16 hours ago        Exited              render-config        0                   5f8736b856a0c       coredns-<failed-master-name> f5098d5d27a39       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2e119d0d9f8470dd634a62329d2670602c5f169d0d9bbe5ad25cee07e716c94b   16 hours ago        Running             keepalived-monitor   0                   4fb91cefa8a9e       keepalived-<failed-master-name> a1e9d4c8cf477       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d24879d39e10fcf00a7c28ab23de1d6cf0c433a1234ff34880f12642b75d4512   16 hours ago        Running             keepalived           0                   4fb91cefa8a9e       keepalived-<failed-master-name> de21bc99f0d3f       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c74c57f91f0f7ed26bb62f58c7b84c55750e51947fd6cc5711fa18f30b9f68c   16 hours ago        Running             etcdctl              0                   5d42cdad70890       etcd-bootstrap-member-<failed-master-name> 

Description of problem:

Now for 4.16 ocp payload , only contain the oc.rhel9 , can't find the oc.rhel8. 
oc adm release extract --command='oc.rhel8'  registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64  -a tmp/config.json --to octest/
error: image did not contain usr/share/openshift/linux_amd64/oc.rhel8

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

always    

Steps to Reproduce:

    1.oc adm release extract --command='oc.rhel8'  registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-01-24-133352 --command-os=linux/amd64  -a tmp/config.json --to octest/
    2.
    3.
    

Actual results:

 failed to extract oc.rhel8   

Expected results:

 for ocp paypload should contain the oc.rhel8. 

Additional info:

    

Description of problem:

- Investigate why  `name` and `namespcae` properties are passed as arguments in `k8sCreate` instance for Create YAML editor function

- Remove the `name` and `namespcae` arguments in `k8sCreate` instance for Create YAML editor function if it does not require a big change. 

Problem:
If consoleFetchCommon takes an additional option(argument) and return response based on the option as proposed in "[Add support for returning response.header in consoleFetchCommon function|https://issues.redhat.com/browse/CONSOLE-3949]" story, the wrong and unused arguments in k8sCreate would cause the consoleFetchCommon method arguments to return entire response instead of response body which would break the Create Resource YAML functionality.

Code: https://github.com/openshift/console/blob/master/frontend/public/components/edit-yaml.jsx#L334
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

This change is killing payloads and thus the org is blocked at a fairly critical time.

Sample failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade/1783654894146686976

 

Hitting the test: [OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace

 

Possibly more.

 

Suspect related to: https://github.com/openshift/origin/pull/28741 which merged Apr 25 20:11 UTC, and sippy db shows the failures start showing at 21:19, before that nothing for months.

Looks related to:

Failed to pull image "registry.redhat.io/redhat/redhat-marketplace-index:v4.16": copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/redhat/redhat-marketplace-index@sha256=7ff75c6598abd1a2abe9fa3db8a805fa552798361272b983ea07c9e9ef22d686/signature-2: unrecognized signature format, starting with binary 0x3c

We suspect there is a problem with the images and the failure may be legitimate.

Use the sippy test details to view the pass rate for the test which exposes this day by day.

Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/21

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

1) Customer tag a image which including # (Hashtag) in the tag name

uk302-img-app-j:v0.6.12-build0000#000

2)When customer using OADP to backup images , they got below error

error excuting custom action(groupResource=imagestream.image.openshift.io namespace=dbp-p0010001, name=uk302-image-app-j): rpc error: code= Unknown= Invalid destination name udistribution-s3-c9814a92-67a4-4251-bd0d-142dfc4d3c80://dbp-p0010001/uk302-image-app-j:v0.6.12-build0000#00: invalid reference format   

3) when check the source code below, we found that there are check towards tag name , seems # (Hashtag) is not allowed in regexp check

https://github.com/openshift/openshift-velero-plugin/blob/83f5067b1e04d740cd79ee0046e24283a8d7a184/velero-plugins/imagecopy/imagestream.go#L138

func copyImage(log logr.Logger, src, dest string, copyOptions *copy.Options) ([]byte, error) {
    policyContext, err := getPolicyContext()
    if err != nil {
        return []byte{}, fmt.Errorf("Error loading trust policy: %v", err)
    }
    defer policyContext.Destroy()
    srcRef, err := alltransports.ParseImageName(src)
    if err != nil {
        return []byte{}, fmt.Errorf("Invalid source name %s: %v", src, err)
    }
    destRef, err := alltransports.ParseImageName(dest)
    if err != nil {
        return []byte{}, fmt.Errorf("Invalid destination name %s: %v", dest, err)
    }

https://github.com/containers/image/blob/main/docker/reference/regexp.go#L111

const (
    // alphaNumeric defines the alpha numeric atom, typically a
    // component of names. This only allows lower case characters and digits.
    alphaNumeric = `[a-z0-9]+`

    // separator defines the separators allowed to be embedded in name
    // components. This allow one period, one or two underscore and multiple
    // dashes. Repeated dashes and underscores are intentionally treated
    // differently. In order to support valid hostnames as name components,
    // supporting repeated dash was added. Additionally double underscore is
    // now allowed as a separator to loosen the restriction for previously
    // supported names.
    separator = `(?:[._]|__|[-]*)`

    // repository name to start with a component as defined by DomainRegexp
    // and followed by an optional port.
    domainComponent = `(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])`

    // The string counterpart for TagRegexp.
    tag = `[\w][\w.-]{0,127}`

    // The string counterpart for DigestRegexp.
    digestPat = `[A-Za-z][A-Za-z0-9]*(?:[-_+.][A-Za-z][A-Za-z0-9]*)*[:][[:xdigit:]]{32,}`

    // The string counterpart for IdentifierRegexp.
    identifier = `([a-f0-9]{64})`

    // The string counterpart for ShortIdentifierRegexp.
    shortIdentifier = `([a-f0-9]{6,64})`
Expected results:

Customer want to know if this should be a bug that ,  when doing 
{code:java}
oc tag 

We should have some checking towards the tag name to prevent the #(Hashtag)  or other non allowed code been setting in the image tag which causing unexpected issue like in using OADP or other tools.

please have a check , thank you!

Regards
Jacob

Assisted installer agent's api_vip check doesn't accept multiple headers (src). This poses an issue when there are different ignition servers (e.g. hypershift) that expect different headers.

 

Latest use case: Hypershift's ignition server expects this header: Nodepool name and targetconfigversionhash

Several oc examples are incorrect. These are used in the CLI reference docs, but would also appear in the oc CLI help.

The commands that don't work have been removed manually from the CLI reference docs via this update: https://github.com/openshift/openshift-docs/compare/9907074162999c982a8a97c45665c98913d848c9..441f3419ef460d9863a45e4c2d6914b1c019e1d1

List of commands:

  • oc adm copy-to-node --copy=new-bootstrap-kubeconfig=/etc/kubernetes/kubeconfig
  • oc adm copy-to-node --copy=new-bootstrap-kubeconfig=/etc/kubernetes/kubeconfig -l node-role.kubernetes.io/master
  • oc adm certificates regenerate-leaf -A --all
  • oc adm ocp-certificates regenerate-leaf -n openshift-config-managed kube-controller-manager-client-cert
  • oc adm certificates regenerate-machine-config-server-serving-cert --update-ignition=false
  • oc adm certificates update-ignition-ca-bundle-for-machine-config-server
  • oc adm certificates regenerate-signers -A --all
  • oc adm certificates regenerate-signers -n openshift-kube-apiserver-operator loadbalancer-serving-signer
  • oc adm ocp-certificates remove-old-trust configmaps -A --all
  • oc adm ocp-certificates remove-old-trust -n openshift-config-managed configmaps/kube-apiserver-aggregator-client-ca

For more information, see the feedback on these PRs:

Description of problem:

make verify uses the latest version of setup-envtest, regardless of what go version the repo is currently on

How reproducible:

100%

Steps to Reproduce:

Run `make verify` without a local image of setup-envtest should cause the issue

Actual results:

go: sigs.k8s.io/controller-runtime/tools/setup-envtest@latest: sigs.k8s.io/controller-runtime/tools/setup-envtest@v0.0.0-20240323114127-e08b286e313e requires go >= 1.22.0 (running go 1.21.7; GOTOOLCHAIN=local)
Go compliance shim [5685] [rhel-8-golang-1.21][openshift-golang-builder]: Exited with: 1    

Expected results:

make verify should be able to run without build errors

Additional info:

    

Description of problem:

    The olm-operator pod has initilization errors in the logs in a HyperShift deployment. It appears that the --writePackageServerStatusName="" passed in as an argument is being interpreted as \"\" instead of an empty string.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

$ kubectl -n master-coh67vr100a3so6e7erg logs olm-operator-75474cfd48-w2fp5

Actual results:

Several errors that look like this

time="2024-04-19T12:41:32Z" level=error msg="initialization error - failed to ensure name=\"\" - ClusterOperator.config.openshift.io \"\\\"\\\"\" is invalid: metadata.name: Invalid value: \"\\\"\\\"\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')" monitor=clusteroperator

Expected results:

    No errors

Additional info:

    

Description of the problem:

When the function ops.GetEncapsulatedMC takes too long, the host-stage 'Writing Image to Disk' may time out
This may be caused by timed out connection to API VIP to get the ignition

How reproducible:

When connection to bootstrap API VIP times out

Steps to reproduce:

Only artificially

Actual results:

When the problem happens, the host-stage 'Writing image to disk' host stage times out.

Expected results:

In case such problem happens, the host-stage shouldn't time out

Description of problem:

I've noticed that 'agent-cluster-install.yaml' and 'journal.export' from the agent gather process contain passwords. It's important not to expose password information in any of these generated files.

Version-Release number of selected component (if applicable):

4.15   

How reproducible:

Always    

Steps to Reproduce:

    1. Generate an agent ISO by utilising agent-config and install-config, including platform credentials
    2. Boot the ISO that was created
    3. Run the agent-gather command on the node 0 machine to generate files.

Actual results:

The 'agent-cluster-install.yaml' and 'journal.export' are containing the passwords information.

Expected results:

Password should be redacted.

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/268

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Seeing failures for SDN periodics running [sig-network][Feature:tuning] sysctl allowlist update should start a pod with custom sysctl only when the sysctl is added to whitelist [Suite:openshift/conformance/parallel] beginning with 4.16.0-0.nightly-2024-01-05-205447

sippy: sysctl allowlist update should start a pod with custom sysctl only when the sysctl is added to whitelist

  Jan  5 23:14:22.066: INFO: At 2024-01-05 23:14:09 +0000 UTC - event for testpod: {kubelet ip-10-0-54-42.us-west-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_testpod_e2e-test-tuning-bzspr_2a9ce6e0-726d-47a6-ac64-71d430926574_0(968a55c5afd81e077b1d15a4129084d5f15002ac3ae6aa9fe32648e841940fe2): error adding pod e2e-test-tuning-bzspr_testpod to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): timed out waiting for the condition

That payload contains OCPBUGS-26222: Adds a wait on unix socket readiness not sure that is the cause but will investigate.

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1033

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/router/pull/546

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

In PowerVS, when I try and deploy a 4.16 cluster, I see the following:

Description of problem:

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get pods -n openshift-cloud-controller-manager
NAME                                                READY   STATUS             RESTARTS      AGE
powervs-cloud-controller-manager-6b6fbcc9db-9rhtj   0/1     CrashLoopBackOff   4 (10s ago)   2m47s
powervs-cloud-controller-manager-6b6fbcc9db-wnvck   0/1     CrashLoopBackOff   3 (49s ago)   2m46s
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager
Error from server: no preferred addresses found; known addresses: []
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc logs pod/powervs-cloud-controller-manager-6b6fbcc9db-wnvck -n openshift-cloud-controller-manager
Error from server: no preferred addresses found; known addresses: []

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-ppc64le-2024-01-07-111144

How reproducible:

Aways

Steps to Reproduce:

    1. Deploy OpenShift cluster

On the master-0 node, I see:

[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl ps -a
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                               ATTEMPT             POD ID              POD
a048556553827       ec3035a371e09312254a277d5eb9affba2930adbd4018f7557899a2f3d76bc88                                                         18 seconds ago      Exited              kube-rbac-proxy                    7                   0381a589d57cd       cluster-cloud-controller-manager-operator-94dd5b468-kxqw5
a326f7ec83ddb       60f5c9455518c79a9797cfbeab0b3530dae1bf77554eccc382ff12d99053efd1                                                         11 minutes ago      Running             config-sync-controllers            0                   0381a589d57cd       cluster-cloud-controller-manager-operator-94dd5b468-kxqw5
ddaa6999b5b86       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:60eff87ed56ee4761fd55caa4712e6bea47dccaa11c59ba53a6d5697eacc7d32   11 minutes ago      Running             cluster-cloud-controller-manager   0                   0381a589d57cd       cluster-cloud-controller-manager-operator-94dd5b468-kxqw5

The failing pod has this as its log:

[core@rdr-hamzy-test-wdc06-fs5m2-master-0 ~]$ sudo crictl logs a048556553827
Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components
I0108 18:09:12.320332       1 flags.go:64] FLAG: --add-dir-header="false"
I0108 18:09:12.320401       1 flags.go:64] FLAG: --allow-paths="[]"
I0108 18:09:12.320413       1 flags.go:64] FLAG: --alsologtostderr="false"
I0108 18:09:12.320420       1 flags.go:64] FLAG: --auth-header-fields-enabled="false"
I0108 18:09:12.320427       1 flags.go:64] FLAG: --auth-header-groups-field-name="x-remote-groups"
I0108 18:09:12.320435       1 flags.go:64] FLAG: --auth-header-groups-field-separator="|"
I0108 18:09:12.320441       1 flags.go:64] FLAG: --auth-header-user-field-name="x-remote-user"
I0108 18:09:12.320447       1 flags.go:64] FLAG: --auth-token-audiences="[]"
I0108 18:09:12.320454       1 flags.go:64] FLAG: --client-ca-file=""
I0108 18:09:12.320460       1 flags.go:64] FLAG: --config-file="/etc/kube-rbac-proxy/config-file.yaml"
I0108 18:09:12.320467       1 flags.go:64] FLAG: --help="false"
I0108 18:09:12.320473       1 flags.go:64] FLAG: --http2-disable="false"
I0108 18:09:12.320479       1 flags.go:64] FLAG: --http2-max-concurrent-streams="100"
I0108 18:09:12.320486       1 flags.go:64] FLAG: --http2-max-size="262144"
I0108 18:09:12.320492       1 flags.go:64] FLAG: --ignore-paths="[]"
I0108 18:09:12.320500       1 flags.go:64] FLAG: --insecure-listen-address=""
I0108 18:09:12.320506       1 flags.go:64] FLAG: --kubeconfig=""
I0108 18:09:12.320512       1 flags.go:64] FLAG: --log-backtrace-at=":0"
I0108 18:09:12.320520       1 flags.go:64] FLAG: --log-dir=""
I0108 18:09:12.320526       1 flags.go:64] FLAG: --log-file=""
I0108 18:09:12.320531       1 flags.go:64] FLAG: --log-file-max-size="1800"
I0108 18:09:12.320537       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I0108 18:09:12.320543       1 flags.go:64] FLAG: --logtostderr="true"
I0108 18:09:12.320550       1 flags.go:64] FLAG: --oidc-ca-file=""
I0108 18:09:12.320556       1 flags.go:64] FLAG: --oidc-clientID=""
I0108 18:09:12.320564       1 flags.go:64] FLAG: --oidc-groups-claim="groups"
I0108 18:09:12.320570       1 flags.go:64] FLAG: --oidc-groups-prefix=""
I0108 18:09:12.320576       1 flags.go:64] FLAG: --oidc-issuer=""
I0108 18:09:12.320581       1 flags.go:64] FLAG: --oidc-sign-alg="[RS256]"
I0108 18:09:12.320590       1 flags.go:64] FLAG: --oidc-username-claim="email"
I0108 18:09:12.320595       1 flags.go:64] FLAG: --one-output="false"
I0108 18:09:12.320601       1 flags.go:64] FLAG: --proxy-endpoints-port="0"
I0108 18:09:12.320608       1 flags.go:64] FLAG: --secure-listen-address="0.0.0.0:9258"
I0108 18:09:12.320614       1 flags.go:64] FLAG: --skip-headers="false"
I0108 18:09:12.320620       1 flags.go:64] FLAG: --skip-log-headers="false"
I0108 18:09:12.320626       1 flags.go:64] FLAG: --stderrthreshold="2"
I0108 18:09:12.320631       1 flags.go:64] FLAG: --tls-cert-file="/etc/tls/private/tls.crt"
I0108 18:09:12.320637       1 flags.go:64] FLAG: --tls-cipher-suites="[TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305]"
I0108 18:09:12.320654       1 flags.go:64] FLAG: --tls-min-version="VersionTLS12"
I0108 18:09:12.320661       1 flags.go:64] FLAG: --tls-private-key-file="/etc/tls/private/tls.key"
I0108 18:09:12.320667       1 flags.go:64] FLAG: --tls-reload-interval="1m0s"
I0108 18:09:12.320674       1 flags.go:64] FLAG: --upstream="http://127.0.0.1:9257/"
I0108 18:09:12.320681       1 flags.go:64] FLAG: --upstream-ca-file=""
I0108 18:09:12.320686       1 flags.go:64] FLAG: --upstream-client-cert-file=""
I0108 18:09:12.320692       1 flags.go:64] FLAG: --upstream-client-key-file=""
I0108 18:09:12.320697       1 flags.go:64] FLAG: --upstream-force-h2c="false"
I0108 18:09:12.320703       1 flags.go:64] FLAG: --v="3"
I0108 18:09:12.320709       1 flags.go:64] FLAG: --version="false"
I0108 18:09:12.320719       1 flags.go:64] FLAG: --vmodule=""
I0108 18:09:12.320735       1 kube-rbac-proxy.go:578] Reading config file: /etc/kube-rbac-proxy/config-file.yaml
I0108 18:09:12.321427       1 kube-rbac-proxy.go:285] Valid token audiences: 
I0108 18:09:12.321473       1 kube-rbac-proxy.go:399] Reading certificate files
E0108 18:09:12.321519       1 run.go:74] "command failed" err="failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory"

When I describe the pod, I see:

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe pod/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj -n openshift-cloud-controller-manager
Name:                 powervs-cloud-controller-manager-6b6fbcc9db-9rhtj
Namespace:            openshift-cloud-controller-manager
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      cloud-controller-manager
Node:                 rdr-hamzy-test-wdc06-fs5m2-master-2/
Start Time:           Mon, 08 Jan 2024 11:57:45 -0600
Labels:               infrastructure.openshift.io/cloud-controller-manager=PowerVS
                      k8s-app=powervs-cloud-controller-manager
                      pod-template-hash=6b6fbcc9db
Annotations:          operator.openshift.io/config-hash: 09205e81b4dc20086c29ddbdd3fccc29a675be94b2779756a0e748dd9ba91e40
Status:               Running
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/powervs-cloud-controller-manager-6b6fbcc9db
Containers:
  cloud-controller-manager:
    Container ID:  cri-o://4365a326d05ecaac8e4114efabb4a46e01a308459ad30438d742b4829c24a717
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09
    Image ID:      65401afa73528f9a425a9d7f5dee8a9de8d9d3d82c8fd84cd653b16409093836
    Port:          10258/TCP
    Host Port:     10258/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      fi
      exec /bin/ibm-cloud-controller-manager \
      --bind-address=$(POD_IP_ADDRESS) \
      --use-service-account-credentials=true \
      --configure-cloud-routes=false \
      --cloud-provider=ibm \
      --cloud-config=/etc/ibm/cloud.conf \
      --profiling=false \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager \
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \
      --v=2
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 08 Jan 2024 12:35:12 -0600
      Finished:     Mon, 08 Jan 2024 12:35:12 -0600
    Ready:          False
    Restart Count:  12
    Requests:
      cpu:     75m
      memory:  60Mi
    Liveness:  http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3
    Environment:
      POD_IP_ADDRESS:               (v1:status.podIP)
      VPCCTL_CLOUD_CONFIG:         /etc/ibm/cloud.conf
      ENABLE_VPC_PUBLIC_ENDPOINT:  true
    Mounts:
      /etc/ibm from cloud-conf (rw)
      /etc/kubernetes from host-etc-kube (ro)
      /etc/pki/ca-trust/extracted/pem from trusted-ca (ro)
      /etc/vpc from ibm-cloud-credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z5xdm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ccm-trusted-ca
    Optional:  false
  host-etc-kube:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes
    HostPathType:  Directory
  cloud-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cloud-conf
    Optional:  false
  ibm-cloud-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ibm-cloud-credentials
    Optional:    false
  kube-api-access-z5xdm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  38m                    default-scheduler  Successfully assigned openshift-cloud-controller-manager/powervs-cloud-controller-manager-6b6fbcc9db-9rhtj to rdr-hamzy-test-wdc06-fs5m2-master-2
  Normal   Pulling    38m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09"
  Normal   Pulled     37m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" in 36.694s (36.694s including waiting)
  Normal   Started    36m (x4 over 37m)      kubelet            Started container cloud-controller-manager
  Normal   Created    35m (x5 over 37m)      kubelet            Created container cloud-controller-manager
  Normal   Pulled     35m (x4 over 37m)      kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dd2cf78ddeed971d38731d27ce293501547b960cefc3aadaa220186eded8a09" already present on machine
  Warning  BackOff    2m57s (x166 over 37m)  kubelet            Back-off restarting failed container cloud-controller-manager in pod powervs-cloud-controller-manager-6b6fbcc9db-9rhtj_openshift-cloud-controller-manager(bf58b824-b1a2-4d2e-8735-22723642a24a)

Description of problem:

    Sometimes the prometheus-operator's informer will be stuck because it receives objects that can't be converted to *v1.PartialObjectMetadata.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Not always

Steps to Reproduce:

    1. Unknown
    2.
    3.
    

Actual results:

    prometheus-operator logs show errors like

2024-02-09T08:29:35.478550608Z level=warn ts=2024-02-09T08:29:35.478491797Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"
2024-02-09T08:29:35.478592909Z level=error ts=2024-02-09T08:29:35.478541608Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"

Expected results:

    No error

Additional info:

    The bug has been introduced in v0.70.0 by https://github.com/prometheus-operator/prometheus-operator/pull/5993 so it only affects 4.16 and 4.15.

Description of problem:

When running 4.15 installer full function test, detect below one arm64 instance families and verified, need to append them in installer doc[1]:
- standardBpsv2Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_aarch64.md

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When scaling from zero replicas, the cluster autoscaler can panic if there are taints on the machineset with no "value" field defined.

    

Version-Release number of selected component (if applicable):

4.16/master
    

How reproducible:

always
    

Steps to Reproduce:

    1. create a machineset with a taint that has no value field and 0 replicas
    2. enable the cluster autoscaler
    3. force a workload to scale the tainted machineset
    

Actual results:

a panic like this is observed

I0325 15:36:38.314276       1 clusterapi_provider.go:68] discovered node group: MachineSet/openshift-machine-api/k8hmbsmz-c2483-9dnddr4sjc (min: 0, max: 2, replicas: 0)
panic: interface conversion: interface {} is nil, not string

goroutine 79 [running]:
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi.unstructuredToTaint(...)
	/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go:246
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi.unstructuredScalableResource.Taints({0xc000103d40?, 0xc000121360?, 0xc002386f98?, 0x2?})
	/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go:214 +0x8a5
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi.(*nodegroup).TemplateNodeInfo(0xc002675930)
	/go/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go:266 +0x2ea
k8s.io/autoscaler/cluster-autoscaler/core/utils.GetNodeInfoFromTemplate({0x276b230, 0xc002675930}, {0xc001bf2c00, 0x10, 0x10}, {0xc0023ffe60?, 0xc0023ffe90?})
	/go/src/k8s.io/autoscaler/cluster-autoscaler/core/utils/utils.go:41 +0x9d
k8s.io/autoscaler/cluster-autoscaler/processors/nodeinfosprovider.(*MixedTemplateNodeInfoProvider).Process(0xc00084f848, 0xc0023f7680, {0xc001dcdb00, 0x3, 0x0?}, {0xc001bf2c00, 0x10, 0x10}, {0xc0023ffe60, 0xc0023ffe90}, ...)
	/go/src/k8s.io/autoscaler/cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go:155 +0x599
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000617550, {0x4?, 0x0?, 0x3a56f60?})
	/go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:352 +0xcaa
main.run(0x0?, {0x2761b48, 0xc0004c04e0})
	/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:529 +0x2cd
main.main.func2({0x0?, 0x0?})
	/go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:617 +0x25
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x105
    

Expected results:

expect the machineset to scale up
    

Additional info:
i think the e2e test that exercises this is only running on periodic jobs and as such we missed this error in OCPBUGS-27509 .

this search shows some failed results

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/48

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/51

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/393

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/origin/pull/28452

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Since the golang.org/x/oauth2 package has been upgraded, GCP installs have been failing with 

level=info msg=Credentials loaded from environment variable "GOOGLE_CLOUD_KEYFILE_JSON", file "/var/run/secrets/ci.openshift.io/cluster-profile/gce.json"
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.gcp.project: Internal error: failed to create cloud resource service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused, : Internal error: failed to create compute service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused]

Version-Release number of selected component (if applicable):

    4.16/master

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    The bump has been introduced by https://github.com/openshift/installer/pull/8020

Rukpak – an alpha tech preview API – has pushed a breaking change upstream. This bug tracks the need for us to disable and then reenable the cluster-olm-operator and platform-operators components which both depend on rukpak in order to push the breaking API change. This bug can be closed once those components are all updated and available on the cluster again.

Description of problem:

    While doing the migration of Pipeline details page, it is expecting customData from Details page - https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pipelines/pipeline-metrics/PipelineMetrics.tsx        but in horizontalnav component exposed to dynamic plugin, we don't have customData prop. https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#horizontalnav

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Pipeline details page PR to be up for testing this[WIP][Story - https://issues.redhat.com/browse/ODC-7525]     2. Install Pipelines Operator and don't install Tekton result
    3. Enabled Pipeline details page in dynamic plugin
    4. create a pipeline and go to Metrics tab in details page

    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    Migrate an OpenShift Cluster to Azure AD Workload Identity, it is not have sufficient permissions to apply the Azure Pod Identity webhook configuration.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

   Always

Steps to Reproduce:

    1. According to the steps provided in the documentation: https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md#steps-to-in-place-migrate-an-openshift-cluster-to-azure-ad-workload-identity     
    2. For step10. Failed to apply the azure pod identity webhook configuration. 

Actual results:

For step10:
[hmx@fedora CCO]$ oc replace -f ./CCO-456/output_dir/manifests/azure-ad-pod-identity-webhook-config.yaml
 Error from server (NotFound): error when replacing "./CCO-456/output_dir/manifests/azure-ad-pod-identity-webhook-config.yaml": secrets "azure-credentials" not found
   
 [hmx@fedora CCO]$ oc get po -n openshift-cloud-credential-operator  NAME                                         READY   STATUS    RESTARTS   AGE cloud-credential-operator-594bf555b4-6srcq   2/2     Running   0          3h32m 

[hmx@fedora CCO]$ oc logs cloud-credential-operator-594bf555b4-6srcq -n openshift-cloud-credential-operator
Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, cloud-credential-operator
Flag --logtostderr has been deprecated, will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components
I0410 06:41:25.490507       1 kube-rbac-proxy.go:285] Valid token audiences: 
I0410 06:41:25.490752       1 kube-rbac-proxy.go:399] Reading certificate files
I0410 06:41:25.491607       1 kube-rbac-proxy.go:447] Starting TCP socket on 0.0.0.0:8443
I0410 06:41:25.492241       1 kube-rbac-proxy.go:454] Listening securely on 0.0.0.0:8443
E0410 06:41:52.996659       1 webhook.go:154] Failed to make webhook authenticator request: Unauthorized
E0410 06:41:52.997568       1 auth.go:47] Unable to authenticate the request due to an error: Unauthorized
E0410 06:42:15.871706       1 webhook.go:154] Failed to make webhook authenticator request: Unauthorized
E0410 06:42:15.871754       1 auth.go:47] Unable to authenticate the request due to an error: Unauthorized

Expected results:

    Apply the azure pod identity webhook configuration successfully.

Additional info:

    

Description of problem:

    We need to make controllerAvailabilityPolicy field inmutable in the HostedCluster spec section to ensure the customer cannot go from/to SingleReplica to HighAvailability.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When expanding a PVC of unit-less size (e.g., '2147483648'), the Expand PersistentVolumeClaim modal populates the spinner with a unit-less value (e.g., 2147483648) instead of a meaningful value.

Version-Release number of selected component (if applicable):

CNV - 4.14.3

How reproducible:

always

Steps to Reproduce:

1.Create a PVC using the following YAML.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:   
  name: task-pv-claim
spec: 
  storageClassName: gp3-csi
  accessModes:     
    - ReadWriteOnce
  resources: 
    requests:       
      storage: "2147483648" 
apiVersion: v1
kind: Pod
metadata:   
  name: task-pv-pod
spec:   
  securityContext:     
    runAsNonRoot: true
    seccompProfile:       
      type: RuntimeDefault
  volumes:     
    - name: task-pv-storage
      persistentVolumeClaim:         
        claimName: task-pv-claim
  containers:     
    - name: task-pv-container
      image: nginx
      ports:         
        - containerPort: 80
          name: "http-server"
      volumeMounts:         
        - mountPath: "/usr/share/nginx/html"
          name: task-pv-storage

2. From the newly created PVC details page, Click Actions > Expand PVC.
3. Note the value in the spinner input.

See https://drive.google.com/file/d/1toastX8rCBtUzx5M-83c9Xxe5iPA8fNQ/view for a demo

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/47

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    If a cluster is installed using proxy and the username used for connecting to the proxy contains the characters "%40" for encoding a "@" in case of providing a doamin, the instalation fails. The failure is because the proxy variables implemented in the file "/etc/systemd/system.conf.d/10-default-env.conf" in the bootstrap node are ignored by systemd. This issue seems was already fixed in MCO (BZ 1882674 - fixed in RHOCP 4.7), but looks like is affecting the bootstrap process in 4.13 and 4.14, causing the installation to not start at all.

Version-Release number of selected component (if applicable):

    4.14, 4.13

How reproducible:

    100% always

Steps to Reproduce:

    1. create a install-config.yaml file with "%40" in the middle of the username used for proxy.
    2. start cluster installation.
    3. bootstrap will fail for not using proxy variables.
    

Actual results:

Installation fails because systemd fails to load the proxy varaibles if "%" is present in the username.

Expected results:

    Installation to succeed using a username with "%40" for the proxy. 

Additional info:

File "/etc/systemd/system.conf.d/10-default-env.conf" for the bootstrap should be generated in a way accepted by systemd.    

Description of the problem:

Up to latest decision RH won't going to support installation OCP cluster on vSphere with

nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "vSphere" on page "Cluster Details"

Slack discussion thread

https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159

Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS

 

Description of problem:

Dev console buildconfig got [the server does not allow this method on the requested resource] error when not setting metadate.namespace

How reproducible:

Test case is shown in below 

Steps to Reproduce:

Using below to create a Buildconfig in GUI page of 
openshift console -> Developer -> Builds -> Create BuildConfig -> yaml view


  
~~~
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: mywebsite
  labels:
    name: mywebsite
spec:
  triggers:
  - type: ImageChange
    imageChange: {}
  - type: ConfigChange
  source:
    type: Git
    git:
      uri: https://github.com/monodot/container-up
    contextDir: httpd-hello-world
  strategy:
    type: Docker
    dockerStrategy:
      dockerfilePath: Dockerfile
      from:
        kind: ImageStreamTag
        name: httpd:latest 
        namespace: testbuild
  output:
    to:
      kind: ImageStreamTag
      name: mywebsite:latest 
~~~

Actual results:

Get  [the server does not allow this method on the requested resource] error 

Expected results:

we can find even not setting metadata.namespace in CLI mode or by contact from the customer that in 4.11 GUI console will not trigger this error, Does that mean the code changed in 4.13 ? 

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/16

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

As an openshift developer, I want to remove the image openshift-proxy-pull-test-container from the build, so we will not be affected by the possible bugs during the image build.

we requested the ART team to add this image in the ticket https://issues.redhat.com/browse/ART-2961

 

aws single-node are failing starting with https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.nightly/release/4.16.0-0.nightly-2024-03-27-123853

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial

 

A bunch of operators are degraded, I did notice this but still investigating:

 

    - lastTransitionTime: '2024-03-27T15:56:02Z'
      message: 'OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve
        route from cache: route.route.openshift.io "oauth-openshift" not found        OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.201.206:443/healthz":
        dial tcp 172.30.201.206:443: connect: connection refused        OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints
        "oauth-openshift" not found        ReadyIngressNodesAvailable: Authentication requires functional ingress which
        requires at least one schedulable and ready node. Got 0 worker nodes, 1 master
        nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).        WellKnownAvailable: The well-known endpoint is not yet available: failed to
        get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap:
        configmap "oauth-openshift" not found (check authentication operator, it is
        supposed to create this)' 

Description of the problem:

Some requests of this type:

  {"x_request_id":"05e66411-7612-46bb-86c2-69bf7096b6da","protocol":"HTTP/1.1","authority":"zoscaru4s08w1mz.api.openshift.com","user_agent":"Go-http-client/2.0","method":"GET","response_flags":"UC","x_forwarded_for":"163.244.72.2,10.128.10.16,23.21.192.204","bytes_rx":0,"duration":13,"bytes_tx":95,"response_code":503,"timestamp":"2024-01-31T15:52:41.418Z","upstream_duration":null,"path":"/api/assisted-install/v2/infra-envs/84596f7d-0138-4f57-ada4-be72aea031a5/hosts/62148a92-8588-b591-5f7e-046bf1136b3b/instructions?timestamp=1706716359"}

are causing 503 because the applicaiton crashes. Fortunately it's only a goroutine to crash, so main loop is still going and other requests seem unaffected

How reproducible:

Not sure what conditions we need to meet but plenty of requests of this type can be found from prod logs

 

Steps to reproduce:

1.

2.

3.

Actual results:

2024/01/31 16:41:31 http: panic serving 127.0.0.1:39486: runtime error: invalid memory address or nil pointer dereference
goroutine 575931 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1854 +0xbf
panic({0x4369120, 0x6bee680})
	/usr/local/go/src/runtime/panic.go:890 +0x263
github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).getTangServersFromHostIgnition(0x34?, 0x484da60?)
	/assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:41 +0x3e
github.com/openshift/assisted-service/internal/host/hostcommands.(*tangConnectivityCheckCmd).GetSteps(0xc000a5c440, {0xc00057c1c8?, 0x48adb49?}, 0xc000f5d180)
	/assisted-service/internal/host/hostcommands/tang_connectivity_check_cmd.go:104 +0x112
github.com/openshift/assisted-service/internal/host/hostcommands.(*InstructionManager).GetNextSteps(0xc000ad0780, {0x50cf558, 0xc00444b4a0}, 0xc000f5d180)
	/assisted-service/internal/host/hostcommands/instruction_manager.go:178 +0xa2f
github.com/openshift/assisted-service/internal/host.(*Manager).GetNextSteps(0xc0003e4990?, {0x50cf558?, 0xc00444b4a0?}, 0x0?)
	/assisted-service/internal/host/host.go:548 +0x48
github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).V2GetNextSteps(0xc000da4800, {0x50cf558, 0xc00444b4a0}, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158})
	/assisted-service/internal/bminventory/inventory.go:5357 +0x1b8
github.com/openshift/assisted-service/restapi.HandlerAPI.func54({0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980?, 0xc000fb67e0?})
	/assisted-service/restapi/configure_assisted_install.go:654 +0xf4
github.com/openshift/assisted-service/restapi/operations/installer.V2GetNextStepsHandlerFunc.Handle(0xc00203b300?, {0xc00203b500, 0xc002b9f460, {0xc001daa690, 0x24}, {0xc001daa6f0, 0x24}, 0xc0015c8158}, {0x408d980, 0xc000fb67e0})
	/assisted-service/restapi/operations/installer/v2_get_next_steps.go:19 +0x7a
github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetNextSteps).ServeHTTP(0xc00169e468, {0x50bfe00, 0xc001399d20}, 0xc00203b500)
	/assisted-service/restapi/operations/installer/v2_get_next_steps.go:66 +0x2dd
github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x50bfe00, 0xc001399d20}, 0xc00203b500)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59
net/http.HandlerFunc.ServeHTTP(0x0?, {0x50bfe00?, 0xc001399d20?}, 0x17334b7?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/internal/metrics.Handler.func1.1()
	/assisted-service/internal/metrics/reporter.go:37 +0x31
github.com/slok/go-http-metrics/middleware.Middleware.Measure({{{0x50bfdd0, 0xc0014ce0e0}, {0x48922ea, 0x12}, 0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x50d23c0, ...}, ...)
	/assisted-service/vendor/github.com/slok/go-http-metrics/middleware/middleware.go:117 +0x30e
github.com/openshift/assisted-service/internal/metrics.Handler.func1({0x50cdea0?, 0xc0005ec770}, 0xc00203b500)
	/assisted-service/internal/metrics/reporter.go:36 +0x35f
net/http.HandlerFunc.ServeHTTP(0x50cf558?, {0x50cdea0?, 0xc0005ec770?}, 0xc002b9ee80?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/context.ContextHandler.func1.1({0x50cdea0, 0xc0005ec770}, 0xc00203b400)
	/assisted-service/pkg/context/param.go:95 +0xc8
net/http.HandlerFunc.ServeHTTP(0xc001735430?, {0x50cdea0?, 0xc0005ec770?}, 0xc0026625f8?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.NewRouter.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/router.go:77 +0x257
net/http.HandlerFunc.ServeHTTP(0x7fc8dc2c3820?, {0x50cdea0?, 0xc0005ec770?}, 0xc00009f000?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.Redoc.func1({0x50cdea0, 0xc0005ec770}, 0x418b480?)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/redoc.go:72 +0x242
net/http.HandlerFunc.ServeHTTP(0x1?, {0x50cdea0?, 0xc0005ec770?}, 0x0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/go-openapi/runtime/middleware.Spec.func1({0x50cdea0, 0xc0005ec770}, 0x486ac6f?)
	/assisted-service/vendor/github.com/go-openapi/runtime/middleware/spec.go:46 +0x18c
net/http.HandlerFunc.ServeHTTP(0xc001136380?, {0x50cdea0?, 0xc0005ec770?}, 0xc00203b200?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/rs/cors.(*Cors).Handler.func1({0x50cdea0, 0xc0005ec770}, 0xc00203b200)
	/assisted-service/vendor/github.com/rs/cors/cors.go:281 +0x1c4
net/http.HandlerFunc.ServeHTTP(0x0?, {0x50cdea0?, 0xc0005ec770?}, 0x4?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x50cd990, 0xc0012e8540}, 0xc0017358f0?)
	/assisted-service/vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e
net/http.HandlerFunc.ServeHTTP(0x100c0017359e8?, {0x50cd990?, 0xc0012e8540?}, 0x10?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/app.WithMetricsResponderMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x16f0a19?)
	/assisted-service/pkg/app/middleware.go:32 +0xb0
net/http.HandlerFunc.ServeHTTP(0xc000a26900?, {0x50cd990?, 0xc0012e8540?}, 0xc00444a7e0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/app.WithHealthMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0x5094901?)
	/assisted-service/pkg/app/middleware.go:55 +0x162
net/http.HandlerFunc.ServeHTTP(0x50cf4b0?, {0x50cd990?, 0xc0012e8540?}, 0x5094980?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/openshift/assisted-service/pkg/requestid.handler.ServeHTTP({{0x50aa9c0?, 0xc0008a4eb0?}}, {0x50cd990, 0xc0012e8540}, 0xc00203b100)
	/assisted-service/pkg/requestid/requestid.go:69 +0x1ad
github.com/openshift/assisted-service/internal/spec.WithSpecMiddleware.func1({0x50cd990?, 0xc0012e8540?}, 0xc00203b100?)
	/assisted-service/internal/spec/spec.go:38 +0x9b
net/http.HandlerFunc.ServeHTTP(0xc00124ec35?, {0x50cd990?, 0xc0012e8540?}, 0x170a0ce?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0xc002cb8f30?}, {0x50cd990, 0xc0012e8540}, 0xc00203b100)
	/usr/local/go/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc00189d320, {0x50cf558, 0xc0011fc240})
	/usr/local/go/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3089 +0x5ed 

Expected results:

Description of problem:

The ovs-if-br-ex.nmconnection.J1K8B2 like files breaks ovs-configuration.service. Deleting the file fixes the issue.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Looks like we're accidentally passing the JavaScript `window.alert()` method instead of the Prometheus alert object.

Description of the problem:
this issue was discovered while trying to verify bug opened by Javipolo
MGMT-16966

It seems that the avoid extra reboot not working properly for 4.15 cluster with partition on installation disk

Here are 3 clusters:
1) test-infra-cluster-7a4cb4cc OCP 4.15 with partition on installation disk
2) test-infra-cluster-066749e2 OCP 4.14 with partition on installation disk
3) test-infra-cluster-f9051a36 OCP 4.15 without partition on installation disk

i see indications:
*1) test-infra-cluster-7a4cb4cc *

2/19/2024, 9:19:53 PM	Node test-infra-cluster-7a4cb4cc-worker-0 has been rebooted 1 times before completing installation
2/19/2024, 9:19:51 PM	Node test-infra-cluster-7a4cb4cc-master-2 has been rebooted 2 times before completing installation
2/19/2024, 9:19:08 PM	Node test-infra-cluster-7a4cb4cc-worker-1 has been rebooted 1 times before completing installation
2/19/2024, 8:49:59 PM	Node test-infra-cluster-7a4cb4cc-master-1 has been rebooted 2 times before completing installation
2/19/2024, 8:49:55 PM	Node test-infra-cluster-7a4cb4cc-master-0 has been rebooted 2 times before completing installation

2) test-infra-cluster-066749e2

2/19/2024, 8:32:36 PM	Node test-infra-cluster-066749e2-master-2 has been rebooted 2 times before completing installation
2/19/2024, 8:32:35 PM	Node test-infra-cluster-066749e2-worker-1 has been rebooted 2 times before completing installation
2/19/2024, 8:32:31 PM	Node test-infra-cluster-066749e2-worker-0 has been rebooted 2 times before completing installation
2/19/2024, 8:05:26 PM	Node test-infra-cluster-066749e2-master-0 has been rebooted 2 times before completing installation
2/19/2024, 8:05:25 PM	Node test-infra-cluster-066749e2-master-1 has been rebooted 2 times before completing installation

3) test-infra-cluster-f9051a36

2/18/2024, 5:13:49 PM	Node test-infra-cluster-f9051a36-worker-1 has been rebooted 1 times before completing installation
2/18/2024, 5:10:10 PM	Node test-infra-cluster-f9051a36-worker-0 has been rebooted 1 times before completing installation
2/18/2024, 5:08:46 PM	Node test-infra-cluster-f9051a36-worker-2 has been rebooted 1 times before completing installation
2/18/2024, 5:03:12 PM	Node test-infra-cluster-f9051a36-master-1 has been rebooted 1 times before completing installation
2/18/2024, 4:33:39 PM	Node test-infra-cluster-f9051a36-master-2 has been rebooted 1 times before completing installation
2/18/2024, 4:33:38 PM	Node test-infra-cluster-f9051a36-master-0 has been rebooted 1 times before completing installation

according to Ori analysis
It seems skip MCO reboot didn't happen for masters. The ignition was not accessible

Feb 19 18:38:19 test-infra-cluster-7a4cb4cc-master-0 installer[3403]: time="2024-02-19T18:38:19Z" level=warning msg="failed getting encapsulated machine config. Continuing installation without skipping MCO reboot" error="failed after 240 attempts, last error: unexpected end of JSON input"

How reproducible:

 

Steps to reproduce:

1.create cluster with 4.15

2.add ustom manifest which modify ignition to create a partition on disk

3.start installation

Actual results:

 seems that reboot avoid did not work properly

Expected results:

Description of problem:

Should not collect the previous.log which not corresponding with the --since/--since-time for `oc adm inspect` command

Version-Release number of selected component (if applicable):

  4.16  

How reproducible:

    always

Steps to Reproduce:

    1.  `oc adm inspect --since-time="2024-01-25T01:35:27Z"  ns/openshift-multus`     

Actual results:

    also collect the previous.log which not corresponding with the specified time.

Expected results:

    Only collect the logs after the --since/--since-time.

Additional info:

    

Description of problem:

/var/log/etcd/etcd-health-probe.log exist on control plane node, but we only touch it in code:
https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L26

etcd's /var/log/etcd/etcd-health-probe.log be though audit log, because there are audit log in same directory tree for apiserver and auth:
/var/log/kube-apiserver/audit-2024-03-21T04-27-49.470.log
/var/log/oauth-server/audit.log

etcd-health-probe.log will bring some misunderstanding to user
    How reproducible:always


    Steps to Reproduce:
    1. login control plane node
    2. check /var/log/etcd/etcd-health-probe.log
    3. the file size is always zero

    Actual results:

    
    Expected results:remove this file in code/don't touch this file

    

Additional info:


    

Please review the following PR: https://github.com/openshift/console-operator/pull/823

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Adding automountServiceAccount: false to the pod removes the SA token in ovnkube-control-plane pod, this causes it to crash with following error:   

F1212 12:18:13.705048 1 ovnkube.go:136] unable to create kubernetes rest config, err: TLS-secured apiservers require token/cert and CA certificate.

This error is misleading as the pod doesnt use KAS and doesnt need the SA token.

 

 

Version-Release number of selected component (if applicable):

4.15    

How reproducible:

100%    

Steps to Reproduce:

    1. Add automountServiceAccountToken: false to pod spec in ovnkube-control-plane deployment
    2. Check new pod for error
    3.
    

Actual results:

pod crashes with error:     unable to create kubernetes rest config, err: TLS-secured apiservers require token/cert and CA certificate.

Expected results:

 pod runs without issues

Additional info:

    

Description of problem:

There is a new zone in PowerVS called dal12.  We need to add this zone to the list of supported zones in the installer.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Always
    

Steps to Reproduce:

    1. Deploy OpenShift cluster to the zone
    2.
    3.
    

Actual results:

Fails
    

Expected results:

Works
    

Additional info:


    

Description of problem

The egress-router implementations under https://github.com/openshift/images/tree/master/egress have unit tests alongside the implementations, within the same repository, but the repository does not have a CI job to run those unit tests. We do not have any tests for egress-router in https://github.com/openshift/origin. This means that we are effectively lacking CI test coverage for egress-router.

Version-Release number of selected component (if applicable)

All versions.

How reproducible

100%.

Steps to Reproduce

1. Open a PR in https://github.com/openshift/images and check which CI jobs are run on it.
2. Check the job definitions in https://github.com/openshift/release/blob/master/ci-operator/jobs/openshift/images/openshift-images-master-presubmits.yaml.

Actual results

There are "ci/prow/e2e-aws", "ci/prow/e2e-aws-upgrade", and "ci/prow/images" jobs defined, but no "ci/prow/unit" job.

Expected results

There should be a "ci/prow/unit" job, and this job should run the unit tests that are defined in the repository.

Additional info

The lack of a CI job came up on https://github.com/openshift/images/pull/162.

Description of problem:

The konnectivity-agent on the data plane needs to resolve its proxy-server-url to connect the control plane's konnectivity server. Also, the these agents are using the default dnsPolicy which is ClusterFirst.

This creates a dependency with CoreDNS. If CoreDNS is misconfigured or down, agents won't able to connect to the server, and all konnectivity related traffic goes down (blocks updates, webhooks, logs, etc).

The correction would to use the dnsPolicy: Default in the konnectivity-agent daemonset on the data plane, so it would use the name resolution configuration from the node.

This makes sure that the konnectivity-agent's proxy-server-url can be resolved even if coreDNS is down or mis-configured

The konnectivity-agent control plane deployment shall not change as it still needs to use coreDNS as in that case a ClusterIP Service is configured as proxy-server-url.   

Version-Release number of selected component (if applicable):

4.14, 4.15

    

How reproducible:

Break coreDNS configuration

Steps to Reproduce:

1. Put an invalid forwarder to the dns.operator/default to fail upstream DNS resolving
2. Rollout restart the konnectivity-agent daemonset in kube-system

Actual results:

kubectl log is failing

Expected results:

kubectl log is working

Additional info:

 

Description of problem:
While upgrading a loaded 250 node ROSA cluster from 4.13.13 to 4.14.rc2 the cluster failed to upgrade and was stuck at when network operator was trying
to upgrade.
Around 20 multus pods were in CrashLookpack state with the log

oc logs multus-4px8t
2023-10-10T00:54:34+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315
2023-10-10T00:54:34+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 to /host/opt/cni/bin/
2023-10-10T00:54:34Z [verbose] multus-daemon started
2023-10-10T00:54:34Z [verbose] Readiness Indicator file check
2023-10-10T00:55:19Z [error] have you checked that your default network is ready? still waiting for readinessindicatorfile @ /host/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition

Description of problem:

Since many 4.y ago, before 4.11 and all the minor versions that are still supported, CRI-O has wiped images when it comes up after a node reboot and notices it has a new (minor?) version. This causes redundant pulls, as seen in this 4.11-to-4.12 update run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1732741139229839360/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes/ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4/journal | zgrep 'Starting update from rendered-\|crio-wipe\|Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2'
Dec 07 13:05:42.474144 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded.
Dec 07 13:05:42.481470 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 191ms CPU time
Dec 07 13:59:51.000686 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1498]: time="2023-12-07 13:59:51.000591203Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=a62bc972-67d7-401a-9640-884430bd16f1 name=/runtime.v1.ImageService/PullImage
Dec 07 14:00:55.745095 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 root[101294]: machine-config-daemon[99469]: Starting update from rendered-worker-ca36a33a83d49b43ed000fd422e09838 to rendered-worker-c0b3b4eadfe6cdfb595b97fa293a9204: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Dec 07 14:05:33.274241 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded.
Dec 07 14:05:33.289605 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 216ms CPU time
Dec 07 14:14:50.277011 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1573]: time="2023-12-07 14:14:50.276961087Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=1a092fbd-7ffa-475a-b0b7-0ab115dbe173 name=/runtime.v1.ImageService/PullImage

The redundant pulls cost network and disk traffic, and avoiding them should make those update-initiated reboots quicker and cheaper. The lack of update-initiated wipes is not expected to cost much, because the Kubelet's old-image garbage collection should be along to clear out any no-longer-used images if disk space gets tight.

Version-Release number of selected component (if applicable):

At least 4.11. Possibly older 4.y; I haven't checked.

How reproducible:

Every time.

Steps to Reproduce:

1. Install a cluster.
2. Update to a release image with a different CRI-O (minor?) version.
3. Check logs on the nodes.

Actual results:

crio-wipe entries in the logs, with reports of target-release images being pulled before and after those wipes, as I quoted in the Description.

Expected results:

Target-release images pulled before the reboot, and found in the local cache if that image is needed again post-reboot.

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/45

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On ipv6primary dualstack cluster, creating an ipv6 egressIP following this procedure:

https://docs.openshift.com/container-platform/4.14/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html

is not working. ovnkube-cluster-manager shows below error:

2024-01-16T14:48:18.156140746Z I0116 14:48:18.156053       1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6
2024-01-16T14:48:18.161367817Z I0116 14:48:18.161269       1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
2024-01-16T14:48:18.161416023Z I0116 14:48:18.161357       1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
2024-01-16T14:49:37.714410622Z I0116 14:49:37.714342       1 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.Service total 8 items received
2024-01-16T14:49:48.155826915Z I0116 14:49:48.155330       1 obj_retry.go:296] Retry object setup: *v1.EgressIP egress-dualstack-ipv6
2024-01-16T14:49:48.156172766Z I0116 14:49:48.155899       1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6
2024-01-16T14:49:48.168795734Z I0116 14:49:48.168520       1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
2024-01-16T14:49:48.169400971Z I0116 14:49:48.168937       1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]

Same is observed with ipv6 subnet on slaac mode.

Version-Release number of selected component (if applicable):

  • 4.15.0-0.nightly-2024-01-06-062415
  • RHOS-16.2-RHEL-8-20230510.n.1

How reproducible: Always.
Steps to Reproduce:

Applying below:

$ oc label node/ostest-8zrlf-worker-0-4h78l k8s.ovn.org/egress-assignable=""

$ cat egressip_ipv4.yaml && cat egressip_ipv6.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-dualstack-ipv4
spec:
  egressIPs:
    - 192.168.192.111
  namespaceSelector:
    matchLabels: 
      app: egress
      
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-dualstack-ipv6
spec:
  egressIPs:
    - fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333
  namespaceSelector:
    matchLabels: 
      app: egress

$ oc apply -f egressip_ipv4.yaml
$ oc apply -f egressip_ipv6.yaml

But it shows only info about ipv4 egressIP. The IPv6 port is not even created in openstack:

oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-67cbc4bc84-786jm 
I0116 13:15:48.914323       1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue
I0116 13:15:48.928927       1 cloudprivateipconfig_controller.go:357] CloudPrivateIPConfig: "192.168.192.111" will be added to node: "ostest-8zrlf-worker-0-4h78l"
I0116 13:15:48.942260       1 cloudprivateipconfig_controller.go:381] Adding finalizer to CloudPrivateIPConfig: "192.168.192.111"
I0116 13:15:48.943718       1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue
I0116 13:15:49.758484       1 openstack.go:760] Getting port lock for portID 8854b2e9-3139-49d2-82dd-ee576b0a0cce and IP 192.168.192.111
I0116 13:15:50.547268       1 cloudprivateipconfig_controller.go:439] Added IP address to node: "ostest-8zrlf-worker-0-4h78l" for CloudPrivateIPConfig: "192.168.192.111"
I0116 13:15:50.602277       1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue
I0116 13:15:50.614413       1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue

$ openstack port list --network network-dualstack | grep -e 192.168.192.111 -e 6f44:5dd8:c956:f816:3eff:fef0:3333
| 30fe8d9a-c1c6-46c3-a873-9a02e1943cb7 | egressip-192.168.192.111      | fa:16:3e:3c:23:2a | ip_address='192.168.192.111', subnet_id='ae8a4c1f-d3e4-4ea2-bc14-ef1f6f5d0bbe'                         | DOWN   |

Actual results: ipv6 egressIP object is ignored.
Expected results: ipv6 egressIP is created and can be attached to a pod.
Additional info: must-gather linked in private comment.

Owner: Architect:

Story (Required)

As an ODC helm backend developer I would like to be able to bump version of helm to 3.13 to stay synched up with the version we will ship with OCP 4.15

Background (Required)

Normal activity we do every time a new OCP version is release to stay current

Glossary

NA

Out of scope

NA

Approach(Required)

Bump version of helm to 3.13 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.

Dependencies

Might had dependencies with DevFile team to move some dependencies forward

Edge Case

NA

Acceptance Criteria

Console Helm dependency is moved to 3.13

INVEST Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Legend

Unknown
Verified
Unsatisfied

Please review the following PR: https://github.com/openshift/ovirt-csi-driver/pull/132

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

when user clicks on perspective switcher after a hard refresh, the flicker appears

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-25-100326

How reproducible:

Always after user refresh the console

Steps to Reproduce:

1. user login to OCP console
2. refresh the whole console then click perspective switcher 
3.
    

Actual results:

there is flicker when clicking on perspective switcher

Expected results:

no flickers

Additional info:

screen recording https://drive.google.com/file/d/1_2tPZ0DXNTapFP9sSz27vKbnwxxdWZSV/view?usp=drive_link 

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/203

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Follow on bug for story Add option to enable/disable tailing to Pod log viewer issues:
1. The position property for pf-5 Dropdown component doesn't workTo reproduce:Add `position="right"` property to `Dropdown` componentThe position doesn't change in `"@patternfly/react-core": "5.1.0"` 
2. Clicking the `Checkbox` label wrapped with `DropdownItem` doesn't  trigger the `onChange` on mobile screen. 
3.  The Expand button color is not blue in mobile due to replacing Button with DropdownItem4. The kebab toggle jumps to the screen top if already opened when resizing. 
4. The kebab toggle jumps to the screen top if already opened when resizing. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After updating the cluster to 4.12.42 (from 4.12.15), the customer noticed some issues for the scheduled PODs to start on the node.

The initial thought was a multus issue, and then we realised that the script /usr/local/bin/configure-ovs.sh was modified and reverting the modification fixed the issue.

Modification:

>     if nmcli connection show "$vlan_parent" &> /dev/null; then
>       # if the VLAN connection is configured with a connection UUID as parent, we need to find the underlying device
>       # and create the bridge against it, as the parent connection can be replaced by another bridge.
>       vlan_parent=$(nmcli --get-values GENERAL.DEVICES conn show ${vlan_parent})
>     fi

Reference:

Version-Release number of selected component (if applicable):

4.12.42

How reproducible:

Should be reproducible by setting inactive nmcli connections with the same names as the active once

Steps to Reproduce:

Not tested, but this should be something like
1. create inactive same nmcli connections
2. run the script

Actual results:

Script failing

Expected results:

Script should manage the connection using the UUID instead of using the Name.
Or maybe it's an underline issue how nmcli is managing the relationship between objects.

Additional info:

The issue may be related to the way that nmcli is working, as it should use the UUID to match the `vlan.parent` as it does with the `connection.master`

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/63

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    manifests are duplicated with cluster-config-api image

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1020

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In the tested HCP external OIDC env, when issuerCertificateAuthority is set, console pods are stuck in ContainerCreating status. The reason is the CA configmap is not propagated to openshift-console namespace by the console operator.

Version-Release number of selected component (if applicable):

Latest 4.16 and 4.15 nightly payloads

How reproducible:

Always

Steps to Reproduce:

1. Configure HCP external OIDC env with issuerCertificateAuthority set.
2. Check oc get pods -A

Actual results:

2. Before OCPBUGS-31319 is fixed, console pods are in CrashLoopBackOff status. After OCPBUGS-31319 is fixed or manually coping the CA configmap to openshift-config namespace as workaround, console pods are stuck in ContainerCreating status until the CA configmap is manually copied to openshift-console namespace too. Console login is affected.

Expected results:

2. Console operator should be responsible to copy the CA to openshift-console namespace. And console login should succeed.

Additional info:

In https://redhat-internal.slack.com/archives/C060D1W96LB/p1711548626625499 , HyperShift Dev side Seth requested to create this separate console bug to unblock the PR merge and backport for OCPBUGS-31319 . So creating it

Issue customer is experiencing:
Despite manually removing the alternate service (old) and saving the configuration from the UI, the alternate service did not get removed from the route, and the changes did not take effect.

From the UI, if using the Form view and select Remove Alternate Service, click save, if they refresh the route information it still shows the route configuration with Alternate service defined.
If they use the YAML view, and remove the entry from there and save it's gone properly.
If they use the CLI and edit the route, and remove the alternate service section, it also works properly.

Tests:

I have tested this scenario in my test cluster with OCP v4.13

  • I have created a route with the Alternate Backends:
    ~~~
  1. oc describe routes.route.openshift.io
    Name: httpd-example
    Namespace: test-ab
    Created: 5 minutes ago
    Labels: app=httpd-example
    template=httpd-example
    Annotations: openshift.io/generated-by=OpenShiftNewApp
    openshift.io/host.generated=true
    Requested Host: httpd-example-test-ab.apps.shrocp4upi413ovn.lab.upshift.rdu2.redhat.com
    exposed on router default (host router-default.apps.shrocp4upi413ovn.lab.upshift.rdu2.redhat.com) 5 minutes ago
    Path: <none>
    TLS Termination: <none>
    Insecure Policy: <none>
    Endpoint Port: <all endpoint ports>
    Service: httpd-example <-----------
    Weight: 50 (50%).
    Endpoints: <none>
    Service: pod-b. <-----------
    Weight: 50 (50%)
    Endpoints: <none>
    ~~~
  • Then I tried deleting it from the Console.
  • After removing the Alternate Backend from the console in the Form view, I saved the config.
  • But upon checking the route details again in the CLI, I could see the same Alternate Backend even though I have removed it:
    ~~~
  1. oc describe routes.route.openshift.io
    Name: httpd-example
    Namespace: test-ab
    Created: 12 minutes ago
    Labels: app=httpd-example
    template=httpd-example
    Annotations: openshift.io/generated-by=OpenShiftNewApp
    openshift.io/host.generated=true
    Requested Host: httpd-example-test-ab.apps.shrocp4upi413ovn.lab.upshift.rdu2.redhat.com
    exposed on router default (host router-default.apps.shrocp4upi413ovn.lab.upshift.rdu2.redhat.com) 12 minutes ago
    Path: <none>
    TLS Termination: <none>
    Insecure Policy: <none>
    Endpoint Port: web
    Service: httpd-example. <-----
    Weight: 100 (66%)
    Endpoints: 10.131.0.148:8080
    Service: pod-b <-----
    Weight: 50 (33%)
    Endpoints: <none>
    ~~~

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-azure-sdn-upgrade-4.15-minor-release-openshift-release-analysis-aggregator/1757905312053989376

Aggregator claims these tests only ran 4 times out of what looks like 10 jobs that ran to normal completion:

[sig-network-edge] Application behind service load balancer with PDB remains available using new connections
[sig-network-edge] Application behind service load balancer with PDB remains available using reused connections

However looking at one of the jobs not in the list of passes, we can see these tests ran:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade/1757905303602466816

Why is the aggregator missing this result somehow?

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/46

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description

Seen in 4.15-related update CI:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n
      1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused
      1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable'
      2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host
      2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF
      8 console RouteHealth_RouteNotAdmitted console route is not admitted
     16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)

For example this 4.14 to 4.15 run had:

: [bz-Management Console] clusteroperator/console should not change condition/Available 
Run #0: Failed 	1h25m23s
{  1 unexpected clusteroperator state transitions during e2e test run 

Nov 28 03:42:41.207 - 1s    E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}

While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant immediate admin intervention. Teaching the console operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component

At least 4.15. Possibly other versions; I haven't checked.

.h2 How reproducible

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact

Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.

Steps to reproduce

There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.

Actual results

Lots of console ClusterOperator going Available=False blips in 4.15 update CI.

Expected results

Console goes Available=False if and only if immediate admin intervention is appropriate.

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/561

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Seen in this 4.15 to 4.16 CI run:

: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers	0s
{  event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times
event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}

The operator recovered, and the update completed, but it's still probably worth cleaning up whatever's happening to avoid alarming anyone.

Version-Release number of selected component (if applicable):

Seems like all recent CI runs that match this string touch 4.15, 4.16, or development branches:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Back-off+restarting+failed+container+cluster-baremetal-operator+in+pod+cluster-baremetal-operator' | grep 'failures match'
pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway (all) - 11 runs, 36% failed, 25% of failures match = 9% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 20% failed, 33% of failures match = 7% impact
pull-ci-openshift-kubernetes-master-e2e-aws-ovn-downgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 15 runs, 27% failed, 25% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 32 runs, 91% failed, 7% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 40 runs, 25% failed, 20% of failures match = 5% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 40 runs, 8% failed, 33% of failures match = 3% impact
pull-ci-openshift-azure-file-csi-driver-operator-main-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-arm64 (all) - 6 runs, 33% failed, 50% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 11 runs, 18% failed, 50% of failures match = 9% impact

How reproducible:

Looks like ~8% impact.

h2. Steps to Reproduce:

1.  Run ~20 exposed job types.
2. Check for {{: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers}} failures with {{Back-off restarting failed container cluster-baremetal-operator}} messages.

h2. Actual results:

~8% impact.

h2. Expected results:

~0% impact.

h2. Additional info:

Dropping into Loki for the run I'd picked:

{code:none}
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016"} | unpack | pod="cluster-baremetal-operator-574577fbcb-z8nd4" container="cluster-baremetal-operator" |~ "220 06:0"

includes:

E1220 06:04:18.794548       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
I1220 06:05:40.753364       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
I1220 06:05:40.766200       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
I1220 06:05:40.780426       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
E1220 06:05:40.795555       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
I1220 06:08:21.730591       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
I1220 06:08:21.747466       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
I1220 06:08:21.768138       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
E1220 06:08:21.781058       1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"

So some kind of ClusterOperator-modification race?

Description of problem:

After installing an OpenShift IPI vSPhere cluter the coredns-monitor containers in the "openshift-vsphere-infra" namespace continuously report the message: "Failed to read ip from file /run/nodeip-configuration/ipv4" error="open /run/nodeip-configuration/ipv4: no such file or directory". The file "/run/nodeip-configuration/ipv4" present on the nodes is not actually moutned on the coredns pods. Apparently doesn't look to have any impact on the functionality of the cluster, but having a "failed" message on the container can triggers allarm or reserach for problem in the cluster.

Version-Release number of selected component (if applicable):

Any 4.12, 4.13, 4.14

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift IPI vSphere cluster
2. Wait forthe installation to complete
3. Read the logs of any coredns-monitor container in the "openshift-vsphere-infra" namespace

Actual results:

coredns-monitor continuously report the failed message, mesleading a cluster administartor for searching if there is a real issue.

Expected results:

coredns-monitor should not report this failed message if is not needed to fix it.

Additional info:

The same issue happens in Baremetal IPI clusters.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/207

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When the user configures the install-config.yaml additionalTrustBundle field (for example, in a disconnected installation using a local registry),
the user-ca-bundle configmap gets populated with more content than strictly required

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

    1. Setup a local registry and mirror the content of an ocp release
    2. Configure the install-config.yaml for a mirrored installation. In particular, configure the additionalTrustBundle field with the registry cert
    3. Create the agent ISO, boot the nodes and wait for the installation to complete
    

Actual results:

    The user-ca-bundle cm does not contain onyl the registry cert

Expected results:

user-ca-bundle configmap with just the content of the install-config additionalTrustBundle field

Additional info:

     

Description of problem:

    Before: Warning  FailedCreatePodSandBox  8s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "82187d55b1379aad1e6c02b3394df7a8a0c84cc90902af413c1e0d9d56ddafb0": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-hhvfn/89e6349b-9797-4e03-8828-ebafe224dfaf:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: net.IPNet{IP:net.IP{0x20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}

After:  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               6s    default-scheduler  Successfully assigned default/netshoot-deployment-59898b5dd9-kk2zm to whereabouts-worker
  Normal   AddedInterface          6s    multus             Add eth0 [10.244.2.2/24] from kindnet
  Warning  FailedCreatePodSandBox  6s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "23dd45e714db09380150b5df74be37801bf3caf73a5262329427a5029ef44db1": plugin type="multus" name="multus-cni-network" failed (add): [default/netshoot-deployment-59898b5dd9-kk2zm/142de5eb-9f8a-4818-8c5c-6c7c85fe575e:whereaboutsexample]: error adding container to network "whereaboutsexample": error at storage engine: Could not allocate IP in range: ip: 2000::1 / - 2000::ffff:ffff:ffff:fffe / range: 2000::/64 / excludeRanges: [2000::/32]

Fixed upstream in #366 https://github.com/k8snetworkplumbingwg/whereabouts/pull/366

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Workload hints test cases get stuck  when the existing profile is similar to changes proposed in some of the test cases
 

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Starting with this 4.16-ci payload, we see these failures (examples shown below).  It happens on aws, azure, and gcp:

4.16.0-0.ci-2024-03-16-025152  Rejected 38 hours ago  03-16T02:51:52Z   https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-03-16-025152

  aggregated-aws-ovn-upgrade-4.16-minor Failed

    https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-upgrade-4.16-minor-release-openshift-release-analysis-aggregator/1768832918135771136

    Failed: suite=[openshift-tests], [sig-auth] all workloads in ns/openshift-must-gather-smq72 must set the 'openshift.io/required-scc' annotation

  aggregated-azure-sdn-upgrade-4.16-minor Failed

    https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-azure-sdn-upgrade-4.16-minor-release-openshift-release-analysis-aggregator/1768832928868995072

    Failed: suite=[openshift-tests], [sig-auth] all workloads in ns/openshift-must-gather-494qg must set the 'openshift.io/required-scc' annotation

 

This looks like the culprit: https://github.com/openshift/origin/pull/28589 ; revert = https://github.com/openshift/origin/pull/28659.

Description of problem:

Test case:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should start and expose a secured proxy and unsecured metrics [apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Example Z Job Link:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864

Z must-gather Link:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-s390x/1754481543524388864/artifacts/ocp-e2e-ovn-remote-libvirt-s390x/gather-libvirt/artifacts/must-gather.tar

Example P Job Link:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480

P must-gather Link:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-ovn-remote-libvirt-ppc64le/1754481543436308480/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/gather-libvirt/artifacts/must-gather.tar

JSON body of error:
{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:383]: Unexpected error:
    <*fmt.wrapError | 0xc001d9c000>: 
    https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get "https://thanos-querier.openshift-monitoring.svc:9091": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host
    {
        msg: "https://thanos-querier.openshift-monitoring.svc:9091: request failed: Get \"https://thanos-querier.openshift-monitoring.svc:9091\": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.38.188:53: no such host",
        err: <*url.Error | 0xc0020e02d0>{
            Op: "Get",
            URL: "https://thanos-querier.openshift-monitoring.svc:9091",
            Err: <*net.OpError | 0xc000b8f770>{
                Op: "dial",
                Net: "tcp",
                Source: nil,
                Addr: nil,
                Err: <*net.DNSError | 0xc0020df700>{
                    Err: "no such host",
                    Name: "thanos-querier.openshift-monitoring.svc",
                    Server: "172.30.38.188:53",
                    IsTimeout: false,
                    IsTemporary: false,
                    IsNotFound: true,
                },
            },
        },
    }

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Observe Nightlies on P and/or Z
    

Actual results:

    Test failing

Expected results:

    Test passing

Additional info:

    

Description of problem:

RHEL8 workers fail to go ready, ovn-controller node component is crashlooping with

2024-03-29T20:41:34.082252221Z + sourcedir=/usr/libexec/cni/
2024-03-29T20:41:34.082269221Z + case "${rhelmajor}" in
2024-03-29T20:41:34.082269221Z + sourcedir=/usr/libexec/cni/rhel8
2024-03-29T20:41:34.082276361Z + cp -f /usr/libexec/cni/rhel8/ovn-k8s-cni-overlay /cni-bin-dir/
2024-03-29T20:41:34.083575440Z cp: cannot stat '/usr/libexec/cni/rhel8/ovn-k8s-cni-overlay': No such file or directory

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    100% since https://github.com/openshift/ovn-kubernetes/pull/2083 merged

Steps to Reproduce:

    1. run periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-workers-rhel8
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    4.15 nightly payloads have been affected by this test multiple times:

: [sig-arch] events should not repeat pathologically for ns/openshift-kube-scheduler expand_less0s{ 1 events happened too frequently

event happened 21 times, something is wrong: namespace/openshift-kube-scheduler node/ci-op-2gywzc86-aa265-5skmk-master-1 pod/openshift-kube-scheduler-guard-ci-op-2gywzc86-aa265-5skmk-master-1 hmsg/2652c73da5 - reason/ProbeError Readiness probe error: Get "https://10.0.0.7:10259/healthz": dial tcp 10.0.0.7:10259: connect: connection refused result=reject
body:
 From: 08:41:08Z To: 08:41:09Z}

In each of the 10 jobs aggregated, 2 to 3 jobs failed with this test. Historically this test passed 100%. But with the past two days test data, the passing rate has dropped to 97% and aggregator started allowing this in the latest payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1732295947339173888

The first payload this started appearing is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-12-05-071627.

All the events happened during cluster-operator/kube-scheduler progressing.

For comparison, here is a passed job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936539870498816

Here is a failed one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936538192777216

They both have the same set of probe error events. For the passing jobs, the frequency is lower than 20, while for the failed job, one of those events repeated more than 20 times and therefore results in the test failure. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/266

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators.

An example of this is on VSphere clusters. CAPI is not yet supported on VSphere clusters, but the CAPI IPAM CRDs, are needed by other operators than the usual consumer, cluster-capi-operator and the CAPI controllers.  

Version-Release number of selected component (if applicable):

    

How reproducible:

    Launch a techpreview cluster for an unsupported platform (e.g. vsphere/azure). Check that the Core CAPI CRDs are not present.

Steps to Reproduce:

    $ oc get crds | grep cluster.x-k8s.io

Actual results:

    Core CAPI CRDs are not present (only the metal ones)

Expected results:

    Core CAPI CRDs should be present

Additional info:

    

Description of problem:

There is no response when user clicks on quickstart items.
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-26-013420
browser 122.0.6261.69 (Official Build) (64-bit)
    

How reproducible:

always
    

Steps to Reproduce:

    1.Go to quick starts page by clicking "View all quick starts" on Home -> Overview page.
    2. Click on any quickstart item to check its steps.
    3.
    

Actual results:

2. There is no response.
    

Expected results:

2. Should open quickstart sidepage for installation instructions.
    

Additional info:

The issue doesn't exist on firefox 123.0 (64-bit)
    

Description of problem:

    Failed to upgrade 4.15 from 4.16 with vsphere UPI due to 
03-19 05:58:11.372  network                                    4.16.0-0.nightly-2024-03-13-061822   True        False         True       9h      Error while updating infrastructures.config.openshift.io/cluster: failed to apply / update (config.openshift.io/v1, Kind=Infrastructure) /cluster: Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.apiServerInternalIPs: Invalid value: "null": spec.platformSpec.vsphere.apiServerInternalIPs in body must be of type array: "null", spec.platformSpec.vsphere.ingressIPs: Invalid value: "null": spec.platformSpec.vsphere.ingressIPs in body must be of type array: "null", spec.platformSpec.vsphere.machineNetworks: Invalid value: "null": spec.platformSpec.vsphere.machineNetworks in body must be of type array: "null", <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Version-Release number of selected component (if applicable):

  upgrade chain 4.11.58-x86_64 - > 4.12.53-x86_64,4.13.37-x86_64,4.14.17-x86_64,4.15.3-x86_64,4.16.0-0.nightly-2024-03-13-061822


How reproducible:

    always

Steps to Reproduce:

    1. Upgrade cluster from 4.15 -> 4.16
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Adding test case when exceed openshift.io/image-tags will ban to create new image references in the project 

Version-Release number of selected component (if applicable):

    4.16
pr - https://github.com/openshift/origin/pull/28464

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/109

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-29384.

Description of problem:

    Sample job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-data-path-9nodes/1760228008968327168

Version-Release number of selected component (if applicable):

    

How reproducible:

    Anytime there is an error from the move-blobs command

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    An error message is shown

Expected results:

    A panic is shown followed by the error message

Additional info:

    

==== This Jira covers only haproxy component ====

Description of problem:

Pods running in the namespace openshift-vsphere-infra are so much verbose printing as INFO messages that should debug.

This excesse of verbosity has an impact in CRIO, in the node and also in the Logging system. 

For instance, having 71 nodes, the number of logs coming from this namespace in 1 month was: 450.000.000 meaning 1TB of logs written to disk on the node by CRIO, reading but the Red Hat log collector and stored in the Log Store.

Added to the impact on the performance, it have a financial impact for the storage needed.

Examples of logs are that adjust better to DEBUG and not as INFO:
```
/// For keep-alive pods are printed 4 messages per node each 10 seconds per node, in this example, the number of nodes is 71, then, this means 284 log entries per second, then 1704 log entries by minute and keepalive pod
$ oc logs keepalived-master.example-0 -c  keepalived-monitor |grep master.example-0|grep 2024-02-15T08:20:21 |wc -l

$ oc logs keepalived-master-example-0 -c  keepalived-monitor |grep worker-example-0|grep 2024-02-15T08:20:21 
2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'."
2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP"
2024-02-15T08:20:21.733399279Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'."
2024-02-15T08:20:21.733421398Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP"

/// For haproxy logs observed 2 logs printed per 6 seconds for each master, this means 6 messages in the same second, 60 messages/minute per pod
$ oc logs haproxy-master-0-example -c haproxy-monitor
...
2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="Searching for Node IP of master-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x]'."
2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="For node master-example-0 selected peer address x.x.x.x using NodeInternalIP"

Version-Release number of selected component (if applicable):

OpenShift 4.14
VSphere IPI installation

How reproducible:

Always

Steps to Reproduce:

    1. Install OpenShift 4.14 Vsphere IPI environment
    2. Review the logs of the haproxy pods and keealived pods running in the namespace `openshift-vsphere-infra`
    

Actual results:

The pods haproxy-* and keepalived-* pods being so much verbose printing as INFO messages should be as DEBUG. 

Some of the messages are available in the Description of the problem in the present bug.

Expected results:

Printed as INFO only relevant messages helping to reduce the verbosity of the pods running in the namespace  `openshift-vsphere-infra`

Additional info:

    

 Sometimes user manifests could be the source of problems, and right now they're not included in the logs archive downloaded for an Assisted cluster from the UI.

Currently we can only see the feature usage "Custom manifest" in metadata.json but that only tells us the user has custom manifests, not manifests and what is their value 

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/46

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/631

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Creating an OS image without a cpu architecture field is currently allowed.

    - openshiftVersion: "4.12"
      version: "rhcos-412.86.202308081039-0"
      url: "http://registry.ocp-edge-cluster-assisted-0.qe2.e2e.bos.redhat.com:8080/images/openshift-v4/amd64/dependencies/rhcos/4.12/4.12.30/rhcos-live.x86_64.iso" 

This results in invalid InfraEnvs being allowed and assisted-image-service returning an empty ISO file

Assisted-image-service log (4.12- is missing the architecture):

{"file":"/remote-source/app/pkg/imagestore/imagestore.go:299","func":"github.com/openshift/assisted-image-service/pkg/imagestore.(*rhcosStore).Populate","level":"info","msg":"Finished creating minimal iso for 4.12- (rhcos-412.86.202308081039-0)","time":"2024-04-11T17:04:16Z"} 

InfraEnv conditions:

[
  {
    "lastTransitionTime": "2024-04-11T17:04:47Z",
    "message": "Image has been created",
    "reason": "ImageCreated",
    "status": "True",
    "type": "ImageCreated"
  }
] 

InfraEnv ISODownloadURL:

  isoDownloadURL: https://assisted-image-service-multicluster-engine.apps.ocp-edge-cluster-assisted-0.qe2.e2e.bos.redhat.com/byapikey/eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI1ZjZhZmZjYy0zMzMwLTQ0NTYtODkxOC1lOThmYTE5ZTU2NGQifQ.T3h-_q6yMr1JvNkWXMspNk_9MFsHOX-CGBlBIlfpgjje9k-Y6RsI_6cWdZgJTPT0nMXRJiEUuvBJZJGPNdK-MQ/4.12/x86_64/minimal.iso 

Actually curling for this URL:

$ curl -kI "https://assisted-image-service-multicluster-engine.apps.ocp-edge-cluster-assisted-0.qe2.e2e.bos.redhat.com/byapikey/eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiI1ZjZhZmZjYy0zMzMwLTQ0NTYtODkxOC1lOThmYTE5ZTU2NGQifQ.T3h-_q6yMr1JvNkWXMspNk_9MFsHOX-CGBlBIlfpgjje9k-Y6RsI_6cWdZgJTPT0nMXRJiEUuvBJZJGPNdK-MQ/4.12/x86_64/minimal.iso"
HTTP/1.1 404 Not Found
content-type: text/plain; charset=utf-8
x-content-type-options: nosniff
date: Thu, 11 Apr 2024 17:09:35 GMT
content-length: 19
set-cookie: 1a4b5ac1ad25c005c048fb541ba389b4=02300906d3489ab71b6417aaeed52390; path=/; HttpOnly; Secure; SameSite=None 

 

 

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/279

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The default channel of 4.15, 4.16 clusters is stable-4.14.
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-03-193825
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Install a 4.16 cluster
    2. Check default channel
# oc adm upgrade 
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-0.nightly-2024-01-03-193825 not found in the "stable-4.14" channel

Cluster version is 4.16.0-0.nightly-2024-01-03-193825

Upgradeable=False

  Reason: MissingUpgradeableAnnotation
  Message: Cluster operator cloud-credential should not be upgraded between minor versions: Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade.

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.14

    3.
    

Actual results:

Default channel is stable-4.14 in a 4.16 cluster
    

Expected results:

Default channel should be stable-4.16 in a 4.16 cluster
    

Additional info:

4.15 cluster has the issue as well.
    

To support external OIDC on hypershift, but not on self-managed, we need different schemas for the authentication CRD on a default-hypershift versus a default-self-managed.  This requires us to change rendering so that it honors the clusterprofile.

 

Then we have to update the installer to match, then update hypershift, then update the manifests.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

I was identifying what remains with:

cat e2e-events_20231204-183144.json| jq '.items[] | select(has("tempSource") | not)'

I think I've cleared all the difficult ones, hopefully these are just simple stragglers.

Description of problem:

Creation of a second hostedcluster in the same namespace fails with the error "failed to set secret''s owner reference" in the status of the second hostedlcuster's yaml.

~~~
  conditions:
  - lastTransitionTime: "2024-04-02T06:57:18Z"
    message: 'failed to reconcile the CLI secrets: failed to set secret''s owner reference'
    observedGeneration: 1
    reason: ReconciliationError
    status: "False"
    type: ReconciliationSucceeded
~~~

Note that the hosted control plane namespace is still different for both clusters.

Customer is just following the doc - https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/clusters/cluster_mce_overview#creating-a-hosted-cluster-bm for both the clusters and only the hostedcluster CR is created in the same namespace.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    

Steps to Reproduce:

    1. Create a hostedcluster as per the doc https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/clusters/cluster_mce_overview#creating-a-hosted-cluster-bm
    2. Create another hostedcluster in the same namespace where the first hostedcluster was created.
    3. Second hostedcluster fails to proceed with the said error.
    

Actual results:

The hostedcluster creation fails

Expected results:

The hostedcluster creation should succeed

Additional info:

    

revert: https://github.com/openshift/origin/pull/28603

output looks like:

INFO[2024-02-16T00:25:04Z] time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n
\n
ERROR
\n
The requested URL could not be retrieved
\n
\n
\n\n
\n

The following error was encountered while trying to retrieve the URL: http://35.212.33.188/health\">http://35.212.33.188/health

\n\n
\n

Access Denied.

\n
\n\n

Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.

\n\n

Your cache administrator is root.

\n
\n
\n\n
\n
\n

Generated Fri, 16 Feb 2024 00:20:41 GMT by ofcir-3329d9226457452fb2040e269776e3a5 (squid/5.2)

\n\n
\n\n" auditID=facdfd31-51e5-4812-a356-6e4b0e30cd38 backend=gcp-network-liveness-reused-connections this-instance="{Disruption map[backend-disruption-name:gcp-network-liveness-reused-connections connection:reused disruption:openshift-tests]}" type=reused
time="2024-02-16T00:20:41Z" level=error msg="disruption sample failed: error running request: 403 Forbidden: http://www.w3.org/TR/html4/strict.dtd\">\n\n\n\n\n\n\n
\n
ERROR

Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/45

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Installer requires the `s3:HeadBucket` even though such permission does not exist. The correct permission for the `HeadBucket` action is `s3:ListBucket`

https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadBucket.html

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Install a cluster using a role with limited permissions
    2.
    3.
    

Actual results:

    level=warning msg=Action not allowed with tested creds action=iam:DeleteUserPolicy
level=warning msg=Tested creds not able to perform all requested actions
level=warning msg=Action not allowed with tested creds action=s3:HeadBucket
level=warning msg=Tested creds not able to perform all requested actions
level=fatal msg=failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: AWS credentials cannot be used to either create new creds or use as-is
Installer exit with code 1

Expected results:

    Installer should check only for s3:ListBucket

Additional info:

    

Description of problem:

OCP 4.15 nightly deployment on a Bare-metal servers without using the provisioning network is stuck during deployment.

Job history:

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-telco5g

Deployment stuck similiar to this:

Upstream job logs:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-telco5g/1732520780954079232/artifacts/e2e-telco5g/telco5g-cluster-setup/artifacts/cloud-init-output.log

~~~

level=debug msg=ironic_node_v1.openshift-master-host[2]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[0]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[1]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [10s elapsed]..level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [2h28m51s elapsed]level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [2h28m51s elapsed]
~~~

Ironic logs from bootstrap node:
~~~
Dec 07 13:10:13 localhost.localdomain start-provisioning-nic.sh[3942]: Error: failed to modify ipv4.addresses: invalid IP address: Invalid IPv4 address ''.
Dec 07 13:10:13 localhost.localdomain systemd[1]: provisioning-interface.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 07 13:10:13 localhost.localdomain systemd[1]: provisioning-interface.service: Failed with result 'exit-code'.
Dec 07 13:10:13 localhost.localdomain systemd[1]: Failed to start Provisioning interface.
Dec 07 13:10:13 localhost.localdomain systemd[1]: Dependency failed for DHCP Service for Provisioning Network.
Dec 07 13:10:13 localhost.localdomain systemd[1]: ironic-dnsmasq.service: Job ironic-dnsmasq.service/start failed with result 'dependency'
~~~

Version-Release number of selected component (if applicable):

4.15 

How reproducible:

Everytime

Steps to Reproduce:

1.Deploy OCP

More information about our setup:
In our environment, We have 3 virtual master node, 1 virtual worker and 1 baremetal worker. We use KCLI tool for creation of the virtual environment and for running the deployment workflow using IPI, In our setup we don't use provisioning network. (Same setup is used for other OCP version till 4.14 and are working fine.)

We have attached our install-config.yaml (for RH employees) and logs from bootstrap node.

Actual results:

Deployment is failing

Dec 07 13:10:13 localhost.localdomain start-provisioning-nic.sh[3942]: Error: failed to modify ipv4.addresses: invalid IP address: Invalid IPv4 address ''.

Expected results:

Deployment should pass

Additional info:

    

Description of problem:

    HyperShift operator is applying control-plane-pki-operator RBAC resources regardless of if PKI reconciliation is disabled for the HostedCluster.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    100%

Steps to Reproduce:

    1. Create 4.15 HostedCluster with PKI reconciliation disabled
    2. Unused RBAC resources for control-plane-pki-operator is created
    

Actual results:

    Unused RBAC resources for control-plane-pki-operator is created

Expected results:

RBAC resources for control-plane-pki-operator should not be created if deployment for control-plane-pki-operator itself is not created.

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/65

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

OKD/FCOS uses FCOS for its bootimage which lacks several tools and services such as oc and crio that the rendezvous host of the Agent-based Installer needs to set up a bootstrap control plane.

Version-Release number of selected component (if applicable):

4.13.0
4.14.0
4.15.0

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/87

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

AWS HyperShift clusters' nodes cannot join cluster with custom domain name in DHCP Option Set

Version-Release number of selected component (if applicable):

Any

How reproducible:

100%

Steps to Reproduce:

1. Create a VPC for a HyperShift/ROSA HCP cluster in AWS
2. Replace the VPC's DHCP Option Set with another with a custom domain name (example.com or really any domain of your choice)
3. Attempt to install a HyperShift/ROSA HCP cluster with a nodepool

Actual results:

All EC2 instances will fail to become nodes. They will generate CSR's based on the default domain name - ec2.internal for us-east-1 or ${region}.compute.internal for other regions (e.g. us-east-2.compute.internal)

Expected results:

Either that they become nodes or that we document that custom domain names in DHCP Option Sets are not allowed with HyperShift at this time. There is currently no pressing need for this feature, though customers do use this in ROSA Classic/OCP successfully.

Additional info:

This is a known gap currently in cluster-api-provider-aws (CAPA) https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/1691

Description of problem:


In the self-managed HCP use case, if the on-premise baremetal management cluster does not have nodes labeled with the "topology.kubernetes.io/zone" key, then all HCP pods for a High Available cluster are scheduled to a single mgmt cluster node.

This is a result of the way the affinity rules are constructed.

Take the pod affinity/antiAffinity example below, which is generated for a HA HCP cluster. If the "topology.kubernetes.io/zone" label does not exist on the mgmt cluster nodes, then the pod will still get scheduled but that antiAffinity rule is effectively ignored. That seems odd due to the usage of the "requiredDuringSchedulingIgnoredDuringExecution" value, but I have tested this and the rule truly is ignored if the topologyKey is not present.

        podAffinity: 
          preferredDuringSchedulingIgnoredDuringExecution: 
          - podAffinityTerm: 
              labelSelector: 
                matchLabels: 
                  hypershift.openshift.io/hosted-control-plane: clusters-vossel1
              topologyKey: kubernetes.io/hostname
            weight: 100
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector: 
              matchLabels: 
                app: kube-apiserver
                hypershift.openshift.io/control-plane-component: kube-apiserver
            topologyKey: topology.kubernetes.io/zone
In the event that no "zones" are configured for the baremetal mgmt cluster, then the only other pod affinity rule is one that actually colocates the pods together. This results in a HA HCP having all the etcd, apiservers, etc... scheduled to a single node.

Version-Release number of selected component (if applicable):

4.14


How reproducible:

100%

Steps to Reproduce:

1. Create a self-managed HA HCP cluster on a mgmt cluster with nodes that lack the "topology.kubernetes.io/zone" label

Actual results:

all HCP pods are scheduled to a single node.

Expected results:

HCP pods should always be spread across multiple nodes.

Additional info:


A way to address this is to add another anti-affinity rule which prevents every component from being scheduled on the same node as its replicas

Description of problem:

Version shown for `oc-mirror --v2 version` should be similar to `oc-mirror version`

 

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1) `oc-mirror –v2 -v` 

Actual results: 

oc-mirror --v2 -v--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. oc-mirror version v2.0.0-dev-01 

Expected results:

oc-mirror version --output=yaml
clientVersion:
  buildDate: "2024-03-07T03:46:24Z"
  compiler: gc
  gitCommit: c4f829512107f7d0f52a057cd429de2030b9b3b3
  gitTreeState: clean
  gitVersion: 4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295
  goVersion: go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime
  major: ""
  minor: ""
  platform: linux/amd64

Description of problem:

If the installer using cluster api exits before bootstrap destroy, it may leak processes which continue to run in the background of the host system. These processes may continue to reconcile cloud resources, so the cluster resources would be created and recreated even when you are trying to delete them. 

This occurs because the installer runs kube-apiserver, etcd, and the capi provider binaries as subprocesses. If the installer exits without shutting down those subprocesses, due to an error or user interrupt, the processes will continue to run in the background.

The processes can be identified with the ps command. pgrep and pkill are also useful.

Brief discussion here of this occurring in PowerVS: https://redhat-internal.slack.com/archives/C05QFJN2BQW/p1712688922574429

Version-Release number of selected component (if applicable):

    

How reproducible:

    Often

Steps to Reproduce:

    1. Run capi-based install (on any platform), by specifying fields below in the install config [0]
    2. Wait until CAPI controllers begin to run. This will be easy to identify because the terminal will fill with controller logs. Particularly you should see [1]
    3. Once the controllers are running interrupt with CTRL + C

[0] Install config for capi install
featureGates:
- ClusterAPIInstall=true
featureSet: CustomNoUpgrade    

[1] INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /c/auth/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --metrics-bind-addr=0 --

 

Actual results:

    controllers will leak and continue to run. They can be viewed with ps or pgrep

You may also see INFO Shutting down local Cluster API control plane... 
That means the Shutdown started but did not complete.

Expected results:

    The installer should shutdown gracefully and not leak processes, such as:

^CWARNING Received interrupt signal                    
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: aws infrastructure provider 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create infrastructure manifest: Post "https://127.0.0.1:41441/apis/infrastructure.cluster.x-k8s.io/v1beta2/awsclustercontrolleridentities": unexpected EOF 
INFO Local Cluster API system has completed operations 

Additional info:

    

Description of problem:

Following https://issues.redhat.com/browse/CNV-28040
On CNV, when virtual machine, with secondary interfaces connected with bridge CNI, is live migrated we observe disruption at the VM inbound traffic.

The root cause for it is the migration target bridge interface advertise before the migration is completed.

When the migration destination pod is created an IPv6 NS (Neighbor Solicitation)
and NA (Neighbor Advertisement) are sent automatically by the kernel.
The switches at the endpoints (e.g.: migration destination node) tables
get updated and the traffic is forwarded to the migration destination before
the migration is completed [1].

The solution is to have the bridge CNI create the pod interface in "link-down" state [2], the IPv6 NS/NA packets are avoided, CNV in turn, set the pod interface to "link-up" [3].

CNV depends on bridge CNI with [2] bits, which is deployed by cluster-network-operator.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2186372#c6
[2] https://github.com/kubevirt/kubevirt/pull/11069
[3] https://github.com/containernetworking/plugins/pull/997  

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

100%

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

CNO deploys CNI bridge w/o an option to set the bridge interface down.

Expected results:

CNO to deploy bridge CNI with [1] changes, from release-4.16 branch. 

[1] https://github.com/containernetworking/plugins/pull/997

Additional info:

More https://issues.redhat.com/browse/CNV-28040    

 
 
 
 

 

Description of problem:

    On February 27th endpoints were turned off that were being queried for account details. The check is not vital so we are fine with removing it, however it is currently blocking all Power VS installs.
            

Version-Release number of selected component (if applicable):

    4.13.0 - 4.16.0

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy with Power VS
    2. Fail at the platform credentials check
    

Actual results:

    Check fails

Expected results:

    Check should succeed

Additional info:

    

 

Description of problem:

CNV upgrades from v4.14.1 to v4.15.0 (unreleased) are not starting due to out of sync operatorCondition.

We see:

$ oc get csv
NAME                                       DISPLAY                    VERSION               REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.14.1   OpenShift Virtualization   4.14.1                kubevirt-hyperconverged-operator.v4.14.0   Replacing
kubevirt-hyperconverged-operator.v4.15.0   OpenShift Virtualization   4.15.0                kubevirt-hyperconverged-operator.v4.14.1   Pending

And on the v4.15.0 CSV:

$ oc get csv kubevirt-hyperconverged-operator.v4.15.0 -o yaml
....
status:
  cleanup: {}
  conditions:
  - lastTransitionTime: "2023-12-19T01:50:48Z"
    lastUpdateTime: "2023-12-19T01:50:48Z"
    message: requirements not yet checked
    phase: Pending
    reason: RequirementsUnknown
  - lastTransitionTime: "2023-12-19T01:50:48Z"
    lastUpdateTime: "2023-12-19T01:50:48Z"
    message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True"
      is outdated'
    phase: Pending
    reason: OperatorConditionNotUpgradeable
  lastTransitionTime: "2023-12-19T01:50:48Z"
  lastUpdateTime: "2023-12-19T01:50:48Z"
  message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True"
    is outdated'
  phase: Pending
  reason: OperatorConditionNotUpgradeable

and if we check the pending operator condition (v4.14.1) we see:

$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml
apiVersion: operators.coreos.com/v2
kind: OperatorCondition
metadata:
  creationTimestamp: "2023-12-16T17:10:17Z"
  generation: 18
  labels:
    operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: ""
  name: kubevirt-hyperconverged-operator.v4.14.1
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: ClusterServiceVersion
    name: kubevirt-hyperconverged-operator.v4.14.1
    uid: 7db79d4b-e69e-4af8-9335-6269cf004440
  resourceVersion: "4116127"
  uid: 347306c9-865a-42b8-b2c9-69192b0e350a
spec:
  conditions:
  - lastTransitionTime: "2023-12-18T18:47:23Z"
    message: ""
    reason: Upgradeable
    status: "True"
    type: Upgradeable
  deployments:
  - hco-operator
  - hco-webhook
  - hyperconverged-cluster-cli-download
  - cluster-network-addons-operator
  - virt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  serviceAccounts:
  - hyperconverged-cluster-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
status:
  conditions:
  - lastTransitionTime: "2023-12-18T09:41:06Z"
    message: ""
    observedGeneration: 11
    reason: Upgradeable
    status: "True"
    type: Upgradeable

where metadata.generation (18) is not in sync with status.conditions[*].observedGeneration (11).

Even manually redacting spec.conditions.lastTransitionTime is causing a change in metadata.generation (as expected) but this doesn't trigger any reconciliation on the OLM and so status.conditions[*].observedGeneration remains at 11.

$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml
apiVersion: operators.coreos.com/v2
kind: OperatorCondition
metadata:
  creationTimestamp: "2023-12-16T17:10:17Z"
  generation: 19
  labels:
    operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: ""
  name: kubevirt-hyperconverged-operator.v4.14.1
  namespace: openshift-cnv
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: ClusterServiceVersion
    name: kubevirt-hyperconverged-operator.v4.14.1
    uid: 7db79d4b-e69e-4af8-9335-6269cf004440
  resourceVersion: "4147472"
  uid: 347306c9-865a-42b8-b2c9-69192b0e350a
spec:
  conditions:
  - lastTransitionTime: "2023-12-18T18:47:25Z"
    message: ""
    reason: Upgradeable
    status: "True"
    type: Upgradeable
  deployments:
  - hco-operator
  - hco-webhook
  - hyperconverged-cluster-cli-download
  - cluster-network-addons-operator
  - virt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  serviceAccounts:
  - hyperconverged-cluster-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
  - cluster-network-addons-operator
  - kubevirt-operator
  - ssp-operator
  - cdi-operator
  - hostpath-provisioner-operator
  - mtq-operator
status:
  conditions:
  - lastTransitionTime: "2023-12-18T09:41:06Z"
    message: ""
    observedGeneration: 11
    reason: Upgradeable
    status: "True"
    type: Upgradeable

since its observedGeneration is out of sync, this check:
https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/olm/operatorconditions.go#L44C1-L48

fails and the upgrade never starts.

I suspect (I'm only guessing) that it could be a regression introduced with the memory optimization for https://issues.redhat.com/browse/OCPBUGS-17157 .

Version-Release number of selected component (if applicable):

    OCP 4.15.0-ec.3

How reproducible:

- Not reproducible (with the same CNV bundles) on OCP v4.14.z.
- Pretty high (but not 100%) on OCP 4.15.0-ec.3

 

 

Steps to Reproduce:

    1. Try triggering a CNV v4.14.1 -> v4.15.0 on OCP 4.15.0-ec.3
    2.
    3.
    

Actual results:

    The OLM is not reacting to changes on spec.conditions on the pending operator condition, so metadata.generation is constantly out of sync with status.conditions[*].observedGeneration and so the CSV is reported as 

    message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True"
      is outdated'
    phase: Pending
    reason: OperatorConditionNotUpgradeable

Expected results:

    The OLM correctly reconcile the operatorCondition and the upgrade starts

Additional info:

    Not reproducible with exactly the same bundle (origin and target) on OCP v4.14.z

Description of problem:

In a recently installed cluster running 4.13.29, after configuring the cluster-wide-proxy, the "vsphere-problem-detector" is not taking the proxy configuration.
As the pod cannot reach vSphere  it's failing to run checks:
2024-02-01T09:28:00.150332407Z E0201 09:28:00.150292       1 operator.go:199] failed to run checks: failed to connect to vsphere.local: Post "https://vsphere.local/sdk": dial tcp 172.16.1.3:443: i/o timeout  

The pod doesn't get the cluster proxy settings as expected:
   - name: HTTPS_PROXY
     value: http://proxy.local:3128
   - name: HTTP_PROXY
     value: http://proxy.local:3128

Other storage related pods get the configuration expected as above.

This causes the vsphere-problem-detector to fail connections to vSphere, hence failing the health checks.

 

Version-Release number of selected component (if applicable):

  4.13.29 

How reproducible:

   Always

Steps to Reproduce:

    1.Configure cluster-wide proxy in the environment. 
    2. Wait for the change
    3. Check the pod configuration
    

Actual results:

    vSphere health checks failing

Expected results:

    vSphere health checks working through the cluster proxy

Additional info:

    

Description of problem:

    We see failures in this test:

[Jira:"Networking / router"] monitor test service-type-load-balancer-availability setup expand_less 15m1s{ failed during setup error waiting for load balancer: timed out waiting for service "service-test" to have a load balancer: timed out waiting for the condition}

See this https://search.ci.openshift.org/?search=error+waiting+for+load+balancer&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job to find recent ones.

example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade/1754402739040817152

this has failed payloads like:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-01-211543
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-02-061913
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-02-001913

Version-Release number of selected component (if applicable):

    4.15 and 4.16

How reproducible:

    intermittent as shown in the search.ci query above

Steps to Reproduce:

    1. run the e2e tests on 4.15 and 4.16
    2.
    3.
    

Actual results:

    timeouts on getting load balancer

Expected results:

    no timeout and successful load balancer

Additional info:

    https://issues.redhat.com/browse/TRT-1486 has more info 
thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1707142256956139

Description of the problem:

Specified ClusterImageSetRef in SiteConfig CR in HubCluster mistmatches the specific image pulled from quay when trying to install 4.12.x managed SNO clusters. This behavior has not been detected  when installing SNO 4.14.x

How reproducible:

Install 4.12.19 from ACM using clusterImageSetNameRef: "img4.12.19-x86-64-appsub"

Actual results:

Image pulled is actually img4.12.19-multi-x86-64-appsub

 

# oc adm release info 4.12.19 | more
Name:           4.12.19
Digest:         sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
Created:        2023-05-24T06:58:32Z
OS/Arch:        linux/amd64
Manifests:      647
Metadata files: 1
 
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:4db028028aae12cb82b784ae54aaca6888f224b46565406f1a89831a67f42030
  Metadata:     
     release.openshift.io/architecture: multi                               url:                               https://access.redhat.com/errata/RHSA-2023:3287  
 

 

Expected results:

Image pulled is img4.12.19-x86-64-appsub

# oc adm release info 4.12.19 | more
Name:           4.12.19
Digest:     sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
 
Pull From: quay.io/openshift-release-dev/ocp-release@sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
  Metadata:     
url: https://access.redhat.com/errata/RHSA-2023:3287  

 

 

Description of problem:

    ART is moving the container images to be built by Golang 1.21. We should do the same to keep our build config in sync with ART.

Version-Release number of selected component (if applicable):

    4.16/master

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

In three clusters, I am receiving the alert:

"Multiple default storage classes are marked as default. The storage class is chosen for the PVC is depended on version of the cluster.

Starting with OpenShift 4.13, a persistent volume claim (PVC) requesting the default storage class gets the most recently created default storage class if multiple default storage classes exist."

But the alert clearly shows only one default SC:

"Red Hat recommends to set only one storage class as the default one.

Current default storage classes:

ocs-external-storagecluster-ceph-rbd"

This is confirmed with 'oc get sc'

NAME                                             PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ocs-external-storagecluster-ceph-rbd (default)   openshift-storage.rbd.csi.ceph.com      Delete          Immediate           true                   351d
ocs-external-storagecluster-ceph-rbd-windows     openshift-storage.rbd.csi.ceph.com      Delete          Immediate           true                   11d
ocs-external-storagecluster-cephfs               openshift-storage.cephfs.csi.ceph.com   Delete          Immediate           true                   351d
openshift-storage.noobaa.io                      openshift-storage.noobaa.io/obc         Delete          Immediate           false                  351d

Description of problem:

nothing happens when user clicks on the 'Configure' button next to AlertmanagerReceiversNotConfigured alert    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-03-11-041450    

How reproducible:

 Always   

Steps to Reproduce:

1. navigate to Home -> Overview, locate the AlertmanagerReceiversNotConfigured alert in 'Status' card
2. click the 'Configure' button next to AlertmanagerReceiversNotConfigured alert
    

Actual results:

nothing happens    

Expected results:

user should be taken to alert manager configuration page /monitoring/alertmanagerconfig    

Additional info:

    

Description of problem:

When creating an IAM role with a "path" (https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-friendly-names) its "principal" when applied to trust policies or VPC Endpoint Service allowed principals confusingly does not include the path. That is, for the folowing rolesRef on a hostedcluster:

spec:
  platform:
    aws:
      rolesRef:
        controlPlaneOperatorARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator
        imageRegistryARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-image-registry-installer-cloud-crede
        ingressARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-ingress-operator-cloud-credentials
        kubeCloudControllerARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-kube-controller-manager
        networkARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cloud-network-config-controller-clou
        nodePoolManagementARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-capa-controller-manager
        storageARN: arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-openshift-cluster-csi-drivers-ebs-cloud-creden 

The actual valid principal that should be added to the VPC Endpoint Service's allowed principals is: 

arn:aws:iam::765374464689:role/ad-int-path1-y4y2-kube-system-control-plane-operator 

instead of

arn:aws:iam::765374464689:role/adecorte/ad-int-path1-y4y2-kube-system-control-plane-operator 

However, for all other cases, the full ARN including the path should be used, e.g. https://github.com/openshift/hypershift/blob/082e880d0a492a357663d620fa58314a4a477730/hypershift-operator/controllers/hostedcluster/internal/platform/aws/aws.go#L237-L273

Version-Release number of selected component (if applicable):

4.14.1

How reproducible:

100%

Steps to Reproduce:

ROSA HCP-specific steps:
1. rosa create account-roles --path /anything/ -m auto -y
2. rosa create cluster --hosted-cp
3. ...etc
4a. Observe on the hosted cluster AWS Account that the VPC Endpoint cannot be created with the error: 'failed to create vpc endpoint: InvalidServiceName'
4b. Observe on the management cluster that CPO is failing to update the VPC Endpoint Service's allowed principals with the error: Client.InvalidPrincipal
5. If the contents of .spec.platform.aws.rolesRef.controlPlaneOperatorARN are manually applied to the additional allowed principals with the path component removed, then the problems are largely fixed on the hosted cluster side. VPC Endpoint is created, worker nodes can spin up, etc.

Actual results:

The VPC Endpoint Service is attempting and failing to get this applied to its additional allowed principals:

arn:aws:iam::${ACCOUNT_ID}:role/path/name

Expected results:

The VPC Endpoint Service gets this applied to its additional allowed principals:

arn:aws:iam::${ACCOUNT_ID}:role/name

Additional info:

 

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/246

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The upstream project reorganized the config directory and we need to adapt it for downstream. Until then, upstream->downstream syncing is blocked.

Description of problem:

    A change to how Power VS Workspaces are queried is not compatible with the version of terraform-provider-ibm

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy with Power VS
    2. Fail with an error stating that [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
    

Actual results:

    Fail with [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist

Expected results:

    Install should succeed.

Additional info:

    

 

Description of problem:

Azure-File volume mount failed, it happens on arm cluster with multi payload

$ oc describe pod
  Warning  FailedMount       6m28s (x2 over 95m)  kubelet            MountVolume.MountDevice failed for volume "pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2" : rpc error: code = InvalidArgument desc = GetAccountInfo(wduan-0319b-bkp2k-rg#clusterjzrlh#pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2###wduan) failed with error: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wduan-0319b-bkp2k-rg/providers/Microsoft.Storage/storageAccounts/clusterjzrlh/listKeys?api-version=2021-02-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token": dial tcp 20.190.190.193:443: i/o timeout'

 

The node log reports:
W0319 09:41:30.745936 1 azurefile.go:806] GetStorageAccountFromSecret(azure-storage-account-clusterjzrlh-secret, wduan) failed with error: could not get secret(azure-storage-account-clusterjzrlh-secret): secrets "azure-storage-account-clusterjzrlh-secret" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:azure-file-csi-driver-node-sa" cannot get resource "secrets" in API group "" in the namespace "wduan"

 

 
 

Checked the role looks good, at least the same as previous: 
$ oc get clusterrole azure-file-privileged-role -o yaml
...
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

 

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-multi-2024-03-13-031451

How reproducible:

2/2

Steps to Reproduce:

    1. Checked in CI, azure-file cases failed due to this
    2. Create one cluster with the same config and payload, create azure-file pvc and pod
    3.
    

Actual results:

Pod could not be running    

Expected results:

Pod should be running 

Additional info:

    

Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/144

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/network-tools/pull/108

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/95

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component Readiness has found a potential regression in [Unknown][invariant] alert/KubePodNotReady should not be at or above info in all the other namespaces.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.16
Start Time: 2024-04-18T00:00:00Z
End Time: 2024-04-24T23:59:59Z
Success Rate: 93.14%
Successes: 95
Failures: 7
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 482
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20upgrade-micro%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=azure&sampleEndTime=2024-04-24%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-04-18%2000%3A00%3A00&testId=openshift-tests-upgrade%3A57b9d37e7f1d80cb25d3ba4386abc630&testName=%5BUnknown%5D%5Binvariant%5D%20alert%2FKubePodNotReady%20should%20not%20be%20at%20or%20above%20info%20in%20all%20the%20other%20namespaces&upgrade=upgrade-micro&variant=standard

Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/53

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected. 

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-01-14-100410

How reproducible:

    Always

Steps to Reproduce:

1. create the first cluster and make sure it succeeds
2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed
3. destroy the second cluster which failed due to "Platform Provisioning Check"
4. check if the first cluster is still healthy     

Actual results:

    The first cluster turns unhealthy, because the dns record-sets are deleted by step3

Expected results:

    The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.

Additional info:

(1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-01-14-100410   True        False         54m     Cluster version is 4.15.0-0.nightly-2024-01-14-100410
$ oc get nodes
NAME                                                       STATUS   ROLES                  AGE   VERSION
jiwei-0115y-lgns8-master-0.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-master-1.c.openshift-qe.internal         Ready    control-plane,master   73m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-master-2.c.openshift-qe.internal         Ready    control-plane,master   74m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal   Ready    worker                 62m   v1.28.5+c84a6b8
jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal   Ready    worker                 63m   v1.28.5+c84a6b8
$ 

(2) try to create the second cluster and expect failing due to dns record already exists

$ openshift-install version
openshift-install 4.15.0-0.nightly-2024-01-14-100410
built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5
release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d
release architecture amd64
$ mkdir test1
$ cp install-config.yaml test1
$ yq-3.3.0 r test1/install-config.yaml baseDomain
qe.gcp.devcluster.openshift.com
$ yq-3.3.0 r test1/install-config.yaml metadata
creationTimestamp: null
name: jiwei-0115y
$ yq-3.3.0 r test1/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
$ openshift-install create cluster --dir test1
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
INFO Consuming Install Config from target directory 
FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue 
$ 

(3) delete the second cluster

$ openshift-install destroy cluster --dir test1
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
INFO Deleted 2 recordset(s) in zone qe            
INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone 
WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer 
INFO Time elapsed: 37s                            
INFO Uninstallation complete!                     
$ 

(4) check the first cluster status and the dns record-sets

$ oc get clusterversion
Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host
$
$ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2024-01-15T07:22:55.199Z'
description: Created By OpenShift Installer
dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com.
id: '9193862213315831261'
kind: dns#managedZone
labels:
  kubernetes-io-cluster-jiwei-0115y-lgns8: owned
name: jiwei-0115y-lgns8-private-zone
nameServers:
- ns-gcp-private.googledomains.com.
privateVisibilityConfig:
  kind: dns#managedZonePrivateVisibilityConfig
  networks:
  - kind: dns#managedZonePrivateVisibilityConfigNetwork
    networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network
visibility: private
$ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone
NAME                                          TYPE  TTL    DATA
jiwei-0115y.qe.gcp.devcluster.openshift.com.  NS    21600  ns-gcp-private.googledomains.com.
jiwei-0115y.qe.gcp.devcluster.openshift.com.  SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
$ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y'
Listed 0 items.
$ 

Description of problem:

Update OWNERS file in route-controller-manager repository.

    

Version-Release number of selected component (if applicable):

4.15
    

How reproducible:

n/a
    

Steps to Reproduce:

n/a
    

Actual results:

n/a
    

Expected results:

n/a
    

Additional info:


    

Description of problem:

    OAuth-Proxy breaks when it's using Service Account as an oauth-client as documented in https://docs.openshift.com/container-platform/4.15/authentication/using-service-accounts-as-oauth-client.html

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    100%

Steps to Reproduce:

    1. install an OCP cluster without the ImageRegistry capability
    2. deploy an oauth-proxy that uses an SA as its OAuth2 client
    3. try to login to the oauth-proxy using valid credentials
    

Actual results:

    The login fails, the oauth-server logs:

2024-02-05T13:30:56.059910994Z E0205 13:30:56.059873       1 osinserver.go:91] internal error: system:serviceaccount:my-namespace:my-sa has no tokens

Expected results:

    The login succeeds

Additional info:

    

Description of problem:

    If you allow the installer to provision a Power VS Workspace instead of bringing your own, it can sometimes fail when creating a network. This is because Power Edge Router can sometimes take up to a minute to configure.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Infrequent, but will probably hit it within 50-100 runs

Steps to Reproduce:

    1. Install on Power VS with IPI with serviceInstanceGUID not set in the install-config.yaml
    2. Occasionally you'll observe a failure due to the workspace not being ready for networks
    

Actual results:

    Failure

Expected results:

    Success

Additional info:

    Not consistently reproducible

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/55

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In a dualstack cluster, we use IPv4 URLs for callbacks to Ironic and Inspector. If the host only has IPv6 networking, the provisioning will fail. This issue affects both the normal IPI and ZTP with the converged flow.

A similar issue has been fixed as part of METAL-163 where we use the BMC's address family to determine which URL to send to it. This bug is somewhat simpler: we can provide IPA with several URLs and let it decide which one it can use. This way, only small changes to IPA itself, ICC and CBO are required.

The fix will only affect virtual media deployments without provisioning network or with virtualMediaViaExternalNetwork:true. We don't have a good dualstack story around provisioning networks anyway.

Upstream IPA request: https://bugs.launchpad.net/ironic/+bug/2045548

Description of problem:

Converting IPv6 Primary Dual Stack to IPv6 Single stack causing control plane failures. OVN masters in CLBO state peridically

OVN masters logs: http://shell.lab.bos.redhat.com/~anusaxen/convert/

MG is not working as cluster lands in bad shape. Happy to share cluster if needed for debugging

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-04-21-084440

How reproducible:

Always

Steps to Reproduce:

1.Bring cluster on IPv6 Primary Dual Stack

2.Edit network.config.openshift.io from dual stack to single like as follow


spec:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    externalIP:
      policy: {}
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
    - 172.30.0.0/16
  status:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    clusterNetworkMTU: 1400
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
    - 172.30.0.0/16


TO



 apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Network
  metadata:
    creationTimestamp: "2023-04-27T14:11:37Z"
    generation: 3
    name: cluster
    resourceVersion: "81045"
    uid: 28f15675-e739-4262-9acc-4c2c0df4b38d
  spec:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    externalIP:
      policy: {}
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
  status:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    clusterNetworkMTU: 1400
    networkType: OVNKubernetes
    serviceNetwork:
    - fd02::/112
kind: List
metadata:
  resourceVersion: ""

3. Wait for control plane components to roll out successfully

Actual results:

Cluster fails with network, ETCD, Kube API and ingress failures

Expected results:

Cluster should convert to IPv6 single without any issues

Additional info:

MGs not working due to varios control place restricting it


Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/56

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When installing a cluster, if the CPMS is created with a template without a path, the ControlPlaneMachineSet operator is rejecting any modifications to / deletion of the CPMS CR.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

Steps to Reproduce:

1. Install cluster with generated FD
2. Once the cluster is installed, attempt to delete the CPMS
    

Actual results:

Deletion of CPMS is rejected due to invalid template definition

Expected results:

Deletion of CPMS completes without error.

Additional info:

The job "pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-vsphere-operator" is currently failing with:

~~~
control plane machine set should be able to be updated
Expected success, but got an error:
    <*errors.StatusError | 0xc000233c20>: 
    admission webhook "controlplanemachineset.machine.openshift.io" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.template: Invalid value: "ci-op-7xjyyytp-91aad-zrdm2-rhcos-generated-region-generated-zone": template must be provided as the full path
    {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.template: Invalid value: \"ci-op-7xjyyytp-91aad-zrdm2-rhcos-generated-region-generated-zone\": template must be provided as the full path",
            Reason: "Forbidden",
            Details: nil,
            Code: 403,
        },
    } failed [FAILED] Timed out after 60.000s.
~~~

Description of problem:

    ResourceYAMLEditor don't have readOnly prop which will help to hide the Save button in YAML editor which don't allow user to edit resource.  

https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#resourceyamleditor

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/53

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1].

We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping:

1 events happened too frequently
event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times}


I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time.


[1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97

[2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368

[3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144

Version-Release number of selected component (if applicable):

> 4.15

How reproducible:

always by running the test

Steps to Reproduce:

Run the test:

[sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial]     

and observe the event invariant failing on it crash looping

Actual results:

catalogd-controller-manager crash loops and causes our CI jobs to fail

Expected results:

our e2e job is green again and catalogd-controller-manager doesn't crash loop       

Additional info:

 

Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/88

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/227

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Observed in 

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-serial-ovn-ipv6/1786198211774386176

 

there was a delay in creating master-0 ,

control plane services started on master-2, at this point (as master-0 wasn't yet in a provisioned state) we had 2 sets of provisioning services provisioning master-0 and presumably stomping over each other. 

 

master-0 never came up

 

Continued work on the move to structured intervals requires us to replace all uses of the legacy format in in origin so we can reclaim the "locator" (and "message") properties for the new structured interval, and stop duplicating a lot of text when we store and upload and process these.

Description of problem:

The sdn image inherits from the cli image to get the oc binary. Change this to install the openshift-clients rpm instead.
 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
  • Don't presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
  • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/155

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing

doesn't give much detail or suggest next-steps. Expanding it to include at least a more detailed error message would make it easier for the admin to figure out how to resolve the issue.

Version-Release number of selected component (if applicable):

It's in the dev branch, and probably dates back to whenever the canary system was added.

How reproducible:

100%

Steps to Reproduce:

1. Break ingress. FIXME: Maybe by deleting the cloud load balancer, or dropping a firewall in the way, or something.
2. See the canary pods start failing.
3. Ingress operator sets CanaryChecksRepetitiveFailures with a message.

Actual results:

Canary route checks for the default ingress controller are failing

Expected results:

Canary route checks for the default ingress controller are failing: ${ERROR_MESSAGE}. ${POSSIBLY_ALSO_MORE_TROUBLESHOOTING_IDEAS?}

Additional info:

Plumbing the error message through might be as straightforward as passing probeRouteEndpoint's err through to setCanaryFailingStatusCondition for formatting. Or maybe it's more complicated than that?

Description of problem:

When a namespace has a Resource Quota applied to it the Workload Graphs in the Observe view does not renders properly.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a new project/namespace
2. Apply the following Resource Quota (just a sample) to it
```
kind: ResourceQuota
apiVersion: v1
metadata:
  name: staging-workshop-quota
spec:
  hard:
    limits.cpu: '3'
    limits.memory: 3Gi
    pods: '10'
```
3. From the Developer Console, access the Observe View 3. From teh Dashboard list, select `Kubernetes / Compute Resources / Namespaces (Workloads)` Option 

Actual results:

The Graph is not rendered (see attached screenshot)

Expected results:

The Graph should render even with no data point

Additional info:

When you have a Resource Quota applied to the namespace you can try this query to see the `NaN` value returned.

```
curl -G -s -k -H "Authorization: Bearer $(oc whoami -t)" 'https://thanos-querier-openshift-monitoring.apps.cluster-your cluster domain here/api/v1/query' --data-urlencode 'query=scalar(kube_resourcequota{cluster="", namespace="user9-staging", type="hard",resource="requests.memory"})'
```

sample respose
```
{"status":"success","data":{"resultType":"scalar","result":[1682600794.396,"NaN"]}}
```

Description of problem:

On a 4.14.5-fast channel cluster in ARO after the upgrade when the customer tried to add a new node the Machine Config was not applied and the node never joined the pool. This happens for every node and can only be remediated by SRE not the customer. 
    

Version-Release number of selected component (if applicable):

4.14.5 -candidate
    

How reproducible:

Every time a node is added to the cluster at version. 
    

Steps to Reproduce:

    1. Install an ARO cluster
    2. Upgrade it to 4.14 along fast channel
    3. Add a node
    

Actual results:

 message: >-
        could not Create/Update MachineConfig: Operation cannot be fulfilled on
        machineconfigs.machineconfiguration.openshift.io
        "99-worker-generated-kubelet": the object has been modified; please
        apply your changes to the latest version and try again
      status: 'False'
      type: Failure
    - lastTransitionTime: '2023-11-29T17:44:37Z'

~~~
    

Expected results:

Node is created and configured correctly. 
    

Additional info:

 MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 15 on node: "aro-cluster-REDACTED-master-0" didn't show up, waited: 4m45s
    

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/328

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

User may provide an DNS domain outside GCP, once custom DNS is enabled, installer should skip DNS zone validation:

level=fatal msg="failed to fetch Terraform Variables: failed to generate asset \"Terraform Variables\": failed to get GCP public zone: no matching public DNS Zone found"

    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-02-03-192446
4.16.0-0.nightly-2024-02-03-221256

    

How reproducible:


Always
    

Steps to Reproduce:

1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade
2. config a baseDomain which does not exist on GCP.

    

Actual results:

See description.
    

Expected results:

Installer should skip the validation, as the custom domain may not exist on GCP
    

Additional info:

    

Description of problem:

[Multi-NIC]Egress traffic connect got timeout after remove another pod label in same namespace

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-08-024357


How reproducible:

Always

Steps to Reproduce:

1. Label one node as egress node
2. Create an egressIP object, egressIP was assigned to egress node secondary interface
# oc get egressip -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: EgressIP
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip-66293"},"spec":{"egressIPs":["172.22.0.190"],"namespaceSelector":{"matchLabels":{"org":"qe"}},"podSelector":{"matchLabels":{"color":"pink"}}}}
    creationTimestamp: "2023-10-08T07:28:04Z"
    generation: 2
    name: egressip-66293
    resourceVersion: "461590"
    uid: f1ca3483-63f1-4f31-99b0-e6a55161c285
  spec:
    egressIPs:
    - 172.22.0.190
    namespaceSelector:
      matchLabels:
        org: qe
    podSelector:
      matchLabels:
        color: pink
  status:
    items:
    - egressIP: 172.22.0.190
      node: worker-0
kind: List
metadata:
  resourceVersion: ""
3. Created a namespace and two pod under it. 
% oc get pods -n hrw -o wide
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
hello-pod    1/1     Running   0          6m46s   10.129.2.7    worker-1   <none>           <none>
hello-pod1   1/1     Running   0          6s      10.131.0.14   worker-0   <none>           <none>

4. Add label org=qe to namespace hrw
# oc get ns hrw --show-labels
NAME   STATUS   AGE   LABELS
hrw    Active   21m   kubernetes.io/metadata.name=hrw,*org=qe,*pod-security.kubernetes.io/audit-version=v1.24,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=v1.24,pod-security.kubernetes.io/warn=restricted

5. At this time, from both pods to access external endpoint, succeeded. 
% oc rsh -n hrw hello-pod 
~ $ curl 172.22.0.1 --connect-timeout 5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
~ $ exit
 % oc rsh -n hrw hello-pod1 
~ $ curl 172.22.0.1 --connect-timeout 5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>

6. Add label color=pink to both pods
 % oc label pod hello-pod color=pink -n hrw
pod/hello-pod labeled
 % oc label pod hello-pod1 color=pink -n hrw
pod/hello-pod1 labeled

7. Both pods can access external endpoint.
8. Remove label color=pink from pod hello-pod
% oc label pod hello-pod color- -n hrw     
pod/hello-pod unlabeled



Actual results:


Access external endpoint from the pod which keep the label got connect timeout
 % oc rsh -n hrw hello-pod1            
~ $ curl 172.22.0.1 --connect-timeout 5
curl: (28) Connection timeout after 5000 ms
~ $ 
~ $ 
~ $ curl 172.22.0.1 --connect-timeout 5
curl: (28) Connection timeout after 5000 ms

Note the label was removed from hello-pod , but try to access external endpoint from another pod, here hello-pod1 which should still use egressIP and be able to access

Expected results:

Should be able to access external endpoint

Additional info:


Seeing CI jobs with 

> level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).

 

search shows 65 hits in the last 7 days

https://search.dptools.openshift.org/?search=Got+0+worker+nodes%2C+3+master+nodes&maxAge=168h&context=-1&type=build-log&name=metal-ipi&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none

Please review the following PR: https://github.com/openshift/sdn/pull/600

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

DRA plugins can be installed, but do not really work because the required scheduler plugin DynamicResources isn't enabled.

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Always

Steps to Reproduce:

1. Install an OpenShift cluster and enable the TechPreviewNoUpgrade feature set either during installation or post-install. The feature set includes the DynamicResourceAllocation feature gate.
2. Install a DRA plugin by any vendor, e.g. by NVIDIA (requires at least one GPU worker with NVIDIA GPU drivers installed on the node, and a few tweaks to allow the plugin to run on OpenShift).
3. Create a resource claim.
4. Create a pod that consumes the resource claim.

Actual results:

The pod remains in ContainerCreating state, the claim in WaitingForFirstConsumer state forever, without any meaningful event or error message.

Expected results:

A resource is allocated according to the resource claim, and assigned to the pod.

Additional info:

The problem is caused by the DynamicResources scheduler plugin not being automatically enabled when the feature flag is turned on. This makes DRA plugins run without issues (the right APIs are available), but do nothing.

Description of problem:

The problem was that namespace handler on initial sync would delete all ports (because logical port cache where it got lsp UUIDs wasn't populated) and all acls (they were just set to nil). Even though both ports and acls will be re-added by the corresponding handlers, it may cause disruption.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. create a namespace with at least 1 pod and egress firewall in it

2. pick any ovnkube-node pod, find namespace port group UUID in nbdb by external_ids["name"]=<namespace name>, e.g. for "test" namespace

_uuid               : 6142932d-4084-4bc3-bdcb-1990fc71891b
acls                : [ab2be619-1266-41c2-bb1d-1052cb4e1e97, b90a4b4a-ceee-41ee-a801-08c37a9bf3e7, d314fa8d-7b5a-40a5-b3d4-31091d7b9eae]
external_ids        : {name=test}
name                : a18007334074686647077
ports               : [55b700e4-8176-42e7-97a6-8b32a82fefe5, cb71739c-ad6c-4436-8fd6-0643a5417c7d, d8644bf1-6bed-4db7-abf8-7aaab0625324] 

3. restart chosen ovn-k pod

4. check logs on restart that update chosen port group to have zero ports and zero acls

Update operations generated as: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:test]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {6142932d-4084-4bc3-bdcb-1990fc71891b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUID: UUIDName:}] 

Actual results:

 

Expected results:

On restart port group stays the same, no extra update with empty ports and acls is generated

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Description of problem:

Cluster install fails on IBMCloud, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

Version-Release number of selected component (if applicable):

from 4.16.0-0.nightly-2023-12-22-210021

last PASS version: 4.16.0-0.nightly-2023-12-20-061023

How reproducible:

Always 

Steps to Reproduce:

    1. Install a cluster on IBMCloud, we use auto flexy template: aos-4_16/ipi-on-ibmcloud/versioned-installer

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          92m     Unable to apply 4.16.0-0.nightly-2023-12-25-200355: an unknown error has occurred: MultipleErrors
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                                                                                                               
baremetal                                                                                                                    
cloud-controller-manager                   4.16.0-0.nightly-2023-12-25-200355   True        False         False      89m     
cloud-credential                                                                                                             
cluster-autoscaler                                                                                                           
config-operator                                                                                                              
console                                                                                                                      
control-plane-machine-set                                                                                                    
csi-snapshot-controller                                                                                                      
dns                                                                                                                          
etcd                                                                                                                         
image-registry                                                                                                               
ingress                                                                                                                      
insights                                                                                                                     
kube-apiserver                                                                                                               
kube-controller-manager                                                                                                      
kube-scheduler                                                                                                               
kube-storage-version-migrator                                                                                                
machine-api                                                                                                                  
machine-approver                                                                                                             
machine-config                                                                                                               
marketplace                                                                                                                  
monitoring                                                                                                                   
network                                                                                                                      
node-tuning                                                                                                                  
openshift-apiserver                                                                                                          
openshift-controller-manager                                                                                                 
openshift-samples                                                                                                            
operator-lifecycle-manager                                                                                                   
operator-lifecycle-manager-catalog                                                                                           
operator-lifecycle-manager-packageserver                                                                                     
service-ca                                                                                                                   
storage                                                                                                                       
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                        STATUS     ROLES                  AGE   VERSION
huliu-ibma-qbg48-master-0   NotReady   control-plane,master   89m   v1.29.0+b0d609f
huliu-ibma-qbg48-master-1   NotReady   control-plane,master   89m   v1.29.0+b0d609f
huliu-ibma-qbg48-master-2   NotReady   control-plane,master   89m   v1.29.0+b0d609f
liuhuali@Lius-MacBook-Pro huali-test % oc describe node huliu-ibma-qbg48-master-0
Name:               huliu-ibma-qbg48-master-0
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=huliu-ibma-qbg48-master-0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.openshift.io/os_id=rhcos
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 27 Dec 2023 18:02:21 +0800
Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  huliu-ibma-qbg48-master-0
  AcquireTime:     <unset>
  RenewTime:       Wed, 27 Dec 2023 19:32:24 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Addresses:
Capacity:
  cpu:                4
  ephemeral-storage:  104266732Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16391716Ki
  pods:               250
Allocatable:
  cpu:                3500m
  ephemeral-storage:  95018478229
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15240740Ki
  pods:               250
System Info:
  Machine ID:                 0ae21a012be844f18c5871f6eaefb85b
  System UUID:                0ae21a01-2be8-44f1-8c58-71f6eaefb85b
  Boot ID:                    fbe619e2-8ff5-4cdb-b6a4-cd6830ccc568
  Kernel Version:             5.14.0-284.45.1.el9_2.x86_64
  OS Image:                   Red Hat Enterprise Linux CoreOS 416.92.202312250319-0 (Plow)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.28.2-9.rhaos4.15.git6d902a3.el9
  Kubelet Version:            v1.29.0+b0d609f
  Kube-Proxy Version:         v1.29.0+b0d609f
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:
  Type    Reason                   Age                From             Message
  ----    ------                   ----               ----             -------
  Normal  NodeHasNoDiskPressure    90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientPID
  Normal  NodeHasSufficientMemory  90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientMemory
  Normal  RegisteredNode           90m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           73m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           53m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           32m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
  Normal  RegisteredNode           12m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller 
liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager
NAME                                            READY   STATUS             RESTARTS         AGE
ibm-cloud-controller-manager-787645668b-djqnr   0/1     CrashLoopBackOff   22 (2m29s ago)   90m
ibm-cloud-controller-manager-787645668b-pgkh2   0/1     Error              15 (5m8s ago)    52m
liuhuali@Lius-MacBook-Pro huali-test % oc describe pod ibm-cloud-controller-manager-787645668b-pgkh2 -n openshift-cloud-controller-manager
Name:                 ibm-cloud-controller-manager-787645668b-pgkh2
Namespace:            openshift-cloud-controller-manager
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 huliu-ibma-qbg48-master-2/
Start Time:           Wed, 27 Dec 2023 18:41:23 +0800
Labels:               infrastructure.openshift.io/cloud-controller-manager=IBMCloud
                      k8s-app=ibm-cloud-controller-manager
                      pod-template-hash=787645668b
Annotations:          operator.openshift.io/config-hash: 82a75c6ff86a490b0dac9c8c9b91f1987da0e646a42d72c33c54cbde3c29395b
Status:               Running
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/ibm-cloud-controller-manager-787645668b
Containers:
  cloud-controller-manager:
    Container ID:  cri-o://c56e246f64c770146c30b7a894f6a4d974159551dbb9d1ea31c238e516a0f854
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218
    Image ID:      e494d0d4b28e31170a4a2792bb90701c7f1e81c78c03e3686c5f0e601801937e
    Port:          10258/TCP
    Host Port:     10258/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      fi
      exec /bin/ibm-cloud-controller-manager \
      --bind-address=$(POD_IP_ADDRESS) \
      --use-service-account-credentials=true \
      --configure-cloud-routes=false \
      --cloud-provider=ibm \
      --cloud-config=/etc/ibm/cloud.conf \
      --profiling=false \
      --leader-elect=true \
      --leader-elect-lease-duration=137s \
      --leader-elect-renew-deadline=107s \
      --leader-elect-retry-period=26s \
      --leader-elect-resource-namespace=openshift-cloud-controller-manager \
      --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \
      --v=2
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 27 Dec 2023 19:33:23 +0800
      Finished:     Wed, 27 Dec 2023 19:33:23 +0800
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:     75m
      memory:  60Mi
    Liveness:  http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3
    Environment:
      POD_IP_ADDRESS:           (v1:status.podIP)
      VPCCTL_CLOUD_CONFIG:     /etc/ibm/cloud.conf
      VPCCTL_PUBLIC_ENDPOINT:  false
    Mounts:
      /etc/ibm from cloud-conf (rw)
      /etc/kubernetes from host-etc-kube (ro)
      /etc/pki/ca-trust/extracted/pem from trusted-ca (ro)
      /etc/vpc from ibm-cloud-credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cbd4b (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ccm-trusted-ca
    Optional:  false
  host-etc-kube:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes
    HostPathType:  Directory
  cloud-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cloud-conf
    Optional:  false
  ibm-cloud-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ibm-cloud-credentials
    Optional:    false
  kube-api-access-cbd4b:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  52m                    default-scheduler  Successfully assigned openshift-cloud-controller-manager/ibm-cloud-controller-manager-787645668b-pgkh2 to huliu-ibma-qbg48-master-2
  Normal   Pulling    52m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218"
  Normal   Pulled     52m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" in 3.431s (3.431s including waiting)
  Normal   Created    50m (x5 over 52m)      kubelet            Created container cloud-controller-manager
  Normal   Started    50m (x5 over 52m)      kubelet            Started container cloud-controller-manager
  Normal   Pulled     50m (x4 over 52m)      kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" already present on machine
  Warning  BackOff    2m19s (x240 over 52m)  kubelet            Back-off restarting failed container cloud-controller-manager in pod ibm-cloud-controller-manager-787645668b-pgkh2_openshift-cloud-controller-manager(d7f93ecf-cd14-450e-a986-028559a775b3)
liuhuali@Lius-MacBook-Pro huali-test % 

Actual results:

    cluster install failed on IBMCloud

Expected results:

    cluster install succeed on IBMCloud

Additional info:

    

Description of problem:

Currently console frontend and backend is using OpenShift centric UserKind type. In order for the console to work without OAuth server, iow. with. external OIDC it needs to use k8s UserInfo type, which is retrieved querying SelfSubjectReview API

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Console is not working with external OIDC provider

Expected results:

Console will be working with external OIDC provider  

Additional info:

This is mainly an API change.

Upon debugging, nodes are stuck in NotReady state and CNI is not initialised on them.

Seeing the following error log in cluster network operator 

failed parsing certificate data from ConfigMap "openshift-service-ca.crt": failed to parse certificate PEM

CNO operator logs: https://docs.google.com/document/d/1hor1r9ue4gnetkXm9mh8AKa7vm8zNBPhUQqWCbbnnUc/edit?usp=sharing

This is happening on a management cluster that is configured to use legacy service CA's:

$ oc get kubecontrollermanager/cluster -o yaml --as system:admin
apiVersion: operator.openshift.io/v1
kind: KubeControllerManager
metadata:
  name: cluster
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
  unsupportedConfigOverrides: null
  useMoreSecureServiceCA: false 

In newer clusters, useMoreSecureServiceCA is set to true.

Description of problem:

    4.15 control plane can't create a 4.14 node pool due to an issue with payload

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Create an Hosted Cluster in 4.15
    2. Create a Node Pool in 4.14
    3. Node pool stuck in provisioning
    

Actual results:

    No node pool is created

Expected results:

    Node pool is created as we support N-2 version there

Additional info:

Possibly linked to OCPBUGS-26757    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Original issue reported here: https://issues.redhat.com/browse/ACM-6189 reported by QE and customer.

Using ACM/hive, customers can deploy Openshift on vSphere. In the upcoming release of ACM 2.9, we support customers on OCP 4.12 - 4.15. ACM UI updates the install config as users add configurations details.

This has worked for several releases over the last few years. However in OCP 4.13+ the format has changed and there is now additional validation to check if the datastore is a full path.

As per https://issues.redhat.com/browse/SPLAT-1093, removal of the legacy fields should not happen until later, so any legacy configurations such as relative paths should still work.

 

Version-Release number of selected component (if applicable):

ACM 2.9.0-DOWNSTREAM-2023-10-24-01-06-09
OpenShift 4.14.0-rc.7
OpenShift 4.13.18
OpenShift 4.12.39

How reproducible:

Always

Steps to Reproduce:

1. Deploy OCP 4.12 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS
2. Installer passes.
3. Deploy OCP 4.12 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS
4. Installer fails.
5. Deploy OCP 4.12 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 
6. Installer fails.

7. Deploy OCP 4.13 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS
8. Installer fails.
9. Deploy OCP 4.13 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 
10. Installer passes. 
11. Deploy OCP 4.13 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 
12. Installer fails.

Actual results:

Default Datastore Value OCP 4.12 OCP 4.13 OCP 4.14
/Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS No Yes Yes
WORKLOAD-DS-Folder/WORKLOAD-DS No Yes Yes
WORKLOAD-DS Yes No No

For OCP 4.12.z managed clusters deployments name-only path is the only one that works as expected.
For OCP 4.13.z+ managed cluster deployments only full name and relative path with folder works as expected.

Expected results:

OCP 4.13.z+ takes relative path without specifying the folder like OCP 4.12.z does.

Additional info:

 

 

Description of problem:

Release controller > 4.14.2 > HyperShift conformance run > gathered assets:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c
     65 hosted-cluster-config-operator-manager
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5
2023-11-09T17:17:15.130454Z 409 AlreadyExists
2023-11-09T17:17:15.163256Z 409 AlreadyExists
2023-11-09T17:17:15.198908Z 409 AlreadyExists
2023-11-09T17:17:15.230532Z 409 AlreadyExists
2023-11-09T17:17:22.899579Z 409 AlreadyExists

That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?

Version-Release number of selected component (if applicable):

4.14.2. I haven't checked other releases.

How reproducible:

Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.

Steps to Reproduce:

1. Dump a hosted cluster.
2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.

Actual results:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c
    130 create 409

Expected results:

Zero or rare 409 creation request from this user-agent.

Additional info:

The user agent seems to be defined here, so likely the fix will involve changes to that manager.

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/359

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

For special operators, there are warning info on operator detail modal page and installation page when it's  Azure WI/FI cluster. The warning info titles are not consistent on these two pages.
    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-21-155123
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Prepare a special operator which would show warning info on  Azure WI/FI cluster.
    2.Login console of  Azure WI/FI cluster, check the warning info title on the operator detail item modal and installation page.
    3.
    

Actual results:

2. On operator detail item modal, the warning title is "Cluster in Azure Workload Identity / Federated Identity Mode", and on installation page, the warning info title is "Cluster in Workload Identity / Federated Identity Mode". The word "Azure" is missed on the second page.
    

Expected results:

2. The warning title should keep consistent.
    

Additional info:

screenshot: https://drive.google.com/drive/folders/1alFBEtO1gN4q5_mAtHCNzuLTOe5zXp0K?usp=drive_link
    

Tired of scrolling through alerts and pod states that are seldom useful to get to things that we need every day.

Description of problem:

PodStartupStorageOperationsFailing alert is not getting raised when there are no successfull(zero) mount/attach happens on node 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-10-25-185510

How reproducible:

Always

Steps to Reproduce:

1. Install any platform cluster.
2. Create sc, pvc, dep.
3. Check dep pod reaching to containercreatingstate and check for alert 

Actual results:

Alert is not getting raised when there are 0 successfull mount/attach happens

Expected results:

Alert should get raised when there are no successfull mount/attach happens

Additional info:

Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1697793500890839 

When we take same alerting expression from 4.12, we can observe the alert in ocp web console page. 

Description of problem:

Port 22 is added to the worker node security group in TF install [1]:

resource "aws_security_group_rule" "worker_ingress_ssh" {
  type          	= "ingress"
  security_group_id = aws_security_group.worker.id
  description   	= local.description

  protocol	= "tcp"
  cidr_blocks = var.cidr_blocks
  from_port   = 22
  to_port 	= 22
}

But it's missing in SDK install [2]


[1] https://github.com/openshift/installer/blob/master/data/data/aws/cluster/vpc/sg-worker.tf#L39-L48
[2] https://github.com/openshift/installer/pull/7676/files#diff-c89a0152f7d51be6e3830081d1c166d9333628982773c154d8fc9a071c8ff765R272


    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-03-31-180021
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a cluster using SDK installation method
    2.
    3.
    

Actual results:

See description.
    

Expected results:

Port 22 is added to worker node's security group.
    

Additional info:

    

Description of problem:

PipelineRun list view contains Task status column, which shows the overall task status of the pipelinerurn. Inorder to render this column we fetch all the tasksruns of that pipelinerun.  Every pipelinerun row will have to have all the related TaskRuns information, which is causing performance issue in the pipelinerun list view.

Customer is facing issue of UI slowness and rendering problem for large number of pipelineruns with and without results enabled. In both cases, there is significant slowness being observed which is hampering their daily operations.

How reproducible:

Always

Steps to Reproduce:

1. Create few pipelineruns
2. Navigate to pipelineruns list view

Actual results:

All the Taskruns are being fetched and the pipelinerun list view renders this column  asynchronously with loading indicator.

 

Expected results:

Taskruns should not be fetched at all, rather UI need to parse the `` string to render this column.

Additional info:

Pipelinerun status message gets updated on every task completion.

pipelinerun.status.conditions:

  • lastTransitionTime: '2023-11-15T07:51:42Z'
    message: 'Tasks Completed: 3 (Failed: 0, Cancelled 0), Skipped: 0'
    reason: Succeeded
    status: 'True'
    type: Succeeded

we can parse the above information to derive the following object and use this for rendering the column,  this will increase the performance of this page hugely. 

{
 completed: 3, // 3 (total count) - 0 (failed count) - 0 (cancelled count),
 failed: 0,
 cancelled: 0,
 skipped: 0,
 pending: 0 
}

 

 

Slack thread for more details - thread

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/57

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

 [sig-cluster-lifecycle][Feature:Machines] Managed cluster should [sig-scheduling][Early] control plane machine set operator should not have any events [Suite:openshift/conformance/parallel]

Looks like this test is permafailing on 4.16 and 4.15 AWS UPI jobs - does this need to be skipped on UPI?

 

{  fail [github.com/openshift/origin/test/extended/machines/machines.go:191]: Unexpected error:
    <*errors.StatusError | 0xc0031b8f00>: 
    controlplanemachinesets.machine.openshift.io "cluster" not found
    {
        ErrStatus: 
            code: 404
            details:
              group: machine.openshift.io
              kind: controlplanemachinesets
              name: cluster
            message: controlplanemachinesets.machine.openshift.io "cluster" not found
            metadata: {}
            reason: NotFound
            status: Failure,
    }
occurred
Ginkgo exit error 1: exit with code 1} 

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-upi/1758372765364129792

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upi/1758308563689672704

Seeing this in hypershift e2e. I think it is racing with the Infrastructure status being populated and PlatformStatus being nil.

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn/1785458059246571520/artifacts/e2e-aws-ovn/run-e2e/artifacts/TestAutoscaling_Teardown/namespaces/e2e-clusters-rjhhw-example-g6tsn/core/pods/logs/cluster-image-registry-operator-5597f9f4d4-dfvc6-cluster-image-registry-operator-previous.log

I0501 00:13:11.951062       1 azurepathfixcontroller.go:324] Started AzurePathFixController
I0501 00:13:11.951056       1 base_controller.go:73] Caches are synced for LoggingSyncer 
I0501 00:13:11.951072       1 imageregistrycertificates.go:214] Started ImageRegistryCertificatesController
I0501 00:13:11.951077       1 base_controller.go:110] Starting #1 worker of LoggingSyncer controller ...
E0501 00:13:11.951369       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 534 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2d6bd00?, 0x57a60e0})
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x3bcb370?})
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x2d6bd00?, 0x57a60e0?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).sync(0xc000003d40)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:171 +0x97
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).processNextWorkItem(0xc000003d40)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:154 +0x292
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).runWorker(...)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:133
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001186820?, {0x3bd1320, 0xc000cace40}, 0x1, 0xc000ca2540)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0011bac00?, 0x3b9aca00, 0x0, 0xd0?, 0x447f9c?)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0xc001385f68?, 0xc001385f78?)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x1e
created by github.com/openshift/cluster-image-registry-operator/pkg/operator.(*AzurePathFixController).Run in goroutine 248
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/azurepathfixcontroller.go:322 +0x1a6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x2966e97]

https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/operator/azurepathfixcontroller.go#L171

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/515

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    See https://issues.redhat.com/browse/OCPBUGS-26053


Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Create an ETP=Local LB Service on LGW for some v6 workload (assign IP to lb with MetalLB or manually)
    2. Set static routes to a node hosting a pod on the client
    3. Attempt reaching the IPv6 Service fails
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Looking at the telemetry data for Nutanix I noticed that the “host_type” for clusters installed with platform nutanix shows as “virt-unknown”. Do you know what needs to happen in the code to tell telemetry about host type being Nutanix? The problem is that we can’t track those installations with platform none, just IPI.

Refer to the slack thread https://redhat-internal.slack.com/archives/C0211848DBN/p1687864857228739.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

Create an OCP Nutanix cluster

Actual results:

The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as “virt-unknown”.

Expected results:

The telemetry data for Nutanix shows the “host_type” for the nutanix cluster as "nutanix".

Additional info:

 

Description of problem:

 no ipsec on cluster post NS mc's deletion during ipsecConfig mode `Full`, on an upgraded cluster from 4.14 ->4.15 build

Version-Release number of selected component (if applicable):

 bot build on https://github.com/openshift/cluster-network-operator/pull/2191

How reproducible:

    Always

Steps to Reproduce:

Steps:
1. Cluster on EW+NS cluster(4.14), Upgraded to above bot build to check ipsecConfig modes 
2. ipsecConfig mode changed to Full
3. Deleted NS MCs 
4. new MCs spawned up as `80-ipsec-master-extensions` and `80-ipsec-worker-extensions`
5. cluster settled with no ipsec at all (no ovn-ipsec-host ds)
6. mode still Full

Actual results:

mode Full actually replicated Diasbled state on above steps    

Expected results:

Just NS IPsec should have gone away. EW should have persisted

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/150

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Oc-mirror get the wrong index.json and failed when ImageSetConfig containing OCI FBC

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1)  Copy the operator as OCI format to localhost:
`skopeo copy docker://registry.redhat.io/redhat/redhat-operator-index:v4.12 oci:///app1/noo/redhat-operator-index  --remove-signatures`

2)  Use following imagesetconfigure for mirror:
cat config-oci.yaml
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
storageConfig:
  registry:
    imageURL: registryhost:5000/metadata:latest
mirror:
  additionalImages:
   - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6
  platform:
    channels:
    - name: stable-4.12
      type: ocp
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
       - name: elasticsearch-operator
    - catalog: oci:///app1/noo/redhat-operator-index
      packages:
        - name: cluster-kube-descheduler-operator
        - name: odf-operator

`oc-mirror --config config-oci.yaml file://outoci --v2`


Actual results: 

2) In the configuration we are use oci:///app1/noo/redhat-operator-index, so should not to check index.json under outoci/working-dir/operator-images/redhat-operator-index/index.json

 oc-mirror --config config-oci.yaml file://outoci --v2
--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 
2024/03/25 06:23:06  [INFO]   : mode mirrorToDisk
2024/03/25 06:23:06  [INFO]   : local storage registry will log to /app1/0321/outoci/working-dir/logs/registry.log
2024/03/25 06:23:06  [INFO]   : starting local storage on localhost:55000
2024/03/25 06:23:06  [INFO]   : detected minimum version as 4.12.53
2024/03/25 06:23:06  [INFO]   : detected minimum version as 4.12.53
2024/03/25 06:23:07  [INFO]   : Found update 4.12.53
2024/03/25 06:23:07  [INFO]   : signature b584f5458fb946115b0cf0f1793dc9224c5e6a4567e74018f0590805a03eb523
2024/03/25 06:23:07  [WARN]   : signature for b584f5458fb946115b0cf0f1793dc9224c5e6a4567e74018f0590805a03eb523 not in cache
2024/03/25 06:23:07  [INFO]   : content {"critical": {"image": {"docker-manifest-digest": "sha256:b584f5458fb946115b0cf0f1793dc9224c5e6a4567e74018f0590805a03eb523"}, "type": "atomic container signature", "identity": {"docker-reference": "quay.io/openshift-release-dev/ocp-release:4.12.53-x86_64"}}, "optional": {"creator": "Red Hat OpenShift Signing Authority 0.0.1"}}
2024/03/25 06:23:07  [INFO]   : image found : quay.io/openshift-release-dev/ocp-release:4.12.53-x86_64
2024/03/25 06:23:07  [INFO]   : public Key : 567E347AD0044ADE55BA8A5F199E2F91FD431D51
2024/03/25 06:23:07  [INFO]   : copying  quay.io/openshift-release-dev/ocp-release:4.12.53-x86_64
2024/03/25 06:23:12  [INFO]   : copying  cincinnati response to outoci/working-dir/release-filters
2024/03/25 06:23:12  [INFO]   : creating graph data image
2024/03/25 06:23:15  [INFO]   : graph image created and pushed to cache.
2024/03/25 06:23:15  [INFO]   : total release images to copy 185
2024/03/25 06:23:15  [INFO]   : copying operator image registry.redhat.io/redhat/redhat-operator-index:v4.12
2024/03/25 06:23:18  [INFO]   : manifest 7b9891532a76194c1b18698518abad9be4aca7f1152ac73f450aa8bfadef538f
2024/03/25 06:23:18  [INFO]   : label /configs
2024/03/25 06:23:36  [INFO]   : copying operator image oci:///app1/noo/redhat-operator-index
error closing log file registry.log: close outoci/working-dir/logs/registry.log: file already closed
2024/03/25 06:23:36  [ERROR]  : open outoci/working-dir/operator-images/redhat-operator-index/index.json: no such file or directory

Expected results:

2) oc-mirror find correct path for index.json and not fail

Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/213

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CAPI manifests have the TechPreviewNoUpgrade annotation but are missing the CustomNoUpgrade annotation    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

  The image registry CO is not progressing on Azure Hosted Control Planes

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Create an Azure HCP
    2. Create a kubeconfig for the guest cluster
    3. Check the image-registry CO
    

Actual results:

    image-registry co's message is Progressing: The registry is ready...

Expected results:

    image-registry finishes progressing

Additional info:

    I let it go for about 34m

% oc get co | grep -i image
image-registry                             4.16.0-0.nightly-multi-2024-02-26-105325   True        True          False      34m     Progressing: The registry is ready...

% oc get co/image-registry -oyaml
...
  - lastTransitionTime: "2024-02-28T19:10:30Z"
    message: |-
      Progressing: The registry is ready
      NodeCADaemonProgressing: The daemon set node-ca is deployed
      AzurePathFixProgressing: The job does not exist
    reason: AzurePathFixNotFound::Ready
    status: "True"
    type: Progressing

 

because of the pin in the packages list the ART pipeline is rebuilding packages all the time
unfortunately we need to remove the strong pins and move back to relaxed ones

once that's done we need to merge https://github.com/openshift-eng/ocp-build-data/pull/4097

Background:

CCO was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. CloudCredential was introduced as a new capability to openshift/api. We need to bump api at oc to include the CloudCredential capability so oc adm release extract works correctly.

Description of problem:

Some relevant CredentialsRequests are not extracted by the following command: oc adm release extract --credentials-requests --included --install-config=install-config.yaml ...
where install-config.yaml looks like the following:
...
capabilities:
  baselineCapabilitySet: None
  additionalEnabledCapabilities:
  - MachineAPI
  - CloudCredential
platform:
  aws:
...

Logs:

...
I1209 19:57:25.968783 79037 extract.go:418] Found manifest 0000_50_cloud-credential-operator_05-iam-ro-credentialsrequest.yaml
I1209 19:57:25.968902 79037 extract.go:429] Excluding Group: "cloudcredential.openshift.io" Kind: "CredentialsRequest" Namespace: "openshift-cloud-credential-operator" Name: "cloud-credential-operator-iam-ro": unrecognized capability names: CloudCredential
...

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/272

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

At 17:26:09, the cluster is happily upgrading nodes:

An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config

At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is OCPBUGS-20061)

An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available

~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:

An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver

This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:

1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
2. These two COs upgrade successfuly and upgrade proceeds
3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.

Version-Release number of selected component (if applicable):

all

How reproducible:

Not entirely deterministic:

1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)

Steps to Reproduce:

1. Watch oc adm upgrade during the upgrade

Actual results:

Spurious "waiting for over 40m" message pops out of the blue

Expected results:

CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.

Description of problem:

If two clusters share a single OpenStack projects, cloud-provider-openstack won't distinguish type=LoadBalancer Services between them if they have the same namespace name and service name.

https://github.com/kubernetes/cloud-provider-openstack/issues/2241

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Deploy 2 clusters.
2. Create LoadBalancer Services of the same name in default namespaces of both clusters.

Actual results:

cloud-provider-openstack fights over ownership of the LB.

Expected results:

LBs are distinguished.

Additional info:

 

Please review the following PR: https://github.com/openshift/installer/pull/7817

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem: MCN lister fires in the operator pod before the CRD exists. This causes API issues and could impact upgrades.

    Version-Release number of selected component (if applicable):{code:none}

    

How reproducible: always

    Steps to Reproduce:{code:none}
    1. upgrade to 4.15 from any version
    2.
    3.
    

Actual results:

I1211 18:44:40.972098       1 operator.go:347] Starting MachineConfigOperator
I1211 18:44:40.982079       1 event.go:298] Event(v1.ObjectReference{Kind:"", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"68bc5e8f-b7f5-4506-a870-2eecaa5afd35", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.14.6}] to [{operator 4.15.0-0.nightly-2023-12-11-033133}]
W1211 18:44:41.255502       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:44:41.255587       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:58:04.915119       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 18:58:06.425952       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:58:06.426037       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 18:58:09.396004       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:58:09.396068       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 18:58:14.540488       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:58:14.540560       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 18:58:25.293029       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:58:25.293095       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 18:58:50.166866       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:58:50.166903       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 18:59:39.950454       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 18:59:39.950523       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 19:00:23.432005       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:00:23.432038       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 19:01:13.237298       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:01:13.237382       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 19:02:02.035555       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:02:02.035628       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 19:02:52.111260       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:02:52.111332       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 19:03:38.243461       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:03:38.243499       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
W1211 19:04:27.848493       1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:04:27.848585       1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io)
E1211 19:05:37.064033       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:38.057685       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:39.036638       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:40.039736       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:41.039696       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:42.034840       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:43.044901       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:44.033229       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:45.034792       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
E1211 19:05:46.052866       1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
    Expected results:

    

Additional info:


Please review the following PR: https://github.com/openshift/must-gather/pull/409

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Trying to define multiple receivers in a single user-defined AlertmanagerConfig

Version-Release number of selected component (if applicable):

 

How reproducible:

always   

Steps to Reproduce:

#### Monitoring for user-defined projects is enabled
```
oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml | head -4
```
```
apiVersion: v1
data:
  config.yaml: |
    enableUserWorkload: true
```

#### separate Alertmanager instance for user-defined alert routing is Enabled and Configured
```
oc -n openshift-user-workload-monitoring get configmap user-workload-monitoring-config -o yaml | head -6
```
```
apiVersion: v1
data:
  config.yaml: |
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true
```
create testing namespace 
oc new-project libor-alertmanager-testing 
```
## TESTING - MULTIPLE RECEIVERS IN ALERTMANAGERCONFIG
Single AlertmanagerConfig
`alertmanager_config_webhook_and_email_rootDefault.yaml`
```
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: libor-alertmanager-testing-email-webhook
  namespace: libor-alertmanager-testing
spec:
  receivers:
  - name: 'libor-alertmanager-testing-webhook'
    webhookConfigs:
      - url: 'http://prometheus-msteams.internal-monitoring.svc:2000/occ-alerts'
  - name: 'libor-alertmanager-testing-email'
    emailConfigs:
      - to: USER@USER.CO
        requireTLS: false
        sendResolved: true
  - name: Default
  route:
    groupBy:
    - namespace
    receiver: Default
    groupInterval: 60s
    groupWait: 60s
    repeatInterval: 12h
    routes:
    - matchers:
      - name: severity
        value: critical
        matchType: '='
        continue: true
      receiver: 'libor-alertmanager-testing-webhook'
    - matchers:
      - name: severity
        value: critical
        matchType: '='
      receiver: 'libor-alertmanager-testing-email'
```
Once saved the continue statement is removed from the object. 
```
the configuration applied to alertmanager contains continue false statements
```
oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 

```
route:
  receiver: Default
  group_by:
  - namespace
  continue: false
  routes:
  - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/Default
    group_by:
    - namespace
    matchers:
    - namespace="libor-alertmanager-testing"
    continue: true
    routes:
    - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-webhook
      matchers:
      - severity="critical"
      continue: false  <----
    - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-email
      matchers:
      - severity="critical"
      continue: false <-----
```
If I update the statements to read `continue: true` 
and test here: https://prometheus.io/webtools/alerting/routing-tree-editor/ 

then I get the desired results

workaround is to use 2 separate files - the continue statement is being added. 

Actual results:

Once saved the continue statement is removed from the object. 

Expected results:

continue true statement is retain and applied to alertmanager 

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/128

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is continuation of OCPBUGS-23342, now the vmware-vsphere-csi-driver-operator cannot connect to vCenter at all. Tested using invalid credentials.

The operator ends up with no Progressing condition during upgrade from 4.11 to 4.12, and cluster-storage-operator interprets it as Progressing=true.

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/639

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/178

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/397

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Manifests will be removed from CCO image so we have to start using CCA(cluster-config-api) image for bootstrap   

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

  KAS bootstrap container fails

Expected results:

    KAS bootstrap container suceeds

Additional info:

    

Description of problem:

sha256 sum for "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/openshift-install-mac-arm64-4.14.9.tar.gz" does not match what it should be

# sha256sum openshift-install-mac-arm64-4.14.9.tar.gz 61cccc282f39456b7db730a0625d0a04cd6c1c2ac0f945c4c15724e4e522a073 openshift-install-mac-arm64-4.14.9.tar.gz

Which does not match what is posted here: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.14.9/sha256sum.txt


It should be :
c765c90a32b8a43bc62f2ba8bd59dc8e620b972bcc2a2e217c36ce139d517e29  openshift-install-mac-arm64-4.14.9.tar.gz

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Install a private cluster, the base domain set in install-config.yaml is same as another existed cis domain name. 
After destroy the private cluster, the dns resource-records remains. 

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

1.create a DNS service instance, setting its domain to "ibmcloud.qe.devcluster.openshift.com", Note, this domain name is also being used in another existing CIS domain.
2.Install a private ibmcloud cluster, the base domain set in install-config is "ibmcloud.qe.devcluster.openshift.com"
3.Destroy the cluster
4.Check the remains dns records     

Actual results:

$ ibmcloud dns resource-records 5f8a0c4d-46c2-4daa-9157-97cb9ad9033a -i preserved-openshift-qe-private | grep ci-op-17qygd06-23ac4
api-int.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com 
*.apps.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com 
api.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com

Expected results:

No more dns records about the cluster

Additional info:

$ ibmcloud dns zones -i preserved-openshift-qe-private | awk '{print $2}'   
Name
private-ibmcloud.qe.devcluster.openshift.com 
private-ibmcloud-1.qe.devcluster.openshift.com 
ibmcloud.qe.devcluster.openshift.com  

$ ibmcloud cis domains
Name
ibmcloud.qe.devcluster.openshift.com

When use private-ibmcloud.qe.devcluster.openshift.com and private-ibmcloud-1.qe.devcluster.openshift.com as domain, no such issue, when use ibmcloud.qe.devcluster.openshift.com as domain the dns records remains. 

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/114

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/must-gather/pull/406

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    While mirroring with the following command[1], it is observed that the command fails with error[2] as shown below:
~~~
[1] oc mirror --config=imageSet-config.yaml docker://<registry_url>:<Port>/<repository>
~~~

~~~
[2] error: error rebuilding catalog images from file-based catalogs: error regenerating the cache for <registry_url>:<Port>/<repository>/community-operator-index:v4.15: exit status 1
~~~

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Download `oc mirror` v:4.15.0 binary
    2. Create ImageSet-config.yaml
    3. Use the following command:
~~~
oc mirror --config=imageSet-config.yaml docker://<registry_url>:<Port>/<repository>
~~~
    4. Observe the mentioned error

Actual results:

    Command failed to complete with the mentioned error.

Expected results:

   ICSP and mapping.txt file should be created. 

Additional info:

    

Description of problem:

Starting OpenShift 4.8 (https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-notable-technical-changes), all pods are getting bound SA tokens.

Currently, instead of expiring the token, we use the `service-account-extend-token-expiration` that extends a bound token validity to 1yr and warns in case of a use of a token that would've otherwise been expired.

We want to disable this behavior in a future OpenShift release, which would break the OpenShift web console.

Version-Release number of selected component (if applicable):

4.8 - 4.14

How reproducible:

100%

Steps to Reproduce:

1. install a fresh cluster
2. wait ~1hr since console pods were deployed for the token rotation to occur
3. log in to the console and click around
4. check the kube-apiserver audit logs events for the "authentication.k8s.io/stale-token" annotation

Actual results:

many occurrences (I doubt I'll be able to upload a text file so I'll show a few audit events in the first comment.

Expected results:

The web-console re-reads the SA token regularly so that it never uses an expired token

Additional info:

In a theoretical case where a console pod lasts for a year, it's going to break and won't be able to authenticate to the kube-apiserver.

We are planning on disallowing the use of stale tokens in a future release and we need to make sure that the core platform is not broken so that the metrics we collect from the clusters in the wild are not polluted.

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/108

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Copying BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 on OCP side (as fix is needed on console).

[UI] In Openshift-storage-client namespace, 'RWX' access  mode RBD PVC with volumemode'Filesystem' can be created from Client. However, this is an invalid combination for RBD PVC creation From ODF Operator UI of other Platforms. Volume mode is not available when Cepfrbd storageclass and RWX access mode selected on other platform. This is visible in client operator view.  This attempt to create PVc and stuck in pending state 

Version-Release number of selected component (if applicable):

    

How reproducible:

 

Steps to Reproduce:

1. Deploy Provider Client setup.
2. From UI Create PVC, select storage class : ceph-rbd, RWX access mode, check filemode : in case of this bug 'Filesystem' and 'block' volume mode is visible on UI, select volumemode: Filesystem and create the PVC.     

Actual results:

PVC Created and stuck in pending status. 
PVC event shows error like:
 Generated from openshift-storage-client.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6d9dcb9fc7-vjj22_2bd4ede5-9418-4c8e-80ae-169b5cb4fa8012 times in the last 13 minutes
failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes

Expected results:

Volumemode should not be visible on page when PVC with RWX access mode and RBD storage class is selected.

Additional info:

Screenshots are attached to the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911

https://bugzilla.redhat.com/show_bug.cgi?id=2250911#c3

Description of problem:

ovnkube-node doesn't issue a CSR to get new certificates when node is suspended for 30 days   

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Setup a libvirt cluster on machine
    2. Disable chronyd on all nodes and host machine
    3. Suspend nodes
    4. Change time on host 30 days forward
    5. Resume nodes
    6. Wait for API server to come up
    7. Wait for all operators to become ready
    

Actual results:

ovnkube-node would attempt to use expired certs:  2024-01-21T01:24:41.576365431+00:00 stderr F I0121 01:24:41.573615    8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0"
2024-04-20T01:25:08.519622252+00:00 stderr F I0420 01:25:08.516550    8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service
2024-04-20T01:25:08.900228370+00:00 stderr F I0420 01:25:08.898580    8852 services_controller.go:567] Deleting service openshift-operator-lifecycle-manager/packageserver-service
2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137891    8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:17.137956433+00:00 stderr F I0420 01:25:17.137933    8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:17.137997952+00:00 stderr F I0420 01:25:17.137979    8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 
2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099057    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-1
2024-04-20T01:25:19.099635059+00:00 stderr F I0420 01:25:19.099080    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-1: 35.077µs
2024-04-20T01:25:22.245550966+00:00 stderr F W0420 01:25:22.242774    8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-controller-manager/controller-manager-5485d88c84-xztxq IP address from the namespace address-set, err: pod openshift-controller-manager/controller-manager-5485d88c84-xztxq: no pod IPs found 
2024-04-20T01:25:22.262446336+00:00 stderr F W0420 01:25:22.261351    8852 base_network_controller_namespace.go:458] Unable to remove remote zone pod's openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9 IP address from the namespace address-set, err: pod openshift-route-controller-manager/route-controller-manager-6b5868f887-n6jj9: no pod IPs found 
2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154744    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-0
2024-04-20T01:25:27.154790226+00:00 stderr F I0420 01:25:27.154770    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-0: 31.72µs
2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168666    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-2
2024-04-20T01:25:27.172301639+00:00 stderr F I0420 01:25:27.168692    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-2: 34.346µs
2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194311    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-master-0
2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194339    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-master-0: 40.027µs
2024-04-20T01:25:27.196078022+00:00 stderr F I0420 01:25:27.194582    8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0"
2024-04-20T01:25:27.215435944+00:00 stderr F I0420 01:25:27.215387    8852 master.go:740] Adding or Updating Node "test-infra-cluster-4832ebf8-master-0"
2024-04-20T01:25:35.789830706+00:00 stderr F I0420 01:25:35.789782    8852 egressservice_zone_node.go:110] Processing sync for Egress Service node test-infra-cluster-4832ebf8-worker-1
2024-04-20T01:25:35.790044794+00:00 stderr F I0420 01:25:35.790025    8852 egressservice_zone_node.go:113] Finished syncing Egress Service node test-infra-cluster-4832ebf8-worker-1: 250.227µs
2024-04-20T01:25:37.596875642+00:00 stderr F I0420 01:25:37.596834    8852 iptables.go:358] "Running" command="iptables-save" arguments=["-t","nat"]
2024-04-20T01:25:47.138312366+00:00 stderr F I0420 01:25:47.138266    8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:47.138382299+00:00 stderr F I0420 01:25:47.138370    8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:25:47.138453866+00:00 stderr F I0420 01:25:47.138440    8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 
2024-04-20T01:26:17.138583468+00:00 stderr F I0420 01:26:17.138544    8852 obj_retry.go:296] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:26:17.138640587+00:00 stderr F I0420 01:26:17.138629    8852 obj_retry.go:358] Adding new object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp
2024-04-20T01:26:17.138708817+00:00 stderr F I0420 01:26:17.138696    8852 obj_retry.go:370] Retry add failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp, will try again later: failed to obtain IPs to add remote pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: suppressed error logged: pod openshift-operator-lifecycle-manager/collect-profiles-28559595-wfdsp: no pod IPs found 
2024-04-20T01:26:39.474787436+00:00 stderr F I0420 01:26:39.474744    8852 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.EndpointSlice total 130 items received
2024-04-20T01:26:39.475670148+00:00 stderr F E0420 01:26:39.475653    8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: the server has asked for the client to provide credentials (get endpointslices.discovery.k8s.io)
2024-04-20T01:26:40.786339334+00:00 stderr F I0420 01:26:40.786255    8852 reflector.go:325] Listing and watching *v1.EndpointSlice from k8s.io/client-go/informers/factory.go:159
2024-04-20T01:26:40.806238387+00:00 stderr F W0420 01:26:40.804542    8852 reflector.go:535] k8s.io/client-go/informers/factory.go:159: failed to list *v1.EndpointSlice: Unauthorized
2024-04-20T01:26:40.806238387+00:00 stderr F E0420 01:26:40.804571    8852 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Unauthorized
 

Expected results:

ovnkube-node detects that cert is expired, requests new certs via CSR flow and reloads them

Additional info:

CI periodic to check this flow: https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ovn-sno-cert-rotation-suspend-30d
artifacts contain sosreport

Applies to SNO and HA clusters, works as expected when nodes are being properly shutdown instead of suspended

Description of problem:

    The kube-apiserver has a container called audit-logs
 that keeps audit records stored in the logs of the container (just 
prints to stdout). We would like the ability to disable this container 
whenever the None policy is used on the
 cluster. As of today, this consumes about 1gb of storage for each 
apiserver pod on the system. As you scale up, that 1gb per master adds 
up.

https://github.com/openshift/hypershift/issues/3764

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/527

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When external TCP traffic is IP fragmented with no DF flag set and is targeted to a pod external IP, the fragmented packets are responded by RST and are not delivered to the PODs application socket.
   
Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.14.8
Kustomize Version: v5.0.1
Server Version: 4.14.7
Kubernetes Version: v1.27.8+4fab27b
     
How reproducible:

I built a reproducer for this issue on KVM hosted OCP claster.
I can simulate the same traffic as can be seen in the customer's network.
So we do have a solid reproducer for the issue.
Details are in the JIRA updates.
     
Steps to Reproduce:
I wrote a simple C-based tcp_server/tcp_client application for testing.
The client simply sends a file towards the server from a networking namespace with
disabled pmtu. The server app runs in a pod and simply waits for connections then reads the data from the socket and stores the received file into /tmp .
There is along the way from the client namespace a veth pair with MTU 1000 since the
path MTU is 1500.
This is enough to get ip packets fragmented along the way from the client to the server.
Details of the setup and testing steps are in the JIRA comments.  

Actual results:

$ oc get network.operator -o yaml | grep routingViaHost
          routingViaHost: false
All fragmented packets are responded causing a TCP RST and are not delivered to the
application socket in the pod.  

Expected results:

Fragmented packets are delivered to the application socket running in a pod with
$ oc get network.operator -o yaml | grep routingViaHost
          routingViaHost: false
     

Additional info:

There is a WA to prevent the issue.
$ oc get network.operator -o yaml | grep routingViaHost
          routingViaHost: true
Makes the fragmented traffic arrive at the application socket in the pod.

I can assist with the reproducer and testing on the test env.
Regards Michal Tesar

ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: ignition failed to provision storage: failed to create storage: failed to create bucket: googleapi: Error 409: Your previous request to create the named bucket succeeded and you already own it., conflict 

Description of the problem:

 LVMS multi node
requires a additional disk for the operator

however i was able to create cluster 4.15 multinode , select lvms and without adding the additional disk i see that lvm requirment passes and i am able to continue and start installaiton

How reproducible:

 

Steps to reproduce:

1. create cluter 4.15 multi node

2. select lvms operator

3. do not attach additional disk

Actual results:

 it is possible to continue until installation page and start installation
lvm requirment is set as success

Expected results:

lvm requirement should show fail
should not be bale to proceed to installation before attaching difk

Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/73

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Oh no! Something went wrong" in Topology -> Observese Tab

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-14-115151

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Topology -> click one deployment and go to Observer Tab
    2.
    3.
    

Actual results:

    The page crushed
ErrorDescription:Component trace:Copy to clipboardat te (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:31:9773)
    at j (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:12:3324)
    at div
    at s (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:70124)
    at div
    at g (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:6:11163)
    at div
    at d (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:1:174472)
    at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:487478)
    at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:486390)
    at div
    at l (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:106304)
    at div
Expected results:
{code:none}
    not crush

Additional info:

    

Description of problem:

Cluster install fails on ASH, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-24-133352    

How reproducible:

Always    

Steps to Reproduce:

1. Built a cluster on ASH  
$ oc get node            
NAME                             STATUS     ROLES                  AGE    VERSION
ropatil-261ash1-x9kcj-master-0   NotReady   control-plane,master   7h     v1.29.1+0e0d15b
ropatil-261ash1-x9kcj-master-1   NotReady   control-plane,master   7h1m   v1.29.1+0e0d15b
ropatil-261ash1-x9kcj-master-2   NotReady   control-plane,master   7h1m   v1.29.1+0e0d15b 

$ oc get node -o yaml | grep uninitialized    
      key: node.cloudprovider.kubernetes.io/uninitialized
      key: node.cloudprovider.kubernetes.io/uninitialized
      key: node.cloudprovider.kubernetes.io/uninitialized  

$ oc get po -n openshift-cloud-controller-manager     
NAME                                              READY   STATUS             RESTARTS         AGE
azure-cloud-controller-manager-7b75cbbd64-qzhmm   0/1     CrashLoopBackOff   43 (20s ago)     4h54m
azure-cloud-controller-manager-7b75cbbd64-w5cl8   1/1     Running            70 (2m52s ago)   7h33m
azure-cloud-node-manager-9r8gb                    0/1     CrashLoopBackOff   93 (79s ago)     7h33m
azure-cloud-node-manager-jn8lv                    0/1     CrashLoopBackOff   93 (82s ago)     7h33m
azure-cloud-node-manager-n4vt4                    0/1     CrashLoopBackOff   93 (102s ago)    7h33m  

$ oc -n openshift-cloud-controller-manager logs -f azure-cloud-controller-manager-7b75cbbd64-w5cl8 -c cloud-controller-manager     Error from server: no preferred addresses found; known addresses: []     

Actual results:

Cluster install failed on ASH 

Expected results:

Cluster install succeed on ASH    

Additional info:

log-bundle: https://drive.google.com/file/d/1QQwyQ1MxuunZx6AXqOTt6KwYwUk2GW7R/view?usp=sharing 

Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/308

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The customer is pointing that the memory max value scale up is based in GB value, while the min memory for scale down is based on GiB.

So, setting both values in GB makes scale down fail:

Skipping ocgc4preplatgt-98fwh-worker-c-sk2sz - minimal limit exceeded for [memory] 

While setting both values in GiB makes the scale up fail.

 

  • API reference says that GBs must be used for memory min/max limits:

https://docs.openshift.com/container-platform/4.12/rest_api/autoscale_apis/clusterautoscaler-autoscaling-openshift-io-v1.html#spec-resourcelimits

  • While OCP Cluster Autoscaler documentation points to GiBs:

https://docs.openshift.com/container-platform/4.12/machine_management/applying-autoscaling.html#cluster-autoscaler-cr_applying-autoscaling

Description of problem:

    The button text for VolumeSnapshotContents is incorrect

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-04-02-182836

How reproducible:

    always

Steps to Reproduce:

    1. Navigate to Storage -> VolumeSnapshotContents page
       /k8s/cluster/snapshot.storage.k8s.io~v1~VolumeSnapshotContent
    2. check the create button text
    3.
    

Actual results:

the text in the button shows 'Create VolumeSnapshot'

Expected results:

the text in the button should be 'Create VolumeSnapshotContents'    

Additional info:

    

Description of problem:

tested https://github.com/openshift/cluster-monitoring-operator/pull/2187 with PR

launch 4.15,openshift/cluster-monitoring-operator#2187 aws

don't find "scrape.timestamp-tolerance" setting in prometheus and prometheus pod, no result for below commands

$ oc -n openshift-monitoring get prometheus k8s -oyaml | grep -i "scrape.timestamp-tolerance"
$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep -i "scrape.timestamp-tolerance"
$ oc -n openshift-monitoring get sts  prometheus-k8s  -oyaml | grep -i "scrape.timestamp-tolerance" 

not in prometheus configuration file either

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml | head
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: openshift-monitoring/k8s
    prometheus_replica: prometheus-k8s-0
rule_files:
- /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
scrape_configs:
- job_name: serviceMonitor/openshift-apiserver-operator/openshift-apiserver-operator/0

Description of problem

Build02, a years old cluster currently running 4.15.0-ec.2 with TechPreviewNoUpgrade, has been Available=False for days:

$ oc get -o json clusteroperator monitoring | jq '.status.conditions[] | select(.type == "Available")'
{
  "lastTransitionTime": "2024-01-14T04:09:52Z",
  "message": "UpdatingMetricsServer: reconciling MetricsServer Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/metrics-server: context deadline exceeded",
  "reason": "UpdatingMetricsServerFailed",
  "status": "False",
  "type": "Available"
}

Both pods had been having CA trust issues. We deleted one pod, and it's replacement is happy:

$ oc -n openshift-monitoring get -l app.kubernetes.io/component=metrics-server pods
NAME                             READY   STATUS    RESTARTS   AGE
metrics-server-9cc8bfd56-dd5tx   1/1     Running   0          136m
metrics-server-9cc8bfd56-k2lpv   0/1     Running   0          36d

The young, happy pod has occasional node-removed noise, which is expected in this cluster with high levels of compute-node autoscaling:

$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-dd5tx
E0117 17:16:13.492646       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5"
E0117 17:16:28.611052       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5"
E0117 17:16:56.898453       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": context deadline exceeded" node="build0-gstfj-ci-builds-worker-b-srjk5"

While the old, sad pod is complaining about unknown authorities:

$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-k2lpv
E0117 17:19:09.612161       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.0.3:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-m-2.c.openshift-ci-build-farm.internal"
E0117 17:19:09.620872       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.90:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-ci-prowjobs-worker-b-cg7qd"
I0117 17:19:14.538837       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"

More details in the Additional details section, but the timeline seems to have been something like:

  1. 2023-12-11, metrics-server-* pods come up, and are running happily, scraping kubelets with a CA trust store descended from openshift-config-managed's kubelet-serving-ca ConfigMap.
  2. 2024-01-02, a new openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 is created.
  3. 2024-01-04, kubelets rotate their serving CA. Not entirely clear how this works yet outside of bootstrapping, but at least for bootstrapping it uses a CertificateSigningRequest, approved by cluster-machine-approver, and signed by the kubernetes.io/kubelet-serving signing component in the kube-controller-manager-* pods in the openshift-kube-controller-manager namespace.
  4. 2024-01-04, the csr-signer Secret in openshift-kube-controller-manager has the new openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 issuing a certificate for kube-csr-signer_@1704338196.
  5. The kubelet-serving-ca ConfigMap gets updated to include a CA for the new kube-csr-signer_@1704338196, signed by the new openshift-kube-controller-manager-operator_csr-signer-signer@1704206554.
  6. Local /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt updated in metrics-server-* containers.
  7. But metrics-server-* pods fail to notice the file change and reload /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, so the existing pods do not trust the new kubelet server certs.
  8. Mysterious time delay. Perhaps the monitoring operator does not notice sad metrics-server-* pods outside of things that trigger DeploymentRollout?
  9. 2024-01-14, monitoring ClusterOperator goes Available=False on UpdatingMetricsServerFailed.
  10. 2024-01-17, deleting one metrics-server-* pod triggers replacement-pod creation, and the replacement pod comes up fine.

So addressing the metrics-server /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt change detection should resolve this use-case. And triggering a container or pod restart would be an aggressive-but-sufficient mechanism, although loading the new data without rolling the process would be less invasive.

Version-Release number of selected component (if applicable)

4.15.0-ec.3, which has fast CA rotation, see discussion in API-1687.

How reproducible

Unclear.

Steps to Reproduce

Unclear.

Actual results

metrics-server pods having trouble with CA trust when attempting to scrape nodes.

Expected results

metrics-server pods successfully trusting kubelets when scraping nodes.

Additional details

The monitoring operator sets up the metrics server with --kubelet-certificate-authority=/etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, which is the "Path to the CA to use to validate the Kubelet's serving certificates" and is mounted from the kubelet-serving-ca-bundle ConfigMap. But that mount point only contains openshift-kube-controller-manager-operator_csr-signer-signer@... CAs:

$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- cat /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not '
Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-gtctn ...

Removing debug pod ...
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec  3 14:42:33 2023 GMT
            Not After : Feb  1 14:42:34 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec 20 03:16:35 2023 GMT
            Not After : Jan 19 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1703042196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  4 03:16:35 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1704338196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  2 14:42:34 2024 GMT
            Not After : Mar  2 14:42:35 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
unable to load certificate
137730753918272:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE

While actual kubelets seem to be using certs signed by kube-csr-signer_@1704338196 (which is one of the Subjects in /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt):

$ oc get -o wide -l node-role.kubernetes.io/master= nodes
NAME                                                  STATUS   ROLES    AGE      VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
build0-gstfj-m-0.c.openshift-ci-build-farm.internal   Ready    master   3y240d   v1.28.3+20a5764   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow)   5.14.0-284.41.1.el9_2.x86_64   cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9
build0-gstfj-m-1.c.openshift-ci-build-farm.internal   Ready    master   3y240d   v1.28.3+20a5764   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow)   5.14.0-284.41.1.el9_2.x86_64   cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9
build0-gstfj-m-2.c.openshift-ci-build-farm.internal   Ready    master   3y240d   v1.28.3+20a5764   10.0.0.3      <none>        Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow)   5.14.0-284.41.1.el9_2.x86_64   cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9
$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- openssl s_client -connect 10.0.0.3:10250 -showcerts </dev/null
Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-ksl2k ...
Can't use SSL_get_servername
depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
verify return:1
CONNECTED(00000003)
---
Certificate chain
 0 s:O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
   i:CN = kube-csr-signer_@1704338196
-----BEGIN CERTIFICATE-----
MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm
MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3
MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx
SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp
ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH
A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X
PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud
JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2
9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w
ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB
CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/
vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON
rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb
q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2
H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw
JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM
-----END CERTIFICATE-----
---
Server certificate
subject=O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal

issuer=CN = kube-csr-signer_@1704338196

---
Acceptable client certificate CA names
OU = openshift, CN = admin-kubeconfig-signer
CN = openshift-kube-controller-manager-operator_csr-signer-signer@1699022534
CN = kube-csr-signer_@1700450189
CN = kube-csr-signer_@1701746196
CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1691004449
CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1702234292
CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1699642292
OU = openshift, CN = kubelet-bootstrap-kubeconfig-signer
CN = openshift-kube-apiserver-operator_node-system-admin-signer@1678905372
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1902 bytes and written 383 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 21 (unable to verify the first certificate)
---
DONE

Removing debug pod ...
$ openssl x509 -noout -text <<EOF 2>/dev/null
> -----BEGIN CERTIFICATE-----
MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm
MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3
MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx
SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp
ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH
A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X
PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud
JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2
9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w
ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB
CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/
vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON
rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb
q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2
H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw
JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM
-----END CERTIFICATE-----
> EOF
...
        Issuer: CN = kube-csr-signer_@1704338196
        Validity
            Not Before: Jan 17 03:14:30 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal
...

The monitoring operator populates the openshift-monitoring kubelet-serving-ca-bundle} ConfigMap using data from the openshift-config-managed kubelet-serving-ca ConfigMap, and that propagation is working, but does not contain the kube-csr-signer_ CA:

$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not '
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec  3 14:42:33 2023 GMT
            Not After : Feb  1 14:42:34 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554
            Not Before: Dec 20 03:16:35 2023 GMT
            Not After : Jan 19 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1703042196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  4 03:16:35 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: CN = kube-csr-signer_@1704338196
Certificate:
        Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
            Not Before: Jan  2 14:42:34 2024 GMT
            Not After : Mar  2 14:42:35 2024 GMT
        Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554
unable to load certificate
140531510617408:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | sha1sum 
a32ab44dff8030c548087d70fea599b0d3fab8af  -
$ oc -n openshift-monitoring get -o json configmap kubelet-serving-ca-bundle | jq -r '.data["ca-bundle.crt"]' | sha1sum 
a32ab44dff8030c548087d70fea599b0d3fab8af  -

Flipping over to the kubelet side, nothing in the machine-config operator's template is jumping out at me as a key/cert pair for serving on 10250. The kubelet seems to set up server certs via serverTLSBootstrap: true. But we don't seem to set the beta RotateKubeletServerCertificate, so I'm not clear on how these are supposed to rotate on the kubelet side. But there are CSRs from kubelets requesting serving certs:

$ oc get certificatesigningrequests | grep 'NAME\|kubelet-serving'
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-8stgd   51m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-xkdw2                           <none>              Approved,Issued
csr-blbjx   9m1s    kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-longtests-worker-b-5w9dz                        <none>              Approved,Issued
csr-ghxh5   64m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-sdwdn                           <none>              Approved,Issued
csr-hng85   33m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-longtests-worker-d-7d7h2                        <none>              Approved,Issued
csr-hvqxz   24m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-fp6wb                           <none>              Approved,Issued
csr-vc52m   50m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-xlmt6                           <none>              Approved,Issued
csr-vflcm   40m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-djpgq                           <none>              Approved,Issued
csr-xfr7d   51m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-8v4vk                           <none>              Approved,Issued
csr-zhzbs   51m     kubernetes.io/kubelet-serving                 system:node:build0-gstfj-ci-builds-worker-b-rqr68                           <none>              Approved,Issued
$ oc get -o json certificatesigningrequests csr-blbjx
{
    "apiVersion": "certificates.k8s.io/v1",
    "kind": "CertificateSigningRequest",
    "metadata": {
        "creationTimestamp": "2024-01-17T19:20:43Z",
        "generateName": "csr-",
        "name": "csr-blbjx",
        "resourceVersion": "4719586144",
        "uid": "5f12d236-3472-485f-8037-3896f51a809c"
    },
    "spec": {
        "groups": [
            "system:nodes",
            "system:authenticated"
        ],
        "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlh6Q0NBUVFDQVFBd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVQwd093WURWUVFERXpSegplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnpMWGR2Y210bGNpMWlMVFYzCk9XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZ3F4ZHNZWkdmQXovTEpoZVgKd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVSUpUN2pCblV1WEdnZktCTQpNRW9HQ1NxR1NJYjNEUUVKRGpFOU1Ec3dPUVlEVlIwUkJESXdNSUlvWW5WcGJHUXdMV2R6ZEdacUxXTnBMV3h2CmJtZDBaWE4wY3kxM2IzSnJaWEl0WWkwMWR6bGtlb2NFQ2dBZ0F6QUtCZ2dxaGtqT1BRUURBZ05KQURCR0FpRUEKMHlRVzZQOGtkeWw5ZEEzM3ppQTJjYXVJdlhidTVhczNXcUZLYWN2bi9NSUNJUURycEQyVEtScHJOU1I5dExKTQpjZ0ZpajN1dVNieVJBcEJ5NEE1QldEZm02UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=",
        "signerName": "kubernetes.io/kubelet-serving",
        "usages": [
            "digital signature",
            "server auth"
        ],
        "username": "system:node:build0-gstfj-ci-longtests-worker-b-5w9dz"
    },
    "status": {
        "certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN6ekNDQWJlZ0F3SUJBZ0lSQUlGZ1NUd0ovVUJLaE1hWlE4V01KcEl3RFFZSktvWklodmNOQVFFTEJRQXcKSmpFa01DSUdBMVVFQXd3YmEzVmlaUzFqYzNJdGMybG5ibVZ5WDBBeE56QTBNek00TVRrMk1CNFhEVEkwTURFeApOekU1TVRVME0xb1hEVEkwTURJd016QXpNVFl6Tmxvd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6Ck1UMHdPd1lEVlFRREV6UnplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnoKTFhkdmNtdGxjaTFpTFRWM09XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZwpxeGRzWVpHZkF6L0xKaGVYd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVCklKVDdqQm5VdVhHZ2ZLT0JrakNCanpBT0JnTlZIUThCQWY4RUJBTUNCNEF3RXdZRFZSMGxCQXd3Q2dZSUt3WUIKQlFVSEF3RXdEQVlEVlIwVEFRSC9CQUl3QURBZkJnTlZIU01FR0RBV2dCVGYzZ0FHNUxiMkxTcXl6MEFxVCtaRAoyV0VuenpBNUJnTlZIUkVFTWpBd2dpaGlkV2xzWkRBdFozTjBabW90WTJrdGJHOXVaM1JsYzNSekxYZHZjbXRsCmNpMWlMVFYzT1dSNmh3UUtBQ0FETUEwR0NTcUdTSWIzRFFFQkN3VUFBNElCQVFBRE5ad0pMdkp4WWNta2RHV08KUm5ocC9rc3V6akJHQnVHbC9VTmF0RjZScml3eW9mdmpVNW5Kb0RFbGlLeHlDQ2wyL1d5VXl5a2hMSElBK1drOQoxZjRWajIrYmZFd0IwaGpuTndxQThudFFabS90TDhwalZ5ZzFXM0VwR2FvRjNsZzRybDA1cXBwcjVuM2l4WURJClFFY2ZuNmhQUnlKN056dlFCS0RwQ09lbU8yTFllcGhqbWZGY2h5VGRZVGU0aE9IOW9TWTNMdDdwQURIM2kzYzYKK3hpMDhhV09LZmhvT3IybTVBSFBVN0FkTjhpVUV0M0dsYzI0SGRTLzlLT05tT2E5RDBSSk9DMC8zWk5sKzcvNAoyZDlZbnYwaTZNaWI3OGxhNk5scFB0L2hmOWo5TlNnMDN4OFZYRVFtV21zN29xY1FWTHMxRHMvWVJ4VERqZFphCnEwMnIKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=",
        "conditions": [
            {
                "lastTransitionTime": "2024-01-17T19:20:43Z",
                "lastUpdateTime": "2024-01-17T19:20:43Z",
                "message": "This CSR was approved by the Node CSR Approver (cluster-machine-approver)",
                "reason": "NodeCSRApprove",
                "status": "True",
                "type": "Approved"
            }
        ]
    }
}
$ oc get -o json certificatesigningrequests csr-blbjx | jq -r '.status.certificate | @base64d' | openssl x509 -noout -text | grep '^Certificate:\|Issuer\|Subject:\|Not '
Certificate:
        Issuer: CN = kube-csr-signer_@1704338196
            Not Before: Jan 17 19:15:43 2024 GMT
            Not After : Feb  3 03:16:36 2024 GMT
        Subject: O = system:nodes, CN = system:node:build0-gstfj-ci-longtests-worker-b-5w9dz

So that's approved by cluster-machine-approver, but signerName: kubernetes.io/kubelet-serving is an upstream Kubernetes component documented here, and the signer is implemented by kube-controller-manager.

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Description of problem:

Hosted control plane clusters of OCP 4.16 are using default catalog sources (redhat-operators, certified-operators, community-operators and redhat-marketplace) pointing to the 4.14, thus 4.16 operators are not available and this can't be updated from within the guest.

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Always

Steps to Reproduce:

1. check the .spec.image of the default catalog sources in openshift-marketplace namespace.

Actual results:

the default catalogs are pointing to :v4.14

Expected results:

they should point to :v4.16 instead

Additional info:

    

Description of problem: new feature in ironic to clear the non-OS disks during the bmh installation. Only works for disks with blocksize=512

Customer says the following:

This is unlisted new feature (or enhancement) in OCP4.14. This non-OS disk wiping during bmh installation is not available in 4.12.

    Version-Release number of selected component (if applicable):{code:none}



The following command generated by ironic.
"dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct"
Fails with

ironic-agent[4054]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct

ironic-agent[4054]: Exit code: 1

podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root [-] Unexpected error dispatching erase_devices_metadata to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f050797f2e0>: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.

    

How reproducible:
Repeatable for bmh with disk has block size larger than 512.

Steps to Reproduce:

    1. This problem will occur on server with disk has block size greater than 512.   For example, SAMSUNG, p/n: KR-05RJND-SSK00-389-02DF-A02, that drive has block size of 4096.
    2. Add a bmh which has non-OS disk with block size greater than 512.
      2a.  The introspection of the bmh will be fine.
      2b.  When the bmh is added (provisioning phase), the OCP installation will try to wipe out the non-OS drive using command "dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct".   This can be monitored in the bmh via "journalctl -f -b".  In the case of disk with block size of 4096, the above dd command will be rejected.  The bmh will be rebooted and attempt the step 2b over again.
    3. Manual testing.  in a bmh server with disks have block-size greater than 512, (tested with disks have bz=4096 ).
      the command:    "dd bs=512 if=/dev/zero of=/dev/sdb count=33"  will failed
       the alternate command which determine the disk's block-size for the dd command will work. 
      "dd bs=$(blockdev --getss /dev/sdb) if=/dev/zero of=/dev/sdb count=33 oflag=direct"

    

Actual results:


    

Expected results:
disk will be formatted with any blocksize

additial logs:

In OCP 4.14.16, there is new feature in ironic to clear the non-OS disks during the bmh installation. The following command generated by ironic.
"dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct"
"dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct"

The reason for failure is the alignment restriction, logical block size are different depending disk type.

May be instead of
"dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct".

It could be replaced by:
"dd bs=$(blockdev --getss /dev/sda) if=/dev/zero of=/dev/sda count=33 oflag=direct"

That would work for various type of disks.

~~~~ THE IRONIC ERROR in OCP 4.14.16 ~~~~
-agent[4054]: 2024-03-27 04:00:48.240 1 ERROR root [-] Unexpected error dispatching erase_devices_metadata to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f050797f2e0>: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root [-] Unexpected error dispatching erase_devices_metadata to manager <ironic_python_agent.hardware.GenericHardwareManager object at 0x7f050797f2e0>: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n": ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Traceback (most recent call last):
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root return getattr(manager, method)(*args, **kwargs)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1702, in erase_devices_metadata
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root raise errors.BlockDeviceEraseError(excpt_msg)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.240 1 ERROR root
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root [-] Error performing clean step erase_devices_metadata: ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Traceback (most recent call last):
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/extensions/clean.py", line 77, in execute_clean_step
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root result = hardware.dispatch_to_managers(step['step'], node, ports,
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root return getattr(manager, method)(*args, **kwargs)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1702, in erase_devices_metadata
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root raise errors.BlockDeviceEraseError(excpt_msg)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root [-] Command failed: execute_clean_step, error: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n": ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Traceback (most recent call last):
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/extensions/base.py", line 174, in run
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root result = self.execute_method(**self.command_params)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/extensions/clean.py", line 77, in execute_clean_step
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root result = hardware.dispatch_to_managers(step['step'], node, ports,
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root return getattr(manager, method)(*args, **kwargs)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1702, in erase_devices_metadata
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root raise errors.BlockDeviceEraseError(excpt_msg)
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root ironic_python_agent.errors.BlockDeviceEraseError: Error erasing block device: Failed to erase the metadata on the device(s): "/dev/sdb": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sdb count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sdb': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.6049e-05 s, 0.0 kB/s\n"; "/dev/sda": Unexpected error while running command.
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Command: dd bs=512 if=/dev/zero of=/dev/sda count=33 oflag=direct
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stdout: ''
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root Stderr: "dd: error writing '/dev/sda': Invalid argument\n1+0 records in\n0+0 records out\n0 bytes copied, 3.9605e-05 s, 0.0 kB/s\n"
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com podman[4014]: 2024-03-27 04:00:48.242 1 ERROR root
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Exit code: 1
Mar 27 04:00:48 r660hd-0.compactocp802.mavdallab.com ironic-agent[4054]: Stdout: ''

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

The node selector for the console deployment requires deploying it on the master nodes, The node selector for the console deployment requires deploying it on the master nodes, while the replica count is determined by the infrastructureTopology, which primarily tracks the workers' setup.

When an OpenShift cluster is installed with a single master node and multiple workers, this leads the console deployment to request 2 replicas as infrastructureTopology is set to HighlyAvailable. Instead, ControlPlaneTopology is set to SingleReplica as expected.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always    

Steps to Reproduce:

    1. Install an openshift cluster with 1 master and 2 workers

Actual results:

The installation fails as the replicas for the console deployment is set to 2.

  apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: "2024-01-18T08:34:47Z"
    generation: 1
    name: cluster
    resourceVersion: "517"
    uid: d89e60b4-2d9c-4867-a2f8-6e80207dc6b8
  spec:
    cloudConfig:
      key: config
      name: cloud-provider-config
    platformSpec:
      aws: {}
      type: AWS
  status:
    apiServerInternalURI: https://api-int.adstefa-a12.qe.devcluster.openshift.com:6443
    apiServerURL: https://api.adstefa-a12.qe.devcluster.openshift.com:6443
    controlPlaneTopology: SingleReplica
    cpuPartitioning: None
    etcdDiscoveryDomain: ""
    infrastructureName: adstefa-a12-6wlvm
    infrastructureTopology: HighlyAvailable
    platform: AWS
    platformStatus:
      aws:
        region: us-east-2
      type: AWS


apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
   .... 
  creationTimestamp: "2024-01-18T08:54:23Z"
  generation: 3
  labels:
    app: console
    component: ui
  name: console
  namespace: openshift-console
spec:
  progressDeadlineSeconds: 600
  replicas: 2


Expected results:

The replica is set to 1, tracking the ControlPlaneTopology value instead of hte infrastructureTopology.

Additional info:

    

Red Hat OpenShift Container Platform subscriptions are often measured against underlying cores. However, the metrics for cores are unreliable with some known edge cases. Namely, when virtualization is used, depending on a variety of factors, the hypervisor doesn't report the underlying cores, and instead reports a core per "cpu" where "cpu" is a schedulable executor (possibly backed by a single hyperthreaded executor). In order to address, we assume a ratio of 2-vCPU to 1 core, and divide the "cores" value by 2 to normalize when we detect that hyperthreading information was not reported, when we're on x86-64 CPU architecture, and when the cluster is not a bare-metal cluster.

At this time, x86-64 virtualized clusters are the ones affected.

Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/142

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/api/pull/1700

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On customer feedback modal, there are 3 links for user to feedback to Red Hat, the third link lacks a title.
    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-21-155123
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Login admin console. Click on "?"->"Share Feedback", check the links on the modal
    2.
    3.
    

Actual results:

1. The third link lacks a link title (the link for "Learn about opportunities to ……").
    

Expected results:

1. There is link title "Inform the direction of Red Hat" in 4.14, it should also exists for 4.15.
    

Additional info:

screenshot for 4.14 page: https://drive.google.com/file/d/19AnPlE0h9WwvIjxV0gLuf5x27jLN7TLS/view?usp=drive_link
screenshot for 4.15 page: https://drive.google.com/file/d/19MRjzNGRWfYnK-zcoMozh7Z7eaDDG2L-/view?usp=drive_link
    

Description of problem:

The default catalog source pod never gets updates, the users have to manually recreate it to get updated. Here is must-gather log for your debugging: https://drive.google.com/file/d/16_tFq5QuJyc_n8xkDFyK83TdTkrsVFQe/view?usp=drive_link 

I went through the code and found the `updateStrategy` depends on the `ImageID`, see

https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go#L527-L534 

// imageID returns the ImageID of the primary catalog source container or an empty string if the image ID isn't available yet.
// Note: the pod must be running and the container in a ready status to return a valid ImageID.
func imageID(pod *corev1.Pod) string {
 if len(pod.Status.ContainerStatuses) < 1 {
 logrus.WithField("CatalogSource", pod.GetName()).Warn("pod status unknown")
 return ""
 }
 return pod.Status.ContainerStatuses[0].ImageID
}

But, for those default catalog source pods, their `pod.Status.ContainerStatuses[0].ImageID` will never change since it's the `opm` image, not index image.

jiazha-mac:~ jiazha$ oc get pods redhat-operators-mpvzm -o=jsonpath={.status.containerStatuses} |jq
[
  {
    "containerID": "cri-o://115bd207312c7c8c36b63bfd251c085a701c58df2a48a1232711e15d7595675d",
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:965fe452763fd402ca8d8b4a3fdb13587673c8037f215c0ffcd76b6c4c24635e",
    "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:965fe452763fd402ca8d8b4a3fdb13587673c8037f215c0ffcd76b6c4c24635e",
    "lastState": {},
    "name": "registry-server",
    "ready": true,
    "restartCount": 1,
    "started": true,
    "state": {
      "running": {
        "startedAt": "2024-03-26T04:21:41Z"
      }
    }
  }
] 

The imageID() func should return the index image ID for those default catalog sources.

jiazha-mac:~ jiazha$ oc get pods redhat-operators-mpvzm -o=jsonpath={.status.initContainerStatuses[1]} |jq
{
  "containerID": "cri-o://4cd6e1f45e23aadc27b8152126eb2761a37da61c4845017a06bb6f2203659f5c",
  "image": "registry.redhat.io/redhat/redhat-operator-index:v4.15",
  "imageID": "registry.redhat.io/redhat/redhat-operator-index@sha256:19010760d38e1a898867262698e22674d99687139ab47173e2b4665e588635e1",
  "lastState": {},
  "name": "extract-content",
  "ready": true,
  "restartCount": 1,
  "started": false,
  "state": {
    "terminated": {
      "containerID": "cri-o://4cd6e1f45e23aadc27b8152126eb2761a37da61c4845017a06bb6f2203659f5c",
      "exitCode": 0,
      "finishedAt": "2024-03-26T04:21:39Z",
      "reason": "Completed",
      "startedAt": "2024-03-26T04:21:27Z"
    }
  }
} 

Version-Release number of selected component (if applicable):

    4.15.2

How reproducible:

    always

Steps to Reproduce:

    1. Install an OCP 4.16.0
    2. Waiting for the redhat-operator catalog source updates
    3.
    

Actual results:

The redhat-operator catalog source never gets updates.

Expected results:

These default catalog source should get updates depending on the `updateStrategy`.

    jiazha-mac:~ jiazha$ oc get catalogsource redhat-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  annotations:
    operatorframework.io/managed-by: marketplace-operator
    target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
  creationTimestamp: "2024-03-20T15:48:59Z"
  generation: 1
  name: redhat-operators
  namespace: openshift-marketplace
  resourceVersion: "12217605"
  uid: cc0fc420-c9d8-4c7d-997e-f0893b4c497f
spec:
  displayName: Red Hat Operators
  grpcPodConfig:
    extractContent:
      cacheDir: /tmp/cache
      catalogDir: /configs
    memoryTarget: 30Mi
    nodeSelector:
      kubernetes.io/os: linux
      node-role.kubernetes.io/master: ""
    priorityClassName: system-cluster-critical
    securityContextConfig: restricted
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 120
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 120
  icon:
    base64data: ""
    mediatype: ""
  image: registry.redhat.io/redhat/redhat-operator-index:v4.15
  priority: -100
  publisher: Red Hat
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m
status:
  connectionState:
    address: redhat-operators.openshift-marketplace.svc:50051
    lastConnect: "2024-03-27T06:35:36Z"
    lastObservedState: READY
  latestImageRegistryPoll: "2024-03-27T10:23:16Z"
  registryService:
    createdAt: "2024-03-20T15:56:03Z"
    port: "50051"
    protocol: grpc
    serviceName: redhat-operators
    serviceNamespace: openshift-marketplace

Additional info:

I also checked the currentPodsWithCorrectImageAndSpec, but no hash changed due to the pod.spec are the same always.

time="2024-03-26T03:22:01Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace
time="2024-03-26T03:27:01Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=xW0cW
time="2024-03-26T03:27:01Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=xW0cW
time="2024-03-26T03:27:02Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=vq5VA
time="2024-03-26T03:27:03Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" catalogsource.name=redhat-operators catalogsource.namespace=openshift-marketplace correctHash=true correctImages=true current-pod.name=redhat-operators-mpvzm current-pod.namespace=openshift-marketplace id=vq5VA

This wasn't supposed to have a junit associated but it looks like it did and is now killing payloads. It started failing because the LB was reaped, and we do not yet have confirmation the new one will be preserved.

This should be pulled out until we've got confirmation that (a) there is no junit for the backend, and (b) the LB is on the preserve whitelist.

Description of problem:

The archive tar file size should respect the archiveSize setting when mirror with V2 format

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1) With following imagesetconfigure : 
cat config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
archiveSize: 8
storageConfig:
  local:
    path: /app1/ocmirror/offline
mirror:
  platform:
    channels:
    - name: stable-4.12                                             
      type: ocp
      minVersion: '4.12.46'
      maxVersion: '4.12.46'
      shortestPath: true
    graph: true
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: advanced-cluster-management                                  
      channels:
      - name: release-2.9             
    - name: compliance-operator
      channels:
      - name: stable
    - name: multicluster-engine
      channels:
      - name: stable-2.4
      - name: stable-2.5
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest                        
  - name: registry.redhat.io/rhel8/support-tools:latest
  - name: registry.access.redhat.com/ubi8/nginx-120:latest
  - name: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.8.0
  - name: registry.k8s.io/sig-storage/csi-resizer:v1.8.0
  - name: quay.io/openshifttest/hello-openshift@sha256:4200f438cf2e9446f6bcff9d67ceea1f69ed07a2f83363b7fb52529f7ddd8a83
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27

2) Run `oc-mirror --config config.yaml file://out --v2`

Actual results: 

2) The archive size is still 49G , not following the setting in imagesetconfigure.
ll  out/ -h
total 49G
-rw-r--r--.  1 root root  49G Mar 20 09:03 mirror_000001.tar
drwxr-xr-x. 11 root root 4.0K Mar 20 08:54 working-dir

Expected results:

multiple tar files with size greater or equal to 8G should be generated 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

User Story:

As a ROSA customer, I want to enforce that my workloads follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.

As Red Hat, I would like to follow AWS best-practices by using AWS Regionalized STS Endpoints instead of the global one.

Per AWS docs:

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html 

AWS recommends using Regional AWS STS endpoints instead of the global endpoint to reduce latency, build in redundancy, and increase session token validity. 

https://docs.aws.amazon.com/sdkref/latest/guide/feature-sts-regionalized-endpoints.html

All new SDK major versions releasing after July 2022 will default to regional. New SDK major versions might remove this setting and use regional behavior. To reduce future impact regarding this change, we recommend you start using regional in your application when possible. 

Acceptance Criteria:

Areas where HyperShift creates STS credentials use regionalized STS endpoints, e.g. https://github.com/openshift/hypershift/blob/ae1caa00ff3a2c2bfc1129f0168efc1e786d1d12/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1225-L1228 

Engineering Details:

Description of problem:

    No detail failure on signature verification while failing to validate signature of the target release payload during upgrade. It's unclear for user to know which action could be taken for the failure. For example, checking if any wrong configmap set, or default store is not available or any issue on custom store?
 
# ./oc adm upgrade
Cluster version is 4.15.0-0.nightly-2023-12-08-202155
Upgradeable=False  

  Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade
  Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates

ReleaseAccepted=False  
  Reason: RetrievePayload
  Message: Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat

Upstream: https://amd64.ocp.releases.ci.openshift.org/graph
Channel: stable-4.15
Recommended updates:  
  VERSION                            IMAGE
  4.15.0-0.nightly-2023-12-09-012410 registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7
 
# ./oc -n openshift-cluster-version logs cluster-version-operator-6b7b5ff598-vxjrq|grep "verified"|tail -n4
I1211 09:28:22.755834       1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
I1211 09:28:22.755974       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
I1211 09:28:37.817102       1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
I1211 09:28:37.817488       1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-08-202155

How reproducible:

    always

Steps to Reproduce:

    1. trigger an fresh installation with tp enabled(no spec.signaturestores property set by default) 

    2.trigger an upgrade against a nightly build(no signature available in default signature store)

    3.
    

Actual results:

    no detail log on signature verification failure

Expected results:

    include detail failure on signature verification in the cvo log

Additional info:

    https://github.com/openshift/cluster-version-operator/pull/1003

Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/75

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:
Issue - Profiles are degraded [1]even after applied due to below [2]error:

[1]

$oc get profile -A
NAMESPACE                                NAME                                          TUNED                APPLIED   DEGRADED   AGE
openshift-cluster-node-tuning-operator   master0    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master1    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master2    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   worker0    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker1    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker10   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker11   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker12   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker13   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker14   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker15   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker2    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker3    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker4  rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker5    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker6    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker7    rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker8   rdpmc-patch-worker   True      True       5d
openshift-cluster-node-tuning-operator   worker9   rdpmc-patch-worker   True      True       5d

[2]

  lastTransitionTime: "2023-12-05T22:43:12Z"
    message: TuneD daemon issued one or more sysctl override message(s) during profile
      application. Use reapply_sysctl=true or remove conflicting sysctl net.core.rps_default_mask
    reason: TunedSysctlOverride
    status: "True"

If we see in rdpmc-patch-master tuned:

NAMESPACE                                NAME                                          TUNED                APPLIED   DEGRADED   AGE
openshift-cluster-node-tuning-operator   master0    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master1    rdpmc-patch-master   True      True       5d
openshift-cluster-node-tuning-operator   master2    rdpmc-patch-master   True      True       5d

We are configuring below in rdpmc-patch-master tuned:

$ oc get tuned rdpmc-patch-master -n openshift-cluster-node-tuning-operator -oyaml |less
spec:
  profile:
  - data: |
      [main]
      include=performance-patch-master
      [sysfs]
      /sys/devices/cpu/rdpmc = 2
    name: rdpmc-patch-master
  recommend:

Below in Performance-patch-master which is included in above tuned:

spec:
  profile:
  - data: |
      [main]
      summary=Custom tuned profile to adjust performance
      include=openshift-node-performance-master-profile
      [bootloader]
      cmdline_removeKernelArgs=-nohz_full=${isolated_cores}

Below(which is coming in error) is in openshift-node-performance-master-profile included in above tuned:

net.core.rps_default_mask=${not_isolated_cpumask}

RHEL BUg has been raised for the same https://issues.redhat.com/browse/RHEL-18972

    Version-Release number of selected component (if applicable):{code:none}
4.14
    

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2156

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

openshift-install is unable to generate an aarch64 iso:
              
FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file 

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100 %

Steps to Reproduce:

    1. Create an install_config.yaml with controlplane.architecture and compute.architecture = arm64 
    2. openshift-install agent create image  --log-level debug     

Actual results:

DEBUG Generating Agent Installer ISO... 
INFO Consuming Install Config from target directory 
DEBUG Purging asset "Install Config" from disk 
INFO Consuming Agent Config from target directory 
DEBUG Purging asset "Agent Config" from disk 
DEBUG initDisk(): start DEBUG initDisk(): regular file 
FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file 

Expected results:

    agent.aarch64.iso is created

Additional info:

Seems to be related to this PR:
https://github.com/openshift/installer/pull/7896

boot.catalog is also referenced in the assisted-image-service here:
https://github.com/openshift/installer/blob/master/vendor/github.com/openshift/assisted-image-service/pkg/isoeditor/isoutil.go#L155

Description of problem:

    'Oh no somthing went wrong' shown on Image Manifest Vulnerability page after create IMV via CLI

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2024-02-03-192446

How reproducible:

    Always

Steps to Reproduce:

    1. Installed the operator of 'Red Hat Quay Container Security Operator'
    2. Use Command Line to created the IMV
       $ oc create -f imv.yaml 
        imagemanifestvuln.secscan.quay.redhat.com/example created
       $ cat IMV.yaml
         apiVersion: secscan.quay.redhat.com/v1alpha1
         kind: ImageManifestVuln
         metadata:
           name: example
           namespace: openshift-operators
         spec: {}
     3. Navigate to page /k8s/ns/openshift-operators/operators.coreos.com~v1alpha1~ClusterServiceVersion/container-security-operator.v3.10.3/secscan.quay.redhat.com~v1alpha1~ImageManifestVuln
     

Actual results:

    Oh no! Something went wrong. will be shown
Description:Cannot read properties of undefined (reading 'replace')Component trace:Copy to clipboardat T (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/container-security-chunk-c75b48f176a6a5981ee2.min.js:1:3465)
    at https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631947
    at tr
    at x (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:630876)
    at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:82:73479)
    at tbody
    at table
    at g (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:6:199268)
    at l (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-69dd6dd4312fdf07fedf.min.js:10:88631)
    at D (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:632038)
    at div
    at div
    at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:50:39294)
    at t (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/vendors~main-chunk-e839d85039c974dbb9bb.min.js:49:16122)
    at o (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:642088)
    at div
    at M (https://console-openshift-console.apps.qe-uidaily-0205.qe.devcluster.openshift.com/static/main-chunk-d635f27942d1b5fdaad6.min.js:1:631697)
    at div

Expected results:

    no error issue

Additional info:

    

Description of problem:

    There is no kubernetes service associated with the kube-scheduler, so it does not require a readiness probe.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

# In the control plane:
kubectl get services | grep scheduler
kubectl get deploy kube-scheduler | grep readiness

Actual results:

    Probe exists, but no service

Expected results:

    No probe or service

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

I took a look at Component Readiness today and noticed that "[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time" is permafailing.  I modified the sample start time to see that is appears to have started around February 19th.

Is this expected with 4.16 or do we have a problem?

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Cluster%20Version%20Operator&confidence=95&environment=ovn%20upgrade-minor%20amd64%20metal-ipi%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-03-04%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-01-16%2000%3A00%3A00&testId=Cluster%20upgrade%3A0bf7638bc532109d8a7a3c395e2867da&testName=%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%20a%20reasonable%20time&upgrade=upgrade-minor&upgrade=upgrade-minor&variant=standard&variant=standard

 

Component Readiness has found a potential regression in [sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.16
Start Time: 2024-02-27T00:00:00Z
End Time: 2024-03-04T23:59:59Z
Success Rate: 0.00%
Successes: 0
Failures: 4
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 47
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Cluster%20Version%20Operator&confidence=95&environment=ovn%20upgrade-minor%20amd64%20metal-ipi%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-03-04%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-02-27%2000%3A00%3A00&testId=Cluster%20upgrade%3A0bf7638bc532109d8a7a3c395e2867da&testName=%5Bsig-cluster-lifecycle%5D%20cluster%20upgrade%20should%20complete%20in%20a%20reasonable%20time&upgrade=upgrade-minor&upgrade=upgrade-minor&variant=standard&variant=standard

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/52

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/117

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

A table in a dashboard relies on the order of the metric labels to merge results

How to Reproduce:

Create a dashboard with a table including this query:

label_replace(sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by ( alertstate, alertname)), "aaa", "$1", "alertstate", "(.+)") 

A single row will be displayed as the query is simulating that the first label `aaa` has a single value.

Expected result:

The table should not rely on a single metric label to merge results but consider all the labels so the expected rows are displayed.

 

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Deployment cannot be scaled up/down when an HPA is associated with it.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Create a test deployment
$ oc new-app httpd
2. Create a HPA for the deployment
$ oc autoscale deployment/httpd --min 1 --max 10 --cpu-percent 10 
3. Scale down the deployment via script or manually to 0 replicas.
$ oc scale deployment/httpd --replicas=0
4. The HPA shows below status that it cannot scale up until the deployment is scaled up.
~~~
     - type: ScalingActive
      status: 'False'
      lastTransitionTime: '2023-10-24T10:00:01Z'
      reason: ScalingDisabled
      message: scaling is disabled since the replica count of the target is zero
~~~  
5. Since the scale up/down is disabled, the users will not be able to scale up the deployment from GUI. The only option is to do it from CLI.

Actual results:

The scale up/down arrows are disabled and users are unable to start the deployment.

Expected results:

The scale up/down arrows should be enabled or another option that can help to scale up the deployment.

Additional info:

 

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/274

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

PipelineRun logs page navigation is broken on navigate through the task on the PiplineRun log tab.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Navigate to PipelineRuns details page and select the Logs tab.
    2. Navigate through the tasks of the PipelineRun tasks
    

Actual results:

- Details tab gets active on selection of any task
- Logs page gets empty on seldction of Logs tab again
- Last task is not selected for completed PipelineRuns

Expected results:

- Logs tab should be active when user is not the Logs tab
- Last task should be selected in case of the completed PipelineRuns

Additional info:

  It is a regression after change in logic of tab selection in HorizontalNav component. 
 

https://github.com/openshift/console/pull/13216/files#diff-267d61f330ad6cd9b0f2d743d9ff27929fbe7001780d73e1ec88599d3778eb96R177-R190

Video- https://drive.google.com/file/d/15fx9GWO2dRh4uaibRmZ4VTk4HFxQ7NId/view?usp=sharing

 

 

Description of problem:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.15.0-rc.4: 701 of 873 done (80% complete), waiting on operator-lifecycle-manager

Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.


$ oc get pods -n openshift-operator-lifecycle-manager 
NAME                                      READY   STATUS             RESTARTS        AGE
catalog-operator-db86b7466-gdp4g          1/1     Running            0               9h
collect-profiles-28443465-9zzbk           0/1     Completed          0               34m
collect-profiles-28443480-kkgtk           0/1     Completed          0               19m
collect-profiles-28443495-shvs7           0/1     Completed          0               4m10s
olm-operator-56cb759d88-q2gr7             0/1     CrashLoopBackOff   8 (3m27s ago)   20m
package-server-manager-7cf46947f6-sgnlk   2/2     Running            0               9h
packageserver-7b795b79f-thxfw             1/1     Running            1               14d
packageserver-7b795b79f-w49jj             1/1     Running            0               4d17h

Version-Release number of selected component (if applicable):

 

How reproducible:

Unknown

Steps to Reproduce:

Upgrade from 4.15.0-rc.2 to 4.15.0-rc.4    

Actual results:

The upgrade is unable to proceed

Expected results:

The upgrade can proceed

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/394

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Pod capi-ibmcloud-controller-manager stuck in ContainerCreating on IBM cloud

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

    1. Built a cluster on ibm cloud and enable TechPreviewNoUpgrade
    2.
    3.
    

Actual results:

4.16 cluster
$ oc get po                        
NAME                                                READY   STATUS              RESTARTS      AGE
capi-controller-manager-6bccdc844-jsm4s             1/1     Running             9 (24m ago)   175m
capi-ibmcloud-controller-manager-75d55bfd7d-6qfxh   0/2     ContainerCreating   0             175m
cluster-capi-operator-768c6bd965-5tjl5              1/1     Running             0             3h

  Warning  FailedMount       5m15s (x87 over 166m)  kubelet            MountVolume.SetUp failed for volume "credentials" : secret "capi-ibmcloud-manager-bootstrap-credentials" not found

$ oc get clusterversion               
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-01-21-154905   True        False         156m    Cluster version is 4.16.0-0.nightly-2024-01-21-154905

4.15 cluster
$ oc get po                           
NAME                                                READY   STATUS              RESTARTS        AGE
capi-controller-manager-6b67f7cff4-vxtpg            1/1     Running             6 (9m51s ago)   35m
capi-ibmcloud-controller-manager-54887589c6-6plt2   0/2     ContainerCreating   0               35m
cluster-capi-operator-7b7f48d898-9r6nn              1/1     Running             1 (17m ago)     39m
$ oc get clusterversion           
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-01-22-160236   True        False         11m     Cluster version is 4.15.0-0.nightly-2024-01-22-160236        

Expected results:

No pod is in ContainerCreating status

Additional info:

must-gather: https://drive.google.com/file/d/1F5xUVtW-vGizAYgeys0V5MMjp03zkSEH/view?usp=sharing 

Description of problem:

    Navigation:
    Workloads -> Deployments -> (select any Deployment from list) -> Details -> Volumes -> Remove volume

    Issue:
    Message "Are you sure you want to remove volume audit-policies from Deployment: apiserver?" is in English.

    Observation:
    Translation is present in branch release-4.15 file...
    frontend/public/locales/ja/public.json

Version-Release number of selected component (if applicable):

    4.15.0-rc.3

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Content is in English

Expected results:

    Content should be in selected language

Additional info:

    Reference screenshot attached.

Please review the following PR: https://github.com/openshift/builder/pull/385

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

While attempting to provision 300 clusters every hour of mixed cluster sizes (SNO, Compact, and standard cluster sizes) It appears that the metal3 baremetal operator has his a failure to provision any clusters.  Out of the 1850 attempted clusters, only 282 successfully provisioned (Mostly SNO size).

There seems to be many errors in the baremetal operator log, some of which are actual stack traces but it is unclear if this is the actually reason why the clusters began to fail to install with 100% not installing on the 3rd wave and beyond.


Version-Release number of selected component (if applicable):

Hub OCP - 4.14.0-rc.2
Deployed Cluster OCP - 4.14.0-rc.2
ACM - 2.9.0-DOWNSTREAM-2023-09-27-22-12-46

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Some of the errors found in the logs:
{"level":"error","ts":"2023-09-28T22:39:56Z","msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","BareMetalHost":{"name":"vm01343","namespace":"compact-00046"},"namespace":"compact-00046","name":"vm01343","reconcileID":"4bbfa52f-12a6-4983-b86b-01086491de9f","error":"action \"provisioning\" failed: failed to provision: failed to change provisioning state to \"active\": Internal Server Error","errorVerbose":"Internal Server Error\nfailed to change provisioning state to \"active\"\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).tryChangeNodeProvisionState\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:740\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).changeNodeProvisionState\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:750\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).Provision\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1604\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1179\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:527\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\nfailed to provision\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1188\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:527\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\naction \"provisioning\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:229\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}


{"level":"info","ts":"2023-09-29T16:11:24Z","logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"standard-00241~vm03618","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/","error":"Bad Gateway"}



Description of problem:

During the destroy cluster operation, unexpected results from the IBM Cloud API calls for Disks can result in panics when response data (or responses) are missing, resulting in unexpected failures during destroy.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Unknown, dependent on IBM Cloud API responses

Steps to Reproduce:

1. Successfully create IPI cluster on IBM Cloud
2. Attempt to cleanup (destroy) the cluster

Actual results:

Golang panic attempting to parse a HTTP response that is missing or lacking data.


level=info msg=Deleted instance "ci-op-97fkzvv2-e6ed7-5n5zg-master-0"
E0918 18:03:44.787843      33 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 228 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x6a3d760?, 0x274b5790})
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe?})
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x6a3d760, 0x274b5790})
	/usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion.func1()
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:84 +0x12a
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).Retry(0xc000791ce0, 0xc000573700)
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:99 +0x73
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion(0xc000791ce0, {{0xc00160c060, 0x29}, {0xc00160c090, 0x28}, {0xc0016141f4, 0x9}, {0x82b9f0d, 0x4}, {0xc00160c060, ...}})
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:78 +0x14f
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyDisks(0xc000791ce0)
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:118 +0x485
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction.func1()
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:201 +0x3f
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x7f7801e503c8, 0x18})
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:109 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x227a2f78?, 0xc00013c000?}, 0xc000a9b690?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:154 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x227a2f78, 0xc00013c000}, 0xd0?, 0x146fea5?, 0x7f7801e503c8?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:245 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x227a2f78, 0xc00013c000}, 0x4136e7?, 0x28?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:229 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x100000000000000?, 0x806f00?)
	/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:214 +0x46
github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction(0xc000791ce0, {{0x82bb9a3?, 0xc000a9b7d0?}, 0xc000111de0?}, 0x840366?, 0xc00054e900?)
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:198 +0x108
created by github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyCluster
	/go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:172 +0xa87
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

Destroy IBM Cloud Disks during cluster destroy, or provide a useful error message to follow up on.

Additional info:

The ability to reproduce is relatively low, as it requires the IBM Cloud API's to return specific data (or lack there of), which is currently unknown why the HTTP respoonse and/or data is missing.

IBM Cloud already has a PR to attempt to mitigate this issue, like done with other destroy resource calls. Potentially followup for additional resources as necessary.
https://github.com/openshift/installer/pull/7515

Observed during testing of candidate-4.15 image as of 2024-02-08.

This is an incomplete report as I haven't verified the reproducer yet or attempted to get a must-gather. I have observed this multiple times now, so I am confident it's a thing. I can't be confident that the procedure described here reliably reproduces it, or that all the described steps are required.

I have been using MCO to apply machine config to masters. This involves a rolling reboot of all masters.

During a rolling reboot I applied an update to CPMS. I observed the following sequence of events:

  • master-1 was NotReady as it was rebooting
  • I modified CPMS
  • CPMS immediately started provisioning a new master-0
  • CPMS immediately started deleting master-1
  • CPMS started provisioning a new master-1

At this point there were only 2 nodes in the cluster:

  • old master-0
  • old master-2

and machines provisioning:

  • new master-0
  • new master-1

Description of the problem:

 Until latest release we had a test which try to set dns name to 11.11.11 and was expecting BE to throw exception was succeding , (BE was throwing exception)
since last release it is no longer the case and dns name is beeing accepted as

How reproducible:

 

Steps to reproduce:

1.create a cluster , set base dns name to 11.11.11

2.

3.

Actual results:

 according to discussion in thread dns name must start with a letter , in that case we expecting BE to throw exception

Expected results:
No exception thrown

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/159

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-operator/pull/115

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On a hybrid cluster with Windows nodes and coreOS nodes mixed, egressIP cannot be applied to coreOS anymore. 
QE testing profile: 53_IPI on AWS & OVN & WindowsContainer 

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Always

Steps to Reproduce:

1.  Setup cluster with template aos-4_14/ipi-on-aws/versioned-installer-ovn-winc-ci
2.  Label on coreOS node as egress node 
% oc describe node ip-10-0-59-132.us-east-2.compute.internal
Name:               ip-10-0-59-132.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m6i.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    k8s.ovn.org/egress-assignable=
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-59-132.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m6i.xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-2b
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        cloud.network.openshift.io/egress-ipconfig:
                      [{"interface":"eni-0c661bbdbb0dde54a","ifaddr":{"ipv4":"10.0.32.0/19"},"capacity":{"ipv4":14,"ipv6":15}}]
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0629862832fff4ae3"}
                    k8s.ovn.org/host-cidrs: ["10.0.59.132/19"]
                    k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip: 10.129.2.13
                    k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 0a:58:0a:81:02:0d
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-132.us-east-2.compute.internal","mac-address":"06:06:e2:7b:9c:45","ip-address...
                    k8s.ovn.org/network-ids: {"default":"0"}
                    k8s.ovn.org/node-chassis-id: fa1ac464-5744-40e9-96ca-6cdc74ffa9be
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.7/16"}
                    k8s.ovn.org/node-id: 7
                    k8s.ovn.org/node-mgmt-port-mac-address: a6:25:4e:55:55:36
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.132/19"}
                    k8s.ovn.org/node-subnets: {"default":["10.129.2.0/23"]}
                    k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.7/16"}
                    k8s.ovn.org/remote-zone-migrated: ip-10-0-59-132.us-east-2.compute.internal
                    k8s.ovn.org/zone-name: ip-10-0-59-132.us-east-2.compute.internal
                    machine.openshift.io/machine: openshift-machine-api/wduan-debug-1120-vtxkp-worker-us-east-2b-z6wlc
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 22806
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 20 Nov 2023 09:46:53 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-59-132.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 20 Nov 2023 14:01:05 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:46:53 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:46:53 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:46:53 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Mon, 20 Nov 2023 13:57:33 +0800   Mon, 20 Nov 2023 09:47:34 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.59.132
  InternalDNS:  ip-10-0-59-132.us-east-2.compute.internal
  Hostname:     ip-10-0-59-132.us-east-2.compute.internal
Capacity:
  cpu:                4
  ephemeral-storage:  125238252Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16092956Ki
  pods:               250
Allocatable:
  cpu:                3500m
  ephemeral-storage:  114345831029
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             14941980Ki
  pods:               250
System Info:
  Machine ID:                             ec21151a2a80230ce1e1926b4f8a902c
  System UUID:                            ec21151a-2a80-230c-e1e1-926b4f8a902c
  Boot ID:                                cf4b2e39-05ad-4aea-8e53-be669b212c4f
  Kernel Version:                         5.14.0-284.41.1.el9_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 414.92.202311150705-0 (Plow)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.27.1-13.1.rhaos4.14.git956c5f7.el9
  Kubelet Version:                        v1.27.6+b49f9d1
  Kube-Proxy Version:                     v1.27.6+b49f9d1
ProviderID:                               aws:///us-east-2b/i-0629862832fff4ae3
Non-terminated Pods:                      (21 in total)
  Namespace                               Name                                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                               ----                                                      ------------  ----------  ---------------  -------------  ---
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-tlw5h                             30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         4h14m
  openshift-cluster-node-tuning-operator  tuned-4fvgv                                               10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h14m
  openshift-dns                           dns-default-z89zl                                         60m (1%)      0 (0%)      110Mi (0%)       0 (0%)         11m
  openshift-dns                           node-resolver-v9stn                                       5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         4h14m
  openshift-image-registry                image-registry-67b88dc677-76hfn                           100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         4h14m
  openshift-image-registry                node-ca-hw62n                                             10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4h14m
  openshift-ingress-canary                ingress-canary-9r9f8                                      10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         4h13m
  openshift-ingress                       router-default-5957f4f4c6-tl9gs                           100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         4h18m
  openshift-machine-config-operator       machine-config-daemon-h7fx4                               40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         4h14m
  openshift-monitoring                    alertmanager-main-1                                       9m (0%)       0 (0%)      120Mi (0%)       0 (0%)         4h12m
  openshift-monitoring                    monitoring-plugin-68995cb674-w2wr9                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h13m
  openshift-monitoring                    node-exporter-kbq8z                                       9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         4h13m
  openshift-monitoring                    prometheus-adapter-54fc7b9c87-sg4vt                       1m (0%)       0 (0%)      40Mi (0%)        0 (0%)         4h13m
  openshift-monitoring                    prometheus-k8s-1                                          75m (2%)      0 (0%)      1104Mi (7%)      0 (0%)         4h12m
  openshift-monitoring                    prometheus-operator-admission-webhook-84b7fffcdc-x8hsz    5m (0%)       0 (0%)      30Mi (0%)        0 (0%)         4h18m
  openshift-monitoring                    thanos-querier-59cbd86d58-cjkxt                           15m (0%)      0 (0%)      92Mi (0%)        0 (0%)         4h13m
  openshift-multus                        multus-7gjnt                                              10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         4h14m
  openshift-multus                        multus-additional-cni-plugins-gn7x9                       10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4h14m
  openshift-multus                        network-metrics-daemon-88tf6                              20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         4h14m
  openshift-network-diagnostics           network-check-target-kpv5v                                10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         4h14m
  openshift-ovn-kubernetes                ovnkube-node-74nl9                                        80m (2%)      0 (0%)      1630Mi (11%)     0 (0%)         3h51m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                619m (17%)    0 (0%)
  memory             4296Mi (29%)  0 (0%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:              <none>

 % oc get node -l k8s.ovn.org/egress-assignable=             
NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-0-59-132.us-east-2.compute.internal   Ready    worker   4h14m   v1.27.6+b49f9d1
3.  Create egressIP object

Actual results:

% oc get egressip        
NAME         EGRESSIPS     ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-1   10.0.59.101        

% oc get cloudprivateipconfig
No resources found

Expected results:

The egressIP should be applied to egress node

Additional info:


Many jobs are failing because route53 is throttling us during cluster creation.
We need a make external-dns make fewer calls.

The theoretical minimum is:
list zones - 1 call
list zone records - (# of records / 100) calls
create 3 records per HC - 1-3 calls depending on how they are batched

Work to setup the endpoint will be handled in another card.

For this one we want to setup a new disruption backend similar to the cluster-network-liveness-probe. We'll poll, submit request ids and possibly job identifiers.

This approach us gets us free disruption intervals in bigquery, charting per job, and graphing capabilities from the disruption dashboard.

Description of problem:

In https://github.com/openshift/release/pull/47648 ecr-credentials-provider is built in CI and later included in RHCOS.

In order to make it work on OKD it needs to be included in the payload, so that OKD machine-os could extract RPM and install it on the host
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Ref: OCPBUGS-25662

Description of problem:

   Enable KMS v2 in the ibmcloud KMS provider

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/67

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

There's a typo in the openssl commands within the ovn-ipsec-containerized/ovn-ipsec-host daemonsets. The correct parameter is "-checkend", not "-checkedn".

Version-Release number of selected component (if applicable):

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.10   True        False         7s      Cluster version is 4.14.10

How reproducible:

Steps to Reproduce:

1. Enable IPsec encryption

# oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec": 
 {"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'

Actual results:

Examining the initContainer (ovn-keys) logs

# oc logs ovn-ipsec-containerized-7bcd2 -c ovn-keys
...
+ openssl x509 -noout -dates -checkedn 15770000 -in /etc/openvswitch/keys/ipsec-cert.pem
x509: Use -help for summary.
# oc get ds
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
ovn-ipsec-containerized   1         1         0       1            0           beta.kubernetes.io/os=linux   159m
ovn-ipsec-host            1         1         1       1            1           beta.kubernetes.io/os=linux   159m
ovnkube-node              1         1         1       1            1           beta.kubernetes.io/os=linux   3h44m
# oc get ds ovn-ipsec-containerized -o yaml | grep edn
if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then     

# oc get ds ovn-ipsec-host -o yaml | grep edn
if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then

Description of problem:

oc-mirror with v2 will create the idms file as output , but the source is like :
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  creationTimestamp: null
  name: idms-2024-01-08t04-19-04z
spec:
  imageDigestMirrors:
  - mirrors:
    - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift
    source: localhost:55000/openshift
  - mirrors:
    - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift-release-dev
    source: quay.io/openshift-release-dev
status: {}

The source should always be the origin registry like :quay.io/openshift-release-dev

 

Version-Release number of selected component (if applicable):

   

How reproducible:

always  

Steps to Reproduce:

   1. run the command with v2 :
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
      - name: stable-4.14
        minVersion: 4.14.3
        maxVersion: 4.14.3
    graph: true

`oc-mirror --config config.yaml file://out --v2` 
`oc-mirror --config config.yaml --from file://out  --v2 docker://xxxx:5000/ocp2`    
2. check the idms file 

Actual results:

    2. cat idms-2024-01-08t04-19-04z.yaml 
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  creationTimestamp: null
  name: idms-2024-01-08t04-19-04z
spec:
  imageDigestMirrors:
  - mirrors:
    - xxxx.com:5000/ocp2/openshift
    source: localhost:55000/openshift
  - mirrors:
    - xxxx.com:5000/ocp2/openshift-release-dev
    source: quay.io/openshift-release-dev

Expected results:

The source should not be localhost:55000, should be like the origin registry. 

Additional info:

    

Please review the following PR: https://github.com/openshift/bond-cni/pull/62

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We have ClusterOperatorDown and ClusterOperatorDegraded in this space for ClusterOperator conditions. We should wire that up for ClusterVersion as well.

Recently lextudio dropped pyasn1 so we want to be explicit and show that we install pysnmp-lextudio but normal pyasn1

Description of the problem:
Before we create a cluster , we get in UI list of latest_only=false vs latest_only=true.
Looks like the CPU type is not the same for both .

Example:

Latest:
"4.11.59": {
        "cpu_architectures": [
            "x86_64"
        ],
        "display_name": "4.11.59",
        "support_level": "end-of-life"
    },
    "4.12.53": {
        "cpu_architectures": [
            "x86_64"
        ],
        "display_name": "4.12.53",
        "support_level": "maintenance"
    },

from all:

"4.11.59": {
        "cpu_architectures": [
            "x86_64",
            "arm64"
        ],
        "display_name": "4.11.59",
        "support_level": "end-of-life"
    },
 "4.12.53": {
        "cpu_architectures": [
            "x86_64",
            "arm64"
        ],
        "display_name": "4.12.53",
        "support_level": "maintenance"
    },
 

 

How reproducible:

Always

Steps to reproduce:

Before creating a cluster , open browser debug and see openshift versions returned.
one as latest_only=true

and latest_only=false.

Expecting to get the same cpu type

Actual results:

 

Expected results:

Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/144

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/129

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

I noticed this today when looking at component readiness.  A ~5% decrease in instability may seem minor, but these can certainly add up.  This test passed 713 times in a row on 4.14.  You can see today's failure here.

 

Details below:

-------

Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.

Probability of significant regression: 99.96%

Sample (being evaluated) Release: 4.15
Start Time: 2024-01-17T00:00:00Z
End Time: 2024-01-23T23:59:59Z
Success Rate: 94.83%
Successes: 55
Failures: 3
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 713
Failures: 0
Flakes: 4

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=ovn%20upgrade-minor%20amd64%20gcp%20rt&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=gcp&platform=gcp&sampleEndTime=2024-01-23%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-01-17%2000%3A00%3A00&testId=openshift-tests-upgrade%3A37f1600d4f8d75c47fc5f575025068d2&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers&upgrade=upgrade-minor&upgrade=upgrade-minor&variant=rt&variant=rt

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/87

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When deploying with a service ID, the installer is unable to query resource groups.

Version-Release number of selected component (if applicable):

    4.13-4.16

How reproducible:

    Easily

Steps to Reproduce:

    1. Create a service ID with seemingly enough permissions to do an IPI install
    2. Deploy to power vs with IPI
    3. Fail
    

Actual results:

    Fail to deploy a cluster with service ID

Expected results:

    cluster create should succeed

Additional info:

    

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/85

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1978

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Topology links between VMs and non VMs (such as Pod or Deployment) don't show

Version-Release number of selected component (if applicable):

4.12.14

How reproducible:

every time via UI or annoation

Steps to Reproduce:

1. Create VM
2. Create Pod/Deployment
3. Add annoation or link via UI

Actual results:

annotation is updated only

Expected results:

topology shows linkage

Additional info:

 app.openshift.io/connects-to: >-
      [{"apiVersion":"kubevirt.io/v1","kind":"VirtualMachine","name":"es-master00"},{"apiVersion":"kubevirt.io/v1","kind":"VirtualMachine","name":"es-master01"},{"apiVersion":"kubevirt.io/v1","kind":"VirtualMachine","name":"es-master02"}]

Please review the following PR: https://github.com/openshift/builder/pull/375

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Install private cluster by using azure workload identity, and failed due to no worker machines being provisioned.

install-config:
----------------------
platform:
  azure:
    region: eastus
    networkResourceGroupName: jima971b-12015319-rg
    virtualNetwork: jima971b-vnet
    controlPlaneSubnet: jima971b-master-subnet
    computeSubnet: jima971b-worker-subnet
    resourceGroupName: jima971b-rg
publish: Internal
credentialsMode: Manual

Detailed check on cluster and found machine-api/ingress/image-registry operators reported permissions issues and have no access to customer vnet.

$ oc get machine -n openshift-machine-api
NAME                                  PHASE     TYPE              REGION   ZONE   AGE
jima971b-qqjb7-master-0               Running   Standard_D8s_v3   eastus   2      5h14m
jima971b-qqjb7-master-1               Running   Standard_D8s_v3   eastus   3      5h14m
jima971b-qqjb7-master-2               Running   Standard_D8s_v3   eastus   1      5h15m
jima971b-qqjb7-worker-eastus1-mtc47   Failed                                      4h52m
jima971b-qqjb7-worker-eastus2-ph8bk   Failed                                      4h52m
jima971b-qqjb7-worker-eastus3-hpmvj   Failed                                      4h52m

Errors on worker machine:
--------------------
  errorMessage: 'failed to reconcile machine "jima971b-qqjb7-worker-eastus1-mtc47":
    network.SubnetsClient#Get: Failure responding to request: StatusCode=403 -- Original
    Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed"
    Message="The client ''705eb743-7c91-4a16-a7cf-97164edc0341'' with object id ''705eb743-7c91-4a16-a7cf-97164edc0341''
    does not have authorization to perform action ''Microsoft.Network/virtualNetworks/subnets/read''
    over scope ''/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima971b-12015319-rg/providers/Microsoft.Network/virtualNetworks/jima971b-vnet/subnets/jima971b-worker-subnet''
    or the scope is invalid. If access was recently granted, please refresh your credentials."'
  errorReason: InvalidConfiguration

After manually creating customer role with missed permissions for machine-api/ingress/cloud-controller-manager/image-registry, and assigning it to machine-api/ingress/cloud-controller-manager/image-registry user-assigned identity on scope of customer vnet, cluster was recovered and became running.

Permissions for machine-api/cloud-controller-manager/ingress on customer vnet:
"Microsoft.Network/virtualNetworks/subnets/read",
"Microsoft.Network/virtualNetworks/subnets/join/action"

Permissions for image-registry on customer vnet:
"Microsoft.Network/virtualNetworks/subnets/read",
"Microsoft.Network/virtualNetworks/subnets/join/action"
"Microsoft.Network/virtualNetworks/join/action"

Version-Release number of selected component (if applicable):

    4.15 nightly build

How reproducible:

    always on recent 4.15 payload

Steps to Reproduce:

    1. prepare install-config with private cluster configuration + credentialsMode: Manual
    2. using ccoctl tool to create workload identity
    3. install cluster
    

Actual results:

    Installation failed due to permission issues

Expected results:

    ccoctl also needs to assign customer role to machine-api/ccm/image-registry user-assigned identity on scope of customer vnet if it is configured in install-config

Additional info:

Issue is only detected on 4.15, it works on 4.14. 

Please review the following PR: https://github.com/openshift/configmap-reload/pull/58

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

oc-mirror - maxVersion of the imageset config is ignored for operators

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create 2 imageset that we are using:

_imageset-config-test1-1.yaml:_
~~~
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /local/oc-mirror/test1/metadata
mirror:
  platform:
    architectures:
      - amd64
    graph: true
    channels:
      - name: stable-4.12
        type: ocp
        minVersion: 4.12.1
        maxVersion: 4.12.1
        shortestPath: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
        - name: cincinnati-operator
          channels:
            - name: v1
              minVersion: 5.0.1
              maxVersion: 5.0.1
~~~

_imageset-config-test1-2.yaml:_
~~~
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /local/oc-mirror/test1/metadata
mirror:
  platform:
    architectures:
      - amd64
    graph: true
    channels:
      - name: stable-4.12
        type: ocp
        minVersion: 4.12.1
        maxVersion: 4.12.1
        shortestPath: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
      packages:
        - name: cincinnati-operator
          channels:
            - name: v1
              minVersion: 5.0.1
              maxVersion: 5.0.1
        - name: local-storage-operator
          channels:
            - name: stable
              minVersion: 4.12.0-202305262042
              maxVersion: 4.12.0-202305262042
        - name: odf-operator
          channels:
            - name: stable-4.12
              minVersion: 4.12.4-rhodf
              maxVersion: 4.12.4-rhodf
        - name: rhsso-operator
          channels:
            - name: stable
              minVersion: 7.6.4-opr-002
              maxVersion: 7.6.4-opr-002
    - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.12
      packages:
        - name: k10-kasten-operator-rhmp
          channels:
            - name: stable
              minVersion: 6.0.6
              maxVersion: 6.0.6
  additionalImages:
    - name: registry.redhat.io/rhel8/postgresql-13:1-125
~~~

2. Generate a first .tar file from the first imageset-config file (imageset-config-test1-1.yaml)
oc mirror --config=imageset-config-test1-1.yaml file:///local/oc-mirror/test1

3. Use the first .tar file to populate our registry
oc mirror --from=/root/oc-mirror/test1/mirror_seq1_000000.tar docker://registry-url/oc-mirror1

4.Generate a second .tar file from the second imageset-config file (imageset-config-test1-2.yaml)
oc mirror --config=imageset-config-test1-2.yaml file:///local/oc-mirror/test1

5. Populate the private registry named `oc-mirror1` with the second .tar file:

oc mirror --from=/root/oc-mirror/test1/mirror_seq2_000000.tar docker://registry-url/oc-mirror1

6. Check the catalog index for **odf** and **rhsso** operators

[root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12
    VERSIONS
    4.12.7-rhodf
    4.12.8-rhodf
    4.12.4-rhodf
    4.12.5-rhodf
    4.12.6-rhodf

[root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable
    VERSIONS
    7.6.4-opr-002
    7.6.4-opr-003
    7.6.5-opr-001
    7.6.5-opr-002

Actual results:

Check the catalog index for **odf** and **rhsso** operators. oc-mirror is not respecting the minVersion & maxVersion

[root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.7-rhodf 4.12.8-rhodf 4.12.4-rhodf 4.12.5-rhodf 4.12.6-rhodf

[root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror1/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002 7.6.4-opr-003 7.6.5-opr-001 7.6.5-opr-002

Expected results:

oc-mirror should respect the minVersion & maxVersion

[root@test ~]# oc-mirror list operators --package odf-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable-4.12 VERSIONS 4.12.4-rhodf

[root@test ~]# oc-mirror list operators --package rhsso-operator --catalog=registry-url/oc-mirror2/redhat/redhat-operator-index:v4.12 --channel stable VERSIONS 7.6.4-opr-002

Additional info:

 

Description of problem:

    With cloud-credential-operator moving to rhel9 by default, we added rhel8 binaries. However, users currently have no way of downloading them using `oc`

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

When attempting to extract ccoctl.rhel8 

Steps to Reproduce:

    1. oc adm release extract --tools
    2.
    3.
    

Actual results:

    Only contains ccoctl tarball

Expected results:

    Should include ccoctl.rhel8 and ccoctl.rhel9 tarballs

Additional info:

    ccoctl.rhel8 and ccoctl.rhel9 binaries added in https://issues.redhat.com//browse/OCPBUGS-31290

Description of problem:

After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Start an hypershift-hosted cluster with cluster-bot
2. Enable user-defined monitoring
3.

Actual results:

PrometheusOperatorRejectedResources alert becomes firing

Expected results:

No alert firing

Additional info:

Need to reach out to the HyperShift folks as the fix should probably be in their code base.

Description of problem:

In https://github.com/openshift/release/pull/47618 there are quite a few warnings from snyk in the presubmit rehearsal jobs that have not been reported in the bugs filed against storage

We need to go through each one and either fix (in the case of legit bugs) or ignore (false positives / test cases) to avoid having a presubmit job that always fails

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

Steps to Reproduce:

run 'security' presubmit rehearsal jobs in https://github.com/openshift/release/pull/47618    

Actual results:

snyk issues reported

Expected results:

clean test runs

Additional info:

    

In looking at a component readiness test page we see some failures that take a long time to load:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992 (I noticed that this one resulted in messages asking me to restart chrome)
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1767279909555671040
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1766663255406678016
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1765279833048223744

We'd like to understand why it takes a long time to load these jobs and possible take some action to remediate as much of that slowness as possible.

Taking a long time to load prow jobs will make our TRT tools seem unusable and might make it difficult for managers to inspect Component Readiness failures which would slow down getting them resolved.

Some idea of what to look at:

  • see if the file size of the jobs is any bigger now than before esp. for runs with a lot of failures
  • see if the recent change that cuts the size of the intervals down is still working as expected
  • compare the file size of a passing run vs. one with a lot of failures

Description of the problem:

Prepare cluster for installation , add applied configuration from api code.

When we try to install cluster it backs to ready as expected but in case we fix configuration and try to install again it ALWAYS fails on the first attemp to install ( not related to time)

Only on the second install it works ! without changing the configuration.

 

How reproducible:

Always

Steps to reproduce:

1.Prepare cluster for installation

2.Create invalid applied configuration to verify that preparing-for-installation returns to ready state as expected.

 
invalid_override_config = {
"capabilities":

{ "additionalEnabledCapabilities": ["123454baremetal"], }

}
3. Start installation , back to ready as expected.

4. fix applied configuration and try to install again -> Fails

Actual results:

On the first attemp after config change installation fails.

It work only on the second try

Expected results:

 

Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/273

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


The following binaries need to get extracted from the release payload for both rhel8 and rhel9:

oc
ccoctl
opm
openshift-install
oc-mirror

The images that contain these, should produce artifacts of both kinds in some locatiuon, and probably make the artifact of their architecture available under a normal location in path. Example:

/usr/share/<binary>.rhel8
/usr/share/<binary>.rhel9
/usr/bin/<binary>

This ticket is about getting "oc adm release extract" to do the right thing in a backwards compatible way. If both binaries are available get those. If not, get from the old location. 


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

BuildRun logs cannot be displayed in the console and shows the following error:

The buildrun is created and started using the shp cli (similar behavior is observed when the build is created & started via console/yaml too):

shp build create goapp-buildah \
    --strategy-name="buildah" \
    --source-url="https://github.com/shipwright-io/sample-go" \
    --source-context-dir="docker-build" \
    --output-image="image-registry.openshift-image-registry.svc:5000/demo/go-app" 

The issue occurs on OCP 4.14.6. Investigation showed that this works correctly on OCP 4.14.5.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Today Egress firewall rules with 'nodeSelector' only use the nodeIP in the OVN ACL rule. But there is possibility that one node may have secondary IPs other that the nodeIP. We shall create ACL with all the possible IPs of the selected node.

Version-Release number of selected component (if applicable):

 

How reproducible:

Create a egress firewall rule with 'nodeSelector'

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When you delete a cluster, or just a BMH, before the installation starts (Assisted Service takes the control), the metal3-operator tries to generate a PreprovisioningImage.

In previous versions, it was created a fix that, during some first installation phases the creation of the PreprovisioningImage was not invoked:

https://github.com/openshift/baremetal-operator/pull/262/files#diff-a69d9029388ab766ed36b32180145f52785a9d4a153775510dbddfa928a72e1cR787

it was based on the status "StateDeleting".

Recently, it was added a new status "StatePoweringOffBeforeDelete":

https://github.com/openshift/baremetal-operator/commit/6f65d8e75ef6ed921863ebaf793cccda61de8bcb#diff-eeed3703d04e4c23a7d7af8cd0b7931b6b7990f23d826c49bdbc31c5f0a50291

but this status is not covered on the previous fix. And during this new phase there should not be tried to create the image.

The problem of trying create the PreprovisioningImage, when it should not, it is that create problems on ZTP. Where the BMH and all the objects are deleted at the same time. And the operator cannot create the image because the NS is been deleted.

Version-Release number of selected component (if applicable):

4.14    

How reproducible:

    

Steps to Reproduce:

    1.Create a cluster
    2.Wait until the provisioing phase
    3.Delete the cluster
    4.The metal3-operator tries to create the PreprovisioningImage wrongly. 
    

Actual results:

    

Expected results:

    

Additional info:

    

With 4.15, resource watch completes whenever it is interrupted. But for 4.16 jobs, it does not until 1h grace period kicks in and the job is terminated by ci-operator. This means:

  • All 4.16 jobs take 1h longer than normally
  • If an upgrade pushes the overall time close to the limit, sub-jobs from aggregator will fail at 4h mark.

 

This was discovered when investigating this slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1705578623724259

 

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/91

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test is a frequent offender in the OpenStack CSI jobs. We're seeing it fail on 4.14 up to 4.16.

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.14-e2e-openstack-csi-cinder

Example of failed job.
Example of successful job.

It seems like the 1 min timeout is too short and does not give enough time for the pods backing the service to come up.

https://github.com/openshift/origin/blob/1e6c579/pkg/monitortests/network/disruptionpodnetwork/monitortest.go#L191

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.

2.

3.

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
  • Don't presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with "sbr-untriaged"
  • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"

Description of problem:

ConsolePluginComponents CMO task depends on the availability of a conversion service that is part of the console-operator Pod, that Pod is not duplicated, thus when it restarts due to a cluster upgrade or other that conversion webhook becomes unavailable and all the ConsolePlugin API queries from that CMO task fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

Create a 4.14 cluster, make the console-operator unmanaged and bring it down, watch the ConsolePluginComponents tasks fail instantly after they're run.

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

The ConsolePluginComponents tasks fail instantly after they're run.

Expected results:

The tasks should be more resilient and retry.

The long term solution is for that ConsolePlugin conversion service to be duplicated.

 

Additional info:

For OCP >=4.15, CMO v1 ConsolePlugin querie no longer rely on the conversion webhook because of https://github.com/openshift/api/pull/1477.

But the retries will keep the task future proof + we'll be able to backport the fix.

Description of problem:

Single line execute markdown reference is not working.

Steps to Reproduce

  1.  Install Web Terminal Operator
  2.  Create a ConsoleQuickStart resource using "copy-execute-demo.yaml" file
  3.  Open "Sample Quick Start" from quick start catalog page

Actual results:

The inline code is not getting rendered properly specifically for single line execute syntax.

Expected results:

The inline code should show a code block with a small execute icon to run the commands in web terminal

 

Description of problem:

A user noticed on delete cluster that the IPI generated service instance was not cleaned up. Add more debugging statements to find out why.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create cluster
    2. Delete cluster
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1996624, when the AWS root credential (must possesses the "iam:SimulatePrincipalPolicy" permission) exists on a BM cluster, the CCO Pod crashes when running the secretannotator controller. 

Steps to Reproduce:

1. Install a BM cluster
fxie-mac:cloud-credential-operator fxie$ oc get infrastructures.config.openshift.io cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2024-01-28T19:50:05Z"
  generation: 1
  name: cluster
  resourceVersion: "510"
  uid: 45bc2a29-032b-4c74-8967-83c73b0141c4
spec:
  cloudConfig:
    name: ""
  platformSpec:
    type: None
status:
  apiServerInternalURI: https://api-int.fxie-bm1.qe.devcluster.openshift.com:6443
  apiServerURL: https://api.fxie-bm1.qe.devcluster.openshift.com:6443
  controlPlaneTopology: SingleReplica
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: fxie-bm1-x74wn
  infrastructureTopology: SingleReplica
  platform: None
  platformStatus:
    type: None 

2. Create an AWS user with IAMReadOnlyAccess permissions:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:GenerateCredentialReport",
                "iam:GenerateServiceLastAccessedDetails",
                "iam:Get*",
                "iam:List*",
                "iam:SimulateCustomPolicy",
                "iam:SimulatePrincipalPolicy"
            ],
            "Resource": "*"
        }
    ]
}

3. Create AWS root credentials with a set of access keys of the user above
4. Trigger a reconcile of the secretannotator controller, e.g. via editting cloudcredential/cluster     

Logs:

time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:CreateUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteAccessKey" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:DeleteUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:PutUserPolicy" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Action not allowed with tested creds" action="iam:TagUser" controller=secretannotator
time="2024-01-29T04:47:27Z" level=warning msg="Tested creds not able to perform all requested actions" controller=secretannotator
I0129 04:47:27.988535       1 reflector.go:289] Starting reflector *v1.Infrastructure (10h37m20.569091933s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.988546       1 reflector.go:325] Listing and watching *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
I0129 04:47:27.989503       1 reflector.go:351] Caches populated for *v1.Infrastructure from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:233
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a964a0]
 
goroutine 341 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:115 +0x1e5
panic({0x3fe72a0?, 0x809b9e0?})
/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws.LoadInfrastructureRegion({0x562e1c0?, 0xc002c99a70?}, {0x5639ef0, 0xc0001b6690})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/utils/aws/utils.go:72 +0x40
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).validateCloudCredsSecret(0xc0008c2000, 0xc002586000)
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:206 +0x1a5
github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws.(*ReconcileCloudCredSecret).Reconcile(0xc0008c2000, {0x30?, 0xc000680c00?}, {0x4f38a3d?, 0x0?}, {0x4f33a20?, 0x416325?})
/go/src/github.com/openshift/cloud-credential-operator/pkg/operator/secretannotator/aws/reconciler.go:166 +0x605
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x561ff20?, {0x561ff20?, 0xc002ff3b00?}, {0x4f38a3d?, 0x3b180c0?}, {0x4f33a20?, 0x55eea08?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000189360, {0x561ff58, 0xc0007e5040}, {0x4589f00?, 0xc000570b40?})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314 +0x365
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000189360, {0x561ff58, 0xc0007e5040})
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 183
/go/src/github.com/openshift/cloud-credential-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222 +0x565

Actual results:

CCO Pod crashes and restarts in a loop:
fxie-mac:cloud-credential-operator fxie$ oc get po -n openshift-cloud-credential-operator -w
NAME                                         READY   STATUS    RESTARTS        AGE
cloud-credential-operator-657bdffdff-9wzrs   2/2     Running   3 (2m35s ago)   8h

Description of problem:

It was found when testing OCP-71263 and regression OCP-35770 for 4.15.
For GCP in Mint mode, the root credential can be removed after cluster installation.
But after removing the root credential, CCO became degrade.      

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-25-051548

4.15.0-rc.3

How reproducible:

    
Always    

Steps to Reproduce:

    1.Install a GCP cluster with Mint mode

    2.After install, remove the root credential
jianpingshu@jshu-mac ~ % oc delete secret -n kube-system gcp-credentials
secret "gcp-credentials" deleted     

    3.Wait some time(about 1/2h to 1h), CCO became degrade 
    
jianpingshu@jshu-mac ~ % oc get co cloud-credential
NAME               VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
cloud-credential   4.15.0-rc.3   True        True          True       6h45m   6 of 7 credentials requests are failing to sync.

jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sort
CredentialsProvisionFailure=False openshift-cloud-network-config-controller-gcp CredentialsProvisionSuccess: successfully granted credentials request
CredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-gcp-ccm CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-gcp-pd-csi-driver-operator CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-image-registry-gcs CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-ingress-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found
CredentialsProvisionFailure=True openshift-machine-api-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found

openshift-cloud-network-config-controller-gcp has no failure because it doesn't has customized role in 4.15.0.rc3

Actual results:

 CCO became degrade

Expected results:

 CCO not in degrade, just "upgradeable" condition updated with missing the root credential

Additional info:

Tested the same case on 4.14.10, no issue 

 

Description of problem:

    When trying to run console in local development with auth, the run-bridge.sh script fails out.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Follow step for local development of console with auth - https://github.com/openshift/console/tree/master?tab=readme-ov-file#openshift-with-authentication
    2.
    3.
    

Actual results:

The run-bridge.sh scripts fails with:

    
$ ./examples/run-bridge.sh
++ oc whoami --show-server
++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.alertmanagerPublicURL}'
++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.thanosPublicURL}'
+ ./bin/bridge --base-address=http://localhost:9000 --ca-file=examples/ca.crt --k8s-auth=openshift --k8s-mode=off-cluster --k8s-mode-off-cluster-endpoint=https://api.lprabhu-030420240903.devcluster.openshift.com:6443 --k8s-mode-off-cluster-skip-verify-tls=true --listen=http://127.0.0.1:9000 --public-dir=./frontend/public/dist --user-auth=openshift --user-auth-oidc-client-id=console-oauth-client --user-auth-oidc-client-secret-file=examples/console-client-secret --user-auth-oidc-ca-file=examples/ca.crt --k8s-mode-off-cluster-alertmanager=https://alertmanager-main-openshift-monitoring.apps.lprabhu-030420240903.devcluster.openshift.com --k8s-mode-off-cluster-thanos=https://thanos-querier-openshift-monitoring.apps.lprabhu-030420240903.devcluster.openshift.com
W0403 14:25:07.936281   49352 authoptions.go:99] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
F0403 14:25:07.936827   49352 main.go:539] Failed to create k8s HTTP client: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

Expected results:

    Bridge runs fine

Additional info:

    

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/109

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

on a nodepool with Autoscaling Enabled, "oc scale nodepool" command is disabling Autoscaling, but leavs an invalis configuration with Autoscaling info that should have been cleared.   

Version-Release number of selected component (if applicable):

  (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc version
Client Version: 4.14.14
Kustomize Version: v5.0.1
Server Version: 4.14.14
Kubernetes Version: v1.27.10+28ed2d7
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ 
  

How reproducible:

  happens all the time  

Steps to Reproduce:

     1. deploy a hub cluster with 3 master nodes, and 0 workers, on it, a hostedcluster with 6 nodes(I've used this job to deploy: https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2431/)
    
2. (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc -n clusters patch nodepool hosted-0 --type=json -p '[{"op": "remove", "path": "/spec/replicas"},{"op":"add", "path": "/spec/autoScaling", "value": { "max": 6, "min": 6 }}]'
nodepool.hypershift.openshift.io/hosted-0 patched
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get nodepool -A
NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
clusters    hosted-0   hosted-0                   6               True          False        4.14.14                                      



3. scale to 2 nodes in the nodepool: (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc scale nodepool/hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig --replicas=2
nodepool.hypershift.openshift.io/hosted-0 scaled
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get nodepool -A
NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
clusters    hosted-0   hosted-0   2               6               False         False        4.14.14         

4. and after scaledown ends :
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get nodepool -A
NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
clusters    hosted-0   hosted-0   2               6               False         False        4.14.14                                      

Actual results:

(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc describe nodepool hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig
Name:         hosted-0
Namespace:    clusters
Labels:       <none>
Annotations:  hypershift.openshift.io/nodePoolCurrentConfig: de17bd57
              hypershift.openshift.io/nodePoolCurrentConfigVersion: 84116781
              hypershift.openshift.io/nodePoolPlatformMachineTemplate: hosted-0-52df983b
API Version:  hypershift.openshift.io/v1beta1
Kind:         NodePool
Metadata:
  Creation Timestamp:  2024-03-13T22:39:57Z
  Finalizers:
    hypershift.openshift.io/finalizer
  Generation:  4
  Owner References:
    API Version:     hypershift.openshift.io/v1beta1
    Kind:            HostedCluster
    Name:            hosted-0
    UID:             ec16c5a2-b8dc-4c54-abe8-297020df4442
  Resource Version:  818918
  UID:               671bdaf2-c8f9-4431-9493-476e9fe44d76
Spec:
  Arch:  amd64
  Auto Scaling:
    Max:         6
    Min:         6
  Cluster Name:  hosted-0
  Management:
    Auto Repair:  false
    Replace:
      Rolling Update:
        Max Surge:        1
        Max Unavailable:  0
      Strategy:           RollingUpdate
    Upgrade Type:         InPlace
  Node Drain Timeout:     30s
  Platform:
    Agent:
    Type:  Agent
  Release:
    Image:   quay.io/openshift-release-dev/ocp-release:4.14.14-x86_64
  Replicas:  2

Expected results:

No spec.autoscaling data, only spec.Replicas:2, as were before Enabling Autoscaling.


Spec:   
    Auto Scaling:     
        Max:         6
        Min:         6 


Additional info:

    

Description of problem:

The nodeip-configuration service does not log to the serial console, which makes it difficult to debug problems when networking is not available and there is no access to the node.

Version-Release number of selected component (if applicable):

Reported against 4.13, but present in all releases

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/197

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In Azure Stack, the Azure-Disk CSI Driver node pod CrashLoopBackOff:

openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-57rxv                                      1/3     CrashLoopBackOff   33 (3m55s ago)   59m     10.0.1.5       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-m62cj   <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-8wvqm                                      1/3     CrashLoopBackOff   35 (29s ago)     67m     10.0.0.6       ci-op-q8b6n4iv-904ed-kp5mv-master-1              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-97ww5                                      1/3     CrashLoopBackOff   33 (12s ago)     67m     10.0.0.7       ci-op-q8b6n4iv-904ed-kp5mv-master-2              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-9hzw9                                      1/3     CrashLoopBackOff   35 (108s ago)    59m     10.0.1.4       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-gjqmw   <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-glgzr                                      1/3     CrashLoopBackOff   34 (69s ago)     67m     10.0.0.8       ci-op-q8b6n4iv-904ed-kp5mv-master-0              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-hktfb                                      2/3     CrashLoopBackOff   48 (63s ago)     60m     10.0.1.6       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-kdbpf   <none>           <none>
The CSI-Driver container log:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc8 pc=0x18ff5db]
goroutine 228 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*Cloud).GetZone(0xc00021ec00, {0xc0002d57d0?, 0xc00005e3e0?})
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_zones.go:182 +0x2db
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).NodeGetInfo(0xc000144000, {0x21ebbf0, 0xc0002d5470}, 0x273606a?)
 /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/nodeserver.go:336 +0x13b
github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler.func1({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320})
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7160 +0x72
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320?}, 0xc0003b0340, 0xc00050ae10)
 /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler({0x1ec2f40?, 0xc000144000}, {0x21ebbf0, 0xc0002d5470}, 0xc000054680, 0x20167a0)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7162 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000530000, {0x21ebbf0, 0xc0002d53b0}, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40, 0xc00052c810, 0x30fa1c8, 0x0)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc000530000, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 260
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:997 +0x145  

 

The registrar container log:
E0321 23:08:02.679727       1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = error reading from server: EOF, restarting registration container. 

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-03-21-152650    

How reproducible:

    See it in CI profile, and manual install failed earlier.

Steps to Reproduce:

    See Description     

Actual results:

    Azure-Disk CSI Driver node pod CrashLoopBackOff

Expected results:

    Azure-Disk CSI Driver node pod should be running

Additional info:

    See gather-extra and must-gather: 
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-azure-stack-ipi-proxy-fips-f2/1770921405509013504/artifacts/azure-stack-ipi-proxy-fips-f2/

Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/215

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

A regression was identified creating LoadBalancer services in ARO in new 4.14 clusters (handled for new installations in OCPBUGS-24191)

The same regression has been also confirmed in ARO clusters upgraded to 4.14

Version-Release number of selected component (if applicable):

4.14.z

How reproducible:

On any ARO cluster upgraded to 4.14.z    

Steps to Reproduce:

    1. Install an ARO cluster
    2. Upgrade to 4.14 from fast channel
    3. oc create svc loadbalancer test-lb -n default --tcp 80:8080

Actual results:

# External-IP stuck in Pending
$ oc get svc test-lb -n default
NAME      TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
test-lb   LoadBalancer   172.30.104.200   <pending>     80:30062/TCP   15m


# Errors in cloud-controller-manager being unable to map VM to nodes
$ oc logs -l infrastructure.openshift.io/cloud-controller-manager=Azure  -n openshift-cloud-controller-manager
I1215 19:34:51.843715       1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(default/test-lb) - wantLb(true): started
I1215 19:34:51.844474       1 event.go:307] "Event occurred" object="default/test-lb" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I1215 19:34:52.253569       1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-r5iks3dh) success
I1215 19:34:52.253632       1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(default/test-lb): lb(aro-r5iks3dh/mabad-test-74km6) wantLb(true) resolved load balancer name
I1215 19:34:52.528579       1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again...
E1215 19:34:52.714678       1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-r5iks3dh/providers/Microsoft.Network/networkInterfaces/mabad-test-74km6-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
E1215 19:34:52.714888       1 azure_loadbalancer.go:126] reconcileLoadBalancer(default/test-lb) failed: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
I1215 19:34:52.714956       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.871261893 request="services_ensure_loadbalancer" resource_group="aro-r5iks3dh" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="default/test-lb" result_code="failed_ensure_loadbalancer"
E1215 19:34:52.715005       1 controller.go:291] error processing service default/test-lb (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0

Expected results:

# The LoadBalancer gets an External-IP assigned
$ oc get svc test-lb -n default 
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)        AGE 
test-lb      LoadBalancer   172.30.193.159   20.242.180.199                         80:31475/TCP   14s

Additional info:

In cloud-provider-config cm in openshift-config namespace, vmType=""

When vmType gets changed to "standard" explicitly, the provisioning of the LoadBalancer completes and an ExternalIP gets assigned without errors.

Problem Description:

Installed the Red Hat Quay Container Security Operator on the 4.13.25 cluster .

Below are my test results :

```

sasakshi@sasakshi ~]$ oc version
Client Version: 4.12.7
Kustomize Version: v4.5.7
Server Version: 4.13.25
Kubernetes Version: v1.26.9+aa37255

[sasakshi@sasakshi ~]$ oc get csv -A | grep -i "quay" | tail -1
openshift container-security-operator.v3.10.2 Red Hat Quay Container Security Operator 3.10.2 container-security-operator.v3.10.1 Succeeded

[sasakshi@sasakshi ~]$ oc get subs -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-operators container-security-operator container-security-operator redhat-operators stable-3.10

[sasakshi@sasakshi ~]$ oc get imagemanifestvuln -A | wc -l
82
[sasakshi@sasakshi ~]$ oc get vuln --all-namespaces | wc -l
82

Console -> Administration -> Image Vulnerabitlites : 82

Home -> Overiview -> Status -> Image Vulnerabitlites : 66
```

Observations from My testing :

  • `oc get vuln --all-namespaces` reports the same count as `oc get imagemanifestvuln -A`
  • Difference in the count is reported in the following
    ```
    Console -> Administration -> Image Vulnerabitlites : 82
    Home -> Overiview -> Status -> Image Vulnerabitlites : 66
    ```
    Why there is a difference in reporting of the above two entries?

Kindly refer to the attached screenshots for reference .

Documentation link referred:

https://docs.openshift.com/container-platform/4.14/security/pod-vulnerability-scan.html#security-pod-scan-query-cli_pod-vulnerability-scan

Description of the problem:

In .../openshift-versions?only_latest=true, the multi-arch release images are not retrieved as well.

How reproducible:

Always

Steps to reproduce:

1. Run master assisted-service

2. curl ".../openshift-versions?only_latest=true"

 

Actual results:

{
  "4.10.67": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.10.67",
    "support_level": "production"
  },
  "4.11.58": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.11.58",
    "support_level": "production"
  },
  "4.12.53": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.12.53",
    "support_level": "production"
  },
  "4.13.38": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.13.38",
    "support_level": "production"
  },
  "4.14.18": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.14.18",
    "support_level": "production"
  },
  "4.15.3": {
    "cpu_architectures": [
      "x86_64"
    ],
    "default": true,
    "display_name": "4.15.3",
    "support_level": "production"
  },
  "4.16.0-ec.4": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.16.0-ec.4",
    "support_level": "beta"
  },
  "4.9.59": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.9.59",
    "support_level": "production"
  }
}

Expected results:

{
  "4.10.67": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "display_name": "4.10.67",
    "support_level": "production"
  },
  "4.11.0-multi": {
    "cpu_architectures": [
      "x86_64",
      "arm64",
      "ppc64le",
      "s390x"
    ],
    "display_name": "4.11.0-multi",
    "support_level": "production"
  },
  "4.11.58": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "display_name": "4.11.58",
    "support_level": "production"
  },
  "4.12.53": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "display_name": "4.12.53",
    "support_level": "production"
  },
  "4.12.53-multi": {
    "cpu_architectures": [
      "x86_64",
      "arm64",
      "ppc64le",
      "s390x"
    ],
    "display_name": "4.12.53-multi",
    "support_level": "production"
  },
  "4.13.38": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "display_name": "4.13.38",
    "support_level": "production"
  },
  "4.13.38-multi": {
    "cpu_architectures": [
      "x86_64",
      "arm64",
      "ppc64le",
      "s390x"
    ],
    "display_name": "4.13.38-multi",
    "support_level": "production"
  },
  "4.14.18": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "display_name": "4.14.18",
    "support_level": "production"
  },
  "4.14.18-multi": {
    "cpu_architectures": [
      "x86_64",
      "arm64",
      "ppc64le",
      "s390x"
    ],
    "display_name": "4.14.18-multi",
    "support_level": "production"
  },
  "4.15.3": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "default": true,
    "display_name": "4.15.3",
    "support_level": "production"
  },
  "4.15.3-multi": {
    "cpu_architectures": [
      "x86_64",
      "arm64",
      "ppc64le",
      "s390x"
    ],
    "display_name": "4.15.3-multi",
    "support_level": "production"
  },
  "4.16.0-ec.4": {
    "cpu_architectures": [
      "x86_64",
      "arm64"
    ],
    "display_name": "4.16.0-ec.4",
    "support_level": "beta"
  },
  "4.16.0-ec.4-multi": {
    "cpu_architectures": [
      "x86_64",
      "arm64",
      "ppc64le",
      "s390x"
    ],
    "display_name": "4.16.0-ec.4-multi",
    "support_level": "beta"
  },
  "4.9.59": {
    "cpu_architectures": [
      "x86_64"
    ],
    "display_name": "4.9.59",
    "support_level": "production"
  }
}

Description of problem:

Configuration files applied via the API do not have any effect on the configuration of the bare metal host.
    

Version-Release number of selected component (if applicable):

OpenShift 4.12.42 with ACM 2.8
    

How reproducible:

Reproducible.
    

Steps to Reproduce:

1. Applied nmstateconfig via 'oc apply', realized after it booted the subnet prefix was incorrect.
2. Deleted bmh and nmstateconfig.
3. Applied correct config  via 'oc apply', machine boots with 1st config still.
4. Deleted bmh and nmstateconfig.
5. Created host via BMC form in GUI with correct config.  Machine boots with correct config.
6. Tested deleting bmh and nmstateconfig, and creating new machine by just applying the bmh file with zero network config, and machine boots again with networking from step 5.
    

Actual results:

Bare metal host does not get config via 'oc apply'.
    

Expected results:

'oc apply -f nmstateconfig.yaml' should work to apply networking configuration.
    

Additional info:

HP Synergy 480 Gen10 (871942-B21)
UEFI boot with redfish virtual media
Static IP with bonding.

    

Description of problem:

`ensureSigningCertKeyPair` and `ensureTargetCertKeyPair` are always updating secret type. if the secret requires metadata update, its previous content will not be retained    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Install 4.6 cluster (or make sure installer-generated secrets have `type: SecretTypeTLS` instead of `type: kubernetes.io/tls`
    2. Run secret sync
    3. Check secret contents
    

Actual results:

    Secret was regenerated with new content

Expected results:

Existing content should be preserved, content is not modified

Additional info:

    This causes api-int CA update for clusters born in 4.6 or earlier.

Description of problem:

The following "error" shows up when running a gcp destroy:

Invalid instance ci-op-nlm7chi8-8411c-4tl9r-master-0 in target pool af84a3203fc714c64a8043fdc814386f, target pool will not be destroyed" 

It is a bit misleading as this alerts when the resource is simply not part of the cluster.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

On startup, if release images syncing fails, the service will fail instead of checking whether it can proceed with stale data.

How reproducible:

Always.

Steps to reproduce:

  1. Mock release images service, make it succeed once and then fail.
  2. Restart the service.

Actual results:

assisted-service will fail.

Expected results:

assisted-service will continue with stale data

Today these are in isPodLog in the javascript, we'd like them in their own section, preferably charted very close to the node update section.

Recent periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade failure caused by

: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers expand_less 	0s
{  event [namespace/openshift-machine-api node/ci-op-j666c60n-23cd9-nb7wr-master-1 pod/cluster-baremetal-operator-79b78c4548-n5vrt hmsg/b7cb271b13 - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-79b78c4548-n5vrt_openshift-machine-api(32835332-fc25-4ddf-84ce-d3aa447d3ce0)] happened 25 times}

Shows in Component Readiness as unknown component

/ ovn upgrade-minor amd64 gcp rt > Unknown> [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers

We should update testBackoffStartingFailedContainer to check for / use known namespaces where a junit is created for known namespaces indicating pass, fail or flake.

The code already handles testBackoffStartingFailedContainerForE2ENamespaces

It looks as though we only check for known or e2e namespaces, need to double check that if that is the case we are ok with potentially unknown namespaced events getting through.

We should also review Sippy pathological tests for results that don't contain `for ns/namespace` format and review if they need to be broken out as well.

For each test that we break out we need to map the new namespace specific test to the correct component in the test mapping repository

As a user, I want to be able to impersonate Groups from the Groups list page kebab menu or the Group details page Actions menu dropdown so that I can more easily impersonate a Group without having to find the corresponding RoleBinding.

AC:

  • `Impersonate Group` action appears in the Groups list page kebab menu or the Group details page Actions menu dropdown
spec:
  configuration:
    featureGate:
      featureSet: TechPreviewNoUpgrade
$ oc get pod
NAME                                      READY   STATUS             RESTARTS      AGE
capi-provider-bd4858c47-sf5d5             0/2     Init:0/1           0             9m33s
cluster-api-85f69c8484-5n9ql              1/1     Running            0             9m33s
control-plane-operator-78c9478584-xnjmd   2/2     Running            0             9m33s
etcd-0                                    3/3     Running            0             9m10s
kube-apiserver-55bb575754-g4694           4/5     CrashLoopBackOff   6 (81s ago)   8m30s

$ oc logs kube-apiserver-55bb575754-g4694 -c kube-apiserver --tail=5
E0105 16:49:54.411837       1 controller.go:145] while syncing ConfigMap "kube-system/kube-apiserver-legacy-service-account-token-tracking", err: namespaces "kube-system" not found
I0105 16:49:54.415074       1 trace.go:236] Trace[236726897]: "Create" accept:application/vnd.kubernetes.protobuf, */*,audit-id:71496035-d1fe-4ee1-bc12-3b24022ea39c,client:::1,api-group:scheduling.k8s.io,api-version:v1,name:,subresource:,namespace:,protocol:HTTP/2.0,resource:priorityclasses,scope:resource,url:/apis/scheduling.k8s.io/v1/priorityclasses,user-agent:kube-apiserver/v1.29.0 (linux/amd64) kubernetes/9368fcd,verb:POST (05-Jan-2024 16:49:44.413) (total time: 10001ms):
Trace[236726897]: ---"Write to database call failed" len:174,err:priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request 10001ms (16:49:54.415)
Trace[236726897]: [10.001615835s] [10.001615835s] END
F0105 16:49:54.415382       1 hooks.go:203] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request

Description of problem:

The TaskRun status is not displayed near the TaskRun name on the TaskRun details page

 

All temporal resources like PipelineRuns, Builds, Shipwright BuildRuns, etc show the status of the resource (succeeded, failed, etc) near the name on the resource details page. 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Fast DataPath released a new major version of Open vSwitch, which is 3.3.
This version is going to be a new LTS and contains performance improvements
and features required for future releases of OVN. Since OCP 4.16 is planned
to have a longer support time frame, it should use this version of OVS.
Moving to newer versions of OVS will also gradually allow FDP to drop support
for older streams not used by any layered products.

Most notable relevant improvements over OVS 3.1 are:
- Improved performance of database operations, most notbaly the initial read
  of the database file and the database schema conversion on updates.

The plan is to also update the main ovs-vwitchd on the os level in a separate
issue, this will provide support for flushing CT entries by marks and labels
needed for future versions of OVN.  And it's better to keep versions on the
host and inside the container in sync.

The change was discussed with FDP and OVS-QE.

Description of problem:

default value of option --parallelism cannot be parsed to int

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-02-02-002725
    

How reproducible:

    reproduce with cmd copy-to-node
    

Steps to Reproduce:

    Cmd: "oc --namespace=e2e-test-mco-4zb88 --kubeconfig=/tmp/kubeconfig-3071675436 adm copy-to-node node/ip-10-0-17-85.ec2.internal --copy=/tmp/fetch-w637bgyv=/etc/mco-compressed-test-file",
            StdErr: "error: --parallelism must be either N or N%: strconv.ParseInt: parsing \"10%%\": invalid syntax",
    

Actual results:

default value of --parallelism cannot be parsed 

    

Expected results:

no error 
    

Additional info:

there is a hack code to append % to the default value

ref: https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L75-L79

https://github.com/openshift/oc/blob/79dc671bdaeafa74b92f14ad9f6d84e344608034/pkg/cli/admin/pernodepod/command.go#L94

err var percentParseErr should be used 

Description of problem:

    In accounts with a large amount of resources, the destroy code will fail to list all resources. This has revealed some changes that need to be made to the destroy code to handle these situations.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Difficult - but we have an account where we can reproduce it consistently

Steps to Reproduce:

    1. Try to destroy a cluster in an account with a large amount of resources.
    2. Fail.
    3.
    

Actual results:

Fail to destroy    

Expected results:

Destroy succeeds

Additional info:

    

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/409

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


$ oc get co machine-config
NAME             VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.16.0-0.ci-2024-03-01-110656   False       False         True       2m56s   Cluster not available for [{operator 4.16.0-0.ci-2024-03-01-110656}]: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-24-212.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty]


MCO operator is failing with this error:


218", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MachineConfigNodeFailed' Cluster not available for [{operator 4.16.0-0.ci-2024-03-01-110656}]: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-24-212.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty]
I0301 17:19:12.823035       1 event.go:364] Event(v1.ObjectReference{Kind:"", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"c1bad7e7-26ff-47fb-8a2d-a0c03c04d218", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigNodeFailed' Failed to resync 4.16.0-0.ci-2024-03-01-110656 because: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-49-207.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty]


    

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci-2024-03-01-110656   True        False         17m     Error while reconciling 4.16.0-0.ci-2024-03-01-110656: the cluster operator machine-config is not available

    

How reproducible:

Always
    

Steps to Reproduce:

    1. Enable techpreview
 oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'


    

Actual results:


machine-config CO is degraded

    

Expected results:


machine-config CO should not be degraded, no error should happen in MCO operator pod

    

Additional info:


    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    When navigating to Pipelines list page from Search menu in Dev perspective, Pipelines list page is getting crashed

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1.Install Pipelines Operator
    2.Go to Developer perspective
    3.Go to search menu, select Pipeline
    

Actual results:

    Page is getting crashed

Expected results:

    Page should not be crashed and should show Pipelines List page

Additional info:

    

Description of problem:

When executing oc mirror using an oci path, you can end up with in an error state when the destination is a file://&lt;path> destination (i.e. mirror to disk).

Version-Release number of selected component (if applicable):

4.14.2

How reproducible:

always

Steps to Reproduce:


At IBM we use the ibm-pak tool to generate a OCI catalog, but this bug is reproducible using a simple skopeo copy. Once you've copied the image locally you can move it around using file system copy commands to test this in different ways.

1. Make a directory structure like this to simulate how ibm-pak creates its own catalogs. The problem seems to be related to the path you use, so this represents the failure case:

mkdir -p /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list

2. make a location where the local storage will live:

mkdir -p /root/.ibm-pak/oc-mirror-storage

3. Next, copy the image locally using skopeo:

skopeo copy docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:8d28189637b53feb648baa6d7e3dd71935656a41fd8673292163dd750ef91eec oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list --all --format v2s2

4. You can copy the OCI catalog content to a location where things will work properly so you can see a working example:

cp -r /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list /root/ibm-zcon-zosconnect-catalog

5. You'll need an ISC... I've included both the oci references in the example (the commented out one works, but the oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list reference fails).

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
  - catalog: oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list
  #- catalog: oci:///root/ibm-zcon-zosconnect-catalog
    packages:
    - name: ibm-zcon-zosconnect
      channels:
      - name: v1.0
    full: true
    targetTag: 27ba8e
    targetCatalog: ibm-catalog
storageConfig:
  local:
    path: /root/.ibm-pak/oc-mirror-storage

6. run oc mirror (remember the ISC has oci refs for good and bad scenarios). You may want to change your working directory to different locations between running the good/bad examples.

oc mirror --config /root/.ibm-pak/data/publish/latest/image-set-config.yaml "file://zcon --dest-skip-tls --max-per-registry=6




Actual results:


Logging to .oc-mirror.log
Found: zcon/oc-mirror-workspace/src/publish
Found: zcon/oc-mirror-workspace/src/v2
Found: zcon/oc-mirror-workspace/src/charts
Found: zcon/oc-mirror-workspace/src/release-signatures
error: ".ibm-pak/data/publish/latest/catalog-oci/manifest-list/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" is not a valid image reference: invalid reference format


Expected results:


Simple example where things were working with the oci:///root/ibm-zcon-zosconnect-catalog reference (this was executed in the same workspace so no new images were detected).

Logging to .oc-mirror.log
Found: zcon/oc-mirror-workspace/src/publish
Found: zcon/oc-mirror-workspace/src/v2
Found: zcon/oc-mirror-workspace/src/charts
Found: zcon/oc-mirror-workspace/src/release-signatures
3 related images processed in 668.063974ms
Writing image mapping to zcon/oc-mirror-workspace/operators.1700092336/manifests-ibm-zcon-zosconnect-catalog/mapping.txt
No new images detected, process stopping

Additional info:


I debugged the error that happened and captured one of the instances where the ParseReference call fails. This is only for reference to help narrow down the issue.

github.com/openshift/oc/pkg/cli/image/imagesource.ParseReference (/root/go/src/openshift/oc-mirror/vendor/github.com/openshift/oc/pkg/cli/image/imagesource/reference.go:111)
github.com/openshift/oc-mirror/pkg/image.ParseReference (/root/go/src/openshift/oc-mirror/pkg/image/image.go:79)
github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:194)
github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3 (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/operator.go:575)
golang.org/x/sync/errgroup.(*Group).Go.func1 (/root/go/src/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75)
runtime.goexit (/usr/local/go/src/runtime/asm_amd64.s:1594)

Also, I wanted to point out that because we use a period in the path (i.e. .ibm-pak) I wonder if that's causing the issue? This is just a guess and something to consider. *FOLLOWUP* ... I just removed the period from ".ibm-pak" and that seemed to make the error go away.

Please review the following PR: https://github.com/openshift/oauth-proxy/pull/270

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/94

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/62

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/egress-router-cni/pull/79

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The convention is a format like node-role.kubernetes.io/role: "", not node-role.kubernetes.io: role, however ROSA uses the latter format to indicate the infra role. This changes the node watch code to ignore it, as well as other potential variations like node-role.kubernetes.io/.

The current code panics when run against a ROSA cluster:
 {{ E0209 18:10:55.533265 78 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
goroutine 233 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x7a71840?, 0xc0018e2f48})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1000251f9fe?})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x7a71840, 0xc0018e2f48})
runtime/panic.go:884 +0x213
github.com/openshift/origin/pkg/monitortests/node/watchnodes.nodeRoles(0x7ecd7b3?)
github.com/openshift/origin/pkg/monitortests/node/watchnodes/node.go:187 +0x1e5
github.com/openshift/origin/pkg/monitortests/node/watchnodes.startNodeMonitoring.func1(0}}

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

 [sig-builds][Feature:Builds][Slow] update failure status Build status OutOfMemoryKilled should contain OutOfMemoryKilled failure reason and message [apigroup:build.openshift.io] test is failing on 4.15 (e.g. https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc/1726/pull-ci-openshift-oc-release-4.15-e2e-aws-ovn-builds/1780913191149113344)

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://github.com/containerd/containerd/issues/8180 would be the reason of failure. Because in 4.15 the expected status is OOMKilled however test fails after getting an Error status with the correct exit code (137)    

Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/212

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/90

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When no release image is provided on a HostedCluster, the backend hypershift operator picks the latest OCP release image within the operator's support windown.

Today this fails due to how the operator selects this default image. For example, hypershift operator 4.14 does not support 4.15, but the 4.15.0.rc.3 is picked as a default release image today. 

This is a result of not anticipating that release candidates would not be reported as the latest stable release. The filter used to pick the latest release needs to consider patch level releases before the next y stream release.

Version-Release number of selected component (if applicable):

4.14    

How reproducible:

100%    

Steps to Reproduce:

    1.create a self managed hcp cluster and do not specify a release image
    

Actual results:

    the hcp will be rejected because the default release image picked does not fall within the support window

Expected results:

    hcp should be created with the latest release image in the support window

Additional info:

    

Description of problem:

The install-config.yaml file lets a user set a server group policy for Control plane nodes, and one for Compute nodes, choosing from affinity, soft-affinity, anti-affinity, soft-anti-affinity. Installer will then create the server group if it doesn't exist.

The server group policy defined in install-config for Compute nodes is ignored. The worker server group always has the same policy as the Control plane's.
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1. openshift-install create install-config
    2. set Compute's serverGroupPolicy to soft-affinity in install-config.yaml
    3. openshift-install create cluster
    4. watch the server groups
    

Actual results:

both master and worker server groups have the default soft-anti-affinity policy
    

Expected results:

the worker server group should have soft-affinity as its policy
    

Additional info:


    

Seen in CI:

I0409 09:52:54.280834       1 builder.go:299] openshift-cluster-etcd-operator version v0.0.0-alpha.0-1430-g3d5483e-3d5483e1
...
E0409 10:08:08.921203       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1581 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x28cd3c0?, 0x4b191e0})
	k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0016eccd0, 0x1, 0x27036c0?})
	k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x28cd3c0?, 0x4b191e0?})
	runtime/panic.go:914 +0x21f
github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner.addCertSecretToMap(0x0?, 0x0)
	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go:341 +0x27
github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner.(*EtcdCertSignerController).syncAllMasterCertificates(0xc000521ea0, {0x32731e8, 0xc0006fd1d0}, {0x3280cb0, 0xc000194ee0})
	github.com/openshift/cluster-etcd-operator/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go:252 +0xa65
...

It looks like syncAllMasterCertificates needs to be skipping the addCertSecretToMap calls for certs where EnsureTargetCertKeyPair returned an error.

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/630

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/47

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/305

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    pki operator runs even when annotation to turn off PKI is on the hosted control plane

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/47

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When running 4.15 installer full function test, detect below three instance families and verified, need to append them in installer doc[1]:
- standardHBv4Family
- standardMSMediumMemoryv3Family
- standardMDSMediumMemoryv3Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.

There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!

Technically, this is a copy of STOR-1797, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.

Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/78

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/70

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

We deprecated the default field manager in CNO, but it is still used by default calls like Patch(). We need to update all calls to use explicit fieldManager, and add a test to verify deprecated managers are not used.

Since 4.14 https://github.com/openshift/cluster-network-operator/commit/db57a477b10f517bc4ae501d95cc7b8398a8755c#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6R31 (more specifically, since sigs.k8s.io/controller-runtime bump) we were exposed to this bug https://github.com/kubernetes-sigs/controller-runtime/pull/2435/commits/a6b9c0b672c77a79fff4d5bc03221af1e1fe21fa which made the default fieldManager to be "Go-http-client" instead of "cluster-network-operator".
It means, that "cluster-network-operator" deprecation doesn't really work, since the manager is called differently. Manager name, when unset, is coming from https://github.com/kubernetes/kubernetes/blob/b85c9bbf1ac911a2a2aed2d5c1f5eaf5956cc199/staging/src/k8s.io/client-go/rest/config.go#L498 and then is cropped https://github.com/openshift/cluster-network-operator/blob/5f18e4231f291bf5a01812974b0b4dff19c77f2c/vendor/k8s.io/apiserver/pkg/endpoints/handlers/create.go#L253-L260.

Identified changes needed (may be more):

  • (status *StatusManager) setAnnotation
  • mysterious message for the newly created clusters only (didn't see on CI runs, only seen once)
    `Depreciated field manager cluster-network-operator for object "operator.openshift.io/v1, Kind=Network" cluster`

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Check CNO logs for deprecated field manager logs
    2. oc logs -l name=network-operator --tail=-1 -n openshift-network-operator|grep "Depreciated field manager"     
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    When the installer gathers a log bundle after failure (either automatically or with gather bootstrap), the installer fails to return serial console logs if an SSH connection to the bootstrap node is refused. 

Even if the serial console logs were collected, the installer exits on error if ssh connection is refused:

time="2024-03-09T20:59:26Z" level=info msg="Pulling VM console logs"
time="2024-03-09T20:59:26Z" level=debug msg="Search for matching instances by tag in us-west-1 matching aws.Filter{\"kubernetes.io/cluster/ci-op-4ygffz3q-be93e-jnn92\":\"owned\"}"
time="2024-03-09T20:59:26Z" level=debug msg="Search for matching instances by tag in us-west-1 matching aws.Filter{\"openshiftClusterID\":\"2f9d8822-46fd-4fcd-9462-90c766c3d158\"}"
time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-bootstrap" Instance=i-0413f793ffabe9339
time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0413f793ffabe9339
time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-0" Instance=i-0ab5f920818366bb8
time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0ab5f920818366bb8
time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-2" Instance=i-0b93963476818535d
time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0b93963476818535d
time="2024-03-09T20:59:28Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-1" Instance=i-0797728e092bfbeef
time="2024-03-09T20:59:28Z" level=debug msg="Download complete" Instance=i-0797728e092bfbeef
time="2024-03-09T20:59:28Z" level=info msg="Pulling debug logs from the bootstrap machine"
time="2024-03-09T20:59:28Z" level=debug msg="Added /tmp/bootstrap-ssh3643557583 to installer's internal agent"
time="2024-03-09T20:59:28Z" level=debug msg="Added /tmp/.ssh/ssh-privatekey to installer's internal agent"
time="2024-03-09T21:01:39Z" level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 13.57.212.80:22: connect: connection timed out"

from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/1788/pull-ci-openshift-api-master-e2e-aws-ovn/1766560949898055680

We can see the console logs were downloaded, they should be saved in the log bundle.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Failed install where SSH to bootstrap node fails. https://github.com/openshift/installer/pull/8137 provides a potential reproducer
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

Error handling needs to be reworked here: https://github.com/openshift/installer/blob/master/cmd/openshift-install/gather.go#L160-L190    

 

Description of problem:

[csi-snapshot-controller-operator] does not create suitable role and roleBinding for csi-snapshot-webhook    

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.14.0-rc.0
Kustomize Version: v5.0.1
Server Version: 4.14.0-0.nightly-2024-03-28-004801
Kubernetes Version: v1.27.11+749fe1d    

How reproducible:

Always    

Steps to Reproduce:

    1. Create an OpenShift cluster on AWS;
    2. Check the csi-snapshot-webhook logs with no errors.

Actual results:

In step 2:
$ oc logs csi-snapshot-webhook-76bf9bd758-cxr7g
I0328 08:02:58.016020       1 certwatcher.go:129] Updated current TLS certificate
W0328 08:02:58.029464       1 reflector.go:424] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117: failed to list *v1.VolumeSnapshotClass: volumesnapshotclasses.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumesnapshotclasses" in API group "snapshot.storage.k8s.io" at the cluster scope
E0328 08:02:58.029512       1 reflector.go:140] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117: Failed to watch *v1.VolumeSnapshotClass: failed to list *v1.VolumeSnapshotClass: volumesnapshotclasses.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumesnapshotclasses" in API group "snapshot.storage.k8s.io" at the cluster scope
W0328 08:02:58.888397       1 reflector.go:424] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:117: failed to list *v1.VolumeSnapshotClass: volumesnapshotclasses.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumesnapshotclasses" in API group "snapshot.storage.k8s.io" at the cluster scope

Expected results:

In step2 the csi-snapshot-webhook logs should have no cannot list resource errors

Additional info:

The issue exist on 4.15 and 4.16 as well, in addition since 4.15+ the webhook needs additional "VolumeGroupSnapshotClass" list permissions

$ oc logs csi-snapshot-webhook-794b7b54d7-b8vl9
...
E0328 12:12:06.509158       1 reflector.go:147] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:133: Failed to watch *v1alpha1.VolumeGroupSnapshotClass: failed to list *v1alpha1.VolumeGroupSnapshotClass: volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumegroupsnapshotclasses" in API group "groupsnapshot.storage.k8s.io" at the cluster scope
W0328 12:12:50.836582       1 reflector.go:535] github.com/kubernetes-csi/external-snapshotter/client/v6/informers/externalversions/factory.go:133: failed to list *v1alpha1.VolumeGroupSnapshotClass: volumegroupsnapshotclasses.groupsnapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:default" cannot list resource "volumegroupsnapshotclasses" in API group "groupsnapshot.storage.k8s.io" at the cluster scope
...

This is a tracker bug for issues discovered when working on https://issues.redhat.com/browse/METAL-940. No QA verification will be possible until the feature is implemented much later.

 Description of problem:

Seen in 4.14 to 4.15 update CI:

: [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available expand_less
Run #0: Failed expand_less	1h34m55s
{  1 unexpected clusteroperator state transitions during e2e test run 

Nov 22 21:48:41.624 - 56ms  E clusteroperator/operator-lifecycle-manager-packageserver condition/Available reason/ClusterServiceVersionNotSucceeded status/False ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: APIServiceInstallFailed, message: APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"}

While a brief auth failure isn't fantastic, an issue that only persists for 56ms is not long enough to warrant immediate admin intervention. Teaching the operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required. It's also possible that this is an incoming-RBAC vs. outgoing-RBAC race of some sort, and that shifting manifest filenames around could avoid the hiccup entirely.

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/operator-lifecycle-manager-packageserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 8 runs, 38% failed, 33% of failures match = 13% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 20% failed, 400% of failures match = 80% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 6 runs, 67% failed, 75% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 5 runs, 20% failed, 300% of failures match = 60% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 100% of failures match = 40% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 43 runs, 51% failed, 36% of failures match = 19% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 300% of failures match = 60% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 63% of failures match = 19% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 25% failed, 200% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 50% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 50% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-vsphere-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 80% of failures match = 80% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 178% of failures match = 32% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 60% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 19 runs, 63% failed, 33% of failures match = 21% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 57% of failures match = 27% impact

I'm not sure if all of those are from this system:anonymous issue, or if some of them are other mechanisms. Ideally we fix all of the Available=False noise, while, again, still going Available=False when it is worth summoning an admin immediately. Checking for different reason and message strings in recent 4.15-touching update runs:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/operator-lifecycle-manager-packageserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*message: \(.*\)|\1 \2 \3|' | sort | uniq -c | sort -n
      3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: Unauthorized
      3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install timeout
      4 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": the object has been modified; please apply your changes to the latest version and try again
      9 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded apiServices not installed
     23 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists
     82 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"

How reproducible:

Lots of hits in the above CI search. Running one of the 100% impact flavors has a good chance at reproducing.

Steps to Reproduce:

1. Install 4.14
2. Update to 4.15
3. Keep an eye on operator-lifecycle-manager-packageserver's ClusterOperator Available.

Actual results:

Available=False blips.

Expected results:

Available=True the whole time, or any Available=False looks like a serious issue where summoning an admin would have been appropriate.

Additional info

Causes also these testcases to fail (mentioning them here for Sippy to link here on relevant component readiness failures):

  • [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

Description of problem:
Based on the discussion in https://issues.redhat.com/browse/OCPBUGS-24044
and the discussion in this slack [https://redhat-internal.slack.com/archives/CBWMXQJKD/p1700510945375019|thread] we need to update our CI and some of the work done for mutable scope in NE-621.

Specifically, we need to

  • modify TestScopeChange and TestUnmanagedDNSToManagedDNSInternalIngressController to delete the service on all platforms, as toggling scope is no longer recommended.
  • modify any special behavior added for platformsWithMutableScope.

Version-Release number of selected component (if applicable):

4.15
    

How reproducible:

100%
    

Steps to Reproduce:

  1. Run CI TestUnmanagedDNSToManagedDNSInternalIngressController
  2. Observe failure in unmanaged-migrated-internal  
    

Actual results:

CI tests fail.
    

Expected results:

CI tests shouldn't fail.
    

Additional info:

This is a change from past behavior, as reported in https://issues.redhat.com/browse/OCPBUGS-24044. Further discussion revealed that the new behavior is currently expected but could be restored in the future. Notes to SRE and release notes are needed for this change to behavior.
    

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/96

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In the 4.14 z-stream rollback job, I'm seeing test-case "[sig-network] pods should successfully create sandboxes by adding pod to network " fail. 

The job link is here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-upgrade-rollback-oldest-supported/1719037590788640768

The error is:

56 failures to create the sandbox

ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-48-75.us-east-2.compute.internal - 3314.57 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_95d1a457-3e1b-4ae3-8b57-8023eec5937d_0(5b36bc12b2964e85bcdbe60b275d6a12ea68cb18b81f16622a6cb686270c4eb3): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF
ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-48-75.us-east-2.compute.internal - 3321.57 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_95d1a457-3e1b-4ae3-8b57-8023eec5937d_0(3cc0afc5bec362566e4c3bdaf822209377102c2e39aaa8ef5d99b0f4ba795aaf): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/socket/multus.sock: connect: connection refused


Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-30-170011

How reproducible:

Flaky

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:

The rollback test is testing by installing 4.14.0, then upgrade to the latest 4.14.nightly, at some random point, rolling back to 4.14.0

Description of problem:

    Only customers have a break-glass certificate signer.

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    Always

Steps to Reproduce:

    1.create CSR with any other signer chosen
    2.does not work
    3.
    

Actual results:

    does not work

Expected results:

    should work

Additional info:

    

 

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/68

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/180

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-ovirt/pull/176

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Pre-test greenboot checks fail in during scenario run due to OVN-K pods reporting a "failed" status.

Version-Release number of selected component (if applicable):

I believe this is only affecting `periodic-ci-openshift-microshift-main-ocp-metal-nightly` jobs.

How reproducible:

Unsure.  Has occurred 2 times in consecutive daily-periodic jobs.

Steps to Reproduce:

n/a

Actual results:

- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753221880392716288
- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-ocp-metal-nightly/1753403067350388736

Expected results:

OVN-K Pods should deploy into a healthy state

Additional info:

 

 

Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/58

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

As this shows tls: bad certificate from kube-apiserver operator, for example, https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/470214, checked its must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-aws-ipi-imdsv2-fips-f14/1726036030588456960/artifacts/aws-ipi-imdsv2-fips-f14/gather-must-gather/artifacts/ 

MacBook-Pro:~ jianzhang$ omg logs prometheus-operator-admission-webhook-6bbdbc47df-jd5mb | grep "TLS handshake"
2023-11-27 10:11:50.687 | WARNING  | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader
2023-11-19T00:57:08.318983249Z ts=2023-11-19T00:57:08.318923708Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48334: remote error: tls: bad certificate"
2023-11-19T00:57:10.336569986Z ts=2023-11-19T00:57:10.336505695Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48342: remote error: tls: bad certificate"
...
MacBook-Pro:~ jianzhang$ omg get pods -A -o wide | grep "10.129.0.35"
2023-11-27 10:12:16.382 | WARNING  | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader
openshift-kube-apiserver-operator                 kube-apiserver-operator-f78c754f9-rbhw9                          1/1    Running    2         5h27m  10.129.0.35   ip-10-0-107-238.ec2.internal 

for more information slack - https://redhat-internal.slack.com/archives/CC3CZCQHM/p1700473278471309

Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/44

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:
Event showing how often host has been rebooted showing now only for part of the nodes

11/22/2023, 12:15:10 AM Node test-infra-cluster-57eb6989-master-1 has been rebooted 2 times before completing installation
11/22/2023, 12:00:01 AM Host: test-infra-cluster-57eb6989-master-1, reached installation stage Rebooting
11/21/2023, 11:53:14 PM Host: test-infra-cluster-57eb6989-worker-0, reached installation stage Rebooting
11/21/2023, 11:53:13 PM Host: test-infra-cluster-57eb6989-worker-1, reached installation stage Rebooting
11/21/2023, 11:34:56 PM Host: test-infra-cluster-57eb6989-master-0, reached installation stage Rebooting
11/21/2023, 11:34:26 PM Host: test-infra-cluster-57eb6989-master-2, reached installation stage Rebooting

in this cluster 4 events are missing

11/21/2023, 3:49:34 PM Node test-infra-cluster-164a0f73-master-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-0 has been rebooted 2 times before completing installation
11/21/2023, 3:49:32 PM Node test-infra-cluster-164a0f73-worker-1 has been rebooted 2 times before completing installation
11/21/2023, 3:37:15 PM Host: test-infra-cluster-164a0f73-master-0, reached installation stage Rebooting
11/21/2023, 3:27:34 PM Host: test-infra-cluster-164a0f73-worker-0, reached installation stage Rebooting
11/21/2023, 3:27:30 PM Host: test-infra-cluster-164a0f73-worker-1, reached installation stage Rebooting
11/21/2023, 3:09:40 PM Host: test-infra-cluster-164a0f73-master-2, reached installation stage Rebooting
11/21/2023, 3:09:35 PM Host: test-infra-cluster-164a0f73-master-1, reached installation stage Rebooting

in this cluster 2 events are missing

How reproducible:

 

Steps to reproduce:

1. create cluster

2. start installation

3.

Actual results:
some events are missing for the indication how often
 

Expected results:
for each host there should be indication evet

Description of problem:

The job [sig-node] [Conformance] Prevent openshift node labeling on update by the node TestOpenshiftNodeLabeling [Suite:openshift/conformance/parallel/minimal] uses `oc debug` command [1]. Occasionally we find that command fails to run with ends up failing the test. 

 

Failing job example - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/2220/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-single-node/1742564553356480512

 

CI search results - https://search.ci.openshift.org/?search=TestOpenshiftNodeLabeling&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

 

[1] https://github.com/openshift/origin/pull/28296/files#diff-001ed6507a42a0a1689e86c05512e6186c3483488ec96bf5f4354a8f7fa79261R39

 

 

Version-Release number of selected component (if applicable):

 

How reproducible:

Observed in CI jobs mentioned above. 

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

   oc debug command occasionally exits with error 

Expected results:

    oc debug command should not occasionally exit with error 

Additional info:

    

Description of problem:

https://github.com/openshift/cluster-monitoring-operator/blob/release-4.16/Makefile#L266

@echo Installing tools from hack/tools.go

from https://github.com/openshift/cluster-monitoring-operator/blob/release-4.16/hack/tools/tools.go, it should be

@echo Installing tools from hack/tools/tools.go

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

Steps to Reproduce:

1. see the description
    

Actual results:

hack/tools.go path is wrong in Makefile

Expected results:

should be hack/tools/tools.go

As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.

This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.

Steps to reproduce:

  1. Delete an operator Pod (`oc delete pod xyz`).
  2. Wait for a replacement Pod to be created.
  3. Check logs of the replacement Pod. It should contain "successfully acquired lease XYZ" relatively quickly after the Pod start (+/- 1 second?)
  4. Go to 1. and retry few times.

 

This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).

Currently, the plugin template gives you instructions for running the console using a container image, which is a lightweight to do development and avoids the need to build the console source code from scratch. The image we reference uses a production version of React, however. This means that you aren't able to use the React browser plugin to debug your application.

We should look at alternatives that allow you to use React Developer Tools. Perhaps we can publish a different image that uses a development build. Or at least we need to better document building console locally instead of using an image to allow development builds.

Description of problem: The [sig-arch] events should not repeat pathologically for ns/openshift-dns test is permafailing in the periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial job

{ 1 events happened too frequently

event happened 114 times, something is wrong: namespace/openshift-dns hmsg/d0c68b9435 service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 From: 17:11:03Z To: 17:11:04Z result=reject }
 

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial

Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.15-e2e-openstack-ovn-serial/1732612509958934528

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Because the installer generates some of the keys that will remain present in the cluster (e.g. the signing key for the admin kubeconfig), it should also run in an environment where FIPS is enabled.

Because it is very easy to fail to notice that the keys were generated in a non-FIPS-certified environment, we should enforce this by checking that fips_enabled is true if the target cluster is to have FIPS enabled.

Colin Walters has a patch for this.

Description of problem:

Check on OperatorHub page, the long catalogsource display name will overflow the operator item tile

    Version-Release number of selected component (if applicable):{code:none}
4.15.0-0.nightly-2023-12-19-033450
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a catalogsource with a long display name.
    2. Check operator items supplied by the created catalogsource on OperatorHub page
    3.
    

Actual results:

2. The catalogsource display name overflows from the item tile
    

Expected results:

2. Show show catalogsource display name in the item tile dynamically without overflow.
    

Additional info:

screenshot: https://drive.google.com/file/d/1GOHJOxoBmtZX3QWDsIvc2RT5a2inkpzM/view?usp=sharing
    

Description of problem:

There is an 'Unhealthy Conditions' table on MachineHealthCheck details page, currently the first column is 'Status', user care more about Type then its Status

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-16-113018    

How reproducible:

Always    

Steps to Reproduce:

1. go to any MHC details page and check 'Unhealthy Conditions' table
2.
3.
    

Actual results:

in the table, 'Type' is the last column    

Expected results:

we should put 'Type' as the first column since this is the most important factor user care

for comparsion, we can check the 'Conditions' table on ClusterOperators details page, the order is Type -> Status -> other info which is very user friendly

Additional info:

    

Description of problem:

The e2e-aws-ovn-shared-to-local-gateway-mode-migration and e2e-aws-ovn-local-to-shared-gateway-mode-migration jobs fail about 50% of the time with

+ oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}'
network.operator.openshift.io/cluster patched
+ oc wait co network --for=condition=PROGRESSING=True --timeout=60s
error: timed out waiting for the condition on clusteroperators/network 

Description of problem:

    Missing Source column header in PVC > VolumeSnapshots tab

Version-Release number of selected component (if applicable):

    Cluster 4.10, 4.14, 4.16

How reproducible:

    Yes

Steps to Reproduce:

    1. Create a PVC i.e. "my-pvc"
    2. Create a Pod and bind it to the "my-pvc"
    3. Create a VolumeSnapshots and associate it with the "my-pvc"
    4. Goto to PVC detail > VolumeSnapshots tab      

Actual results:

    The Source column header is not displayed

Expected results:

    
the Source column header should be displayed

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/337

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.

Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

>4.14

How reproducible:

Always

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack
    

Actual results:

CEO controllers are returning errors, but might not deadlock, which currently results in a restart

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

highly related to OCPBUGS-30169

As part of the PatternFly update from 5.1.0 to 5.1.1 it was required to disable some dev console e2e tests.

See https://github.com/openshift/console/pull/13380

We need to re-enable and adapt at least this tests:

In frontend/packages/dev-console/integration-tests/features/addFlow/create-from-devfile.feature and in frontend/packages/dev-console/integration-tests/features/e2e/add-flow-ci.feature

  • Scenario: Deploy git workload with devfile from topology page: A-04-TC01

In frontend/packages/helm-plugin/integration-tests/features/helm-release.feature and frontend/packages/helm-plugin/integration-tests/features/helm/actions-on-helm-release.feature:

  • Scenario: Context menu options of helm release: HR-01-TC01

Can we please also check why we have both broken tests in two features files? 🤷

Please review the following PR: https://github.com/openshift/thanos/pull/134

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/32

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/100

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When deploying a HCP KubeVirt cluster using the hcp's --node-selector cli arg, that node selector is not applied to the "kubevirt-cloud-controller-manager" pods within the HCP namespace. 

This makes it not possible to pin the entire HCP pods to specific nodes.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    100%

Steps to Reproduce:

    1. deploy an hcp kubevirt cluster with the --node-selector cli option
    2.
    3.
    

Actual results:

    the node selector is not applied to cloud provider kubevirt pod

Expected results:

    the node selector should be applied to cloud provider kubevirt pod. 

Additional info:

    

In CNO, need to change cni-sysctl-allowlist daemonset to hostNetwork: true in case of multus ds + cni-sysctl-allowlist ds upgrade successfully.

Description of problem:

Should print out an error if single arch image specified with non-expected arch by filter-by-os

Version-Release number of selected component (if applicable):

 oc version Client Version: 4.16.0-202403121314.p0.gc92b507.assembly.stream-c92b507

How reproducible:

    Always 

Steps to Reproduce:

1)  Use `filter-by-os linux/amd64` for the image only with arch : arm64 
`oc image info  quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64  --filter-by-os linux/amd64

2) Use invalid `--filter-by-os linux/invalid` for the image 
`oc image info  quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64  --filter-by-os linux/invalid`     

Actual results:

   1) Succeed with no error or warning
oc image info  quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64  --filter-by-os linux/amd64
Name:        quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64
Digest:      sha256:0c13de057d9f75c40999778bb924f654be1d0def980acbe8a00096e6bf18cc2a
Media Type:  application/vnd.docker.distribution.manifest.v2+json
Created:     16d ago
Image Size:  155.5MB in 5 layers
Layers:      75.95MB sha256:f90c4920e095dc91c490dd9ed7920d18e0327ddedcf5e10d2887e80ccae94fd7
             42.16MB sha256:a974fa00e888c491ab67f8d63456937bbaffbebb530db5ee2f9f5193fc5bb910
             10.2MB  sha256:c391a61f467f437cf6a0ba00c394aa4dbc107ecf56edd91a018de97ca4cd16bc
             26.07MB sha256:0e78634759d2f9c988dbf5ee73a7ed9a5d3b4ec28dcad5dd9086544826bbde05
             1.115MB sha256:277f2a9ba38386db697a1cbde875c1ec79988a632d006c6d697d0a79911d9955
OS:          linux
Arch:        arm64
Entrypoint:  /usr/bin/cluster-version-operator
Environment: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
             container=oci
             GODEBUG=x509ignoreCN=0,madvdontneed=1
             __doozer=merge
             BUILD_RELEASE=202403070215.p0.g6a76ba9.assembly.stream.el9
             BUILD_VERSION=v4.16.0
             OS_GIT_MAJOR=4
             OS_GIT_MINOR=16
             OS_GIT_PATCH=0
             OS_GIT_TREE_STATE=clean
             OS_GIT_VERSION=4.16.0-202403070215.p0.g6a76ba9.assembly.stream.el9-6a76ba9
             SOURCE_GIT_TREE_STATE=clean
             __doozer_group=openshift-4.16
             __doozer_key=cluster-version-operator
             __doozer_version=v4.16.0
             OS_GIT_COMMIT=6a76ba9
             SOURCE_DATE_EPOCH=1709342193
             SOURCE_GIT_COMMIT=6a76ba95ed441893e1bdf6616c47701c0464b7f4
             SOURCE_GIT_TAG=v1.0.0-1176-g6a76ba95
             SOURCE_GIT_URL=https://github.com/openshift/cluster-version-operator
Labels:      io.openshift.release=4.16.0-ec.4
             io.openshift.release.base-image-digest=sha256:fa1b36be29e72ca5c180ce8cc599a1f0871fa5aacd3153ed4cefc84038cd439a 

2) succeed with no error or warning:
oc image info  quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64  --filter-by-os linux/invalid
Name:        quay.io/openshift-release-dev/ocp-release:4.16.0-ec.4-aarch64
Digest:      sha256:0c13de057d9f75c40999778bb924f654be1d0def980acbe8a00096e6bf18cc2a
Media Type:  application/vnd.docker.distribution.manifest.v2+json
Created:     16d ago
Image Size:  155.5MB in 5 layers
Layers:      75.95MB sha256:f90c4920e095dc91c490dd9ed7920d18e0327ddedcf5e10d2887e80ccae94fd7
             42.16MB sha256:a974fa00e888c491ab67f8d63456937bbaffbebb530db5ee2f9f5193fc5bb910
             10.2MB  sha256:c391a61f467f437cf6a0ba00c394aa4dbc107ecf56edd91a018de97ca4cd16bc
             26.07MB sha256:0e78634759d2f9c988dbf5ee73a7ed9a5d3b4ec28dcad5dd9086544826bbde05
             1.115MB sha256:277f2a9ba38386db697a1cbde875c1ec79988a632d006c6d697d0a79911d9955
OS:          linux
Arch:        arm64
Entrypoint:  /usr/bin/cluster-version-operator
Environment: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
             container=oci
             GODEBUG=x509ignoreCN=0,madvdontneed=1
             __doozer=merge
             BUILD_RELEASE=202403070215.p0.g6a76ba9.assembly.stream.el9
             BUILD_VERSION=v4.16.0
             OS_GIT_MAJOR=4
             OS_GIT_MINOR=16
             OS_GIT_PATCH=0
             OS_GIT_TREE_STATE=clean
             OS_GIT_VERSION=4.16.0-202403070215.p0.g6a76ba9.assembly.stream.el9-6a76ba9
             SOURCE_GIT_TREE_STATE=clean
             __doozer_group=openshift-4.16
             __doozer_key=cluster-version-operator
             __doozer_version=v4.16.0
             OS_GIT_COMMIT=6a76ba9
             SOURCE_DATE_EPOCH=1709342193
             SOURCE_GIT_COMMIT=6a76ba95ed441893e1bdf6616c47701c0464b7f4
             SOURCE_GIT_TAG=v1.0.0-1176-g6a76ba95
             SOURCE_GIT_URL=https://github.com/openshift/cluster-version-operator
Labels:      io.openshift.release=4.16.0-ec.4
             io.openshift.release.base-image-digest=sha256:fa1b36be29e72ca5c180ce8cc599a1f0871fa5aacd3153ed4cefc84038cd439a

[root@localhost Doc]# echo $?
0

Expected results:

1) If the image is not a manifest list , we’d better to print out an error as these is nothing to filter Or have a warning this is not at manifest-list image;

2) Better to print out with error for the invalid arch.

Additional info:

    

In 4.15 when the agent installer is run using the openshift-baremetal-installer binary using an install-config containing platform data, it attempts to contact libvirt to validate the provisioning network interfaces for the bootstrap VM. This should never happen, as the agent installer doesn't use the bootstrap VM.

It is possible that users in the process of converting from baremetal IPI to the agent installer might run into this issue, since they would already be using the openshift-baremetal-installer binary.

Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-30600.

When we apply a machine config with additional ssh key info, this action only needs to uncordon the node, when uncordon is happening, condition Cordoned = True. it will make the user confuse. maybe we can refine this design to show status of cordon/uncordon separately  

 

lastTransitionTime: '2023-11-28T16:53:58Z'   message: 'Action during previous iteration: (Un)Cordoned node. The node is reporting     Unschedulable = false'   reason: UpdateCompleteCordoned   status: 'False'   type: Cordoned   

Description of problem:

    The cluster operator "machine-config" degraded due to MachineConfigPool master is not ready, which tells error like "rendered-master-${hash} not found".

Version-Release number of selected component (if applicable):

    4.15.0-0.nightly-2023-12-11-033133

How reproducible:

    Always. We met the issue in 2 CI profiles, Flexy template "functionality-testing/aos-4_15/upi-on-gcp/versioned-installer-ovn-ipsec-ew-ns-ci", and PROW CI test "periodic-ci-openshift-openshift-tests-private-release-4.15-multi-ec-gcp-ipi-disc-priv-oidc-arm-mixarch-f14".

Steps to Reproduce:

The Flexy template brief steps:
1. "create install-config" and then "create manifests"
2. add manifest file to config ovnkubernetes network for IPSec (please refer to https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blame/master/functionality-testing/aos-4_15/hosts/create_ign_files.sh#L517-530)
3. (optional) "create ignition-config"
4. UPI installation steps (see OCP doc https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-user-infra.html#installation-gcp-user-infra-exporting-common-variables)

Actual results:

    Installation failed, with the machine-config operator degraded

Expected results:

    Installation should succeed.

Additional info:

    The must-gather is at https://drive.google.com/file/d/12xbjWUknDL_DRNSS8T_Z3u4d1KrNlJgT/view?usp=drive_link

Description of the problem:

non-lowercase hostname in DHCP breaks assisted installation

How reproducible:

100%

Steps to reproduce:

  1. https://issues.redhat.com/browse/AITRIAGE-10248
  2. User did ask for a valid requested_hostname

Actual results:

bootkube fails

Expected results:{}

bootkube should succeed

 

slack thread

Description of problem:

Find in QE's CI (with vsphere-agent profile), storage CO is not avaliable and vsphere-problem-detector-operator pod is CrashLoopBackOff with panic.
(Find must-garther here: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-agent-disconnected-ha-f14/1734850632575094784/artifacts/vsphere-agent-disconnected-ha-f14/gather-must-gather/)


The storage CO reports "unable to find VM by UUID":
  - lastTransitionTime: "2023-12-13T09:15:27Z"
    message: "VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable:
      unable to find VM ci-op-782gwsbd-b3d4e-master-2 by UUID \nVSphereProblemDetectorDeploymentControllerAvailable:
      Waiting for Deployment"
    reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_vcenter_api_error::VSphereProblemDetectorDeploymentController_Deploying
    status: "False"
    type: Available
(But I did not see the "unable to find VM by UUID" from vsphere-problem-detector-operator log in must-gather)


The vsphere-problem-detector-operator log:
2023-12-13T10:10:56.620216117Z I1213 10:10:56.620159       1 vsphere_check.go:149] Connected to vcenter.devqe.ibmc.devcluster.openshift.com as ci_user_01@devqe.ibmc.devcluster.openshift.com
2023-12-13T10:10:56.625161719Z I1213 10:10:56.625108       1 vsphere_check.go:271] CountVolumeTypes passed
2023-12-13T10:10:56.625291631Z I1213 10:10:56.625258       1 zones.go:124] Checking tags for multi-zone support.
2023-12-13T10:10:56.625449771Z I1213 10:10:56.625433       1 zones.go:202] No FailureDomains configured.  Skipping check.
2023-12-13T10:10:56.625497726Z I1213 10:10:56.625487       1 vsphere_check.go:271] CheckZoneTags passed
2023-12-13T10:10:56.625531795Z I1213 10:10:56.625522       1 info.go:44] vCenter version is 8.0.2, apiVersion is 8.0.2.0 and build is 22617221
2023-12-13T10:10:56.625562833Z I1213 10:10:56.625555       1 vsphere_check.go:271] ClusterInfo passed
2023-12-13T10:10:56.625603236Z I1213 10:10:56.625594       1 datastore.go:312] checking datastore /DEVQEdatacenter/datastore/vsanDatastore for permissions
2023-12-13T10:10:56.669205822Z panic: runtime error: invalid memory address or nil pointer dereference
2023-12-13T10:10:56.669338411Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x23096cb]
2023-12-13T10:10:56.669565413Z 
2023-12-13T10:10:56.669591144Z goroutine 550 [running]:
2023-12-13T10:10:56.669838383Z github.com/openshift/vsphere-problem-detector/pkg/operator.getVM(0xc0005da6c0, 0xc0002d3b80)
2023-12-13T10:10:56.669991749Z     github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:319 +0x3eb
2023-12-13T10:10:56.670212441Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*vSphereChecker).enqueueSingleNodeChecks.func1()
2023-12-13T10:10:56.670289644Z     github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:238 +0x55
2023-12-13T10:10:56.670490453Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker.func1(0xc000c88760?, 0x0?)
2023-12-13T10:10:56.670702592Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:40 +0x55
2023-12-13T10:10:56.671142070Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker(0xc000c78660, 0xc000c887a0?)
2023-12-13T10:10:56.671331852Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:41 +0xe7
2023-12-13T10:10:56.671529761Z github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool.func1()
2023-12-13T10:10:56.671589925Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:28 +0x25
2023-12-13T10:10:56.671776328Z created by github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool
2023-12-13T10:10:56.671847478Z     github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:27 +0x73




Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-11-033133

How reproducible:

 

Steps to Reproduce:

    1. See description
    2.
    3.
    

Actual results:

   vpd is panic

Expected results:

   vpd should not panic

Additional info:

   I guess it is privileges issue, but our pod should not be panic.

 

Description of problem:

The users are experiencing an issue with NodePort traffic forwarding, where the TCP traffic continues to be directed to pods which are under terminating state, the connection cannot be created sucessfully, as per the customer mentioned this issue is causing the connection disruptions in the business transaction.

 

Version-Release number of selected component (if applicable):

On the OpenShift 4.12.13 with RHEL8.6 workers and OVN environment.

 

How reproducible:

here is the code found.
https://github.com/openshift/ovn-kubernetes/blob/dd3c7ed8c1f41873168d3df26084ecbfd3d9a36b/go-controller/pkg/util/kube.go#L360

func IsEndpointServing(endpoint discovery.Endpoint) bool {
        if endpoint.Conditions.Serving != nil

{                 return *endpoint.Conditions.Serving         }

else

{                 return IsEndpointReady(endpoint)         }

}

// IsEndpointValid takes as input an endpoint from an endpoint slice and a boolean that indicates whether to include
// all terminating endpoints, as per the PublishNotReadyAddresses feature in kubernetes service spec. It always returns true
// if includeTerminating is true and falls back to IsEndpointServing otherwise.
func IsEndpointValid(endpoint discovery.Endpoint, includeTerminating bool) bool

{         return includeTerminating || IsEndpointServing(endpoint) }

Look like 'IsEndpointValid' function will retrun serving=true endpoint, it not checking the ready=true endpoint
I see recently the code has been changed in this section(look up Ready=true is changed to Serving=true)?

[Check the "Serving" field for endpoints]
https://github.com/openshift/ovn-kubernetes/commit/aceef010daf0697fe81dba91a39ed0fdb6563dea#diff-daf9de695e0ff81f9173caf83cb88efa138e92a9b35439bd7044aa012ff931c0

https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/util/kube.go#L326-L386

            out.Port = *port.Port
            for _, endpoint := range slice.Endpoints {
                // Skip endpoint if it's not valid
                if !IsEndpointValid(endpoint, includeTerminating)

{                     klog.V(4).Infof("Slice endpoint not valid")                     continue                 }

                for _, ip := range endpoint.Addresses {
                    klog.V(4).Infof("Adding slice %s endpoint: %v, port: %d", slice.Name, endpoint.Addresses, *port.Port)
                    ipStr := utilnet.ParseIPSloppy(ip).String()
                    switch slice.AddressType

{                     case discovery.AddressTypeIPv4:                         v4ips.Insert(ipStr)                     case discovery.AddressTypeIPv6:                         v6ips.Insert(ipStr)                     default:                         klog.V(5).Infof("Skipping FQDN slice %s/%s", slice.Namespace, slice.Name)                     }

                }
            }

Steps to Reproduce:

Here is the customer's sample pods for you refering.
mbgateway-st-8576f6f6f8-5jc75   1/1     Running   0          104m    172.30.195.124   appn01-100.app.paas.example.com   <none>           <none>
mbgateway-st-8576f6f6f8-q8j6k   1/1     Running   0          5m51s   172.31.2.97      appn01-202.app.paas.example.com   <none>           <none>

pod yaml:
    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 40
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: 9190
      timeoutSeconds: 5
    name: mbgateway-st
    ports:
    - containerPort: 9190
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      initialDelaySeconds: 40
      periodSeconds: 10
      successThreshold: 1
      tcpSocket:
        port: 9190
      timeoutSeconds: 5
    resources:
      limits:
        cpu: "2"
        ephemeral-storage: 10Gi
        memory: 2G
      requests:
        cpu: 50m
        ephemeral-storage: 100Mi
        memory: 1111M

when delete pod Pod(mbgateway-st-8576f6f6f8-5jc75), check the EndpointSlice status:
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:

  • addresses:
      - 172.30.195.124
      conditions:
        ready: false
        serving: true
        terminating: true
      nodeName: appn01-100.app.paas.example.com
      targetRef:
        kind: Pod
        name: mbgateway-st-8576f6f6f8-5jc75
        namespace: lb59-10-st-unigateway
        uid: 5e8a375d-ba56-4894-8034-0009d0ab8ebe
      zone: AZ61QEBIZ_AZ61QEM02_FD3
  • addresses:
      - 172.31.2.97
      conditions:
        ready: true
        serving: true
        terminating: false
      nodeName: appn01-202.app.paas.example.com
      targetRef:
        kind: Pod
        name: mbgateway-st-8576f6f6f8-q8j6k
        namespace: lb59-10-st-unigateway
        uid: 5bd195b7-e342-4b34-b165-12988a48e445
      zone: AZ61QEBIZ_AZ61QEM02_FD1

Wait for a little moment, try to check Ovn Service lb, it found the endpoints information doesn't update to the latest.
9349d703-1f28-41fe-b505-282e8abf4c40    Service_lb59-10-    tcp        172.35.0.185:31693      172.30.195.124:9190,172.31.2.97:9190
dca65745-fac4-4e73-b412-2c7530cf4a91    Service_lb59-10-    tcp        172.35.0.170:31693      172.30.195.124:9190,172.31.2.97:9190
a5a65766-b0f2-4ac6-8f7c-cdebeea303e3    Service_lb59-10-    tcp        172.35.0.89:31693       172.30.195.124:9190,172.31.2.97:9190
a36517c5-ecaa-4a41-b686-37c202478b98    Service_lb59-10-    tcp        172.35.0.213:31693      172.30.195.124:9190,172.31.2.97:9190
16d997d1-27f0-41a3-8a9f-c63c8872d7b8    Service_lb59-10-    tcp        172.35.0.92:31693       172.30.195.124:9190,172.31.2.97:9190

Wait for a little moment,
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:

  • addresses:
      - 172.30.195.124
      conditions:
        ready: false
        serving: true
        terminating: true
      nodeName: appn01-100.app.paas.example.com
      targetRef:
        kind: Pod
        name: mbgateway-st-8576f6f6f8-5jc75
        namespace: lb59-10-st-unigateway
        uid: 5e8a375d-ba56-4894-8034-0009d0ab8ebe
      zone: AZ61QEBIZ_AZ61QEM02_FD3
  • addresses:
      - 172.31.2.97
      conditions:
        ready: true
        serving: true
        terminating: false
      nodeName: appn01-202.app.paas.example.com
      targetRef:
        kind: Pod
        name: mbgateway-st-8576f6f6f8-q8j6k
        namespace: lb59-10-st-unigateway
        uid: 5bd195b7-e342-4b34-b165-12988a48e445
      zone: AZ61QEBIZ_AZ61QEM02_FD1
  • addresses:
      - 172.30.132.78
      conditions:
        ready: false
        serving: false
        terminating: false
      nodeName: appn01-089.app.paas.example.com
      targetRef:
        kind: Pod
        name: mbgateway-st-8576f6f6f8-8lp4s
        namespace: lb59-10-st-unigateway
        uid: 755cbd49-792b-4527-b96a-087be2178e9d
      zone: AZ61QEBIZ_AZ61QEM02_FD3

check Ovn Service lb, it found the Pod Endpoint information is still here:
fceeaf8f-e747-4290-864c-ba93fb565a8a    Service_lb59-10-    tcp        172.35.0.56:31693       172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
bef42efd-26db-4df3-b99d-370791988053    Service_lb59-10-    tcp        172.35.1.26:31693       172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
84172e2c-081c-496a-afec-25ebcb83cc60    Service_lb59-10-    tcp        172.35.0.118:31693      172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190
34412ddd-ab5c-4b6b-95a3-6e718dd20a4f    Service_lb59-10-    tcp        172.35.1.14:31693       172.30.132.78:9190,172.30.195.124:9190,172.31.2.97:9190

 

Actual results:

Service LB endpoint determines on the POD.status.condition[type=Serving] status.

Expected results:

Service LB endpoint should determines on the POD.status.condition[type=Ready] status.

 

Additional info:

The ovn-controller determines whether an endpoint should be added to the Service Load Balancer (serviceLB) based on the condition.serving. The current issue is that when a pod is in the terminating state, the condition.serving remains true. Its state determines on the POD.status.condition[type=Ready] status is being true.

However when a pod is deleted, the endpointslice condition.serving state remains unchanged, and the backend pool of the service LB still includes the IP information of the deleted pod.Why doesn't ovn-controller use the condition.ready status to decide whether the pod's IP should be added to the service LB backend pool?

Could the shift-networking experts confirm whether this is the openshift ovn service lb bug or not?

Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/49

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The 4.13 CPO fails to reconcile

{"level":"error","ts":"2024-04-03T18:45:28Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","hostedControlPlane":{"name":"sjenning-guest","namespace":"clusters-sjenning-guest"},"namespace":"clusters-sjenning-guest","name":"sjenning-guest","reconcileID":"35a91dd1-0066-4c81-a6a4-14770ffff61d","error":"failed to update control plane: failed to reconcile router: failed to reconcile router role: roles.rbac.authorization.k8s.io \"router\" is forbidden: user \"system:serviceaccount:clusters-sjenning-guest:control-plane-operator\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:clusters-sjenning-guest\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"security.openshift.io\"], Resources:[\"securitycontextconstraints\"], ResourceNames:[\"hostnetwork\"], Verbs:[\"use\"]}","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Caused by https://github.com/openshift/hypershift/pull/3789

Description of problem:

The IngressController and DNSRecord CRDs were moved to dedicated packages
following the introduction of a new method for generating CRDs in the OpenShift API repository ([openshift/api#1803|https://github.com/openshift/api/pull/1803]).

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

1. go mod edit -replace=github.com/openshift/api=github.com/openshift/api@ff84c2c732279b16baccf08c7dfc9ff8719c4807
2. go mod tidy
3. go mod vendor
4. make update
    

Actual results:

$ make update
hack/update-generated-crd.sh
--- vendor/github.com/openshift/api/operator/v1/0000_50_ingress-operator_00-ingresscontroller.crd.yaml    1970-01-01 01:00:00.000000000 +0100
+++ manifests/00-custom-resource-definition.yaml    2024-04-17 18:05:05.009605155 +0200
[LONG DIFF]
cp: cannot stat 'vendor/github.com/openshift/api/operator/v1/0000_50_ingress-operator_00-ingresscontroller.crd.yaml': No such file or directory
make: *** [Makefile:39: crd] Error 1

Expected results:

$ make update
hack/update-generated-crd.sh 
hack/update-profile-manifests.sh

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/270

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    ResourceYAMLEditor has no create option. THis means that can be used only for editing objects.  

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Use ResourceYAMLEditor in a different page from the details page
    2.
    3.
    

Actual results:
Be able to create the object
See the samples in the sidebar
See 'Create' button instead of 'Save'

Expected results:

    only "save' button and no samples.
Additional info:
{code:none}
    

Description of problem:

   PublicAndPrivate and Private clusters fail to provision due to missing IngressController RBAC in control plane operator. This RBAC was recently removed from the HyperShift operator. 

Version-Release number of selected component (if applicable):

    4.14.z

How reproducible:

    Always

Steps to Reproduce:

    1. Install hypershift operator from main
    2. Create an AWS PublicAndPrivate cluster using a 4.14.z release
    

Actual results:

    The cluster never provisions because the cpo is stuck

Expected results:

    The cluster provisions successfully

Additional info:

    

[sig-arch][Late] collect certificate data [Suite:openshift/conformance/parallel]

Test is currently making the Unknown component red, but this test should be aligned to the kube-apiserver component. Looks like two others in the same file should be as well.

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/271

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CNO assumes only master and worker machine config pools present on the cluster, While running CI with 24 nodes, it's found that there are two more pools infra and workload present. So these pools are also taken into consideration while rolling out ipsec machine config.

# omg get mcp
NAME      CONFIG                                              UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
infra     rendered-infra-52f7615d8c841e7570b7ab6cbafecac8     True     False     False     3             3                  3                    0                     38m
master    rendered-master-fbb5d8e1337d1244d30291ffe3336e45    True     False     False     3             3                  3                    0                     1h10m
worker    rendered-worker-52f7615d8c841e7570b7ab6cbafecac8    False    True      False     24            12                 12                   0                     1h10m
workload  rendered-workload-52f7615d8c841e7570b7ab6cbafecac8  True     False     False     0             0                  0                    0                     38m

CI run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50740/rehearse-50740-pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.16-nightly-x86-control-plane-ipsec-24nodes/1782308642033242112

 

 

Description of the problem:

Setting up OCP with ODF (compact mode) using AI (stage). I have 3 hosts with install disk (120GB) and data disk (500GB, disk type: Multipath). Though we have non bootable disk (500GB), host status is "Insufficient". Could not proceed forward as the "Next" button is disabled.

Steps to reproduce:

1. Create a new cluster

2. Select "Install OpenShift Data Foundation" in Operators page

3. Take 3 hosts with 1 installation disk and 1 non-installation disk  on each.

4. Add hosts by booting hosts with downloaded iso  

Actual results:

Status of hosts is "Insufficient" and "Next" button is disabled

data.json

 

Expected results:

Status of hosts should be "Ready" and "Next" button should be enabled to proceed with installation

Description of problem:

 

From OCPBUGS-237 we discussed disabled memory-trim-on-compaction once it was enabled by default.

On OCP with 4.15.0-0.nightly-2023-12-19-033450 we are at

Red Hat Enterprise Linux CoreOS 415.92.202312132107-0 (Plow)
openvswitch3.1-3.1.0-59.el9fdp.x86_64

This should have memory-trim-on-compaction enabled by default

v3.0.0 - 15 Aug 2022
--------------------
     * Returning unused memory to the OS after the database compaction is now
       enabled by default.  Use 'ovsdb-server/memory-trim-on-compaction off'
       unixctl command to disable.

https://github.com/openvswitch/ovs/commit/e773140ec3f6d296e4a3877d709fb26fb51bc6ee

If we are enabled by default, we should remove the enable loop.

      # set trim-on-compaction
      if ! retry 60 "trim-on-compaction" "ovn-appctl -t ${nbdb_ctl} --timeout=5 ovsdb-server/memory-trim-on-compaction on"; then
        exit 1
      fi

https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/ovn-kubernetes/common/008-script-lib.yaml#L314
 

Version-Release number of selected component (if applicable):

  4.15.0-0.nightly-2023-12-19-033450

How reproducible:

 Always

Steps to Reproduce:

1. check if memory-trim-on-compaction is enabled by default in OVS

2. check ndbd log files for

Actual results:

 

2023-12-20T18:12:47.053444489Z 2023-12-20T18:12:47.053Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
2023-12-20T18:12:49.001580092Z 2023-12-20T18:12:49.001Z|00003|ovsdb_server|INFO|memory trimming after compaction enabled.

 

Expected results:

 memory-trim-on-compaction should be enabled by default, we don't need to re-enable it.

Affected Platforms:

All

Our test that watches for alerts to appear we've never seen before has picked something up on https://amd64.ocp.releases.ci.openshift.org/releasestream/4.16.0-0.ci/release/4.16.0-0.ci-2024-02-02-031544

[sig-trt][invariant] No new alerts should be firing expand_less 0s

{ Found alerts firing which are new or less than two weeks old, which should not be firing: PrometheusOperatorRejectedResources has no test data, this alert appears new and should not be firing}

It hit about 3-4 out of 10 on both azure and aws agg jobs. Could be a regression, could be something really rare.

Description of problem:

Agent CI jobs (compact and HA) are currently experiencing failures because the control-plane-machine-set operator is degraded, despite the SNO cluster operating normally.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%

Actual results:

level=info msg=Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s)124level=error msg=Cluster operator control-plane-machine-set Degraded is True with UnmanagedNodes: Found 3 unmanaged node(s)125level=info msg=Cluster operator csi-snapshot-controller EvaluationConditionsDetected is Unknown with NoData: 126level=info msg=Cluster operator etcd EvaluationConditionsDetected is Unknown with NoData: 127level=info msg=Cluster operator ingress EvaluationConditionsDetected is False with AsExpected: 128level=info msg=Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer129level=info msg=Cluster operator insights Disabled is False with AsExpected: 130level=info msg=Cluster operator insights SCAAvailable is False with Forbidden: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 403: {"code":"ACCT-MGMT-11","href":"/api/accounts_mgmt/v1/errors/11","id":"11","kind":"Error","operation_id":"dc5b9421-248f-4ac4-9135-ac5bf6bcd2ce","reason":"Account with ID 2DUeKzzTD9ngfsQ6YgkzdJn1jA4 denied access to perform create on Certificate with HTTP call POST /api/accounts_mgmt/v1/certificates"}131level=info msg=Cluster operator kube-apiserver EvaluationConditionsDetected is False with AsExpected: All is well132level=info msg=Cluster operator kube-controller-manager EvaluationConditionsDetected is Unknown with NoData: 133level=info msg=Cluster operator kube-scheduler EvaluationConditionsDetected is Unknown with NoData: 134level=info msg=Cluster operator network ManagementStateDegraded is False with : 135level=info msg=Cluster operator openshift-controller-manager EvaluationConditionsDetected is Unknown with NoData: 136level=info msg=Cluster operator storage EvaluationConditionsDetected is Unknown with NoData: 137level=error msg=Cluster initialization failed because one or more operators are not functioning properly.138level=error msg=				The cluster should be accessible for troubleshooting as detailed in the documentation linked below,139level=error msg=				https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html140ERROR: Installation failed. Aborting execution.

Expected results:

Install should be successful.

Additional info:

HA must gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-vsphere-agent-ha-f14/1771068123387006976/artifacts/vsphere-agent-ha-f14/gather-must-gather/artifacts/must-gather.tar

Compact must gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/50544/rehearse-50544-periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-vsphere-agent-compact-fips-f14/1775524930515898368/artifacts/vsphere-agent-compact-fips-f14/gather-must-gather/artifacts/must-gather.tar

Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/129

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Failed to create RHCOS image when creating Azure infrastructure

Steps to Reproduce & actual results:

fxie-mac:hypershift fxie$ hypershift create infra azure --name $CLUSTER_NAME --azure-creds $HOME/.azure/osServicePrincipal.json --base-domain $BASE_DOMAIN --infra-id $INFRA_ID --location eastus --output-file $OUTPUT_INFRA_FILE
2024-03-20T14:26:23+08:00	INFO	Using credentials from file	{"path": "/Users/fxie/.azure/osServicePrincipal.json"}
2024-03-20T14:26:30+08:00	INFO	Successfully created resource group	{"name": "fxie-hcp-1-fxie-hcp-1-13639"}
2024-03-20T14:26:32+08:00	INFO	Successfully created managed identity	{"name": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourcegroups/fxie-hcp-1-fxie-hcp-1-13639/providers/Microsoft.ManagedIdentity/userAssignedIdentities/fxie-hcp-1-fxie-hcp-1-13639"}
2024-03-20T14:26:32+08:00	INFO	Assigning role to managed identity, this may take some time
2024-03-20T14:26:51+08:00	INFO	Successfully assigned contributor role to managed identity	{"name": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourcegroups/fxie-hcp-1-fxie-hcp-1-13639/providers/Microsoft.ManagedIdentity/userAssignedIdentities/fxie-hcp-1-fxie-hcp-1-13639"}
2024-03-20T14:26:55+08:00	INFO	Successfully created network security group	{"name": "fxie-hcp-1-fxie-hcp-1-13639-nsg"}
2024-03-20T14:27:01+08:00	INFO	Successfully created vnet	{"name": "fxie-hcp-1-fxie-hcp-1-13639"}
2024-03-20T14:27:35+08:00	INFO	Successfully created private DNS zone	{"name": "fxie-hcp-1-azurecluster.qe.azure.devcluster.openshift.com"}
2024-03-20T14:28:09+08:00	INFO	Successfully created private DNS zone link
2024-03-20T14:28:12+08:00	INFO	Successfully created public IP address for guest cluster egress load balancer
2024-03-20T14:28:15+08:00	INFO	Successfully created guest cluster egress load balancer
2024-03-20T14:28:37+08:00	INFO	Successfully created storage account	{"name": "clusterzw22c"}
2024-03-20T14:28:38+08:00	INFO	Successfully created blob container	{"name": "vhd"}
2024-03-20T14:28:38+08:00	ERROR	Failed to create infrastructure	{"error": "failed to create RHCOS image: the image source url must be from an azure blob storage, otherwise upload will fail with an `One of the request inputs is out of range` error"}
github.com/openshift/hypershift/cmd/infra/azure.NewCreateCommand.func2
	/Users/fxie/Projects/hypershift/cmd/infra/azure/create.go:114
github.com/spf13/cobra.(*Command).execute
	/Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
	/Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1032
main.main
	/Users/fxie/Projects/hypershift/main.go:78
runtime.main
	/usr/local/go/src/runtime/proc.go:267
Error: failed to create RHCOS image: the image source url must be from an azure blob storage, otherwise upload will fail with an `One of the request inputs is out of range` error
failed to create RHCOS image: the image source url must be from an azure blob storage, otherwise upload will fail with an `One of the request inputs is out of range` error     

Description of problem:

    https://github.com/openshift/installer/pull/7778 introduced a bug where an error is always returned while retrieving a marketplace image.

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Configure marketplace image in the install-config
    2. openshift-install create manifests
    3.
    

Actual results:

    $ ./openshift-install create manifests --dir ipi1 --log-level debug
DEBUG OpenShift Installer 4.16.0-0.test-2023-12-12-020559-ci-ln-xkqmlqk-latest 
DEBUG Built from commit 456ae720a83e39dffd9918c5a71388ad873b6a38 
DEBUG Fetching Master Machines...                  
DEBUG Loading Master Machines...                   
DEBUG   Loading Cluster ID...                      
DEBUG     Loading Install Config...                
DEBUG       Loading SSH Key...                     
DEBUG       Loading Base Domain...                 
DEBUG         Loading Platform...                  
DEBUG       Loading Cluster Name...                
DEBUG         Loading Base Domain...               
DEBUG         Loading Platform...                  
DEBUG       Loading Pull Secret...                 
DEBUG       Loading Platform...                    
INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" 
ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>), compute[0].platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>)]  

Expected results:

    Success

Additional info:

    When {{errors.Wrap(err, ...)}} was replaced by {{fmt.Errorf(...)}}, there is a slight difference in behavior in which {{errors.Wrap}} returns {{nil}} if {{err}} is {{nil}} but {{fmt.Errorf}} always returns an error.

Description of problem:

In local setup this error appears when creating a deployment with scaling in the git form page locally: 
`Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32`
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-05-154400
    

How reproducible:

Everytime
    

Steps to Reproduce:

    1. In the local setup go to the git form page
    2. Enter a git repo and select deployment as the resource type
    3. In scaling enter the value as '5' and click on Create button
    

Actual results:

Got this error:
"Deployment in version "v1" cannot be handled as a Deployment: json: cannot unmarshal string into Go struct field DeploymentSpec.spec.replicas of type int32"
    

Expected results:

Deployment should be created

    

Additional info:

Happening with Deployment-config creation as well
    

 If the user relies on mirror registries, and clusterimageset is set to a tagged image (e.g. quay.io/openshift-release-dev/ocp-release:4.15.0-multi), as opposed to a by digest image (e.g. quay.io/openshift-release-dev/ocp-release@sha256:b86422e972b9c838dfdb8b481a67ae08308437d6489ea6aaf150242b1d30fa1c), then `oc` will fail to pull with:

--icsp-file only applies to images referenced by digest and will be ignored for tags

Instead we should probably block it at the reconcile stage, or give the user clearer CR errors so they don't have to dig in the assisted service logs to figure out what went wrong

The oc error is actually much more confusing - oc ignores the icsp, tries to pull from quay, and runs into issues because mirror registries are trypically used in disconnected environments where quay is unreachable / has a different certificate - so there's a lot of red herrings the user will chase until they realize they should have used digest

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/158

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1006

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

On https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1756168710529224704 we failed three netpol tests on just one result, failure, with no successess. However the other 9 jobs seemed to run fine.

[sig-network] Netpol NetworkPolicy between server and client should allow ingress access from namespace on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-network] Netpol NetworkPolicy between server and client should allow egress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

[sig-network] Netpol NetworkPolicy between server and client should allow ingress access on one named port [Feature:NetworkPolicy] [Skipped:Network/OVNKubernetes] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

Something seems funky with these tests, they're barely running, they don't seem to consistently report results. 4.15 has just 3 runs in the last two weeks, however 4.16 has just 1 but it passed

Whatever's going on, it's capable of taking out a payload, though it's not happening 100% of the time. https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2024-02-10-040857

Description of problem:

The status controller of CCO reconciles 500+ times/h on average on a resting 6-node mint-mode OCP cluster on AWS. 

Steps to Reproduce:

1. Install a 6-node mint-mode OCP cluster on AWS
2. Do nothing with it and wait for a couple of hours
3. Plot the following metric in the metrics dashboard of OCP console:
rate(controller_runtime_reconcile_total{controller="status"}[1h]) * 3600     

Actual results:

500+ reconciles/h on a resting cluster

Expected results:

12-50 reconciles/h on a resting cluster
Note: the reconcile() function always requeues after 5min so the theoretical minimum is 12 reconciles/h

Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/150

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

There are built-in cluster roles to provide access to the default OpenShift SCCs. The "hostmount-anyuid" SCC does not have a functioning build-in cluster role, as it appears to have a typo in the name.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Consistent
    

Steps to Reproduce:

    1. Attempt to use "system:openshift:scc:hostmount" cluster role
    2. 
    3.
    

Actual results:

No access provided as the name of the SCC is typod
    

Expected results:

Access provided to use the SCC
    

Additional info:


    

While implementing MON-3669, we realized that none of the recording rules running on the telemeter server side are tested. Given how complex these rules can be, it's important for us to be confident that future changes won't bring regressions.

Even though not perfect, it's possible to unit tests Prometheus rules with the promtool binary  (example in CMO: https://github.com/openshift/cluster-monitoring-operator/blob/2ca7067a4d1fc86b31f7a4816c85da6abc0c8abf/Makefile#L218-L221).

DoD

Description of problem:

found typo in 4.14/4.15 branch when review PR: https://github.com/openshift/cluster-monitoring-operator/pull/2073

example typo in 4.14 branch

https://github.com/openshift/cluster-monitoring-operator/blob/release-4.14/pkg/manifests/manifests_test.go

1. systemd unit pattern valiation error
valiation should be validation

2. enable systemd collector with invalid units parttern
parttern should be pattern

3. t.Fatalf("invalid secret namepace, got %s, want %s", s.Namespace, "openshift-user-workload-monitoring")
namepace should be namespace

4. spread contraints
should be spread constraints

Version-Release number of selected component (if applicable):

4.14/4.15

How reproducible:

always

I remember we've added golang-lint to repo, it should find the errors

Please review the following PR: https://github.com/openshift/bond-cni/pull/60

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-29441.

The CVO managed manifest, that CMO ships lack capability annotations as defined in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations.

The dashboards should be tied to the console capability so that when CMO deploys on a cluster without the Console capability, CVO doesn't deploy the dashboards configmap.

Description of problem:

If ccm disabled in cloud such as aws, installation will continue until failed in ingress LoadBalancerPending

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

1. Build image with pr openshift/cluster-cloud-controller-manager-operator#284,openshift/installer#7546,openshift/cluster-version-operator#979,openshift/machine-config-operator#3999 
2. Install cluster on aws with "baselineCapabilitySet: v4.14"
3.
    

Actual results:

Installation failed, ingress LoadBalancerPending.
$ oc get node              
NAME                                        STATUS   ROLES                  AGE   VERSION
ip-10-0-25-230.us-east-2.compute.internal   Ready    control-plane,master   86m   v1.28.3+20a5764
ip-10-0-3-101.us-east-2.compute.internal    Ready    worker                 78m   v1.28.3+20a5764
ip-10-0-46-198.us-east-2.compute.internal   Ready    control-plane,master   87m   v1.28.3+20a5764
ip-10-0-48-220.us-east-2.compute.internal   Ready    worker                 80m   v1.28.3+20a5764
ip-10-0-79-203.us-east-2.compute.internal   Ready    control-plane,master   86m   v1.28.3+20a5764
ip-10-0-95-83.us-east-2.compute.internal    Ready    worker                 78m   v1.28.3+20a5764
 
$ oc get co          
NAME                            VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   False       False         True       85m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
baremetal                       4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
cloud-credential                4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      86m
cluster-autoscaler              4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
config-operator                 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
console                         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   False       True          False      79m     DeploymentAvailable: 0 replicas available for console deployment...
control-plane-machine-set       4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      81m
csi-snapshot-controller         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
dns                             4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
etcd                            4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      83m
image-registry                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
ingress                                                                                   False       True          True       78m     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending)
insights                        4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
kube-apiserver                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      71m
kube-controller-manager         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      82m
kube-scheduler                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      82m
kube-storage-version-migrator   4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
machine-api                     4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      77m
machine-approver                4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
machine-config                  4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
marketplace                     4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      84m
monitoring                      4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      73m
network                         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      86m
node-tuning                     4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
openshift-apiserver             4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      71m
openshift-controller-manager    4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      75m
openshift-samples               4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      78m
service-ca                      4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False         False      85m
storage                         4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest   True        False     

Expected results:

Tell users not to turn CCM off for cloud.

Additional info:

    

Please review the following PR: https://github.com/openshift/images/pull/156

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When specifying a control plane operator image for dev purposes, the control plane operator pod fails to come up with an InvalidImageRef status.

Version-Release number of selected component (if applicable):

    Mgmt cluster is 4.15, HyperShift control plane operator is latest from main

How reproducible:

Always    

Steps to Reproduce:

    1. Create a hosted cluster with an annotation to override control plane operator image and point it to a non-digest image ref.
    

Actual results:

    The cluster fails to come up with the CPO pod failing with InvalidImageRef

Expected results:

The cluster comes up fine.    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/390

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The unworkable filter component should not exist in resource section on the search page with Phone View

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2023-12-15-211129

How reproducible:

    Always

Steps to Reproduce:

    1. Change to phone view for the browser (Browser -> F12 - Toggle device toolbar)
       eg: iPhone 14 Pro Max
    2. Navigate to Home -> Search page, select one resource
       eg: APIRequestCounts
    3. Check the component in the resources panel

Actual results:

   There is an unworkable filter icon under the 'Create APIREquestCount' button

Expected results:

    Remove the filter component in Phone view
 OR, make sure the filter is workable in phone view if customer needed

Additional info:

    https://drive.google.com/file/d/1Fwb8EGznWkA1z3cVVzGcJJjMFkjkuUhK/view?usp=drive_link

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/116

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/telemeter/pull/497

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

iam:TagInstanceProfile is not listed in official document [1], IPI install would fail if iam:TagInstanceProfile permission is missing

level=error msg=Error: creating IAM Instance Profile (ci-op-4hw2rz1v-49c30-zt9vx-worker-profile): AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-4hw2rz1v-49c30-minimal-perm is not authorized to perform: iam:TagInstanceProfile on resource: arn:aws:iam::301721915996:instance-profile/ci-op-4hw2rz1v-49c30-zt9vx-worker-profile because no identity-based policy allows the iam:TagInstanceProfile action
level=error msg=    status code: 403, request id: bb0641f5-d01c-4538-b333-261a804ddb59

[1] https://docs.openshift.com/container-platform/4.14/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account

    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-14-115151
    

How reproducible:

Always
    

Steps to Reproduce:

    1. install a common IPI cluster with minimal permission provided in official document
    2.
    3.
    

Actual results:

Install failed.
    

Expected results:


    

Additional info:

install does a precheck for iam:TagInstanceProfile
    

Description of problem:

    Panic thrown by origin-tests

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Create aws or rosa 4.15 cluster
    2. run origin tests
    3.
    

Actual results:

    time="2024-03-07T17:03:50Z" level=info msg="resulting interval message" message="{RegisteredNode  Node ip-10-0-8-83.ec2.internal event: Registered Node ip-10-0-8-83.ec2.internal in Controller map[reason:RegisteredNode roles:worker]}"
  E0307 17:03:50.319617      71 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
  goroutine 310 [running]:
  k8s.io/apimachinery/pkg/util/runtime.logPanic({0x84c6f20?, 0xc006fdc588})
  	k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:75 +0x99
  k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc008c38120?})
  	k8s.io/apimachinery@v0.29.0/pkg/util/runtime/runtime.go:49 +0x75
  panic({0x84c6f20, 0xc006fdc588})
  	runtime/panic.go:884 +0x213
  github.com/openshift/origin/pkg/monitortests/testframework/watchevents.nodeRoles(0x0?)
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:251 +0x1e5
  github.com/openshift/origin/pkg/monitortests/testframework/watchevents.recordAddOrUpdateEvent({0x96bcc00, 0xc0076e3310}, {0x7f2a0e47a1b8, 0xc007732330}, {0x281d36d?, 0x0?}, {0x9710b50, 0xc000c5e000}, {0x9777af, 0xedd7be6b7, ...}, ...)
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:116 +0x41b
  github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring.func2({0x8928f00?, 0xc00b528c80})
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:65 +0x185
  k8s.io/client-go/tools/cache.(*FakeCustomStore).Add(0x8928f00?, {0x8928f00?, 0xc00b528c80?})
  	k8s.io/client-go@v0.29.0/tools/cache/fake_custom_store.go:35 +0x31
  k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0xe16d020?}, {0x9694a10, 0xc006b00180}, {0x96d2780, 0xc0078afe00}, {0x96f9e28?, 0x8928f00}, 0x0, ...)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:756 +0x603
  k8s.io/client-go/tools/cache.(*Reflector).watch(0xc0005dcc40, {0x0?, 0x0?}, 0xc005cdeea0, 0xc005bf8c40?)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:437 +0x53b
  k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc0005dcc40, 0xc005cdeea0)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:357 +0x453
  k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:291 +0x26
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10?)
  	k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:226 +0x3e
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc007974ec0?, {0x9683f80, 0xc0078afe50}, 0x1, 0xc005cdeea0)
  	k8s.io/apimachinery@v0.29.0/pkg/util/wait/backoff.go:227 +0xb6
  k8s.io/client-go/tools/cache.(*Reflector).Run(0xc0005dcc40, 0xc005cdeea0)
  	k8s.io/client-go@v0.29.0/tools/cache/reflector.go:290 +0x17d
  created by github.com/openshift/origin/pkg/monitortests/testframework/watchevents.startEventMonitoring
  	github.com/openshift/origin/pkg/monitortests/testframework/watchevents/event.go:83 +0x6a5
panic: runtime error: slice bounds out of range [24:23] [recovered]
	panic: runtime error: slice bounds out of range [24:23]

Expected results:

    execution of tests

Additional info:

    

Description of problem:

Add image configuration for hypershift Hosted Cluster not working as expected. 

Version-Release number of selected component (if applicable):

# oc get clusterversions.config.openshift.io
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-rc.8   True        False         6h46m   Cluster version is 4.13.0-rc.8      

How reproducible:

Always

Steps to Reproduce:

1. Get hypershift hosted cluster detail from management cluster. 

# hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r '.items[].metadata.name')  

2. Apply image setting for hypershift hosted cluster. 
#  oc patch hc/$hostedcluster -p '{"spec":{"configuration":{"image":{"registrySources":{"allowedRegistries":["quay.io","registry.redhat.io","image-registry.openshift-image-registry.svc:5000","insecure.com"],"insecureRegistries":["insecure.com"]}}}}}' --type=merge -n clusters     
hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched 

# oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.image
{
  "registrySources": {
    "allowedRegistries": [
      "quay.io",
      "registry.redhat.io",
      "image-registry.openshift-image-registry.svc:5000",
      "insecure.com"
    ],
    "insecureRegistries": [
      "insecure.com"
    ]
  }
}

3. Check Pod or operator restart to apply configuration changes. 

# oc get pods -l app=kube-apiserver  -n clusters-${hostedcluster}
NAME                              READY   STATUS    RESTARTS   AGE
kube-apiserver-67b6d4556b-9nk8s   5/5     Running   0          49m
kube-apiserver-67b6d4556b-v4fnj   5/5     Running   0          47m
kube-apiserver-67b6d4556b-zldpr   5/5     Running   0          51m

#oc get pods -l app=kube-apiserver  -n clusters-${hostedcluster} -l app=openshift-apiserver
NAME                                   READY   STATUS    RESTARTS   AGE
openshift-apiserver-7c69d68f45-4xj8c   3/3     Running   0          136m
openshift-apiserver-7c69d68f45-dfmk9   3/3     Running   0          135m
openshift-apiserver-7c69d68f45-r7dqn   3/3     Running   0          136m  

4. Check image.config in hosted cluster.
# oc get image.config -o yaml
...
  spec:
    allowedRegistriesForImport: []
  status:
    externalRegistryHostnames:
    - default-route-openshift-image-registry.apps.hypershift-ci-32506.qe.devcluster.openshift.com
    internalRegistryHostname: image-registry.openshift-image-registry.svc:5000  

#oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-128-61.us-east-2.compute.internal    Ready    worker   6h42m   v1.26.3+b404935
ip-10-0-130-68.us-east-2.compute.internal    Ready    worker   6h42m   v1.26.3+b404935
ip-10-0-134-89.us-east-2.compute.internal    Ready    worker   6h42m   v1.26.3+b404935
ip-10-0-138-169.us-east-2.compute.internal   Ready    worker   6h42m   v1.26.3+b404935

# oc debug node/ip-10-0-128-61.us-east-2.compute.internal
Temporary namespace openshift-debug-mtfcw is created for debugging node...
Starting pod/ip-10-0-128-61us-east-2computeinternal-debug-mctvr ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.61
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# cat /etc/containers/registries.conf
unqualified-search-registries = ["registry.access.redhat.com", "docker.io"]
short-name-mode = ""[[registry]]
  prefix = ""
  location = "registry-proxy.engineering.redhat.com"  [[registry.mirror]]
    location = "brew.registry.redhat.io"
    pull-from-mirror = "digest-only"[[registry]]
  prefix = ""
  location = "registry.redhat.io"  [[registry.mirror]]
    location = "brew.registry.redhat.io"
    pull-from-mirror = "digest-only"[[registry]]
  prefix = ""
  location = "registry.stage.redhat.io"  [[registry.mirror]]
    location = "brew.registry.redhat.io"
    pull-from-mirror = "digest-only" 

Actual results:

Config changes not applied in backend.Not operator & pod restart

Expected results:

Configuration should applied and pod & operator should restart after config changes. 

Additional info:

 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When I create a pod with empty security context as a user that has access to all SCCs, the SCC annotation shows "privileged"

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. create a bare pod with an empty security context
2. look at the "openshift.io/scc" annotation 

Actual results:

privileged

Expected results:

anyuid

Additional info:

kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
    restartPolicy: Never
    containers:
    - name: fedora
      image: fedora:latest
      command:
      - sleep
      args:
      - "infinity"

 

Due to structural changes in openshift/api our generate make target fails after an api update

make: *** No rule to make target 'vendor/github.com/openshift/api/monitoring/v1/0000_50_monitoring_01_alertingrules.crd.yaml', needed by 'jsonnet/crds/alertingrules-custom-resource-definition.json'.  Stop.

Related with https://issues.redhat.com/browse/OCPBUGS-23000

Cluster-autoscaler by default evict all those pods -including those coming from daemon sets-
In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully. 

Version-Release number of selected component (if applicable):

4.11

How reproducible:

- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption

Additional info:

Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod
cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"

Description of problem:

 The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

  in a statistically significant pattern 

Steps to Reproduce:

    1. run OCP test suite many times for it to matter
    

Actual results:

    fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors
Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times

Expected results:

Test pass 

Additional info:

Link to the regression dashboard - https://sippy.dptools.openshift.org/sippy-ng/component_readiness/capability?baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=SCC&component=oauth-apiserver&confidence=95&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&pity=5&sampleEndTime=2023-12-11%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2023-12-05%2000%3A00%3A00

[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

Description of problem:

When using a custom CNI plugin in a hostedcluster, multus requires some CSRs to be approved. The component approving these CSRs is the network-node-identity. This component only gets the proper RBAC rules configured when networkType is set to Calico.

In the current implementation, there is an condition that will apply the required RBAC if the networkType is set to Calico[1].

When using other CNI plugins, like Cilium, you're supposed to set networkType to Other. With current implementation, you won't get the required RBAC in place and as such, the required CSRs won't be approved automatically.


[1] https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L139   

Version-Release number of selected component (if applicable):

Latest    

How reproducible:

Always

Steps to Reproduce:

    1. Set hostedcluster.spec.networking.networkType to Other
    2. Wait for the HC to start deploying and for the Nodes to join the cluster
    3. The nodes will remain in NotReady. Multus pods will complaing about certificates not being ready.
    4. If you list CSRs you will find pending CSRs.
    

Actual results:

RBAC not properly configured when networkType set to Other

Expected results:

RBAC properly configured when networkType set to Other

Additional info:

Slack discussion:

https://redhat-internal.slack.com/archives/C01C8502FMM/p1704824277049609

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/303

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/202

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/60

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/31

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The e2e-gcp-op-layering CI job seems to be continuously and consistently failing during the teardown process. In particular, it appears to be the TestOnClusterBuildRollsOutImage test that is failing whenever it attempts to tear down the node. See: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/4060/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-layering/1744805949165539328 for an example of a failing job.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

Open a PR to the GitHub MCO repository.

Actual results:

The teardown portion of the TestOnClusterBuildsRollout test fails thusly:

  utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f
    utils.go:1098: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79
                                        /usr/lib/golang/src/testing/testing.go:1150
                                        /usr/lib/golang/src/testing/testing.go:1328
                                        /usr/lib/golang/src/testing/testing.go:1570
            Error:          Received unexpected error:
                            exit status 1
            Test:           TestOnClusterBuildRollsOutImage
    utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f
    utils.go:1098: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79
                                        /usr/lib/golang/src/testing/testing.go:1150
                                        /usr/lib/golang/src/testing/testing.go:1328
                                        /usr/lib/golang/src/testing/testing.go:1312
                                        /usr/lib/golang/src/runtime/panic.go:522
                                        /usr/lib/golang/src/testing/testing.go:980
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79
                                        /usr/lib/golang/src/testing/testing.go:1150
                                        /usr/lib/golang/src/testing/testing.go:1328
                                        /usr/lib/golang/src/testing/testing.go:1570
            Error:          Received unexpected error:
                            exit status 1
            Test:           TestOnClusterBuildRollsOutImage

Expected results:

This part of the test should pass.

Additional info:

The way the test teardown process currently works is that it shells out to the oc command to delete the underlying Machine and Node. We delete the underlying machine and node so that the cloud provider will provision us a new one due to issues with opting out of on-cluster builds that have yet to be resolved.

At the time this test was written, it was implemented in this way to avoid having to vendor the Machine client and API into the MCO codebase which has since happened. I suspect the issue is that oc is failing in some way since we get an exit status 1 from where it is invoked. Now that the Machine client and API are vendored into the MCO codebase, it makes more sense for us to use those directly instead of shelling out to oc in order to do this since we would get more verbose error messages instead.

Description of problem:


Documentation for using Red Hat subscriptions in builds is missing a few important steps, especially for customers that have not turned on the tech preview feature for the Shared Resource CSI driver.

These are the following:

1. Customer needs Simple Content Access import enabled in the Insights Operator: https://docs.openshift.com/container-platform/4.12/support/remote_health_monitoring/insights-operator-simple-access.html
2. Customer needs to copy the secret data from openshift-config-managed/etc-pki-entitlement to the workspace the build is running in. We should provide oc commands that a cluster admin/platform team can execute.

For builds that are running in a network-restricted environment and access RHEL content through Satellite, the documentation must also provide instructions on how to obtain an `rhsm.conf` file for the Satellite instance and mount it into the build container.

Version-Release number of selected component (if applicable):


4.12

How reproducible:


Always

Steps to Reproduce:


Read the documentation for https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html#builds-create-imagestreamtag_running-entitled-builds and execute the commands as is.

Actual results:


Build probably won't run because the required secret is not created.

Expected results:


Customers should be able to run a build that requires RHEL entitlements following the exact steps as described in the doc.

Additional info:


https://docs.openshift.com/container-platform/4.12/cicd/builds/running-entitled-builds.html

Description of problem:

[vSphere-CSI-Driver-Operator] does not update the VSphereCSIDriverOperatorCRAvailable status timely

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-04-162702

How reproducible:

Always    

Steps to Reproduce:

1. Set up a vSphere cluster with 4.15 nightly;
2. Backup the secret/vmware-vsphere-cloud-credentials to "vmware-cc.yaml"
3. Change the secret/vmware-vsphere-cloud-credentials password to an invalid value under ns/openshift-cluster-csi-drivers by oc edit;
4. Wait for the cluster storage operator degrade and the driver controller pods CrashLoopBackOff, then recover the backup secret "vmware-cc.yaml" back by apply;
5. Observer the driver controller pods back to Running and the cluster storage operator should be back to healthy.
     

Actual results:

In Step5 : The driver controller pods back to Running but the cluster storage operator stuck at Degrade: True status for almost 1 hour$ oc get po
NAME                                                    READY   STATUS    RESTARTS        AGE
vmware-vsphere-csi-driver-controller-664db7d497-b98vt   13/13   Running   0               16s
vmware-vsphere-csi-driver-controller-664db7d497-rtj49   13/13   Running   0               23s
vmware-vsphere-csi-driver-node-2krg6                    3/3     Running   1 (3h4m ago)    3h5m
vmware-vsphere-csi-driver-node-2t928                    3/3     Running   2 (3h16m ago)   3h16m
vmware-vsphere-csi-driver-node-45kb8                    3/3     Running   2 (3h16m ago)   3h16m
vmware-vsphere-csi-driver-node-8vhg9                    3/3     Running   1 (3h16m ago)   3h16m
vmware-vsphere-csi-driver-node-9fh9l                    3/3     Running   1 (3h4m ago)    3h5m
vmware-vsphere-csi-driver-operator-5954476ddc-rkpqq     1/1     Running   2 (3h10m ago)   3h17m
vmware-vsphere-csi-driver-webhook-7b6b5d99f6-rxdt8      1/1     Running   0               3h16m
vmware-vsphere-csi-driver-webhook-7b6b5d99f6-skcbd      1/1     Running   0               3h16m
$ oc get co/storage -w
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.15.0-0.nightly-2023-12-04-162702   False       False         True       8m39s   VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: error logging into vcenter: ServerFaultCode: Cannot complete login due to an incorrect user name or password.
storage   4.15.0-0.nightly-2023-12-04-162702   True        False         False      0s
$  oc get co/storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.15.0-0.nightly-2023-12-04-162702   True        False         False      3m41s
 

Expected results:

In Step5 : After driver controller pods back to Running the cluster storage operator should recover healthy status immediatelly  

Additional info:

I compare with the previous CI results seems this issue happened after 4.15.0-0.nightly-2023-11-25-110147    

The hypershift ignition endpoint needlessly supports APLN http2. In light of CVE-2023-39325, there is no reason to support http2 if it is not being used.

When rolling back from 4.16 to 4.15, rollback changes made to the cluster state to allow the 4.15 version of the managed image pull secret generation to take over again.

Description of problem:

 - Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.

  • This manifests as the router pods marking backends as DOWN and dropping traffic; but The behavior can be replicated with curl outside of the HAProxy pods via entering a debug shell to a host node (or SSH) and curling the pod IP directly. A significant percentage of packets time out to the target backend on intermittent subsequent calls.
  • We narrowed the behavior down to the moment we applied the NetworkPolicy for `allow-from-ingress` as outlined below - immediately the namespace began to drop packets on a curl loop running from an infra node directly against the pod IP (some 2-3% of all calls timed out).
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
  metadata:
    name: allow-from-openshift-ingress
    namespace: testing
spec:
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
             policy-group.network.openshift.io/ingress: ""
    podSelector: {}
    policyTypes:
    - Ingress

Version-Release number of selected component (if applicable):

 

How reproducible:

  • every time, all namespaces with this network policy on this clusterversion (replicated on two clusters that underwent the same upgrade).
     

Steps to Reproduce:

1. Upgrade cluster to 4.13.30

2. Apply test pod running basic HTTP instance at random port

3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).

4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).

$ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done

Actual results:

 - Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.

Expected results:

 - traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.

Additional info:

  • This behavior began immediately after completed upgrade from 4.13.24 to 4.13.30 and has been replicated on two separate clusters.
  • Customer has been forced to reinstall a cluster at downgraded version to ensure stability/deliverables for their user-base and this is a critical impact outage scenario for them

– additional required template details in first comment below.

RCA UPDATE:
So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with

oc get ns openshift-host-network --show-labels
# expected this
kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=

# but before the fix you will see 
kubernetes.io/metadata.name=openshift-host-network,policy-group.network.openshift.io/host-network=

Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged

oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0
oc scale deployment network-operator -n openshift-network-operator --replicas=0

and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears

Potentially affected versions (may need to reproduce to confirm)

4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070

4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293

4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039

Mitigation/support KCS:
https://access.redhat.com/solutions/7055050

Description of problem:

When installing a new vSphere cluster with static IPs, control plane machine sets (CPMS) are also enabled in TechPreviewNoUpgrade and the installer applies the incorrect config to the CPMS resulting in masters being recreated.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

always

Steps to Reproduce:

1. create install-config.yaml with static IPs following documentation
2. run `openshift-install create cluster`
3. as install progresses, watch the machines definitions
    

Actual results:

new master machines are created

Expected results:

all machines are the same as what was created by the installer.

Additional info:

    

Description of problem:

The ovn-ipsec-host pods are crashlooping on a 24 node cluster.  

Version-Release number of selected component (if applicable):

 4.16.0, master   

How reproducible:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50690/rehearse-50690-pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-control-plane-ipsec-24nodes/1780216294851743744    

Steps to Reproduce:

Running rehearse test for the PR https://github.com/openshift/release/pull/50690

Actual results:

CI lane fails at control-plane-ipsec-24nodes-ipi-install-install step.

Seeing following errors from ipsec pod:

2024-04-16T14:18:01.158407293Z + counter=0
2024-04-16T14:18:01.158407293Z + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']'
2024-04-16T14:18:01.158512920Z ovnkube-node has configured node.
2024-04-16T14:18:01.158519623Z + echo 'ovnkube-node has configured node.'
2024-04-16T14:18:01.158519623Z + pgrep pluto
2024-04-16T14:18:01.166444142Z pluto is not running, enable the service and/or check system logs
2024-04-16T14:18:01.166465551Z + echo 'pluto is not running, enable the service and/or check system logs'
2024-04-16T14:18:01.166465551Z + exit 2

Expected results:

The step must pass and CI lane should succeed eventually.    

Additional info:

The mcp status for the worker pool contains the following:
status:
  certExpirys:
  - bundle: KubeAPIServerServingCAData
    expiry: "2034-04-14T12:58:49Z"
    subject: CN=admin-kubeconfig-signer,OU=openshift
  - bundle: KubeAPIServerServingCAData
    expiry: "2024-04-17T12:58:51Z"
    subject: CN=kube-csr-signer_@1713274017
  - bundle: KubeAPIServerServingCAData
    expiry: "2024-04-17T12:58:51Z"
    subject: CN=kubelet-signer,OU=openshift
  - bundle: KubeAPIServerServingCAData
    expiry: "2025-04-16T12:58:51Z"
    subject: CN=kube-apiserver-to-kubelet-signer,OU=openshift
  - bundle: KubeAPIServerServingCAData
    expiry: "2025-04-16T12:58:51Z"
    subject: CN=kube-control-plane-signer,OU=openshift
  - bundle: KubeAPIServerServingCAData
    expiry: "2034-04-14T12:58:50Z"
    subject: CN=kubelet-bootstrap-kubeconfig-signer,OU=openshift
  - bundle: KubeAPIServerServingCAData
    expiry: "2025-04-16T13:26:54Z"
    subject: CN=openshift-kube-apiserver-operator_node-system-admin-signer@1713274014
  conditions:
  - lastTransitionTime: "2024-04-16T13:28:53Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2024-04-16T13:34:52Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2024-04-16T13:35:08Z"
    message: ""
    reason: ""
    status: "False"
    type: NodeDegraded
  - lastTransitionTime: "2024-04-16T13:35:08Z"
    message: ""
    reason: ""
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-16T13:34:52Z"
    message: All nodes are updating to MachineConfig rendered-worker-226a284eb61d46506202285ee1cf4688
    reason: ""
    status: "True"
    type: Updating
  configuration:
    name: rendered-worker-95c2861c75a83c0523dcba922c3b9982
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 98-worker-generated-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 97-worker-generated-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 80-ipsec-worker-extensions
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
  degradedMachineCount: 0
  machineCount: 24
  observedGeneration: 140
  readyMachineCount: 8
  unavailableMachineCount: 1
  updatedMachineCount: 8

Description of problem:

when there is only one server CSR pending on approval, we still show two records(one is client CSR requires approval which is already several hours old and the other is server CSR requires approval)    

Version-Release number of selected component (if applicable):

pre-merge testing of https://github.com/openshift/console/pull/13493     

How reproducible:

Always    

Steps to Reproduce:

1. select one node which is joining the cluster, approve client CSR and do not approve server CSR, wait for some time

=> we can see only one node is pending on server CSR approval
$ oc get csr | grep Pending | grep system:node
csr-54sn4   142m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-7nhb9   65m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-9g22f   4m4s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-bgrdq   35m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-chqnf   50m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-f4sbl   127m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-msnml   157m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-p9qrp   19m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-qp2pw   112m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-qrlnv   96m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending
csr-tk7j4   81m    kubernetes.io/kubelet-serving                 system:node:ip-10-0-49-55.us-east-2.compute.internal                        <none>              Pending

Actual results:

1. on nodes list page, we can see two rows shown for node ip-10-0-49-55.us-east-2.compute.internal

Expected results:

since the pending client CSR has been there for several hours and the node now is actually waiting for server CSR approval, we should only show one record/row to indicate user that it requires server CSR approval

The pending client CSR associated with ip-10-0-49-55.us-east-2.compute.internal is already 3 hours old
$ oc get csr csr-4d628
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-4d628   3h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending

Additional info:

    

Description of problem:

MachineAutoscaler resources with a minimum replica size of zero will display a hyphen ("-") instead of zero on the list and detail pages.

Version-Release number of selected component (if applicable):

    4.14.7

How reproducible:

    always

Steps to Reproduce:

    1.create a MachineAutoscaler with "min: 0" field
    2.save record
    3.navigate to MachineAutoscalers page under the Compute tab
    

Actual results:

    the min replicas indicates "-"

Expected results:

    min replicas indicates "0"

Additional info:

    attaching a screenshot

Description of problem:

 Hypershift CLI requires security group id when creating a nodepool, otherwise it fails to create a nodepool.

jiezhao-mac:hypershift jiezhao$ ./bin/hypershift create nodepool aws --name=test --cluster-name=jie-test --node-count=3 2024-02-20T11:29:19-05:00 ERROR Failed to create nodepool {"error": "security group ID was not specified and cannot be determined from default nodepool"} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /Users/jiezhao/hypershift-test/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /Users/jiezhao/hypershift-test/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /Users/jiezhao/hypershift-test/hypershift/main.go:78 runtime.main /usr/local/Cellar/go/1.20.4/libexec/src/runtime/proc.go:250 Error: security group ID was not specified and cannot be determined from default nodepool security group ID was not specified and cannot be determined from default nodepool jiezhao-mac:hypershift jiezhao${code}
Version-Release number of selected component (if applicable):
{code:none}
    

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    nodepool creation should succeed without security group specified in hypershift cli

Additional info:

    

https://github.com/openshift/release/pull/48835

The scatter chart is very slow to load for the last week, but while there are a few hits here and there over last y days, it looks like this got a lot more common yesterday around noon and has continued ever since.

Suspicious PR: https://github.com/openshift/origin/pull/28587

Description of problem:

  The azure csi driver operator cannot run in a HyperShift control plane because it has this selector: node-role.kubernetes.io/master: ""  

Version-Release number of selected component (if applicable):

    4.16 ci latest

How reproducible:

always    

Steps to Reproduce:

    1. Install hypershift
    2. Create azure hosted cluster
    

Actual results:

    azure-disk-csi-driver-operator pod remains in Pending state

Expected results:

    all control plane pods run

Additional info:

    

Description of problem:

When adding parameters to a pipeline there is an error when trying to save.

It seems a resource[] section is added, this doesn't happen when using yaml resources and oc client.

Discussed with Vikram Raj

Version-Release number of selected component (if applicable):

    4.14.12

How reproducible:

    Always

Steps to Reproduce:

    1.Create a pipeline
    2.Add a parameter
    3.Save the pipeline
    

Actual results:

    Error shown

Expected results:

    Save successful

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1004

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/images/pull/159

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/62

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

New spot VMs fail to be created by machinesets defining providerSpec.value.spotVMOptions in Azure regions without Availability Zones.

Azure-controller logs the error: Azure Spot Virtual Machine is not supported in Availability Set.

A new availabilitySet is created for each machineset in non-zonal regions, but this only works with normal nodes. Spot VMs and availabilitySets are incompatible as per Microsoft docs for this error: You need to choose to either use an Azure Spot Virtual Machine or use a VM in an availability set, you can't choose both.
From: https://learn.microsoft.com/en-us/azure/virtual-machines/error-codes-spot

Version-Release number of selected component (if applicable):

    n/a

How reproducible:

    Always

Steps to Reproduce:

1. Follow the instructions to create a machineset to provision spot VMs: 
  https://docs.openshift.com/container-platform/4.12/machine_management/creating_machinesets/creating-machineset-azure.html#machineset-creating-non-guaranteed-instance_creating-machineset-azure

2. New machines will be in Failed state:
$ oc get machines -A
NAMESPACE               NAME                                            PHASE     TYPE              REGION       ZONE   AGE
openshift-machine-api   mabad-test-l5x58-worker-southindia-spot-c4qr5   Failed                                          7m17s
openshift-machine-api   mabad-test-l5x58-worker-southindia-spot-dtzsn   Failed                                          7m17s
openshift-machine-api   mabad-test-l5x58-worker-southindia-spot-tzrhw   Failed                                          7m28s


3. Events in the failed machines show errors creating spot VMs with availabilitySets:
Events:
  Type     Reason             Age                 From                           Message
  ----     ------             ----                ----                           -------
  Warning  FailedCreate       28s                 azure-controller               InvalidConfiguration: failed to reconcile machine "mabad-test-l5x58-worker-southindia-spot-dx78z": failed to create vm mabad-test-l5x58-worker-southindia-spot-dx78z: failure sending request for machine mabad-test-l5x58-worker-southindia-spot-dx78z: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Azure Spot Virtual Machine is not supported in Availability Set. For more information, see http://aka.ms/AzureSpot/errormessages."    

Actual results:

     Machines stay in Failed state and nodes are not created

Expected results:

     Machines get created and new spot VM nodes added to the cluster.

Additional info:

    This problem was identified from a customer alert in an ARO cluster. ICM for ref (requires b- MSFT account): https://portal.microsofticm.com/imp/v3/incidents/incident/455463992/summary

Description of problem:

This is only applicable to systems that install a performance profile

There seems to be a race condition where all systemd spawed processes are not being moved to /sys/fs/cgroup/cpuset/system.slice.

This is suppose to be done by the one-shot cpuset-configure.service. 
Here is a list of processes I see on one lab that are still in the root directory 

/usr/bin/dbus-broker-launch --scope system --audit
dbus-broker --log 4 --controller 9 --machine-id 071fd738af0146859d2c04b7fea6d276 --max-bytes 536870912 --max-fds 4096 --max-matches 131072 --audit
/usr/sbin/NetworkManager --no-daemon
/usr/sbin/dnsmasq -k
/sbin/agetty -o -p -- \u --noclear - linux
sshd: core@pts/0


Version-Release number of selected component (if applicable):

    4.14, 4.15

How reproducible:

    

Steps to Reproduce:

    1. Reboot a SNO with a peformance profile applied 
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

We need to make a d/s sync with the u/s multus to support the expose of MTU in the network-status annotation.

The PR was merged u/s https://github.com/k8snetworkplumbingwg/multus-cni/pull/1250

Description of problem:

 The monitoring operator may be down or disabled, and the components it manages may be unavailable or degraded.
Upon quick check I've noticed an error:

oc get co -o json | jq -r '.items[].status | select (.conditions) '.conditions | jq -r '.[] | select( (.type == "Degraded") and (.status == "True") )'

    {
      "lastTransitionTime": "2023-12-19T10:25:24Z",
      "message": "syncing Thanos Querier trusted CA bundle ConfigMap failed: reconciling trusted CA bundle ConfigMap failed: updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded, syncing Thanos Querier trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=alertmanager,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps), syncing Prometheus trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=prometheus,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)",
      "reason": "MultipleTasksFailed",
      "status": "True",
      "type": "Degraded"
    } 

i.e. updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded

I ran oc get co again and everything looked fine, it seems this timeout condition could be handled better to avoid alerting SRE.

Actual results:

operator degraded

Expected results:

operator retries operation

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/126

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#tabledata includes a reference to `pf-c-table__action`, but v1+ of console-dynamic-plugin-sdk requires PatternFly 5, so the reference should be updated to `pf-v5-c-table__action`.

Currently the openshift-baremetal-install binary is dynamically linked to libvirt-client, meaning that it is only possible to run it on a RHEL system with libvirt installed.

A new version of the libvirt bindings, v1.8010.0, allows the library to be loaded only on demand, so that users who do not execute any libvirt code can run the rest of the installer without needing to install libvirt. (See this comment from Dan Berrangé.) In practice, the "rest of the installer" is everything except the baremetal destroy cluster command (which destroys the bootstrap storage pool - though only if the bootstrap itself has already been successfully destroyed - and has probably never been used by anybody ever). The Terraform providers all run in a separate binary.

There is also a pure-go libvirt library that can be used even within a statically-linked binary on any platform, even when interacting with libvirt. The libvirt terraform provider that does almost all of our interaction with libvirt already uses this library.

Description of problem:

When using the oc cli to query information about release images it is not possible to use the --certificate-authority option to specify an alternative CA bundle for verifying connections to the target registry.

Version-Release number of selected component (if applicable): 4.14.5

How reproducible: 100%

Steps to Reproduce:

    1. oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64

Actual results:

error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority

Expected results:

Something beginning with:

Name:           4.14.9
Digest:         sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44
Created:        2024-01-12T06:48:42Z
OS/Arch:        linux/amd64
Manifests:      680
Metadata files: 1

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44

Release Metadata:

Additional info:

To fully verify that this was an issue I went through the following steps which should show that the oc command is not using the CA bundle in the provided file and that the command would have worked if oc was using the provided bundle

// show the command works with the system CA bundle

# oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head
Name:           4.14.9
Digest:         sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44
Created:        2024-01-12T06:48:42Z
OS/Arch:        linux/amd64
Manifests:      680
Metadata files: 1

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:f5eaf0248779a0478cfd83f055d56dc7d755937800a68ad55f6047c503977c44

Release Metadata:

// move the system CA bundle to the local directory

# mv /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem .

// show the same command now fails without that bundle file

# oc adm release info --registry-config ./auth.json quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head
error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority

// show using that same bundle file with --certificate-authority doesn't work

# oc adm release info --registry-config ./auth.json --certificate-authority ./tls-ca-bundle.pem quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64 | head
error: unable to read image quay.io/openshift-release-dev/ocp-release:4.14.9-x86_64: Get "https://quay.io/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority


Additionally this also seems to be a problem for at least the following commands as well:
oc image info
oc adm release extract

Description of problem:
After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job (gcp-ipi-oidc-rt-fips-f14) failed by same error.
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-disc-priv-oidc-f14/1761337726575054848

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-oidc-rt-fips-f14/1760520933212164096

must-gather:
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-gcp-ipi-disc-priv-oidc-f14/1761337726575054848/artifacts/gcp-ipi-disc-priv-oidc-f14/gather-must-gather/artifacts/


    

Version-Release number of selected component (if applicable):

 4.16.0-0.nightly-2024-02-23-013505 
    

How reproducible:


    

Steps to Reproduce:

After upgrading to 4.16.0-0.nightly-2024-02-23-013505 from 4.15.0-rc.8 (gcp-ipi-disc-priv-oidc-f14), openshift-cloud-network-config-controller CrashLoopBackOff by Error building cloud provider client, err: error: cannot initialize google client, must gather is available. The other job  (gcp-ipi-oidc-rt-fips-f14) failed by same error. 
    

Actual results:


containerStatuses:
  - containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02a0ea00865bda78b3b04056dc9e4f596dae74996ecc1fcdee7fbe8d603e33f1
    imageID: 9dfa10971dce332900b111bbe6a28df76e1d6e0c5b9c132c3abfff80ea0afa9c
    lastState:
      terminated:
        containerID: cri-o://b7dc826c4004583a4195f953bb7c858f3645b3ba864db65c69282fc8b7a9a9e8
        exitCode: 255
        finishedAt: "2024-02-24T15:20:08Z"
        message: |
          r,UID:,APIVersion:apps/v1,ResourceVersion:,FieldPath:,},Reason:FeatureGatesInitialized,Message:FeatureGates updated to featuregates.Features{Enabled:[]v1.FeatureGateName{\"AlibabaPlatform\", \"AzureWorkloadIdentity\", \"BuildCSIVolumes\", \"CloudDualStackNodeIPs\", \"ExternalCloudProvider\", \"ExternalCloudProviderAzure\", \"ExternalCloudProviderExternal\", \"ExternalCloudProviderGCP\", \"KMSv1\", \"NetworkLiveMigration\", \"OpenShiftPodSecurityAdmission\", \"PrivateHostedZoneAWS\", \"VSphereControlPlaneMachineSet\"}, Disabled:[]v1.FeatureGateName{\"AdminNetworkPolicy\", \"AutomatedEtcdBackup\", \"CSIDriverSharedResource\", \"ClusterAPIInstall\", \"DNSNameResolver\", \"DisableKubeletCloudCredentialProviders\", \"DynamicResourceAllocation\", \"EventedPLEG\", \"GCPClusterHostedDNS\", \"GCPLabelsTags\", \"GatewayAPI\", \"InsightsConfigAPI\", \"InstallAlternateInfrastructureAWS\", \"MachineAPIOperatorDisableMachineHealthCheckController\", \"MachineAPIProviderOpenStack\", \"MachineConfigNodes\", \"ManagedBootImages\", \"MaxUnavailableStatefulSet\", \"MetricsServer\", \"MixedCPUsAllocation\", \"NodeSwap\", \"OnClusterBuild\", \"PinnedImages\", \"RouteExternalCertificate\", \"SignatureStores\", \"SigstoreImageVerification\", \"TranslateStreamCloseWebsocketRequests\", \"UpgradeStatus\", \"VSphereStaticIPs\", \"ValidatingAdmissionPolicy\", \"VolumeGroupSnapshot\"}},Source:EventSource{Component:cloud-network-config-controller-86bc6cf968-54kkg,Host:,},FirstTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,LastTimestamp:2024-02-24 15:20:07.457325229 +0000 UTC m=+0.107570685,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:cloud-network-config-controller-86bc6cf968-54kkg,ReportingInstance:,}"
          F0224 15:20:08.633010       1 main.go:138] Error building cloud provider client, err: error: cannot initialize google client, err: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused
        reason: Error
        startedAt: "2024-02-24T15:20:07Z"
    name: controller
    ready: false
    restartCount: 12
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=controller pod=cloud-network-config-controller-86bc6cf968-54kkg_openshift-cloud-network-config-controller(95a0c264-ad8b-4fb0-9218-5b2b84fb8194)
        reason: CrashLoopBackOff

    

Expected results:

   CNCC won't crash after upgrade

    

Additional info:


    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:
When trying to install cluster on 4.15 LVMS with CNV and MCE operator
 In operator page i am unabe to continue with installation
since host discovery pointing that hosts required more resources (as if i also have selected odf)

How reproducible:
 

Steps to reproduce:

1.Create a multi node cluster 4.15

2.make sure to have enough resources for lvms , cnv and mce operator

3.select cnv lvms and mce operator

Actual results:

 In operator page it show that also cpu and ram resources related to ODF are required (which is not selected) and user unable to start installation

Expected results:

Should be able to start installation

Description of problem:

When upgrading clusters to any 4.13 version (from either 4.12 or 4.13), clusters with Hybrid Networking enabled appear to have a few DNS pods (not all) falling into CrashLoopBackOff status. Notably, pods are failing Readiness probes, and when deleted, work without incident. Nearly all other aspects of upgrade continue as expected

Version-Release number of selected component (if applicable):

4.13.x

How reproducible:

Always for systems in use

Steps to Reproduce:

1. Configure initial cluster installation with OVNKubernetes and enable Hybrid Networking on 4.12 or 4.13, e.g. 4.13.13
2. Upgrade cluster to 4.13.z, e.g. 4.13.14

Actual results:

dns-default pods in CrashLoopBackOff status, failing Readiness probes

Expected results:

dns-default pods are rolled out without incident

Additional info:

Appears strongly related to OCPBUGS-13172. CU has kept an affected cluster with the DNS pod issue ongoing for additional investigating, if needed.

Please review the following PR: https://github.com/openshift/node_exporter/pull/141

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Reviewing 4.15 Install failures (install should succeed: overall) there are a number of variants impacted by recent install failures.

search.ci: Cluster operator console is not available

Jobs like periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial show failures that appear to start with 4.15.0-0.nightly-2023-12-07-225558 have installation failures due to console-operator

ConsoleOperator reconciliation failed: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again

 

 

4.15.0-0.nightly-2023-12-07-225558 contains console-operator/pull/814, noting in case it is related

 

 

Version-Release number of selected component (if applicable):

 4.15   

How reproducible:

    

Steps to Reproduce:

    1. Review link to install failures above
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade

In https://issues.redhat.com/browse/OCPBUGS-24195 Lukasz is working on a solution to a problem both the auth and apiserver operators have where a large number of identical kube events can be emitted. The kube apiserver was granted an exception here, but the linked bug was never fixed.

These OpenShiftAPICheckFailed events are reportedly originating during bootstrap, and if bootstrap takes too long many can be emitted, which can trip a test that watches for this sort of thing.

Ideally the problem should be fixed and it sounds like Lukasz is on the path to one which we hope could be used for the apiserver operator as well. (start a controller monitoring the aggregated API only after the bootstrap is complete)

 Fix here would hopefully be to leverage what comes out of OCPBUGS-24195, apply it for the apiserver operator, and then remove the exception linked above in origin.

Description of problem:

Recently we bumped the hyperkube image [1] to use both RHEL 9 builder and base images.

In order to keep things consistent, we tried to do the same with the "tests" image [2], however, that was not possible because there is currently no "tools" image on RHEL 9. The "tests" image uses "tools" as the base image.

As a result, we decided to keep builder & base images for "tests" in RHEL 8, as this work was not required for the kube 1.28 bump nor the FIPS issue we were addressing.

However, for the sake of consistency, eventually it'd be good to bump the "tests" builder image to RHEL 9. This would also require us to create a "tools" image based on RHEL 9.

[1] https://github.com/openshift/kubernetes/blob/6ab54b8d9a0ea02856efd3835b6f9df5da9ce115/openshift-hack/images/hyperkube/Dockerfile.rhel#L1[2] https://github.com/openshift/kubernetes/blob/master/openshift-hack/images/tests/Dockerfile.rhel#L1

[2] https://github.com/openshift/kubernetes/blob/master/openshift-hack/images/tests/Dockerfile.rhel#L1
 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

"tests" image is build and based on a RHEL 8 image.

Expected results:

"tests" image is build and based on a RHEL 9 image.

Additional info:

 

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/65

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

following signing-key deletion, there is a service CA rotation process which might temporary disrupt platform components, but eventually all should use the updated certificates. 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

How reproducible:

100%

Steps to Reproduce:

1.oc delete secret/signing-key -n openshift-service-ca
2. reload the management console
3. 

Actual results:

The Observe tab disappears from the menu bar and the monitoring-plugin shows as unavailable.

Expected results:

No disruption

Additional info:

using manual deletion of the monitoring-plugin pods it is possible to recover the situation

 

Description of problem:

    RHTAP builds are failing because with the addition of request serving node schedueler [1] we use max() [2]  function that required a golang version 1.21 however the containerfile [3] that is used for building the HO binary in rhtap is using 1.20

[1] https://issues.redhat.com/browse/HOSTEDCP-1478 
[2] https://github.com/openshift/hypershift/pull/3776/files#diff-a7f22add63b0067c0a7c9813255519d1432821f431f6eea0c3373d0646d1a855R489
[3] https://github.com/openshift/hypershift/blob/main/Containerfile.operator#L1

 

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

100%    

Steps to Reproduce:

    1.Run rhtap on main branch
    2.
    3.
    

Actual results:

    Fail

Expected results:

    Pass

Additional info:

    

Background

Lightspeed requires to add a link to each alert item, into the kebab menu so they can open a panel

Outcomes

An extension point is added to the alert table so new elements can be added to each alert kebab menu

Description of problem:

Recycler pods are not starting on hostedcontrolplane in disconnected environments ( ImagePullBackOff on quay.io/openshift/origin-tools:latest ).

The root cause is that the recycler-pod template (stored in the recycler-config ConfigMap) on hostedclusters is always pointing to `quay.io/openshift/origin-tools:latest` .

The same configMap for the management cluster is correctly pointing to an image which is part of the release payload:
$ oc get cm -n openshift-kube-controller-manager recycler-config -o json | jq -r '.data["recycler-pod.yaml"]' | grep "image"
      image: "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e458f24c40d41c2c802f7396a61658a5effee823f274be103ac22c717c157308"

but on hosted clusters we have:
$ oc get cm -n clusters-guest414a recycler-config -o json | jq -r '.data["recycler-pod.yaml"]' | grep "image" 
    image: quay.io/openshift/origin-tools:latest

This is likely due to:
https://github.com/openshift/hypershift/blob/e1b75598a62a06534fab6385d60d0f9a808ccc52/control-plane-operator/controllers/hostedcontrolplane/kcm/config.go#L80

quay.io/openshift/origin-tools:latest is not part of any mirrored release payload and it's referenced by tag so it will not be available on disconnected environments.

Version-Release number of selected component (if applicable):

    v4.14, v4.15, v4.16

How reproducible:

    100%

Steps to Reproduce:

    1. create an hosted cluster
    2. check the content of the recycler-config configmap in an hostedcontrolplane namespace
    3.
    

Actual results:

image field for the recycler-pod template is always pointing to `quay.io/openshift/origin-tools:latest` which is not part of the release payload

Expected results:

image field for the recycler-pod template is pointing to the right image (which one???) as extracted from the release payload

Additional info:

see: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/64b4c1ba/bindata/assets/kube-controller-manager/recycler-cm.yaml#L21
to compare with cluster-kube-controller-manager-operator on OCP

Description of problem:

Deleting the node with the Ingress VIP using oc delete node causes a keepalived split-brain

Version-Release number of selected component (if applicable):

4.12, 4.14    

How reproducible:

100%

Steps to Reproduce:

1. In an OpenShift cluster installed via vSphere IPI, check the node with the Ingress VIP.
2. Delete the node.
3. Check the discrepancy between machines objects and nodes. There will be more machines than nodes.
4. SSH to the deleted node, and check the VIP is still mounted and keepalived pods are running.
5. Check the VIP is also mounted in another worker.
6. SSH to the node and check the VIP is still present.     

Actual results:

The deleted node still has the VIP present and the ingress fails sometimes 

Expected results:

The deleted node should not have the VIP present and the ingress should not fail.

Additional info:

 

Description of problem:

When the replica for a nodepool is set to 0, the message for the nodepool is "NotFound". This message should not be displayed if the desired replica is 0.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Create a nodepool and set the replica to 0

Steps to Reproduce:

    1. Create a hosted cluster
    2. Set the replica for the nodepool to 0
    3.
    

Actual results:

NodePool message is "NotFound"    

Expected results:

NodePool message to be empty    

Additional info:

    

Description of problem:

The pipeline operator has been removed from the operator hub so CI has been failing since
https://search.ci.openshift.org/?search=Entire+pipeline+flow+from+Builder+page+%22before+all%22+hook+for+%22Background+Steps%22&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

    To operate HyperShift at high scale, we need an option to disable dedicated request serving isolation, if not used.

Version-Release number of selected component (if applicable):

    4.16, 4.15, 4.14, 4.13

How reproducible:

    100%

Steps to Reproduce:

    1. Install hypershift operator for versions 4.16, 4.15, 4.14, or 4.13
    2. Observe start-up logs
    3. Dedicated request serving isolation controllers are started
    

Actual results:

    Dedicated request serving isolation controllers are started

Expected results:

    Dedicated request serving isolation controllers to not start, if unneeded

Additional info:

    

==== This Jira covers only baremetal-runtimecfg component with respect to node IP detection ====

Description of problem:

Pods running in the namespace openshift-vsphere-infra are so much verbose printing as INFO messages that should debug.

This excesse of verbosity has an impact in CRIO, in the node and also in the Logging system. 

For instance, having 71 nodes, the number of logs coming from this namespace in 1 month was: 450.000.000 meaning 1TB of logs written to disk on the node by CRIO, reading but the Red Hat log collector and stored in the Log Store.

Added to the impact on the performance, it have a financial impact for the storage needed.

Examples of logs are that adjust better to DEBUG and not as INFO:
```
/// For keep-alive pods are printed 4 messages per node each 10 seconds per node, in this example, the number of nodes is 71, then, this means 284 log entries per second, then 1704 log entries by minute and keepalive pod
$ oc logs keepalived-master.example-0 -c  keepalived-monitor |grep master.example-0|grep 2024-02-15T08:20:21 |wc -l

$ oc logs keepalived-master-example-0 -c  keepalived-monitor |grep worker-example-0|grep 2024-02-15T08:20:21 
2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'."
2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP"
2024-02-15T08:20:21.733399279Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'."
2024-02-15T08:20:21.733421398Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP"

/// For haproxy logs observed 2 logs printed per 6 seconds for each master, this means 6 messages in the same second, 60 messages/minute per pod
$ oc logs haproxy-master-0-example -c haproxy-monitor
...
2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="Searching for Node IP of master-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x]'."
2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="For node master-example-0 selected peer address x.x.x.x using NodeInternalIP"

Version-Release number of selected component (if applicable):

OpenShift 4.14
VSphere IPI installation

How reproducible:

Always

Steps to Reproduce:

    1. Install OpenShift 4.14 Vsphere IPI environment
    2. Review the logs of the haproxy pods and keealived pods running in the namespace `openshift-vsphere-infra`
    

Actual results:

The pods haproxy-* and keepalived-* pods being so much verbose printing as INFO messages should be as DEBUG. 

Some of the messages are available in the Description of the problem in the present bug.

Expected results:

Printed as INFO only relevant messages helping to reduce the verbosity of the pods running in the namespace  `openshift-vsphere-infra`

Additional info:

    

Description of problem:

Changes made for faster risk cache-warming (the OCPBUGS-19512 series) introduced an unfortunate cycle:

1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like graph-data#4528.
4. Cases:

    • (a) Before the cache-warming changes, and also after this bug's fix, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
    • (b) Clusters with the cache-warming changes but without this bug's fix say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

Version-Release number of selected component (if applicable):

The regression went back via:

Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.

How reproducible:

Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.

Steps to Reproduce:

1. Launch a cluster.
2. Point it at dummy Cincinnati data, as described in OTA-520. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .

Actual results:

Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).

Expected results:

Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).

Additional info:

Identification

To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:

$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))'
{
  "lastTransitionTime": "2023-12-15T22:00:45Z",
  "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958",
  "reason": "EvaluationFailed",
  "status": "Unknown",
  "type": "Recommended"
} 

To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
I1221 20:36:39.783530       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:36:39.831358       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:40:19.674925       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:40:19.727998       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:43:59.567369       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:43:59.620315       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:47:39.457582       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:47:39.509505       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:51:19.348286       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:51:19.401496       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"

showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
I1221 20:50:10.165101       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:11.166170       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:12.166314       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:13.166517       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:14.166847       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:15.167737       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:16.168486       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:17.169417       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:18.169576       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:19.170544       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail
...no hits...

Recovery

If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:


signing test assumes rhel8 base image with the selection of repositories. It should automatically do the right thing.

    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Jose added few other rendering tests that utilize different inputs and outputs. make render-sync should be able to prepare them.

Description of problem:

Bootstrap process failed due to coredns.yaml manifest generation issue:

Feb 04 05:14:34 yunjiang-p2-2r2b2-bootstrap bootkube.sh[11219]: I0204 05:14:34.966343       1 bootstrap.go:188] manifests/on-prem/coredns.yaml
Feb 04 05:14:34 yunjiang-p2-2r2b2-bootstrap bootkube.sh[11219]: F0204 05:14:34.966513       1 bootstrap.go:188] error rendering bootstrap manifests: failed to execute template: template: manifests/on-prem/coredns.yaml:34:32: executing "manifests/on-prem/coredns.yaml" at <onPremPlatformAPIServerInternalIPs .ControllerConfig>: error calling onPremPlatformAPIServerInternalIPs: invalid platform for API Server Internal IP
Feb 04 05:14:35 yunjiang-p2-2r2b2-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 04 05:14:35 yunjiang-p2-2r2b2-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.

    

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-02-03-192446
4.16.0-0.nightly-2024-02-03-221256
    

How reproducible:

Always
    

Steps to Reproduce:

    1. 1. Enable custom DNS on GCP: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade
    2.
    3.
    

Actual results:

coredns.yaml can not be generated, bootstrap failed.
    

Expected results:

Bootstrap process succeeds.
    

Additional info:

    

Description of problem:

    When deploying a cluster on Power VS, you need to wait for a short period after the workspace is created to facilitate the network configuration. This period is ignored by the DHCP service.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Deploy a cluster on Power VS with an installer provisioned workspace
    2. Observe that the terraform logs ignore the waiting period
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

Currently, assisted-service performs validation of the user pull secret during actions like registering / updating a cluster / infraenv, in which it checks that the pull secret contains the tokens for all release images.  Therefore, currently we can't add nightly release images to stage environment as it blocks users without the right token from installing clusters with different release images.

How reproducible:

Always

Steps to reproduce:

1. Add nightly image to stage. (not recommended)

Actual results:

https://redhat-internal.slack.com/archives/C02RD175109/p1705916732425399?thread_ts=1705915614.824009&cid=C02RD175109

Expected results:

Register cluster successfully.

Description of problem:

The bubble box with wrong layout

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-11-16-110328

How reproducible:

Always

Steps to Reproduce:

1. Make sure there is no pod under your using project
2. navigate to Networking -> NetworkPolicies -> Create NetworkPolicy page, click the 'affected pods' in Pod selector section
3. Check the layout in the bubble component

Actual results:

the layout is in correct (shared file:https://drive.google.com/file/d/1I8e2ZkiFO2Gu4nSt9kJ6JmRG3LdvkE-u/view?usp=drive_link )

Expected results:

layout should correct

Additional info:

 

Description of problem:

When creating alerting silence from RHOCP UI without specifying "Creator" field, error "createdBy in body is required" even though field "Creator" is not marked as mandatory.

Version-Release number of selected component (if applicable):

4.15.5

How reproducible:

100%

Steps to Reproduce:

    1. Login to webconsole (Admin view)
    2. Observe > Alerting
    3. Select the alert to silence
    4. Click Create Silence.
    5. in Info section, update the "Comment" field and skip the "Creator" field. Now, click on Create button.
    6. It will throw an error "createdBy in body is required". 
    

Actual results:

Able to create alerting silence without specifying "Creator" field.

Expected results:

User should not be able to create silences without specifying "Creator" field as it should be a mandatory.

Additional info:

The steps works well for prior version of RHOCP 4.15 (tested on 4.14)

Description of problem:

    VolumeSnapshots data  is not displayed in PVC >  VolumeSnapshots tab

Version-Release number of selected component (if applicable):

    4.16.0-0.ci-2024-01-05-050911

How reproducible:

    

Steps to Reproduce:

    1. Create a PVC i.e. "my-pvc"
    2. Create a Pod and bind it to the "my-pvc"
    3. Create a VolumeSnapshots and associate it with the "my-pvc"
    4. Goto to PVC detail > VolumeSnapshots tab 

Actual results:

  VolumeSnapshots data  is not displayed in PVC >  VolumeSnapshots tab

Expected results:

 VolumeSnapshots data should be displayed in PVC >  VolumeSnapshots tab    

Additional info:

    

 

Description of problem:

When cloning a PVC of 60GiB size, the system autofills the remote size to be 8192 PeB. This size cannot be changed in the UI before starting the clone.

Version-Release number of selected component (if applicable):

CNV - 4.14.3

How reproducible:

always

Steps to Reproduce:

1.Create a VM with a PVC of 60Gib
2.Power off the VM
3.As a cluster admin, clone the 60GiB PVC (Storage -> PersistentVolumeClaims -> Kebab menu next to pvc

Actual results:

The system tries to clone the 60 GiB PVC as a 8192 PeB

Expected results:

A new pvc of the 60 GiB

Additional info:

This seems like the closed BZ 2177979.I will upload a screenshot of the UI.
Here is the yaml for the original pvc.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
cdi.kubevirt.io/storage.bind.immediate.requested: "true"
cdi.kubevirt.io/storage.contentType: kubevirt
cdi.kubevirt.io/storage.pod.phase: Succeeded
cdi.kubevirt.io/storage.populator.progress: 100.0%
cdi.kubevirt.io/storage.preallocation.requested: "false"
cdi.kubevirt.io/storage.usePopulator: "true"
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
creationTimestamp: "2023-12-05T17:34:19Z"
finalizers:kubernetes.io/pvc-protectionprovisioner.storage.kubernetes.io/cloning-protection
labels:
app: containerized-data-importer
app.kubernetes.io/component: storage
app.kubernetes.io/managed-by: cdi-controller
app.kubernetes.io/part-of: hyperconverged-cluster
app.kubernetes.io/version: 4.14.0
kubevirt.io/created-by: 60f46f91-2db3-4118-aaba-b1697b29c496
name: win2k19-base
namespace: base-images
ownerReferences:apiVersion: cdi.kubevirt.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: DataVolume
name: win2k19-base
uid: 8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe
resourceVersion: "697047"
uid: fccb0aa9-8541-4b51-b49e-ddceaa22b68c
spec:
accessModes:ReadWriteMany
dataSource:
apiGroup: cdi.kubevirt.io
kind: VolumeImportSource
name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe
dataSourceRef:
apiGroup: cdi.kubevirt.io
kind: VolumeImportSource
name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe
resources:
requests:
storage: "64424509440"
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Block
volumeName: pvc-dbfc9fe9-5677-469d-9402-c2f3a22dab3f
status:
accessModes:ReadWriteMany
capacity:
storage: 60Gi
phase: Bound



Here is the yaml for the cloning pvc.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
creationTimestamp: "2023-12-06T14:24:07Z"
finalizers:kubernetes.io/pvc-protection
name: win2k19-base-clone
namespace: base-images
resourceVersion: "1551054"
uid: f72665c3-6408-4129-82a2-e663d8ecc0cc
spec:
accessModes:ReadWriteMany
dataSource:
apiGroup: ""
kind: PersistentVolumeClaim
name: win2k19-base
dataSourceRef:
apiGroup: ""
kind: PersistentVolumeClaim
name: win2k19-base
resources:
requests:
storage: "9223372036854775807"
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Block
status:
phase: Pending


  1. When a esxi host is in maintenance mode the installer is unable to query the hosts' version causing validation to fail.

time="2024-01-04T05:30:45-05:00" level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to generate asset \"Platform Provisioning Check\": platform.vsphere: Internal error: vCenter is failing to retrieve config product version information for the ESXi host: "

https://github.com/openshift/installer/blob/0d56a06e02343e6603128e3f58c6c9bbc2edea3d/pkg/asset/installconfig/vsphere/validation.go#L247-L319

ZTP manifests generated by the openshift-install agent create cluster-manifests command do not contain the correct Group/Version/Kind type metadata.

This means that they could not be applied to an OpenShift cluster to use with ZTP as intended.

The node-exporter pods throws following errors if `symbolic_name` is not present or provided by fibre channel vendor.  

$ oc logs node-exporter-m6lbc -n openshift-monitoring -c node-exporter | tail -2
2023-09-27T12:13:39.403106561Z ts=2023-09-27T12:13:39.403Z caller=collector.go:169 level=error msg="collector failed" name=fibrechannel duration_seconds=0.000249813 err="error obtaining FibreChannel class info: failed to read file \"/host/sys/class/fc_host/host0/symbolic_name\": open /host/sys/class/fc_host/host0/symbolic_name: no such file or directory"

https://github.com/prometheus/node_exporter/blob/master/collector/fibrechannel_linux.go#L116C28-L116C28

The ibmvfc kernel module does not supply `symbolic_name`.

    https://github.com/torvalds/linux/blob/master/drivers/scsi/ibmvscsi/ibmvfc.c#L6308

  1. grep -v "zZzZ" -H /sys/class/fc_host/host*/port_state
    /sys/class/fc_host/host0/port_state:Online
    /sys/class/fc_host/host1/port_state:Online

sh-5.1# cd  /sys/class/fc_host/host0
sh-5.1# ls -ltr
total 0
rrr-. 1 root root 65536 Sep 28 19:43 speed
rrr-. 1 root root 65536 Sep 28 19:43 port_type
rrr-. 1 root root 65536 Sep 28 19:43 port_state
rrr-. 1 root root 65536 Sep 28 19:43 port_name
rrr-. 1 root root 65536 Sep 28 19:43 port_id
rrr-. 1 root root 65536 Sep 28 19:43 node_name
rrr-. 1 root root 65536 Sep 28 19:43 fabric_name
rw-rr-. 1 root root 65536 Sep 28 19:43 dev_loss_tmo
rw-rr-. 1 root root 65536 Oct  3 09:24 uevent
rw-rr-. 1 root root 65536 Oct  3 09:24 tgtid_bind_type
rrr-. 1 root root 65536 Oct  3 09:24 supported_classes
lrwxrwxrwx. 1 root root     0 Oct  3 09:24 subsystem -> ../../../../../../class/fc_host
drwxr-xr-x. 2 root root     0 Oct  3 09:24 power
rrr-. 1 root root 65536 Oct  3 09:24 maxframe_size
-w------. 1 root root 65536 Oct  3 09:24 issue_lip
lrwxrwxrwx. 1 root root     0 Oct  3 09:24 device -> ../../../host0

Description of problem:

The MCO's gcp-e2e-op-single-node job https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node has been failing consistently since early Jan.

It always fails on TestKernelArguments but that happens to be the first time where it gets the node to reboot, after which the node never comes up, so we don't get must-gather and (for some reason) don't get any console gathers either.

This is only 4.16 and only single node. Doing the same test on HA gcp clusters yield no issues. The test itself doesn't seem to matter as the next test would fail the same way if it was skipped.

This can be reproduced so far only via a 4.16 clusterbot cluster.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%

Steps to Reproduce:

    1. install SNO 4.16 cluster
    2. run MCO's TestKernelArguments
    3.
    

Actual results:

Node never comes back up

Expected results:

Test passes

Additional info:

    

Maxim Patlasov pointed this out in STOR-1453 but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.

It is possible to set a custom TLSSecurityProfile without minTLSversion:

$ oc edit apiserver cluster
...
spec:
  tlsSecurityProfile:
    type: Custom
    custom:
      ciphers:
      - ECDHE-ECDSA-CHACHA20-POLY1305
      - ECDHE-ECDSA-AES128-GCM-SHA256

This causes the controller to crash loop:

$ oc get pods -n openshift-cluster-csi-drivers
NAME                                             READY   STATUS             RESTARTS       AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2   6/11    CrashLoopBackOff   10 (18s ago)   37s
...

because the `${TLS_MIN_VERSION}` placeholder is never replaced:

        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}

The observed config in the ClusterCSIDriver shows an empty string:

$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
  "targetcsiconfig": {
    "servingInfo":

{       "cipherSuites": [         "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",         "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"       ],       "minTLSVersion": ""     }

  }
}

which means minTLSVersion is empty when we get to this line, and the string replacement is not done:

[https://github.com/openshift/library-go/blob/c7f15dcc10f5d0b89e8f4c5d50cd313ae158de20/pkg/operator/csi/csidrivercontrollerservicecontroller/helpers.go#L234]

So it seems we have a couple of options:

1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object

Please review the following PR: https://github.com/openshift/route-override-cni/pull/54

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    In service details page, under Revision and Route tabs, user is able to see No resource found message although Revision and Route is created for that service

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    Always

Steps to Reproduce:

    1.Install serverless operator
    2.Create serving instance
    3.Create knative service/ function
    4.Go to details page
    

Actual results:

    User is not able to see Revision and Route created for the service

Expected results:

     User should be able to see Revision and Route created for the service

Additional info:

    

Description of problem:

This bug is to get the fixes from another PR in:
https://github.com/openshift/cluster-etcd-operator/pull/1235

Namely that we were relying on a race condition in the fake library to sync the informer and client lister to generate the certificates.
This fix entails a lister that directly goes via the client to avoid using the informer.

Version-Release number of selected component (if applicable):

4.16.0    

How reproducible:

always, when reordering the statements in the code    

Steps to Reproduce:

Reodering the code blocks as in https://github.com/openshift/cluster-etcd-operator/pull/1235/files#diff-273071b77ba329777b70cb3c4d3fb2e33bc8abf45cb3da28cbee512d591ab9ee 

will immediately expose the race condition in unit tests.     

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    Recent introductions of a validation within the hypershift operator's webhook conflicts with the UI's ability to create HCP clusters. Previously the pull secret was not required to be posted before an HC or NP, but with a recent change, the pull secret is required because the pull secret is used to validate the release image payload.

This issue is isolated to 4.15

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    100% attempt to post a HC before the pull secret is posted and the HC will be rejected. 

The expected outcome is that it should be able to post the pull secret for a HC after the HC is posted, and the controller should be eventually consistent to this change.

Description of problem:

In OCP 4.14 the catalog pods in openshift-marketplace where defined as:

$ oc get pods -n openshift-marketplace redhat-operators-4bnz4 -o yaml
apiVersion: v1
kind: Pod
metadata:
...
  labels:
    olm.catalogSource: redhat-operators
    olm.pod-spec-hash: 658b699dc
  name: redhat-operators-4bnz4
  namespace: openshift-marketplace
...
spec:
  containers:
  - image: registry.redhat.io/redhat/redhat-operator-index:v4.14
    imagePullPolicy: Always



Now on OCP 4.15 they are defined as:
apiVersion: v1
kind: Pod
metadata:
...
  name: redhat-operators-44wxs
  namespace: openshift-marketplace
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: true
    kind: CatalogSource
    name: redhat-operators
    uid: 3b41ac7b-7ad1-4d58-a62f-4a9e667ae356
  resourceVersion: "877589"
  uid: 65ad927c-3764-4412-8d34-82fd856a4cbc
spec:
  containers:
  - args:
    - serve
    - /extracted-catalog/catalog
    - --cache-dir=/extracted-catalog/cache
    command:
    - /bin/opm
...
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7259b65d8ae04c89cf8c4211e4d9ddc054bb8aebc7f26fac6699b314dc40dbe3
    imagePullPolicy: Always
...
  initContainers:
...
  - args:
    - --catalog.from=/configs
    - --catalog.to=/extracted-catalog/catalog
    - --cache.from=/tmp/cache
    - --cache.to=/extracted-catalog/cache
    command:
    - /utilities/copy-content
    image: registry.redhat.io/redhat/redhat-operator-index:v4.15
    imagePullPolicy: IfNotPresent
...



And due to `imagePullPolicy: IfNotPresent` on the initContainer used to extract the index image (referenced by tag) content, they are never really updated. 


    

Version-Release number of selected component (if applicable):

    OCP 4.15.0    

How reproducible:

    100%    

Steps to Reproduce:

    1. wait for the next version of a released operator on OCP 4.15
    2.
    3.
    

Actual results:

    Operator catalogs are never really refreshed due to  imagePullPolicy: IfNotPresent for the index image

Expected results:

    Operator catalogs are periodically (every 10 minutes by default) refreshed

Additional info:

    

Description of problem:

Openshift Console shows "Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available." while edit an XML file type configmaps.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create configmap from file:
# oc create cm test-cm --from-file=server.xml=server.xml
configmap/test-cm created

2. If we try to edit the configmap in the OCP console we see the following error:

Info alert:Non-printable file detected.
File contains non-printable characters. Preview is not available.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/127

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

e2e tests are unable to create a prometheus client when legacy service account API tokens are not auto-generated.

Description of problem:

[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get co/image-registry
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry             False       True          True       50m     Available: The deployment does not exist...
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe co/image-registry
...
    Message:               Progressing: Unable to apply resources: unable to sync storage configuration: cos region corresponding to a powervs region wdc not found
...

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-ppc64le-2024-01-10-083055

How reproducible:

Always

Steps to Reproduce:

    1. Deploy a PowerVS cluster in wdc06 zone

Actual results:

See above error message

Expected results:

Cluster deploys

Description of problem:

    The HCP CSR flow allows any CN in the incoming CSR.

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    Using the CSR flow, any name you add to the CN in the CSR will be your username against the Kubernetes API server - check your username using the SelfSubjectRequest API (kubectl auth whoami)

Steps to Reproduce:

    1.create CSR with CN=whatever
    2.CSR signed, create kubeconfig
    3.using kubeconfig, kubectl auth whoami should show whatever CN
    

Actual results:

    any CN in CSR is the username against the cluster

Expected results:

    we should only allow CNs with some known prefix (system:customer-break-glass:...)

Additional info:

    

 

In the `DoHTTPProbe` function located at `github.com/openshift/router/pkg/router/metrics/probehttp/probehttp.go`, logging of the HTTP response object at verbosity level 4 results in a serialisation error due to non-serialisable fields within the `http.Response` object. The error logged is `<<error: json: unsupported type: func() (io.ReadCloser, error)>>`, pointing towards an inability to serialise the `Body` field, which is of type `io.ReadCloser`.

This function is designed to check if a GET request to the specified URL succeeds, logging detailed response information at higher verbosity levels for diagnostic purposes.

Steps to Reproduce:
1. Increase the logging level to 4.
2. Perform an operation that triggers the `DoHTTPProbe` function.
3. Review the logging output for the error message.

Expected Behaviour:

The logger should gracefully handle or exclude non-serialisable fields like `Body`, ensuring clean and informative logging output that aids in diagnostics without encountering serialisation errors.

Actual Behaviour:

Non-serialisable fields in the `http.Response` object lead to the error `<<error: json: unsupported type: func() (io.ReadCloser, error)>>` being logged. This diminishes the utility of logs for debugging at higher verbosity levels.

Impact:

The issue is considered of low severity since it only appears at logging level 4, which is beyond standard operational levels (level 2) used in production. Nonetheless, it could hinder effective diagnostics and clutter logs with errors when high verbosity levels are employed for troubleshooting.

Suggested Fix:

Modify the logging functionality within `DoHTTPProbe` to either filter out non-serialisable fields from the `http.Response` object or implement a custom serialisation approach that allows these fields to be logged in a more controlled and error-free manner.

Description of problem:

    A recent [PR](https://github.com/openshift/hypershift/commit/c030ab66d897815e16d15c987456deab8d0d6da0) updated the kube-apiserver service port to `6443`. That change causes a small outage when upgrading from a 4.13 cluster in IBMCloud. We need to keep the service port as 2040 for IBM Cloud Provider to avoid the outage.

Version-Release number of selected component (if applicable):

    

How reproducible:

 

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

In TRT-1476, we created a VM that served as an endpoint where we can test connectivity in gcp.
We want one for AWS.

In TRT-1477, we created some code in origin to send HTTP GETs to that endpoint as a test to ensure connectivity remains working. Do this also for AWS.

TRT members already have an AWS account so we don't need to request one.

Component Readiness has found a potential regression in [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel].

Probability of significant regression: 98.46%

Sample (being evaluated) Release: 4.15
Start Time: 2023-12-29T00:00:00Z
End Time: 2024-01-04T23:59:59Z
Success Rate: 83.33%
Successes: 15
Failures: 3
Flakes: 0

Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 98.36%
Successes: 120
Failures: 2
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=Other&component=Unknown&confidence=95&environment=sdn%20no-upgrade%20amd64%20metal-ipi%20serial&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=sdn&network=sdn&pity=5&platform=metal-ipi&platform=metal-ipi&sampleEndTime=2024-01-04%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2023-12-29%2000%3A00%3A00&testId=openshift-tests%3A9ff4e9b171ea809e0d6faf721b2fe737&testName=%5Bsig-arch%5D%5BLate%5D%20operators%20should%20not%20create%20watch%20channels%20very%20often%20%5Bapigroup%3Aapiserver.openshift.io%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial&variant=serial

Please review the following PR: https://github.com/openshift/coredns/pull/111

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

whereabouts reconciler is responsible for reclaiming dangling IPs, and freeing them to be available to allocate to new pods.
This is crucial for scenarios where the amount of addresses are limited and dangling IPs prevent whereabouts from successfully allocating new IPs to new pods.

The reconciliation schedule is currently hard-coded to run once a day, without a user-friendly way to configure.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Create a Whereabouts reconciler daemon set, not able to configure the reconciler schedule.

Steps to Reproduce:

    1. Create a Whereabouts reconciler daemonset
       instructions: https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional-      network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network

     2. Run `oc get pods -n openshift-multus | grep whereabouts-reconciler`

     3. Run `oc logs whereabouts-reconciler-xxxxx`      

Actual results:

    You can't configure the cron-schedule of the reconciler.

Expected results:

    Be able to modify the reconciler cron schedule.

Additional info:

    The fix for this bug is in two places: whereabouts, and cluster-network-operator.
    From this reason, in order to verify correctly we need to use both fixed components.
    Please read below for more details about how to apply the new configurations.

How to Verify:

    Create a whereabouts-config ConfigMap with a custom value, and check in the
    whereabouts-reconciler pods' logs that it is updated, and triggering the clean up.

Steps to Verify:

    1. Create a Whereabouts reconciler daemonset
    2. Wait for the whereabouts-reconciler pods to be running. (takes time for the daemonset to get created).
    3. See in logs: "[error] could not read file: <nil>, using expression from flatfile: 30 4 * * *"
       This means it uses the hardcoded default value. (Because no ConfigMap yet)
    4. Run: oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/2 * * * *"
    5. Check in the logs for: "successfully updated CRON configuration" 
    6. Check that in the next 2 minutes the reconciler runs: "[verbose] starting reconciler run"

 

 

Description of problem:

  • We're seeing that on two baremetal nodes where `routingViaHost=true` is enabled (with ipForwarding set properly as Global) the following problem:
  • They have set NodeIP Hint to force OVN to bind to a secondary interface at `bond1.2039 `
  • we're seeing that specific pods that are hostNetworked can't reach the default kubernetes service IP address; and are failing to initialize as a result (CLBO).

~~~
F0120 03:20:42.221327 879146 driver.go:131] failed to get node "wb02.pdns-edtnabtf-arch01.nfvdev.tlabs.ca" information: Get "https://192.168.0.1:443/api/v1/nodes/wb02.pdns-edtnabtf-arch01.nfvdev.tlabs.ca": dial tcp 192.168.0.1:443: i/o timeout
~~~

other pods on affected node with above config can hit the target service however, pods that are hostNetworked appear to be failing:

$ oc get pod csi-rbdplugin-kpz7n -o yaml | grep hostNetwork
hostNetwork: true

 

Version-Release number of selected component (if applicable):

4.14

  • bare-metal  

How reproducible:

  • new cluster, every time

Steps to Reproduce:

We have redeployed the cluster. and have
routingViaHost and ipForwarding both enabled.

We also pushed out a NODEIP_HINT configuraiton to all the nodes to make sure SDN is overlayed on the correct interface.

Default gateway has been moved to bond1.2039on the 2 x baremetal worker nodes.

wb01

wb02

observe that hostNetworked pods crashloop backoff
 

Actual results:

  • hostnetworked pods cannot call the default kube service address  

Expected results:

  • hostnetworked pods should be able to do so.
     

Additional info:

See the first comment for data samples + must-gathers + sosreports

Description of problem:

    Increase MAX_NODES_LIMIT to 300 for 4.16 and 200 for 4.15 so that users don't see alert "Loading is taking longer than expected" in topology page

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1. Create more than 100 nodes in a namespace
    

Additional info:

    

Platform:

IPI on Baremetal

What happened?

In cases where no hostname is provided, host are automatically assigned the name "localhost" or "localhost.localdomain".

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
localhost.localdomain Ready master 31m v1.22.1+6859754
master-0-1 Ready master 39m v1.22.1+6859754
master-0-2 Ready master 39m v1.22.1+6859754
worker-0-0 Ready worker 12m v1.22.1+6859754
worker-0-1 Ready worker 12m v1.22.1+6859754

What did you expect to happen?

Having all hosts come up as localhost is the worst possible user experience, because they'll fail to form a cluster but you won't know why.

However, we know the BMH name in the image-customization-controller, it would be possible to configure the ignition to set a default hostname if we don't have one from DHCP/DNS.

If not, we should at least fail the installation with a specific error message to this situation.

----------
30/01/22 - adding how to reproduce
----------

How to Reproduce:

1)prepare and installation with day-1 static ip.

add to install-config uner one of the nodes:
networkConfig:
routes:
config:

  • destination: 0.0.0.0/0
    next-hop-address: 192.168.123.1
    next-hop-interface: enp0s4
    dns-resolver:
    config:
    server:
  • 192.168.123.1
    interfaces:
  • name: enp0s4
    type: ethernet
    state: up
    ipv4:
    address:
  • ip: 192.168.123.110
    prefix-length: 24
    enabled: true

2)Ensure a DNS PTR for the address IS NOT configured.

3)create manifests and cluster from install-config.yaml

installation should either:
1)fail as early as possible, and provide some sort of feed back as to the fact that no hostname was provided.
2)derive the Hostname from the bmh or the ignition files

Description of problem:

    1. TaskRuns list page is loading constantly for all projects
    2. Archive icon is not displayed for some tasks in TaskRun list page
    3. On change of ns to All Projects, PipelineRuns and TaskRuns are not loading properly

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    Always

Steps to Reproduce:

    1.Create some TaskRun
    2.Go to TaskRun list page
    3.Select all project in project dropdown
    

Actual results:

Screen is keep on loading

Expected results:

     Should load TaskRuns from all projects

Additional info:

    

Description of problem:

After a manual crash of a OCP node the OSPD VM running on the OCP node is stuck in terminating state

Version-Release number of selected component (if applicable):

OCP 4.12.15 
osp-director-operator.v1.3.0
kubevirt-hyperconverged-operator.v4.12.5

How reproducible:

Login to a OCP 4.12.15 Node running a VM 
Manually crash the master node.
After reboot the VM stay in terminating state

Steps to Reproduce:

    1. ssh core@masterX 
    2. sudo su
    3. echo c > /proc/sysrq-trigger     

Actual results:

After reboot the VM stay in terminating state


$ omc get node|sed -e 's/modl4osp03ctl/model/g' | sed -e 's/telecom.tcnz.net/aaa.bbb.ccc/g'
NAME                               STATUS   ROLES                         AGE   VERSION
model01.aaa.bbb.ccc   Ready    control-plane,master,worker   91d   v1.25.8+37a9a08
model02.aaa.bbb.ccc   Ready    control-plane,master,worker   91d   v1.25.8+37a9a08
model03.aaa.bbb.ccc   Ready    control-plane,master,worker   91d   v1.25.8+37a9a08


$ omc get pod -n openstack 
NAME                                                        READY   STATUS         RESTARTS   AGE
openstack-provision-server-7b79fcc4bd-x8kkz                 2/2     Running        0          8h
openstackclient                                             1/1     Running        0          7h
osp-director-operator-controller-manager-5896b5766b-sc7vm   2/2     Running        0          8h
osp-director-operator-index-qxxvw                           1/1     Running        0          8h
virt-launcher-controller-0-9xpj7                            1/1     Running        0          20d
virt-launcher-controller-1-5hj9x                            1/1     Running        0          20d
virt-launcher-controller-2-vhd69                            0/1     NodeAffinity   0          43d

$ omc describe  pod virt-launcher-controller-2-vhd69 |grep Status:
Status:                    Terminating (lasts 37h)

$ xsos sosreport-xxxx/|grep time
...
  Boot time: Wed Nov 22 01:44:11 AM UTC 2023
  Uptime:    8:27,  0 users
  

Expected results:

VM restart automatically OR does not stay in Terminating state 

Additional info:

The issue has been seen two time.

First time, a crash of the kernel occured and we had the associated VM on the node in terminating state

Second time we try to reproduce the issue by crashing manually the kernel and we got the same result.
The VM running on the OCP node stay in terminating state 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When I test PR https://github.com/openshift/machine-config-operator/pull/4083, there is no machineset does not have any machine linked. 

$ oc get machineset/rioliu-1220c-bz2gp-worker-f -n openshift-machine-api
NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
rioliu-1220c-bz2gp-worker-f   0         0                             3h47m

Many errors found in MCD log like below

I1220 09:15:59.743704       1 machine_set_boot_image_controller.go:211] Error syncing machineset openshift-machine-api/rioliu-1220c-bz2gp-worker-f: failed to fetch architecture type of machineset rioliu-1220c-bz2gp-worker-f, err: could not find any machines linked to machineset, error: %!w(<nil>)

the machineset patch is skipped in reconcile loop due to above error, boot image info cannot be patched even it does not have any machine provisioned.

Version-Release number of selected component (if applicable):

 

How reproducible:

Consistently

Steps to Reproduce:

https://github.com/openshift/machine-config-operator/pull/4083#issuecomment-1864226629

Actual results:

the machineset is skipped in reconcile loop due to above error, boot image info cannot be patched

Expected results:

the machineset should be updated even no linked machine found, because maybe it is scaled down to 0 replica

Additional info:

    

Description of problem:

   In-cluster clients should be able to talk directly to the node local apiservert ip address and as a best practice should all be configured to use it. This load balancer provides added benefit in cloud environments of healthchecking the path from the machine to the load balancer fronting the kube-apiserver. It becomes more cruicial in baremetal/on-prem environments where there may not be a load balancer and instead just 3 unique endpoints directly to redundant kube-apiservers. In this case: if using just dns: intermittent traffic failures will be experienced if a control plane instance goes down. Using the node local load balancer: there will be no traffic disruption

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100%

Steps to Reproduce:

    1. Schedule a pod on any hypershift cluster node
    2. In the pod run curl -v -k https://172.20.0.1:6443
    3. The verbose output will show that the kube-apiserver cert does not have the node local client load balancer IP address in it's IPs section and therefore will not allow valid HTTPS requests on that address
    

Actual results:

    Secure HTTPS requests cannot be made to the kube-apiserver

Expected results:

     Secure HTTPS requests can be made to the kube-apiserver (no need to run -k when specifying proper CA bundle)

Additional info:

    

ci/prow/test.local pipeline is currently broken due to the build04 cluster addressed to it in the buildfarm being a bit slow and making github.com/thanos-io/thanos/pkg/store go over the default 900s.

panic: test timed out after 15m0s1027running tests:1028	TestTSDBStoreSeries (4m3s)1029	TestTSDBStoreSeries/1SeriesWith10000000Samples (2m58s)1030 

Extending it makes the test pass

 ok github.com/thanos-io/thanos/pkg/store 984.344s

 

We'll be addressing this alongside a follow-up issue to address this with an env var in upstream Thanos.

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/101

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When trying to deploy with an Internal publish strategy, DNS will fail because proxy VM cannot launch.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

    1. Set publishStrategy: Internal
    2. Fail
    3.
    

Actual results:

    terraform fails

Expected results:

    private cluster launches

Additional info:

    

Description of problem:

Bootstrap process fails. When attempting to gather logs, the process fails. The SSH connection was refused.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    Alsways when failing bootstrap process

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

We need to backport https://github.com/cri-o/cri-o/pull/7744 into 1.28 of crio. CI is failing on upgrades due to a feature not in 1.28.

    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1187

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1190

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Snapshots taken to gather deprecation information from bundles are from the Subscription namespace instead of the CatalogSource namespace. That means that if the Subscription is in a different namespace then no bundles will be present in the snapshot. 

How reproducible:

100% 

Steps to Reproduce:

1.Create CatalogSource with olm.deprecation entries
2.Create Subscription targeting a package with deprecations in a different namespace.

Actual results:

No Deprecation Conditions will be present.

Expected results:

Deprecation Conditions should be present.

Please review the following PR: https://github.com/openshift/prometheus/pull/187

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The shutdown-delay-duration argument for the openshift-oauth-apiserver is set to 3s in hypershift, but set to 15s in core openshift. Hypershift should update the value to match.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

This is an issue that IBM Cloud found and it likely effects Power VS. See https://issues.redhat.com/browse/OCPBUGS-28870

Install a private cluster, the base domain set in install-config.yaml is same as another existed cis domain name. 
After destroy the private cluster, the dns resource-records remains. 

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

1.create a DNS service instance, setting its domain to "ibmcloud.qe.devcluster.openshift.com", Note, this domain name is also being used in another existing CIS domain.
2.Install a private ibmcloud cluster, the base domain set in install-config is "ibmcloud.qe.devcluster.openshift.com"
3.Destroy the cluster
4.Check the remains dns records     

Actual results:

$ ibmcloud dns resource-records 5f8a0c4d-46c2-4daa-9157-97cb9ad9033a -i preserved-openshift-qe-private | grep ci-op-17qygd06-23ac4
api-int.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com 
*.apps.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com 
api.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com

Expected results:

No more dns records about the cluster

Additional info:

$ ibmcloud dns zones -i preserved-openshift-qe-private | awk '{print $2}'   
Name
private-ibmcloud.qe.devcluster.openshift.com 
private-ibmcloud-1.qe.devcluster.openshift.com 
ibmcloud.qe.devcluster.openshift.com  

$ ibmcloud cis domains
Name
ibmcloud.qe.devcluster.openshift.com

When use private-ibmcloud.qe.devcluster.openshift.com and private-ibmcloud-1.qe.devcluster.openshift.com as domain, no such issue, when use ibmcloud.qe.devcluster.openshift.com as domain the dns records remains. 

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/11

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Customer is asking for this flag so they can keep using v1 code even when v2 will be the default.

Description of the problem:

When installing a spoke cluster earlier that 4.14 with a mirror registry config, assisted does not create the required ImageContentSourcePolicy needed to pull images from a custom registry.

How reproducible:

4/4

Steps to reproduce:

1. Install 4.12 spoke cluster with ACM 2.10 using a mirror registry config

Actual results:

Spoke installation fails because master can not pull images needed to run assisted-installer-controller

Expected results:

ICSP created and installation finishes successfully

Description of problem:

When the cluster is in upgrade, scroll sidebar to bottom on cluster settings page, there is blank space at the bottom.

Version-Release number of selected component (if applicable):

upgrade 4.14.0-0.nightly-2023-09-09-164123 to 4.14.0-0.nightly-2023-09-10-184037

How reproducible:

Always

Steps to Reproduce:

1.Launch a 4.14 cluster, trigger an upgrade.
2.Go to "Cluster Settings"->"Details" page during upgrade, scroll down the right sidebar to the bottom.
3.

Actual results:

2. It's blank at the bottom

Expected results:

2. Should not show blank.

Additional info:

screenshot: https://drive.google.com/drive/folders/1DenrQTX7K0chbs9hG9ZbSZyY-viRRy1k?ths=true
https://drive.google.com/drive/folders/10dgToTxZf7gOfmL2Mp5gAMVQnM06XvAf?usp=sharing

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/102

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Kube-apiserver operator is trying to delete prometheus rule that does not exists leading to huge amount of unwanted audit logs, 

With the introduction of the change as a part of BUG-2004585 kube-apiserver SLO rulesare split into 2 groups kube-apiserver-slos-basic and kube-apiserver-slos-extended kube-apiserver-operator is trying to delete /apis/monitoring.coreos.com/v1/namespaces/openshift-kube-apiserver/prometheusrules/kube-apiserver-slos which no longer exist in the cluster

Version-Release number of selected component (if applicable):

4.12
4.13
4.14

How reproducible:

    Its easy to reproduce

Steps to Reproduce:

    1. install a cluster with 4.12
    2. enable cluster logging 
    3. forward the audit log to internal or external logstore using below config

apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  name: instance
  namespace: openshift-logging
spec:
  pipelines: 
  - name: all-to-default
    inputRefs:
    - infrastructure
    - application
    - audit
    outputRefs:
    - default     

    4. Check the audit logs in kibana, it will show the logs like below image

Actual results:

    Kube-apiserver-operator is trying to delete prometheus rule that does not exists in the cluster

Expected results:

if the rule is not there in the cluster it should not be searched for deletion

Additional info:

    

Description of problem:

We should be checking the `currentVersion` and `desiredVersion` for being empty.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/101

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address.

Tests so far affected are:
- TestUnmanagedDNSToManagedDNSInternalIngressController
- TestScopeChange
- TestInternalLoadBalancerGlobalAccessGCP
- TestInternalLoadBalancer
- TestAllowedSourceRanges

For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready:

operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True]
         Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True]

Where DNSReady:False and LoadBalancerReady:False.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

10% of the time

Steps to Reproduce:

1. Run e2e-gcp-operator many times until you see one of these failures

Actual results:

Test Failure

Expected results:

Not failure

Additional info:

Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer 

This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.

Currently the konnectivity agent has the following update strategy:

```
updateStrategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 0

```

We (IBM) suggest to update it to the following:
```
updateStrategy:
rollingUpdate:
maxUnavailable: 10%
type: RollingUpdate
```

In a big cluster, it would speed up the konnectivity-agent update. As the agents are independent, this would not hurt the service.

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/101

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ironic-image/pull/438

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Description of problem:

Observing the following test case failure in 4.14 to 4.15 and 4.15 to 4.16 upgrade CI runs continuously.
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available 

 

JobLink:https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le/1746834772249808896

4.14 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.14.0-0.nightly-ppc64le-2024-01-15-085349
4.15 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.15.0-0.nightly-ppc64le-2024-01-15-042536

Please review the following PR: https://github.com/openshift/images/pull/158

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

the example namespaced page is not working   

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-22-023835    

How reproducible:

Always    

Steps to Reproduce:

1. Deploy console-demo-plugin manifests, enable the plugin
$ oc apply -f https://raw.githubusercontent.com/openshift/console/master/dynamic-demo-plugin/oc-manifest.yaml 
$ oc patch console.operator cluster --type='json' -p='[{"op": "add", "path": "/spec/plugins/-", "value":"console-demo-plugin"}]'

2. Change to Demo perspective, click on `Example Namespaced Page` menu 

Actual results:

2. an error page returned
Cannot destructure property 'ns' of '(intermediate value)(intermediate value)(intermediate value)' as it is undefined.
    

Expected results:

2. a page with namespace dropdown menu should be rendered    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1597

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When using the modal dialogs in a hook as part of the actions hook (i.e. useApplicationsActionsProvider) the console will throw an error since the console framework will pass null objects as part of the render cycle. According to Jon Jackson, the console should be safe from null objects but it looks like the code for useDeleteModal and getGroupVersionKindForresource are not safe,

Version-Release number of selected component (if applicable):

    

How reproducible:

   Always 

Steps to Reproduce:

    1. Use one of the modal APIs in an actions provider hook
    2.
    3.
    

Actual results:

    Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'split')
    at i (main-chunk-9fbeef79a…d3a097ed.min.js:1:1)
    at u (main-chunk-9fbeef79a…d3a097ed.min.js:1:1)
    at useApplicationActionsProvider (useApplicationActionsProvider.tsx:23:43)
    at ApplicationNavPage (ApplicationDetails.tsx:38:67)
    at na (vendors~main-chunk-8…87b.min.js:174297:1)
    at Hs (vendors~main-chunk-8…87b.min.js:174297:1)
    at Sc (vendors~main-chunk-8…87b.min.js:174297:1)
    at Cc (vendors~main-chunk-8…87b.min.js:174297:1)
    at _c (vendors~main-chunk-8…87b.min.js:174297:1)
    at pc (vendors~main-chunk-8…87b.min.js:174297:1) 

Expected results:

    Works with no error

Additional info:

    

After NM introduced dns-change event, we are creating an infinite loop of on-prem-resolv-prepender.service runs. This is because our prepender script ALWAYS runs `nmcli general reload dns-rc`, no matter if the changes are needed for real or not.

Because of this, we have the following

1) NM change DNS
2) dispatcher script append a server to /etc/resolv.conf
3) dispatcher invoked again as new dns-change event.
4) dispatcher check again and creates new /etc/resolv.conf, the same as old
5) NM change DNS, dns-change event is invoked
6) goto 3

As a fix, prepender script should check if the newly generated file differs from existing /etc/resolv.conf and only apply change if needed.

UDP Packets are subject to SNAT in a self-managed OCP 4.13.13 cluster on Azure (OVN-K as CNI) using a Load Balancer Service with `externalTrafficPolicy: Local`. UDP Packets correctly arrive to the Node hosting the Pod but the source IP seen by the Pod is the OVN GW Router of the Node.

I've reproduced the customer scenario with the following steps:

This is issue is very critical because it is blocking customer business.

 

: [bz-Routing] clusteroperator/ingress should not change

Has been failing for over a month in the e2e-metal-ipi-sdn-bm-upgrade jobs 

 

I think this is because there are only two worker nodes in the BM environment and some HA services loose redundancy when one of the workers is rebooted. 

In the medium term I hope to add another node to each cluster but in the sort term we should skip the test.

Description of problem:

On OCP console if we added a parameter related to VMware,add the same value back again and click on save the nodes are rebooted 

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    

Steps to Reproduce:

    1. On any 4.14+ cluster go to ocp console page
    2. Click on the vmware plugin
    3. Edit any parameter and add the same value again.
    4. Click on save
    

Actual results:

    The nodes reboot to pickup change

Expected results:

 nodes should not reboot if the same values are entered

Additional info:

    

Yesterday a major DPCR and thus Loki outage took the system down entirely. One test would fail as a result:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1762573177050894336

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "KubeDaemonSetRolloutStuck",
          "alertstate": "firing",
          "container": "kube-rbac-proxy-main",
          "daemonset": "loki-promtail",
          "endpoint": "https-main",
          "job": "kube-state-metrics",
          "namespace": "openshift-e2e-loki",
          "prometheus": "openshift-monitoring/k8s",
          "service": "kube-state-metrics",
          "severity": "warning"
        },
        "value": [
          1709071917.851,
          "1"
        ]
      }
    ]

The query this test uses should be adapted to omit everything in openshift-e2e-loki.

Ideally, backports would be good here, but we could just fix it going forward also if this is too cumbersome.

Description of problem:

When bootstrap logs are collected (e.g. as part of a CI run when bootstrapping fails), it no longer contains most of the Ironic services. They used to be run in standalone pods, but after a recent refactoring, they are systemd services.

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

  • Upload strings to Memosource at the start of the sprint and reach out to localization team
  • Download translated strings from Memsource when it is ready
  • Review the translated strings and open a pull request
  • Open a followup story for next sprint

Description of problem:

When we create a new HostedCluster with HyperShift, the OLM pods on the management cluster cannot be created correctly.
Regardless using multi or amd64 images, the OLM pods complains:
exec /bin/opm: exec format error
All other pods are running correctly. The nodes on management cluster is amd64.

    

Version-Release number of selected component (if applicable):

4.15.z
    

How reproducible:

Trigger reherasal of this example PR: https://github.com/openshift/release/pull/51141

    

Steps to Reproduce:

    1. Trigger the reherasal on the PR above: /pj-rehearse periodic-ci-opendatahub-io-ai-edge-main-test-ai-edge-periodic
    2. Locate the cluster name in Log of the Pod test-ai-edge-periodic-hypershift-hostedcluster-create-hostedcluster
    3. Login https://console-openshift-console.apps.hosted-mgmt.ci.devcluster.openshift.com/
    4. Enter the namespace for the ephemeral cluster created by the reherasal
    5. Check Pod, looking for marketplace related pods, like certified-operators-catalog-58f7bd7467-4l2s2
    

Actual results:

The Pods are Running
    

Expected results:

The Pods are either CrashLoop or ErrPullImage
    

Additional info:


    

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/102

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/179

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We shut down the bootstrap node before the control plane hosts are provisioned:

Apr 24 17:30:05 localhost.localdomain master-bmh-update.sh[10498]: openshift-machine-api   openshift-4                                            true             8m24s
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[4461]: Waiting for 2 masters to become provisioned
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: NAMESPACE               NAME          STATE          CONSUMER                  ONLINE   ERROR   AGE
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api   openshift-0   provisioning   cluster4-59zbh-master-0   true             8m46s
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api   openshift-1   provisioning   cluster4-59zbh-master-1   true             8m45s
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api   openshift-2   provisioning   cluster4-59zbh-master-2   true             8m45s
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api   openshift-3                                            true             8m44s
Apr 24 17:30:25 localhost.localdomain master-bmh-update.sh[10602]: openshift-machine-api   openshift-4                                            true             8m44s
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[4461]: Stopping provisioning services...
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10708]: deactivating
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[4461]: Unpause all baremetal hosts
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-0 annotated
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-1 annotated
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-2 annotated
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-3 annotated
Apr 24 17:30:45 localhost.localdomain master-bmh-update.sh[10724]: baremetalhost.metal3.io/openshift-4 annotated
Apr 24 17:30:45 localhost.localdomain systemd[1]: Finished Update master BareMetalHosts with introspection data. 

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/106

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

when using the monitoring plugin with the console dashboards plugin, if a custom datasource defined in a dashboard is not found, the default in cluster prometheus is used to fetch data. This creates a false assumption to the user that the custom dashboard is working when in reality, it should fail.

 

How to reproduce:

  • In OpenShift 4.16
  • Install COO
  • Enable the console dashboards plugin as documented here
  • Create a dashboard that uses custom datasources as documented here. Do not create a datasource so the bug can be reproduced
  • Go to monitoring -> dashboards and select the dashboard created above

Expected result

The dashboard should display an error as the custom datasource was not found

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/66

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

security-groups.yaml playbook runs the IPv6 security group rules creation tasks regardless of the os_subnet6 value.
The when clause is not considering the os_subnet6 [1] value and is always executed.

It works with:

  - name: 'Create security groups for IPv6'
    block:
    - name: 'Create master-sg IPv6 rule "OpenShift API"'
    [...]
    when: os_subnet6 is defined

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-11-033133

How reproducible:

Always

Steps to Reproduce:

1. Don't set the os_subnet6 in the inventory file [2] (so it's not dual-stack)
2. Deploy 4.15 UPI by running the UPI playbooks

Actual results:

IPv6 security group rules are created

Expected results:

IPv6 security group rules shouldn't be created

Additional info:
[1] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/security-groups.yaml#L375
[2] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/inventory.yaml#L77

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/135

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api/pull/191

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

HyperShift-managed components use the default RevisionHistoryLimit of 10. This significantly impacts etcd load and scalability on the management cluster.

Version-Release number of selected component (if applicable):

4.9, 4.10, 4.11, 4.12, 4.13, 4.14, 4.15, 4.16    

How reproducible:

100% (may vary depending on resource availablility on management cluster)    

Steps to Reproduce:

    1. Create 375+ HostedCluster
    2. Observe etcd performance on management cluster
    3.
    

Actual results:

etcd hitting storage space limits    

Expected results:

Able to manage HyperShift control planes at scale (375+ HostedClusters)    

Additional info:

    

The command does not honor Windows path separators.

Related to https://issues.redhat.com//browse/OCPBUGS-28864 (access restricted and not publicly visible). This report serves as a target issue for the fix and its backport to older OCP versions. Please see more details in https://issues.redhat.com//browse/OCPBUGS-28864.

Caught by the test: Undiagnosed panic detected in pod

Sample job run:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1783981854974545920

Error message

{  pods/openshift-controller-manager_controller-manager-6b66bf5587-6ghjk_controller-manager.log.gz:E0426 23:06:02.367266       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3c6a2a0), concrete:(*abi.Type)(0x3e612c0), asserted:(*abi.Type)(0x419cdc0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)
pods/openshift-controller-manager_controller-manager-6b66bf5587-6ghjk_controller-manager.log.gz:E0426 23:06:03.368403       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3c6a2a0), concrete:(*abi.Type)(0x3e612c0), asserted:(*abi.Type)(0x419cdc0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)
pods/openshift-controller-manager_controller-manager-6b66bf5587-6ghjk_controller-manager.log.gz:E0426 23:06:04.370157       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3c6a2a0), concrete:(*abi.Type)(0x3e612c0), asserted:(*abi.Type)(0x419cdc0), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)}

Sippy indicates it's happening a small percentage of the time since around Apr 25th.

Took out the last payload so labeling trt-incident for now.

See the linked OCPBUG for the actual component.
 

Hypershift's ignition server expects certain headers. Assisted needs to gather all of this data and pass it when fetching the ignition from hypershift's ignition server.

Work items:

  1. Modify DB host to contain ignition information
  2. Modify agent kube api and agent controller to get additional information needed (targetconfighash and nodepool name) and store it in the db host
  3. Use the headers when fetching the ignition

Description of problem:

    The ValidatingAdmissionPolicy admission plugin is set in OpenShift 4.14+ kube-apiserver config, but is missing from the HyperShift config. It should be set.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    4.15: https://github.com/openshift/hypershift/blob/release-4.15/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L293-L341

    4.14: https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L283-L331

Expected results:

    Expect to see ValidatingAdmissionPolicy

Additional info:

    

Description of problem:

New hypershift scheduler is not replacing '.' with ',' in subnet label values, resulting in invalid subnet annotations for load balancer services. 

Version-Release number of selected component (if applicable):

 4.16.0

How reproducible:

 Always

Steps to Reproduce:

    1. Using new hypershift request serving node scheduler, create a HostedCluster.
    2. Use nodes that are labeled with subnets separated by periods instead of commas.
    

Actual results:

    HostedCluster fails to roll out because router services are not deployed.

Expected results:

    HostedCluster provisions successfully.

Additional info:

    

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/59

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The number of control plane replicas defined in install-config.yaml (or agent-cluster-install.yaml) should be validated to check its set to 3, or 1 in the case of SNO. If set to another value the "create image" command should fail.

We recently had a case where the number of replicas was set to 2 and the installation failed. It would be good to catch this misconfiguration prior to the install.

Description of problem

The API documentation for the status.componentRoutes.currentHostnames field in the ingress config API has developer notes from the Go definition.

Version-Release number of selected component (if applicable)

OpenShift 4.11 and all subsequent versions of OpenShift so far.

How reproducible

100%.

Steps to Reproduce

1. Read the documentation for the API field: oc explain ingresses.status.componentRoutes.currentHostnames --api-version=config.openshift.io/v1

Actual results

The ingresses.config.openshift.io CRD has developer notes in the description of the status.componentRoutes.currentHostnames field:

% oc explain ingresses.status.componentRoutes.currentHostnames --api-version=config.openshift.io/v1
KIND:     Ingress
VERSION:  config.openshift.io/v1

FIELD:    currentHostnames <[]string>

DESCRIPTION:
     currentHostnames is the list of current names used by the route. Typically,
     this list should consist of a single hostname, but if multiple hostnames
     are supported by the route the operator may write multiple entries to this
     list.

     Hostname is an alias for hostname string validation. The left operand of
     the | is the original kubebuilder hostname validation format, which is
     incorrect because it allows upper case letters, disallows hyphen or number
     in the TLD, and allows labels to start/end in non-alphanumeric characters.
     See https://bugzilla.redhat.com/show_bug.cgi?id=2039256.
     ^([a-zA-Z0-9\p{S}\p{L}]((-?[a-zA-Z0-9\p{S}\p{L}]{0,62})?)|([a-zA-Z0-9\p{S}\p{L}](([a-zA-Z0-9-\p{S}\p{L}]{0,61}[a-zA-Z0-9\p{S}\p{L}])?)(\.)){1,}([a-zA-Z\p{L}]){2,63})$
     The right operand of the | is a new pattern that mimics the current API
     route admission validation on hostname, except that it allows hostnames
     longer than the maximum length:
     ^(([a-z0-9][-a-z0-9]{0,61}[a-z0-9]|[a-z0-9]{1,63})[\.]){0,}([a-z0-9][-a-z0-9]{0,61}[a-z0-9]|[a-z0-9]{1,63})$
     Both operand patterns are made available so that modifications on ingress
     spec can still happen after an invalid hostname was saved via validation by
     the incorrect left operand of the | operator.

Expected results

The second paragraph should be omitted from the CRD:

% oc explain ingresses.status.componentRoutes.currentHostnames --api-version=config.openshift.io/v1
KIND:     Ingress
VERSION:  config.openshift.io/v1

FIELD:    currentHostnames <[]string>

DESCRIPTION:
     currentHostnames is the list of current names used by the route. Typically,
     this list should consist of a single hostname, but if multiple hostnames
     are supported by the route the operator may write multiple entries to this
     list.

Additional info

The API field was introduced in OpenShift 4.8: https://github.com/openshift/api/pull/852/commits/c53c57f3d465f28b27ee4fad48763f049228486e

The developer note was added in OpenShift 4.11: https://github.com/openshift/api/pull/1120/commits/1fec415423985530a8925a5fd8c87e1741d8c2fb

In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:

E0125 11:04:58.158222       1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded)
panic: send on closed channel

goroutine 15608 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5

which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.

Job run for reference:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288

Description of the problem:

Installation of cluster using OCP image 4.15.0-rc.0 and using HTTP Proxy configuration failed on

"Control plane was not installed

3/3 control plane nodes failed to install. Please check the installation logs for more information."
and
"error Host master-0-1: updated status from installing-in-progress to error (Host failed to install because its installation stage Waiting for control plane took longer than expected 1h0m0s)"
After looked at master-0-1 node jpurnactl log found the error:
"Dec 21 00:54:29 master-0-1 kubelet.sh[5111]: E1221 00:54:29.290568    5111 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd pod=etcd-bootstrap-member-master-0-1_openshift-etcd(97cad44a9feb70b1091eaa3fb1e565ca)\"" pod="openshift-etcd/etcd-bootstrap-member-master-0-1" podUID="97cad44a9feb70b1091eaa3fb1e565ca"
"

HTTP Proxy configuration works fine with OCP images  4.14.5 and 4.13.26
Steps to reproduce:

1. Setup HTTP Proxy server on hypervisor using quay.io/sameersbn/squid:3.5.27-2

2. Create cluster and got Host Discovery

3. Press Add Host. Fill out SSH public key

4. Select Show proxy settings and fill out

HTTP proxy URL  and No proxy domains 


5. Generate ISO image, download it and boot 3 masters and 2 workers node.

6. Continue regular cluster installation

Actual results:

Installation failed on 69%

Expected results:

Installation passed

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/264

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    In PipelineRun list page, while fetching taskruns for particular pipelinerun, add loading if TaskRuns is not fetched yet

Version-Release number of selected component (if applicable):

4.15.z    

How reproducible:

    Sometimes

Steps to Reproduce:

    1.Create a failed pipelinerun
    2.Check Task Status field
    3.
    

Actual results:

    Sometimes TaskRun Status value is -

Expected results:

    Should show status bars

Additional info:

    

When metal3-plugin is enabled it adds Disks and NICs tabs to Nodes details page.

 

We would like to remove these tabs becase:

  • The same tabs are added to Bare Metal Hosts page - thus we would be removing the duplicity.

Description of problem:

    When adding another IP address to br-ex, geneve traffic sent from this node may be sent with the new IP address rather than the one configured for this tunnel. This will cause traffic to be dropped by the destination with the error:

[root@ovn-control-plane openvswitch]# cat  ovs-vswitchd.log  | grep fc00:f853:ccd:e793::4
2024-04-17T16:47:02.146Z|00012|tunnel(revalidator10)|WARN|receive tunnel port not found (tcp6,tun_id=0xff0003,tun_src=0.0.0.0,tun_dst=0.0.0.0,tun_ipv6_src=fc00:f853:ccd:e793:ffff::1,tun_ipv6_dst=fc00:f853:ccd:e793::3,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_erspan_ver=0,gtpu_flags=0,gtpu_msgtype=0,tun_flags=csum|key,in_port=5,vlan_tci=0x0000,dl_src=0a:58:2b:22:eb:86,dl_dst=0a:58:92:3f:71:e5,ipv6_src=fc00:f853:ccd:e793::4,ipv6_dst=fd00:10:244:1::7,ipv6_label=0x630b1,nw_tos=0,nw_ecn=0,nw_ttl=63,nw_frag=no,tp_src=8080,tp_dst=59130,tcp_flags=syn|ack)

This is more likely to occur on ipv6 than ipv4, due to IP address ordering on the NIC and linux rules used to determine source IP to use when sending host originated traffic.

Version-Release number of selected component (if applicable):

    All versions

How reproducible:

    Always

 

To workaround with ipv6, set preferred_lft 0 on the address, which will cause it to become deprecated and linux will choose an alternative. Alternatively set external_ids:ovn-set-local-ip="true" in openvswitch on each node, which will force OVN to use the configured geneve-encap-ip. Related OVN issue: https://issues.redhat.com/browse/FDP-570

Description of problem:

Agent based installation is stuck on the booting screen for the arm64 SNO cluster.

The installer shuold validate the architecture set by the users in the install-config.yaml with the payload image being used.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

100%

Steps to Reproduce:

[Fixed original version]

1. Create agent ISO with the amd64 payload
2. Boot the created ISO on arm64 server
3. Monitor the booting screen for error

[Generalized]

1. Set the install-config.yaml controlPlane.architecture to arm64
2. Try to install with an

Actual results:

The installation is currently stuck on the initial booting screen.

Expected results:

The SNO cluster should be installed without any issues.

Additional info:

Compact cluster installation was successful, here is the prow ci link: 

https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-arm64-nightly-baremetal-compact-agent-ipv4-static-connected-p1-f7/1665833590451081216/artifacts/baremetal-compact-agent-ipv4-static-connected-p1-f7/baremetal-lab-agent-install/build-log.txt 

Description of problem:

On SNO with DU profile(RT kernel) tuned profile is always degraded due to net.core.busy_read, net.core.busy_poll and kernel.numa_balancing sysctl not existing in RT kernel

Version-Release number of selected component (if applicable):

4.14.1

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with DU profile(RT kernel)
2. Check tuned profile

Actual results:

oc -n openshift-cluster-node-tuning-operator get profile -o yaml
apiVersion: v1
items:
- apiVersion: tuned.openshift.io/v1
  kind: Profile
  metadata:
    creationTimestamp: "2023-11-09T18:26:34Z"
    generation: 2
    name: sno.kni-qe-1.lab.eng.rdu2.redhat.com
    namespace: openshift-cluster-node-tuning-operator
    ownerReferences:
    - apiVersion: tuned.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: Tuned
      name: default
      uid: 4e7c05a2-537e-4212-9009-e2724938dec9
    resourceVersion: "287891"
    uid: 5f4d5819-8f84-4b3b-9340-3d38c41501ff
  spec:
    config:
      debug: false
      tunedConfig: {}
      tunedProfile: performance-patch
  status:
    conditions:
    - lastTransitionTime: "2023-11-09T18:26:39Z"
      message: TuneD profile applied.
      reason: AsExpected
      status: "True"
      type: Applied
    - lastTransitionTime: "2023-11-09T18:26:39Z"
      message: 'TuneD daemon issued one or more error message(s) during profile application.
        TuneD stderr: net.core.rps_default_mask'
      reason: TunedError
      status: "True"
      type: Degraded
    tunedProfile: performance-patch
kind: List
metadata:
  resourceVersion: ""

Expected results:

Not degraded

Additional info:

Looking at the tuned log the following errors show up which are probably causing the profile to get into degraded state:

2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_read', the parameter does not exist
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: sysctl option net.core.busy_read will not be set, failed to read the original value.
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'net.core.busy_poll', the parameter does not exist
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: sysctl option net.core.busy_poll will not be set, failed to read the original value.
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: Failed to read sysctl parameter 'kernel.numa_balancing', the parameter does not exist
2023-11-09 18:30:49,287 ERROR    tuned.plugins.plugin_sysctl: sysctl option kernel.numa_balancing will not be set, failed to read the original value.

These sysctl parameters seem not to be available with RT kernel.

Description of problem:

    The node-network-identity deployment should be set to assist in a controlled rollout of the microservice pods. The general goal is to have a microservice pod only report to Kubernetes as being ready when it has completed initialization and is stable enough to complete tasks.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/70

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In ROSA/OCP 4.14.z, attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes (in ROSA's case, this was attached to the ManagedOpenShift-Worker-Role, which is assigned by the installer to all the worker nodes), has no effect on ECR Image pull. User gets an authentication error. Attaching the policy ideally should avoid the need to provide an image-pull-secret. However, the error is resolved only if the user also provides an image-pull-secret.
This is proven to work correctly in 4.12.z. Seems something has changed in the recent OCP versions.

Version-Release number of selected component (if applicable):

4.14.2 (ROSA)

How reproducible:

The issue is reproducible using the below steps.

Steps to Reproduce:

    1. Create a deployment in ROSA or OCP on AWS, pointing at a private ECR repository
    2. The image pulling will fail with Error: ErrImagePull & authentication required errors
    3.

Actual results:

The image pull fails with "Error: ErrImagePull" & "authentication required" errors. However, the image pull is successful only if the user provides an image-pull-secret to the deployment.

Expected results:

The image should be pulled successfully by virtue of the ECR-read-only policy attached to the worker node role; without needing an image-pull-secret. 

Additional info:


In other words:

in OCP 4.13 (and below) if a user adds the ECR:* permissions to the worker instance profile, then the user can specify ECR images and authentication of the worker node to ECR is done using the instance profile. In 4.14 this no longer works.

It is not sufficient as an alternative, to provide a pull secret in a deployment because AWS rotates ECR tokens every 12 hours. That is not a viable solution for customers that until OCP 4.13, did not have to rotate pull secrets constantly.

The experience in 4.14 should be the same as in 4.13 with ECR.

 

The current AWS policy that's used is this one: `arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly`

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:DescribeImages",
                "ecr:BatchGetImage",
                "ecr:GetLifecyclePolicy",
                "ecr:GetLifecyclePolicyPreview",
                "ecr:ListTagsForResource",
                "ecr:DescribeImageScanFindings"
            ],
            "Resource": "*"
        }
    ]
} 

 

 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/277

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem: In an environment with the following zones, topology was disabled while it should be enabled by default

$ openstack availability zone list --compute
+-----------+-------------+
| Zone Name | Zone Status |
+-----------+-------------+
| AZ-0      | available   |
| AZ-1      | available   |
| AZ-2      | available   |
+-----------+-------------+

$ openstack availability zone list --volume
+-----------+-------------+
| Zone Name | Zone Status |
+-----------+-------------+
| nova      | available   |
| AZ-0      | available   |
| AZ-1      | available   |
| AZ-2      | available   |
+-----------+-------------+
    

We have a check that verify the number of zones is identical for compute and volumes. This check should be removed. We want however to ensure that for every compute zone we have a matching volume zone.

Description of problem:

NetworkAttachmentDefinition always gets created in the default namespace if I use the form method "from the console".
It also doesn't honor the selected name and creates the NAD object with a different name (the selected name + random suffix).

Version-Release number of selected component (if applicable):

OCP 4.15.5

How reproducible:

From the console, under the Networking section, select NetworkAttachmentDefinitions and create a NAD using the Form method and not the YAML one.

Actual results:

The NAD gets created in the wrong namespace (always ends up in the default namespace) and with the wrong name.

Expected results:

The NAD resource gets created in the currently selected namespace with the chosen name

Description of problem:

The customer uses Azure File CSI driver and without this they cannot make use of the Azure Workload Identity work which was one of the banner features of OCP 4.14. This feature is currently available in 4.16, however it will take the customer 3-6 months to validate 4.16 and start its rollout putting their plans to complete a large migration to Azure by end of 2024 at risk.
Could you please backport either the 1.29.3 feature for Azure Workload Idenity or rebase our Azure File CSI driver in 4.14 and 4.15 to at least 1.29.3 which includes the desired feature.
    

Version-Release number of selected component (if applicable):

azure-file-csi-driver in 4.14 and 4.15
- In 4.14, azure-file-csi-driver is version 1.28.1
- In 4.15, azure-file-csi-driver is version 1.29.2
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Install ocp 4.14 with Azure Workload Managed Identity
    2. Try to configure Managed Workload Identiy with Azure CSI file

https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/workload-identity-static-pv-mount.md
    

Actual results:

Is not usable
    

Expected results:

Azure Workload Identity should be manage with Azure File CSi as part of the whole feature
    

Additional info:

    

Description of problem:

The degradation of the storage operator occurred because it couldn't locate the node by UUID. I noticed that the providerID was present for node 0, but it was blank for other nodes. A successful installation can be achieved on day 2 by executing step 4 after step 7 from this document: https://access.redhat.com/solutions/6677901. Additionally, if we provide credentials from the install-config, it's necessary to add a taint to the node using the uninitialized taint(oc adm taint node "$NODE" node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule) after the bootstrap completed.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

100%

Steps to Reproduce:

    1. Create an agent ISO image
    2. Boot the created ISO on vSphere VM    

Actual results:

Installation is failing due to storage operator unable to find the node by UUID.

Expected results:

Storage operator should be installed without any issue.

Additional info:

Slack discussion: https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1702893456002729

Description of problem:

To bump some dependencies for CVE fixes, we added `replace` directives in the go.mod file. These dependencies have since moved way past the pinned version.
We should drop the replaces before we run into problems from having deps pinned to versions that are too old. For example, I've seen PRs with the following diff:

# golang.org/x/net v0.23.0 => golang.org/x/net v0.5.0

which is not really what we want.    

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Some dependencies are not upgraded because they are pinned.

Expected results:

    

Additional info:

    

Description of the problem:

API tests , running from test-infra , set the OPENSHIFT_VERSION=4.15

We expect from service to return latest stable version (x.y.z) 

The returned version is 4.15.8-multi which it not from stable stream but from candidate and should not be chosen

This behviour break the API tests because we expect to pick latest stream when sending Major.Minor.

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

CI is flakey and causing issues for in-cluster team. 

Integration tests need a clean-up and made more robust.

Description of the problem:

Non-Nutanix node was successfully added to Nutanix day1 cluster

How reproducible:

100%

Steps to reproduce:

1. Deploy Nutanix day1 cluster

4. Try to add non-Nutanix day-2 node to Nutanix cluster 

Actual results:

Day-2 node installation started and host installed

Expected results:

Day-2 node doesn't pass pre-installation checks

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/149

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

ARO supplies a platform kubeletconfig to enable certain features, currently we use this to enable node sizing or enable autoSizingReserved. Customers want the ability to customize podPidsLimit and we have directed them to configure a second kubeletconfig.

When these kubeletconfigs are rendered into machineconfigs, the order of their application is nondeterministic: the MCs are suffixed by an increasing serial number based on the order the kubeletconfigs were created. This makes it impossible for the customer to ensure their PIDs limit is applied while still allowing ARO to maintain our platform defaults.

We need a way of supplying platform defaults while still allowing the customer to make supported modifications in a way that does not risk being reverted during upgrades or other maintenance.

This issue has manifested in two different ways: 

During an upgrade from 4.11.31 to 4.12.40, a cluster had the order of kubeletconfig rendered machine configs reverse. We think that in older versions, the initial kubeletconfig did not get an mc-name-suffix annotation applied, but rendered to "99-worker-generated-kubelet" (no suffix). The customer-provided kubeletconfig rendered to the suffix "-1". During the upgrade, MCO saw this as a new kubeletconfig and assigned it the suffix "-2", effectively reversing their order. See the RCS document https://docs.google.com/document/d/19LuhieQhCGgKclerkeO1UOIdprOx367eCSuinIPaqXA

ARO wants to make updates to the platform defaults. We are changing from a kubeletconfig "aro-limits" to a kubeletconfig "dynamic-node". We want to be able to do this while still keeping it as defaults and if the customer has created their own kubeletconfig, the customer's should still take precedence. What we see is that the creation of a new kubeletconfig regardless of source overrides all other kubeletconfigs, causing the customer to lose their customization.

Version-Release number of selected component (if applicable):

4.12.40+

ARO's older kubeletconfig "aro-limits":

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  labels:
    aro.openshift.io/limits: ""
  name: aro-limits
spec:
  kubeletConfig:
    evictionHard:
      imagefs.available: 15%
      memory.available: 500Mi
      nodefs.available: 10%
      nodefs.inodesFree: 5%
    systemReserved:
      memory: 2000Mi
  machineConfigPoolSelector:
    matchLabels:
      aro.openshift.io/limits: ""

ARO's newer kubeletconfig, "dynamic-node"

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: dynamic-node
spec:
  autoSizingReserved: true
  machineConfigPoolSelector:
    matchExpressions:
    - key: machineconfiguration.openshift.io/mco-built-in
      operator: Exists

 

Customer's desired kubeletconfig:

 

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  labels:
    arogcd.arogproj.io/instance: cluster-config
  name: default-pod-pids-limit
spec:
  kubeletConfig:
    podPidsLimit: 2000000
  machineConfigPoolSelector:
    matchExpressions:
    - key: pools.operator.machineconfiguration.io/worker
      operator: Exists

 

Description of problem:

Install IPI cluster against 4.15 nightly build on Azure MAG and Azure Stack Hub or with Azure workload identity, image-registry co is degraded with different errors.

On MAG:
$ oc get co image-registry
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.15.0-0.nightly-2024-02-16-235514   True        False         True       5h44m   AzurePathFixControllerDegraded: Migration failed: panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such host...

$ oc get pod -n openshift-image-registry
NAME                                               READY   STATUS    RESTARTS        AGE
azure-path-fix-ssn5w                               0/1     Error     0               5h47m
cluster-image-registry-operator-86cdf775c7-7brn6   1/1     Running   1 (5h50m ago)   5h58m
image-registry-5c6796b86d-46lvx                    1/1     Running   0               5h47m
image-registry-5c6796b86d-9st5d                    1/1     Running   0               5h47m
node-ca-48lsh                                      1/1     Running   0               5h44m
node-ca-5rrsl                                      1/1     Running   0               5h47m
node-ca-8sc92                                      1/1     Running   0               5h47m
node-ca-h6trz                                      1/1     Running   0               5h47m
node-ca-hm7s2                                      1/1     Running   0               5h47m
node-ca-z7tv8                                      1/1     Running   0               5h44m

$ oc logs azure-path-fix-ssn5w -n openshift-image-registry
panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such hostgoroutine 1 [running]:
main.main()
    /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:49 +0x125

The blob storage endpoint seems not correct, should be:
$ az storage account show -n imageregistryjima41xvvww -g jima415a-hfxfh-rg --query primaryEndpoints
{
  "blob": "https://imageregistryjima41xvvww.blob.core.usgovcloudapi.net/",
  "dfs": "https://imageregistryjima41xvvww.dfs.core.usgovcloudapi.net/",
  "file": "https://imageregistryjima41xvvww.file.core.usgovcloudapi.net/",
  "internetEndpoints": null,
  "microsoftEndpoints": null,
  "queue": "https://imageregistryjima41xvvww.queue.core.usgovcloudapi.net/",
  "table": "https://imageregistryjima41xvvww.table.core.usgovcloudapi.net/",
  "web": "https://imageregistryjima41xvvww.z2.web.core.usgovcloudapi.net/"
}

On Azure Stack Hub:
$ oc get co image-registry
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.15.0-0.nightly-2024-02-16-235514   True        False         True       3h32m   AzurePathFixControllerDegraded: Migration failed: panic: open : no such file or directory...

$ oc get pod -n openshift-image-registry
NAME                                               READY   STATUS    RESTARTS        AGE
azure-path-fix-8jdg7                               0/1     Error     0               3h35m
cluster-image-registry-operator-86cdf775c7-jwnd4   1/1     Running   1 (3h38m ago)   3h54m
image-registry-658669fbb4-llv8z                    1/1     Running   0               3h35m
image-registry-658669fbb4-lmfr6                    1/1     Running   0               3h35m
node-ca-2jkjx                                      1/1     Running   0               3h35m
node-ca-dcg2v                                      1/1     Running   0               3h35m
node-ca-q6xmn                                      1/1     Running   0               3h35m
node-ca-r46r2                                      1/1     Running   0               3h35m
node-ca-s8jkb                                      1/1     Running   0               3h35m
node-ca-ww6ql                                      1/1     Running   0               3h35m

$ oc logs azure-path-fix-8jdg7 -n openshift-image-registry
panic: open : no such file or directorygoroutine 1 [running]:
main.main()
    /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:36 +0x145

On cluster with Azure workload identity:
Some operator's PROGRESSING is True
image-registry                             4.15.0-0.nightly-2024-02-16-235514   True        True          False      43m     Progressing: The deployment has not completed...

pod azure-path-fix is in CreateContainerConfigError status, and get error in its Event.

"state": {
    "waiting": {
        "message": "couldn't find key REGISTRY_STORAGE_AZURE_ACCOUNTKEY in Secret openshift-image-registry/image-registry-private-configuration",
        "reason": "CreateContainerConfigError"
    }
}                

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-02-16-235514    

How reproducible:

    Always

Steps to Reproduce:

    1. Install IPI cluster on MAG or Azure Stack Hub or config Azure workload identity
    2.
    3.
    

Actual results:

    Installation failed and image-registry operator is degraded

Expected results:

    Installation is successful.

Additional info:

    Seems that issue is related with https://github.com/openshift/image-registry/pull/393

When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.

steps to Reproduce:

=================

1. Install 4.15 cluster with baselineCapabilitySet to None

2. Run command `oc get sa -A | grep deployer`

 

Actual Results:

================

[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m

Expected Results:

==================

No SA related to deployer should be returned

Description of problem:

The network resource provisioning playbook for 4.15 dualstack UPI contains a task for adding an IPv6 subnet to the existing external router [1].
This task fails with:
- ansible-2.9.27-1.el8ae.noarch & ansible-collections-openstack-1.8.0-2.20220513065417.5bb8312.el8ost.noarch in OSP 16 env (RHEL 8.5) or
- openstack-ansible-core-2.14.2-4.1.el9ost.x86_64 & ansible-collections-openstack-1.9.1-17.1.20230621074746.0e9a6f2.el9ost.noarch in OSP 17 env (RHEL 9.2)

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-22-160236

How reproducible:

Always

Steps to Reproduce:

1. Set the os_subnet6 in the inventory file for setting dualstack
2. Run the 4.15 network.yaml playbook

Actual results:

Playbook fails:
TASK [Add IPv6 subnet to the external router] ********************************** fatal: [localhost]: FAILED! => {"changed": false, "extra_data": {"data": null, "details": "Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.\", \"detail\": \"\"}}"}, "msg": "Error updating router 8352c9c0-dc39-46ed-94ed-c038f6987cad: Client Error for url: https://10.46.43.81:13696/v2.0/routers/8352c9c0-dc39-46ed-94ed-c038f6987cad, Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}."}

Expected results:

Successful playbook execution

Additional info:

The router can be created in two different tasks, the playbook [2] worked for me.

[1] https://github.com/openshift/installer/blob/1349161e2bb8606574696bf1e3bc20ae054e60f8/upi/openstack/network.yaml#L43
[2] https://file.rdu.redhat.com/juriarte/upi/network.yaml

Description of problem:

    If there is a taskRun with same name in 2 different namespace, then in TaskRuns list page for All namespace, showing only one record due to same name

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1. Create TaskRun using https://gist.github.com/karthikjeeyar/eb1bbdf9157431f5c875eb55ce47580c in 2 different namespace
    2. Go to TaskRun list page
    3. Select All Projects
    

Actual results:

    Only one entry is shown 

Expected results:

    Both entries should be visible

Additional info:

    

Description of problem:

    Re-enable e2e tests Red Hat Openshift Pipelines operator is now available in the operator hub.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2190

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Run OLM on 4.15 cluster
    2.
    3.
    

Actual results:

    OLM pod will panic

Expected results:

    Should run just fine

Additional info:

    This issue is due to failure of initiate a new map if nil

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/102

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/sdn/pull/596

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/multus-cni/pull/212

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/88

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Unable to run oc commands in FIPS enable OCP cluster on PowerVS

Version-Release number of selected component (if applicable):

4.15.0-ec2

How reproducible:

Deploy OCP cluster with FIPS enabled

Steps to Reproduce:

1. Enable the var in var.tfvars - fips_compliant      = true
2. Deploy the cluster
3. run oc commands

Actual results:

[root@rdr-swap-fips-syd05-bastion-0 ~]# oc version
FIPS mode is enabled, but the required OpenSSL library is not available

[root@rdr-swap-fips-syd05-bastion-0 ~]# oc debug node/syd05-master-0.rdr-swap-fips.ibm.com
FIPS mode is enabled, but the required OpenSSL library is not available

[root@rdr-swap-fips-syd05-bastion-0 ~]# fips-mode-setup --check
FIPS mode is enabled.

Expected results:

# oc debug node/syd05-master-0.rdr-swap-fips1.ibm.com
Temporary namespace openshift-debug-dns7d is created for debugging node...
Starting pod/syd05-master-0rdr-swap-fips1ibmcom-debug-hs4dr ...
To use host binaries, run `chroot /host`
Pod IP: 193.168.200.9

Additional info:

Not able to collect must gather logs due to the issue

links - https://access.redhat.com/solutions/7034387

Please review the following PR: https://github.com/openshift/installer/pull/7818

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When we run opm on RHEL8, we met the following error
./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm)

Note: it happened for 4.15.0-ec.3
I tried the 4.14, it works.
I also tried to compile it with latest code, it also work.

Version-Release number of selected component (if applicable):

    4.15.0-ec.3

How reproducible:

    always

Steps to Reproduce:

[root@preserve-olm-env2 slavecontainer]# curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/candidate/opm-linux-4.15.0-ec.3.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz
opm
[root@preserve-olm-env2 slavecontainer]# ./opm version
./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm)
[root@preserve-olm-env2 slavecontainer]# curl -s -l -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest-4.14/opm-linux-4.14.5.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz
opm
[root@preserve-olm-env2 slavecontainer]# opm version
Version: version.Version{OpmVersion:"639fc1203", GitCommit:"639fc12035292dec74a16b306226946c8da404a2", BuildDate:"2023-11-21T08:03:15Z", GoOs:"linux", GoArch:"amd64"}
[root@preserve-olm-env2 kuiwang]# cd operator-framework-olm/
[root@preserve-olm-env2 operator-framework-olm]# git branch
  gs
* master
  release-4.10
  release-4.11
  release-4.12
  release-4.13
  release-4.8
  release-4.9
[root@preserve-olm-env2 operator-framework-olm]# git pull origin master
remote: Enumerating objects: 1650, done.
remote: Counting objects: 100% (1650/1650), done.
remote: Compressing objects: 100% (831/831), done.
remote: Total 1650 (delta 727), reused 1617 (delta 711), pack-reused 0
Receiving objects: 100% (1650/1650), 2.03 MiB | 12.81 MiB/s, done.
Resolving deltas: 100% (727/727), completed with 468 local objects.
From github.com:openshift/operator-framework-olm
 * branch master -> FETCH_HEAD
   639fc1203..85c579f9b master -> origin/master
Updating 639fc1203..85c579f9b
Fast-forward
 go.mod | 120 +-
 go.sum | 240 ++--
 manifests/0000_50_olm_00-pprof-secret.yaml
...
 create mode 100644 vendor/google.golang.org/protobuf/types/dynamicpb/types.go
[root@preserve-olm-env2 operator-framework-olm]# rm -fr bin/opm
[root@preserve-olm-env2 operator-framework-olm]# make build/opm
make bin/opm
make[1]: Entering directory '/data/kuiwang/operator-framework-olm'
go build -ldflags "-X 'github.com/operator-framework/operator-registry/cmd/opm/version.gitCommit=85c579f9be61aaea11e90b6c870452c72107300a' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.opmVersion=85c579f9b' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.buildDate=2023-12-11T06:12:50Z'" -mod=vendor -tags "json1" -o bin/opm github.com/operator-framework/operator-registry/cmd/opm
make[1]: Leaving directory '/data/kuiwang/operator-framework-olm'
[root@preserve-olm-env2 operator-framework-olm]# which opm
/data/kuiwang/operator-framework-olm/bin/opm
[root@preserve-olm-env2 operator-framework-olm]# opm version
Version: version.Version{OpmVersion:"85c579f9b", GitCommit:"85c579f9be61aaea11e90b6c870452c72107300a", BuildDate:"2023-12-11T06:12:50Z", GoOs:"linux", GoArch:"amd64"}

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

E2E test failing   

1. Clone oc-mirror repository
git clone https://github.com/openshift/oc-mirror.git && cd oc-mirror

2. Find the oc-mirror image in the release: https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.14.0-rc.2/ppc64le/release.txt
oc-mirror quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff150b00081ed565169de24cfc82481c5017de73986552d15d129530b62e531

3. Pull container
podman pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff150b00081ed565169de24cfc82481c5017de73986552d15d129530b62e531

4. Extract binary
mkdir bin
container_id=$(podman create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fff150b00081ed565169de24cfc82481c5017de73986552d15d129530b62e531)
podman cp ${container_id}:usr/bin/oc-mirror bin/oc-mirror

5. comfirm file
[root@rdr-ani-014-bastion-0 oc-mirror]# file bin/oc-mirror bin/oc-mirror: ELF 64-bit LSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV), dynamically linked, interpreter /lib64/ld64.so.2, for GNU/Linux 3.10.0, Go BuildID=HuBgap--bII0r0Nw0GxI/SOZCyTWk4pH5ciuQtUO8/ib6uaSW-eAJl24Zzk-G2/O4yxlKreHK_BaH9F4RU6, BuildID[sha1]=c018e70301e18c23f2c119ba451a32aff980d618, with debug_info, not stripped, too many notes (256)
   

6. Build go-toolset and run e2e test
[root@rdr-ani-014-bastion-0 oc-mirror]# podman build -f Dockerfile -t local/go-toolset:latest
Successfully tagged localhost/local/go-toolset:latest
bf24f160059d7ae2ef99a77e6680cdac30e3ba942911b88c7e60dca88fd768f7

[root@rdr-ani-014-bastion-0 oc-mirror]# podman run -it -v $(pwd):/build:z --entrypoint /bin/bash local/go-toolset:latest ./test/e2e/e2e-simple.sh bin/oc-mirror | tee oc-mirror-e2e.log  /build/test/e2e/operator-test.28124 /build
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 49.0M  100 49.0M    0     0  60.2M      0 --:--:-- --:--:-- --:--:--  106M
go: downloading github.com/google/go-containerregistry v0.16.1
go: downloading github.com/docker/cli v24.0.0+incompatible
go: downloading github.com/opencontainers/image-spec v1.1.0-rc3
go: downloading github.com/spf13/cobra v1.7.0
go: downloading github.com/mitchellh/go-homedir v1.1.0
go: downloading golang.org/x/sync v0.2.0
go: downloading github.com/opencontainers/go-digest v1.0.0
go: downloading github.com/docker/distribution v2.8.2+incompatible
go: downloading github.com/google/go-cmp v0.5.9
go: downloading github.com/containerd/stargz-snapshotter/estargz v0.14.3
go: downloading github.com/spf13/pflag v1.0.5
go: downloading github.com/klauspost/compress v1.16.5
go: downloading github.com/vbatts/tar-split v0.11.3
go: downloading github.com/pkg/errors v0.9.1
go: downloading github.com/docker/docker v24.0.0+incompatible
go: downloading golang.org/x/sys v0.8.0
go: downloading github.com/docker/docker-credential-helpers v0.7.0
go: downloading github.com/sirupsen/logrus v1.9.1
bin/registry
/build
INFO: Running 22 test cases
INFO: Running full_catalog
.
.
.
sha256:17de509b5c9e370d501951850ba07f6cbefa529f598f3011766767d1181726b3 localhost.localdomain:5001/skhoury/oc-mirror-dev:4138bec2
info: Mirroring completed in 40ms (119.4kB/s)
worker 0 stopping
worker 1 stopping
worker 5 stopping
worker 3 stopping
worker 2 stopping
worker 3 stopping
worker 2 stopping
worker 4 stopping
work queue exiting
No images specified for pruning
Unpack release signatures
worker 1 stopping
work queue exiting
worker 0 stopping
Wrote release signatures to oc-mirror-workspace/results-1695964813
rebuilding catalog images
Rendering catalog image "localhost.localdomain:5001/skhoury/oc-mirror-dev:test-catalog-latest" with file-based catalog
error: error rebuilding catalog images from file-based catalogs: error regenerating the cache for localhost.localdomain:5001/skhoury/oc-mirror-dev:test-catalog-latest: fork/exec oc-mirror-workspace/images.1753960055/catalogs/localhost.localdomain:5000/skhoury/oc-mirror-dev/test-catalog-latest/bin/opm: exec format error

Version-Release number of selected component (if applicable):

4.14.0-rc.2

How reproducible:

Always

Steps to Reproduce

 Same as details provided in description

Actual results:

E2E test is getting terminated in between the execution

Expected results:

E2E testing should pass with no errors

Additional info:

E2E logs are provided here:
oc-mirror-e2e.log - https://github.ibm.com/redstack-power/project-mgmt/issues/3284#issuecomment-63722862
re-oc-mirror-e2e.log - https://github.ibm.com/redstack-power/project-mgmt/issues/3284#issuecomment-63806863  

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/152

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/223

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/183

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/774

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  • Hosted cluster credentialsMode mode is not manual and cannot create secrets.
  • Now the Control Plan credentialsMode is the same as Management Cluster, but for this feature, it should be manual mode on Hosted Cluster no matter what the credentialsMode of Management Cluster is.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

   
 1.Creates CredentialsRequest including the spec.providerSpec.stsIAMRoleARN string. 
   
 2.Cloud Credential Operator could not populate Secret based on CredentialsRequest.   

$ oc get secret -A | grep test-mihuang
#Secret not found.  

$ oc get CredentialsRequest -n openshift-cloud-credential-operator
NAME                                                  AGE
...
test-mihuang                                               44s
    3.
    

Actual results:

    Secret not create successfully.

Expected results:

    Successfully created the secret on the hosted cluster.

Additional info:

    

Description of problem:

the checkbox should be displayed on a single row
eg: for 'Deny all ingress traffic' & 'Deny all egress traffic' in Create NetworkPolicy page
for 'Secure Route' in Create route page

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-11-14-082209

How reproducible:

Always

Steps to Reproduce:

1. Go to Networking -> NetworkPolicies page, click the 'Create NetworkPolicy' button
2. Check the Policy type section, check if the checkbox of 'Deny all ingress traffic' & "Deny all egress traffic" is displayed in a single row
3. Check the same things in 'Create route' page,

Actual results:

not in a single row

Expected results:

in a single row 

Additional info:

https://drive.google.com/file/d/1xgEe-CuuRYrY9tBFmIa-7o5Rcn7iCr1e/view?usp=drive_link

Description of problem:

click on any node status popover, the popover dialog will always move to the first line 

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-14-100410

How reproducible:

Always    

Steps to Reproduce:

1. goes to Nodes list page, mark any worker as unschedulable
2. click on the 'Ready/Scheduling diabled' text, a popover dialog is opened 
3.
    

Actual results:

2. the popover dialog will always jump to the first line, it looks like popover dialog is showing wrong status/info, also it's difficult for user to perform node actions   

Expected results:

2. popover dialog should be shown exactly next to the correct node     

Additional info:

    

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/265

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/must-gather/pull/395

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/32

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

During live OVN migration, network operator show the error message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.   

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

1. Create 4.15 nightly SDN ROSA cluster
2. oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation
3. oc edit featuregate cluster to enable featuregates 
4. Wait for all node rebooting and back to normal
5. oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'

Actual results:

[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation[weliang@weliang ~]$ oc edit featuregate cluster[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'network.config.openshift.io/cluster patched[weliang@weliang ~]$ [weliang@weliang ~]$ oc get co networkNAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGEnetwork   4.15.0-0.nightly-2023-12-18-220750   True        False         True       105m    Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.[weliang@weliang ~]$ oc describe Network.config.openshift.io clusterName:         clusterNamespace:    Labels:       <none>Annotations:  network.openshift.io/network-type-migration: API Version:  config.openshift.io/v1Kind:         NetworkMetadata:  Creation Timestamp:  2023-12-20T15:13:39Z  Generation:          3  Resource Version:    119899  UID:                 6a621b88-ac4f-4918-a7f6-98dba7df222cSpec:  Cluster Network:    Cidr:         10.128.0.0/14    Host Prefix:  23  External IP:    Policy:  Network Type:  OVNKubernetes  Service Network:    172.30.0.0/16Status:  Cluster Network:    Cidr:               10.128.0.0/14    Host Prefix:        23  Cluster Network MTU:  8951  Network Type:         OpenShiftSDN  Service Network:    172.30.0.0/16Events:  <none>[weliang@weliang ~]$ oc describe Network.operator.openshift.io clusterName:         clusterNamespace:    Labels:       <none>Annotations:  <none>API Version:  operator.openshift.io/v1Kind:         NetworkMetadata:  Creation Timestamp:  2023-12-20T15:15:37Z  Generation:          275  Resource Version:    120026  UID:                 278bd491-ac88-4038-887f-d1defc450740Spec:  Cluster Network:    Cidr:         10.128.0.0/14    Host Prefix:  23  Default Network:    Openshift SDN Config:      Enable Unidling:          true      Mode:                     NetworkPolicy      Mtu:                      8951      Vxlan Port:               4789    Type:                       OVNKubernetes  Deploy Kube Proxy:            false  Disable Multi Network:        false  Disable Network Diagnostics:  false  Kube Proxy Config:    Bind Address:      0.0.0.0  Log Level:           Normal  Management State:    Managed  Observed Config:     <nil>  Operator Log Level:  Normal  Service Network:    172.30.0.0/16  Unsupported Config Overrides:  <nil>  Use Multi Network Policy:      falseStatus:  Conditions:    Last Transition Time:  2023-12-20T15:15:37Z    Status:                False    Type:                  ManagementStateDegraded    Last Transition Time:  2023-12-20T16:58:58Z    Message:               Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.    Reason:                InvalidOperatorConfig    Status:                True    Type:                  Degraded    Last Transition Time:  2023-12-20T15:15:37Z    Status:                True    Type:                  Upgradeable    Last Transition Time:  2023-12-20T16:52:11Z    Status:                False    Type:                  Progressing    Last Transition Time:  2023-12-20T15:15:45Z    Status:                True    Type:                  Available  Ready Replicas:          0  Version:                 4.15.0-0.nightly-2023-12-18-220750Events:                    <none>[weliang@weliang ~]$ oc get clusterversionNAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUSversion   4.15.0-0.nightly-2023-12-18-220750   True        False         84m     Error while reconciling 4.15.0-0.nightly-2023-12-18-220750: the cluster operator network is degraded[weliang@weliang ~]$ 

    

Expected results:

Migration success

Additional info:

Get same error message from ROSA and GCP cluster. 

Description of problem:

'kubeadmin' user unable to logout when logged with 'kube:admin' IDP, clicking on 'Log out' does nothing

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-06-020637    

How reproducible:

Always    

Steps to Reproduce:

1. Login to console with 'kube:admin' IDP, type username 'kubeadmin' and its password
2. Try to Log out from console
    

Actual results:

2. unable to log out successfully    

Expected results:

2. any user should be able to log out successfully    

Additional info:

    

Description of the problem:

BE master ~2.30 - in feature support api - VIP_AUTO_ALLOC is dev_preview for 4.15 - should be unavailable

How reproducible:

100%

Steps to reproduce:

1. GET https://<SERVICE_ADDRESS>/api/assisted-install/v2/support-levels/features?openshift_version=4.15&cpu_architecture=x86_64

2. BE response support level 

3.

Actual results:

VIP_AUTO_ALLOC is dev_preview for 4.15

Expected results:
VIP_AUTO_ALLOC should be unavailable

Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/41

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/egress-router-cni/pull/80

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Currently, the openshift-enterprise-tests image depends on the openstack repository on the x86_64 and ppc64le repositories. The package python-cinder gets installed, to allow the [openstack end-to-end tests|https://github.com/openshift/release/blob/60fed3474509bff9c5585a736554739e8ec4f017/ci-operator/step-registry/openstack/test/e2e/openstack-test-e2e-chain.yaml#L5] to [run|https://github.com/openshift/openstack-test/]. 

The python-cinder package is not made available for rhel9 on ppc64le. To move the tests image to rhel9, OCP probably should follow upstream's decision to not support ppc64le. 

Description of problem:

For years, the TechPreviewNoUpgrade alert has used:

cluster_feature_set{name!="", namespace="openshift-kube-apiserver-operator"} == 0

But recently testing 4.12.19, I saw the alert pending with name="LatencySensitive". Alerting on that not-useful-since-4.6 feature set is fine (OCPBUGS-14497), but TechPreviewNoUpgrade isn't a great name when the actual feature set is LatencySensitive. And the summary and descripition don't apply to LatencySensitive either.

Version-Release number of selected component (if applicable):

The buggy expr / alertname pair shipped in 4.3.0.

How reproducible:

All the time.

Steps to Reproduce:

1. Install a cluster like 4.12.19.
2. Set the LatencySensitive feature set:

$ oc patch featuregate cluster --type=json --patch='[{"op":"add","path":"/spec/featureSet","value":"LatencySensitive"}]'

3. Check alerts with /monitoring/alerts?rowFilter-alert-source=platform&resource-list-text=TechPreviewNoUpgrade in the web console.

Actual results:

TechPreviewNoUpgrade is pending or firing.

Expected results:

Something appropriate to LatencySensitive, like a generic alert that covers all non-default feature sets, is pending or firing.

Description of problem:

capi machine cannot be deleted by installer during cluster destroy, checked on GCP console, found this machine lacks label(kubernetes-io-cluster-clusterid: owned), if adding this label manually on GCP console for the machine, then the machine can be deleted by installer during cluster destroy.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

Always

Steps to Reproduce:

1.Follow the steps here https://bugzilla.redhat.com/show_bug.cgi?id=2107999#c9 to create a capi machine

liuhuali@Lius-MacBook-Pro huali-test % oc get machine.cluster.x-k8s.io      -n openshift-cluster-api
NAME             CLUSTER            NODENAME   PROVIDERID                                                         PHASE         AGE   VERSION
capi-ms-mtchm    huliu-gcpx-c55vm              gce://openshift-qe/us-central1-a/capi-gcp-machine-template-gcw9t   Provisioned   51m   

2.Destroy the cluster
The cluster destroyed successfully, but checked on GCP console, found the capi machine is still there.

labels of capi machine

labels of mapi machine

Actual results:

capi machine cannot be deleted by installer during cluster destroy

Expected results:

capi machine should be deleted by installer during cluster destroy

Additional info:

Also checked on aws, the case worked well, and found there is tag(kubernetes.io/cluster/clusterid:owned) for capi machines.

Description of problem:

A node fails to join cluster as it's CSR contains incorrect hostname
oc describe csr csr-7hftm
Name:               csr-7hftm
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Tue, 24 Oct 2023 10:22:39 -0400
Requesting User:    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper
Signer:             kubernetes.io/kube-apiserver-client-kubelet
Status:             Pending
Subject:
         Common Name:    system:node:openshift-worker-1
         Serial Number:
         Organization:   system:nodes
Events:  <none>
oc get csr csr-7hftm -o yaml
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  creationTimestamp: "2023-10-24T14:22:39Z"
  generateName: csr-
  name: csr-7hftm
  resourceVersion: "96957"
  uid: 84b94213-0c0c-40e4-8f90-d6612fbdab58
spec:
  groups:
  - system:serviceaccounts
  - system:serviceaccounts:openshift-machine-config-operator
  - system:authenticated
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlIN01JR2lBZ0VBTUVBeEZUQVRCZ05WQkFvVERITjVjM1JsYlRwdWIyUmxjekVuTUNVR0ExVUVBeE1lYzNsegpkR1Z0T201dlpHVTZiM0JsYm5Ob2FXWjBMWGR2Y210bGNpMHhNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBECkFRY0RRZ0FFMjRabE1JWGE1RXRKSGgwdWg2b3RVYTc3T091MC9qN0xuSnFqNDJKY0dkU01YeTJVb3pIRTFycmYKOTFPZ3pOSzZ5Z1R0Qm16NkFOdldEQTZ0dUszMlY2QUFNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRFhHMlFVWQoxMnVlWXhxSTV3blArRFBQaE5oaXhiemJvaTBpQzhHci9kMXRBaUVBdEFDcVVwRHFLYlFUNWVFZXlLOGJPN0dlCjhqVEI1UHN1SVpZM1pLU1R2WG89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=
  signerName: kubernetes.io/kube-apiserver-client-kubelet
  uid: c3adb2e0-6d60-4f56-a08d-6b01d3d3c065
  usages:
  - digital signature
  - client auth
  username: system:serviceaccount:openshift-machine-config-operator:node-bootstrapper
status: {}

Version-Release number of selected component (if applicable):

4.14.0-rc.6

How reproducible:

So far only on one setup

Steps to Reproduce:

1. Deploy dualstack baremetal cluster with day1 networking with static DHCP hostnames
2.
3.

Actual results:

A node fails to join the cluster

Expected results:

All nodes join the cluster

Description of problem:

In LGW (local gateway mode) mode, when pod is selected by an EIP thats hosted by an interface that isnt the default interface, connection to node IP fails

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    Agent CI jobs started to fail on a pretty regular basis, especially the compact ones.
Jobs time out due either the console or authentication operators remaining in a degraded state.
From the logs analysis, the are not able to get a route. Both apiserver and etcd component logs report connection refused messages, possibly indicating an underlying network problem

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Pretty frequently

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    The image quay.io/centos7/httpd-24-centos7 used in TestMTLSWithCRLs and TestCRLUpdate is no longer being rebuilt, and has had its 'latest' tag removed. Containers using this image fail to start, and cause the tests to fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    Run 'TEST="(TestMTLSWithCRLs|TestCRLUpdate)" make test-e2e' from the cluster-ingress-operator repo

Actual results:

    Both tests and all their subtests fail

Expected results:

    Tests pass

Additional info:

    

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

Up to latest decision RH won't going to support installation OCP cluster on Nutanix with

nested virtualization. Thus the check box "Install OpenShift Virtualization" on page "Operators" should be disabled when select platform "Nutanix" on page "Cluster Details"

Slack discussion thread

https://redhat-internal.slack.com/archives/C0211848DBN/p1706640683120159

Nutanix
https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000XeiHCAS

 

Description of problem:

While testing oc adm upgrade status against b02, I noticed some COs do not have any annotations, while I expected them to have the include/exclude.release.openshift.io/* ones (to recognize COs that come from the payload).

$ b02 get clusteroperator etcd -o jsonpath={.metadata.annotations}
$ ota-stage get clusteroperator etcd -o jsonpath={.metadata.annotations}
{"exclude.release.openshift.io/internal-openshift-hosted":"true","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"}

CVO does not reconcile CO resources once they exist, only precreates them but does not touch them once they exist. Build02 does not have CO with reconciled metadata because it was born as 4.2 which (AFAIK) is before OCP started to use the exclude/include annotations.

Version-Release number of selected component (if applicable):

4.16 (development branch)

How reproducible:

deterministic

Steps to Reproduce:

1. delete an annotation on a ClusterOperator resource

Actual results:

The annotation wont be recreated

Expected results:

The annotation should be recreated

Description of problem:


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Please review the following PR: https://github.com/openshift/thanos/pull/142

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Whenever I click on a card in the operator hub and developer hub the console window refreshes.
    

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

Every time
    

Steps to Reproduce:

    1. Go to operator hub or developer hub
    2.  Select any card
    

Actual results:

Window refreshes
    

Expected results:

The window should not refresh and show the side panel for the card
    

Additional info:

    

Description of problem:

    The following test "[sig-apps][Feature:DeploymentConfig] deploymentconfigs when tagging images should successfully tag the deployed image [apigroup:apps.openshift.io][apigroup:authorization.openshift.io][apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" fails. One pod is stuck in Pending because it requests 3G memory that the node doesn't have (There are 2 pods of 3G)

Version-Release number of selected component (if applicable):

    4.14.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

It seems something might be wrong with the logic for the new defaultChannel property. After initially syncing an operator to a tarball, subsequent runs complain the catalog is invalid, as if defaultChannel was never set.

Version-Release number of selected component (if applicable):

I tried oc-mirror v4.14.16 and v4.15.2

How reproducible:

100%

Steps to Reproduce:

1. Write this yaml config to an isc.yaml file in an empty dir. (It is worth noting that right now the default channel for this operator is of course something else – currently `latest`.)

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: ./operator-images
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14
      packages:
        - name: openshift-pipelines-operator-rh
          defaultChannel: pipelines-1.11
          channels:
            - name: pipelines-1.11
              minVersion: 1.11.3
              maxVersion: 1.11.3

2. Using oc-mirror v4.14.16 or v4.15.2, run:

oc-mirror -c ./isc.yaml file://operator-images

3. Without the defaultChannel property and a recent version of oc-mirror, that would have failed. Assuming it succeeds, run the same command a second time (with or without the --dry-run option) and note that it now fails. It seems nothing can be done. oc-mirror says the catalog is invalid.
 

Actual results:

$ oc-mirror -c ./isc.yaml file://operator-images
Creating directory: operator-images/oc-mirror-workspace/src/publish
Creating directory: operator-images/oc-mirror-workspace/src/v2
Creating directory: operator-images/oc-mirror-workspace/src/charts
Creating directory: operator-images/oc-mirror-workspace/src/release-signatures
No metadata detected, creating new workspace
wrote mirroring manifests to operator-images/oc-mirror-workspace/operators.1711523827/manifests-redhat-operator-indexTo upload local images to a registry, run:        oc adm catalog mirror file://redhat/redhat-operator-index:v4.14 REGISTRY/REPOSITORY
<dir>
  openshift-pipelines/pipelines-chains-controller-rhel8
    blobs:
      registry.redhat.io/openshift-pipelines/pipelines-chains-controller-rhel8 sha256:b06cce9e748bd5e1687a8d2fb11e5e01dd8b901eeeaa1bece327305ccbd62907 11.51KiB
      registry.redhat.io/openshift-pipelines/pipelines-chains-controller-rhel8 sha256:e5897b8264878f1f63f6eceed870b939ff39993b05240ce8292f489e68c9bd19 11.52KiB
...
  stats: shared=12 unique=274 size=24.71GiB ratio=0.98
info: Mirroring completed in 9m45.86s (45.28MB/s)
Creating archive operator-images/mirror_seq1_000000.tar


$ oc-mirror -c ./isc.yaml file://operator-images
Found: operator-images/oc-mirror-workspace/src/publish
Found: operator-images/oc-mirror-workspace/src/v2
Found: operator-images/oc-mirror-workspace/src/charts
Found: operator-images/oc-mirror-workspace/src/release-signatures
The current default channel was not valid, so an attempt was made to automatically assign a new default channel, which has failed.
The failure occurred because none of the remaining channels contain an "olm.channel" priority property, so it was not possible to establish a channel to use as the default channel.

This can be resolved by one of the following changes:
1) assign an "olm.channel" property on the appropriate channels to establish a channel priority
2) modify the default channel manually in the catalog
3) by changing the ImageSetConfiguration to filter channels or packages in such a way that it will include a package version that exists in the current default channel

The rendered catalog is invalid.

Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information.

error: error generating diff: the current default channel "latest" for package "openshift-pipelines-operator-rh" could not be determined... ensure that your ImageSetConfiguration filtering criteria results in a package version that exists in the current default channel or use the 'defaultChannel' field

Expected results:

It should NOT throw that error and instead should either update (if you've added more to the imagesetconfig) or gracefully print the "No new images" message.

 

Description of problem:

Results of -hypershift-aws-e2e-external CI jobs do not contain obvious reason why a test failed. For example, this TestCreateCluster is listed as failed, but all failures in TestCreateCluster look like errors dumping the cluster after failure.

It should show that "storage operator did not become Available=true". Or even tell that "pod cluster-storage-operator-6f6d69bf89-fx2d2 in the hosted control plane XYZ is in CrashloopBackoff"

The PR under test had a simple typo leading to crashloop and it should be more obvious what went wrong.

Version-Release number of selected component (if applicable):

4.15.0-0.ci.test-2023-10-03-040803

 

Description of problem:

Recently, the passing rate for test "static pods should start after being created" has dropped significantly for some platforms: 

https://sippy.dptools.openshift.org/sippy-ng/tests/4.15/analysis?test=%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-node%5D%20static%20pods%20should%20start%20after%20being%20created%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

Take a look at this example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072

The test failed with the following message:
{  static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ci-op-2z99zzqd-7f99c-rfp4q-master-0" didn't show up, waited: 3m0s}

Seemingly revision 6 was never reached. But if we look at the log from kube-controller-manager-operator, it jumps from revision 5 to revision 7: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-techpreview/1712803313642115072/artifacts/e2e-azure-sdn-techpreview/gather-extra/artifacts/pods/openshift-kube-controller-manager-operator_kube-controller-manager-operator-7cd978d745-bcvkm_kube-controller-manager-operator.log

The log also indicates that there is a possibility of race:

W1013 12:59:17.775274       1 staticpod.go:38] revision 7 is unexpectedly already the latest available revision. This is a possible race!

This might be a static controller issue. But I am starting with kube-controller-manager component for the case. Feel free to reassign. 

Here is a slack thread related to this:
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1697472297510279

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

CNO managed component (network-node-identity) to conform to hypershift control plane expectations that All secrets should be mounted to not have global read. change from 420(0644) to 416(0640)

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/56

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/435

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When running an Azure install, the installer noticeably hangs for a long time when running create manifests or create cluster. It will sit unresponsive for almost 2 minutes at:

DEBUG OpenShift Installer unreleased-master-9741-gbc9836aa9bd3a4f10d229bb6f87981dddf2adc92 
DEBUG Built from commit bc9836aa9bd3a4f10d229bb6f87981dddf2adc92 
DEBUG Fetching Metadata...                         
DEBUG Loading Metadata...                          
DEBUG   Loading Cluster ID...                      
DEBUG     Loading Install Config...                
DEBUG       Loading SSH Key...                     
DEBUG       Loading Base Domain...                 
DEBUG         Loading Platform...                  
DEBUG       Loading Cluster Name...                
DEBUG         Loading Base Domain...               
DEBUG         Loading Platform...                  
DEBUG       Loading Pull Secret...                 
DEBUG       Loading Platform...                    
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 

This could also be related to failures we see in CI such as this:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8123/pull-ci-openshift-installer-master-e2e-azure-ovn/1773611162923962368

 level=info msg=Consuming Worker Machines from target directory
level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json"
level=fatal msg=failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": error connecting to Azure client: failed to list SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'read tcp 10.128.117.2:43870->4.150.240.10:443: read: connection reset by peer' 

If the call takes too long and the context timeout is canceled, we might potentially see this error.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Run azure install
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://github.com/openshift/installer/pull/8134
has a partial fix    

The agent-based installer and assisted-installer create a Deployment named assisted-installer-controller in the assisted-installer namespace. This deployment is responsible for running the assisted-installer-controller to finalise the installation, mainly by updating the status of the Nodes in the assisted-service API. It's also required to be able to install platform:vsphere without credentials in 4.13 and above.

We want the logs for this pod to be included in the must-gather file, so that we can easily debug any installation issues caused by this process. Currently it is not.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Not able to reproduce it manually, but frequently happens when run auto scripts.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-10-05-195247

How reproducible:


Steps to Reproduce:

1. Label worker-0 node as egress node, created egressIP object,the egressIP was assigned to worker-0 node successfully on secondary NIC

2. Block 9107 port on  worker-0 node and label worker-1 as egress node

3.

Actual results:

EgressIP was not moved to second node
 % oc get egressip
NAME             EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-66330   172.22.0.196


40m         Warning   EgressIPConflict          egressip/egressip-66330       Egress IP egressip-66330 with IP 172.22.0.196 is conflicting with a host (worker-0) IP address and will not be assigned
sh-4.4# ip a show enp1s0
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:1c:cf:40:5d:25 brd ff:ff:ff:ff:ff:ff
    inet 172.22.0.109/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0
       valid_lft 76sec preferred_lft 76sec
    inet6 fe80::21c:cfff:fe40:5d25/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Expected results:

EgressIP should move to second egress node

Additional info:

Workaround: deleted it and recreated it works
% oc get egressip
NAME             EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-66330   172.22.0.196                   
% oc delete egressip --all
egressip.k8s.ovn.org "egressip-66330" deleted
 % oc create -f ../data/egressip/config1.yaml 
egressip.k8s.ovn.org/egressip-3 created
% oc get egressip
NAME         EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-3   172.22.0.196   worker-1        172.22.0.196

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    Icons which were formally blue are no longer blue.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

 

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

 After setting an invalid release image on a HostedCluster, it is not possible to fix it by editing the HostedCluster and setting a valid release image.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Create a HostedCluster with an invalid release image
    2. Edit HostedCluster and specify a valid release image
    3.
    

Actual results:

    HostedCluster does not start using the new valid release image

Expected results:

    HostedCluster starts using the valid release image.

Additional info:

    

Description of problem:
Given this nmstate inside the agent-config

        - name: bond0.10
          type: vlan
          state: up
          vlan:
            base-iface: bond0
            id: 10
          ipv4:
            address:
              - ip: 10.10.10.116
                prefix-length: 24
            dhcp: false
            enabled: true
          ipv6:
            enabled: true
            autoconf: true
            dhcp: true
            auto-dns: false
            auto-gateway: true
            auto-routes: true

The installation fails due to the assisted-service validation

    "message": "No connectivity to the majority of hosts in the cluster"

It misses the l2 connectivity for the ipv6 part (??)
Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:


TuneD unnecessarily restarts twice when both current TuneD profile changes and when a new TuneD profile is selected.

    

Version-Release number of selected component (if applicable):


All NTO versions are affected.

    

How reproducible:


Depends on the order of k8s object updates (races), but nearly 100% reproducible.

    

Steps to Reproduce:

    1. Install SNO 
    2. Label your SNO node with label "profile"
    3. Create the following CR:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift profile 1
      include=openshift-node
      [sysctl]
      kernel.pty.max=4096
    name: openshift-profile-1
  - data: |
      [main]
      summary=Custom OpenShift profile 2
      include=openshift-node
      [sysctl]
      kernel.pty.max=8192
    name: openshift-profile-2
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-1

    4. Apply the following CR:

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-profile
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift profile 1
      include=openshift-node
      [sysctl]
      kernel.pty.max=8192
    name: openshift-profile-1
  - data: |
      [main]
      summary=Custom OpenShift profile 2
      include=openshift-node
      [sysctl]
      kernel.pty.max=8192
    name: openshift-profile-2
  recommend:
  - match:
    - label: profile
    priority: 20
    profile: openshift-profile-2

    

Actual results:


You'll see two restarts/applications of the openshift-profile-1

$ cat tuned-operand.log |grep "profile-1' applied"
2024-04-19 06:10:54,685 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-profile-1' applied
2024-04-19 06:13:23,627 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-profile-1' applied


    

Expected results:


Only 1 application of openshift-profile-1:

$ cat tuned-operand.log |grep "profile-1' applied"
2024-04-19 07:20:31,600 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-profile-1' applied


    

Additional info:


    

Description of problem
List DeploymentConfig triggers a warning notification which is not required for Display warning policy feature. This Warning response is set in the cluster by default. See the Warning response below:

299 - "apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+"

Version-Release number of selected component (if applicable):
4.16

How reproducible:

    Steps to Reproduce:{code:none}
    1. Click on Deployment Config sub nav
    2. The Admission Webhook notification is displayed
    3.

Additional info:
I think this is good since the the CLI behavior like that too,.  Will discuss this behavior in next stand up. 

    Actual results:{code:none}


    Expected results:{code:none}


    Additional info:{code:none}

    

Description of problem:

In a 4.16.0-ec.1 cluster, scaling up a MachineSet with publicIP:true fails with:

$ oc -n openshift-machine-api get -o json machines.machine.openshift.io | jq -r '.items[] | select(.status.phase == "Failed") | .status.providerStatus.conditions[].message' | sort  | uniq -c
      1 googleapi: Error 403: Required 'compute.subnetworks.useExternalIp' permission for 'projects/openshift-gce-devel-ci-2/regions/us-central1/subnetworks/ci-ln-q4d8y8t-72292-msmgw-worker-subnet', forbidden

Version-Release number of selected component

Seen in 4.16.0-ec.1. Not noticed in 4.15.0-ec.3.  Fix likely needs a backport to 4.15 to catch up with OCPBUGS-26406.

How reproducible

Seen in the wild in a cluster after updating from 4.15.0-ec.3 to 4.16.0-ec.1. Reproduced in Cluster Bot on the first attempt, so likely very reproducible.

Steps to Reproduce

launch 4.16.0-ec.1 gcp Cluster Bot cluster (logs).

$ oc adm upgrade
Cluster version is 4.16.0-ec.1

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.16 (available channels: candidate-4.16)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
$ oc -n openshift-machine-api get machinesets
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-q4d8y8t-72292-msmgw-worker-a   1         1         1       1           60m
ci-ln-q4d8y8t-72292-msmgw-worker-b   1         1         1       1           60m
ci-ln-q4d8y8t-72292-msmgw-worker-c   1         1         1       1           60m
ci-ln-q4d8y8t-72292-msmgw-worker-f   0         0                             60m
$ oc -n openshift-machine-api get -o json machinesets | jq -c '.items[].spec.template.spec.providerSpec.value.networkInterfaces' | sort | uniq -c
      4 [{"network":"ci-ln-q4d8y8t-72292-msmgw-network","subnetwork":"ci-ln-q4d8y8t-72292-msmgw-worker-subnet"}]
$ oc -n openshift-machine-api edit machineset ci-ln-q4d8y8t-72292-msmgw-worker-f  # add publicIP
$ oc -n openshift-machine-api get -o json machineset ci-ln-q4d8y8t-72292-msmgw-worker-f | jq -c '.spec.template.spec.providerSpec.value.networkInterfaces'
[{"network":"ci-ln-q4d8y8t-72292-msmgw-network","publicIP":true,"subnetwork":"ci-ln-q4d8y8t-72292-msmgw-worker-subnet"}]
$ oc -n openshift-machine-api scale --replicas 1 machineset ci-ln-q4d8y8t-72292-msmgw-worker-f
$ sleep 300
$ oc -n openshift-machine-api get -o json machines.machine.openshift.io | jq -r '.items[] | select(.status.phase == "Failed") | .status.providerStatus.conditions[].message' | sort  | uniq -c

Actual results

      1 googleapi: Error 403: Required 'compute.subnetworks.useExternalIp' permission for 'projects/openshift-gce-devel-ci-2/regions/us-central1/subnetworks/ci-ln-q4d8y8t-72292-msmgw-worker-subnet', forbidden

Expected results

Successfully created machines.

Additional info

I would expect the CredentialsRequest to ask for this permission, but it doesn't seem to. The old roles/compute.admin includes it, and it probably just needs to be added explicitly. Not clear how many other permissions might also need explicit listing.

Description of problem:

The test implementation in https://github.com/openshift/origin/commit/5487414d8f5652c301a00617ee18e5ca8f339cb4#L56 assumes there is just one kubelet service or at least that it is always the first one in the MCP. Which just changed in https://github.com/openshift/machine-config-operator/pull/4124 and the test is failing.    

Version-Release number of selected component (if applicable):

master branch of 4.16    

How reproducible:

always during test  

Steps to Reproduce:

    1. Test with https://github.com/openshift/machine-config-operator/pull/4124 applied

Actual results:

Test detects a wrong service and fails    

Expected results:

Test finds the proper kubelet.service and passes

Additional info:

    

the okd build image job in ironic-agent-image is failing with the error message

Complete!
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    14  100    14    0     0     73      0 --:--:-- --:--:-- --:--:--    73
  File "<stdin>", line 1
    404: Not Found
    ^
SyntaxError: illegal target for annotation
INFO[2024-02-29T08:06:27Z] Ran for 4m3s                                 
ERRO[2024-02-29T08:06:27Z] Some steps failed:                           
ERRO[2024-02-29T08:06:27Z] 
  * could not run steps: step ironic-agent failed: error occurred handling build ironic-agent-amd64: the build ironic-agent-amd64 failed after 1m57s with reason DockerBuildFailed: Dockerfile build strategy has failed. 
INFO[2024-02-29T08:06:27Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image' 

Description of problem:

    Oh no! Something went wrong’ will be shown when user go to MultiClusterEngine details -> Yaml tab

Version-Release number of selected component (if applicable):

    4.14.0-0.nightly-2023-07-20-215234

How reproducible:

    Always

Steps to Reproduce:

1. Install 'multicluster engine for Kubernetes' operator in the cluster
2. Use the default value to create a new MultiClusterEngine
3. Navigate to the MultiClusterEngine details -> Yaml Tab 

   

Actual results: 
‘Oh no! Something went wrong.’ error will be shown with below details
TypeErrorDescription:
 Cannot read properties of null (reading 'editor')   

Expected results:

    no error 

Additional info:

    This bug fix is in conjunction with https://issues.redhat.com/browse/OCPBUGS-22778

Please review the following PR: https://github.com/openshift/route-controller-manager/pull/36

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/multus-cni/pull/202

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Ran into a problem with our testing this morning on a newly create ROKS cluster
```
    Error running /usr/bin/oc --namespace=e2e-test-oc-service-p4fz2 --kubeconfig=/tmp/configfile2694323048 create service nodeport mynodeport --tcp=8080:7777 --node-port=30000:    StdOut>    error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated    StdErr>    error: failed to create NodePort service: Service "mynodeport" is invalid: spec.ports[0].nodePort: Invalid value: 30000: provided port is already allocated    exit status 1
```

The port was already used by a different service, we would like to make a feature request to the testing to make the port number dynamic so that if that port is taken up, it can choose an available one.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/250

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/122

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    For high scalability, we need an option to disable unused machine management control plane components.

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Create HostedCluster/HostedControlPlane
    2. 
    3.
    

Actual results:

    Machine management components (cluster-api, machine-approver, auto-scaler, etc) are deployed

Expected results:

    Should have option to disable as some use cases they provide no utility.

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/135

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

There are several testcases in conformance testsuite that are failing due to openshift-multus configuration.

We are running conformance testsuite as part of our Openshift on Openstack CI. We use that just to confirm correct functionality of the cluster. The command we are using to run the test suite is:

openshift-tests run  --provider '{\"type\":\"openstack\"}' openshift/conformance/parallel 

The name of the tests that failed are:
1. sig-arch] Managed cluster should ensure platform components have system-* priority class associated [Suite:openshift/conformance/parallel]

Reason is:

6 pods found with invalid priority class (should be openshift-user-critical or begin with system-):
openshift-multus/whereabouts-reconciler-6q6h7 (currently "")
openshift-multus/whereabouts-reconciler-87dwn (currently "")
openshift-multus/whereabouts-reconciler-fvhwv (currently "")
openshift-multus/whereabouts-reconciler-h68h5 (currently "")
openshift-multus/whereabouts-reconciler-nlz59 (currently "")
openshift-multus/whereabouts-reconciler-xsch6 (currently "")

2. [sig-arch] Managed cluster should only include cluster daemonsets that have maxUnavailable or maxSurge update of 10 percent or maxUnavailable of 33 percent [Suite:openshift/conformance/parallel]
Reason is:

fail [github.com/openshift/origin/test/extended/operators/daemon_set.go:105]: Sep 23 16:12:15.283: Daemonsets found that do not meet platform requirements for update strategy:
  expected daemonset openshift-multus/whereabouts-reconciler to have maxUnavailable 10% or 33% (see comment) instead of 1, or maxSurge 10% instead of 0
Ginkgo exit error 1: exit with code 1

3.[sig-arch] Managed cluster should set requests but not limits [Suite:openshift/conformance/parallel]

Reason is:

fail [github.com/openshift/origin/test/extended/operators/resources.go:196]: Sep 23 16:12:17.489: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted:
  apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on cpu of 50m which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[cpu]")
  apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on memory of 100Mi which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[memory]")
Ginkgo exit error 1: exit with code 1

4. [sig-node][apigroup:config.openshift.io] CPU Partitioning cluster platform workloads should be annotated correctly for DaemonSets [Suite:openshift/conformance/parallel]

Reason is:

fail [github.com/openshift/origin/test/extended/cpu_partitioning/pods.go:159]: Expected
    <[]error | len:1, cap:1>: [
        <*errors.errorString | 0xc0010fa380>{
            s: "daemonset (whereabouts-reconciler) in openshift namespace (openshift-multus) must have pod templates annotated with map[target.workload.openshift.io/management:{\"effect\": \"PreferredDuringScheduling\"}]",
        },
    ]
to be empty

How reproducible: Always
Steps to Reproduce: Run conformance testsuite:
https://github.com/openshift/origin/blob/master/test/extended/README.md

Actual results: Testcases failing
Expected results: Testcases passing

Description of problem:

    The apiserver-url.env file is a dependency of all CCM components. These mostly run on the masters, however, on Azure, they also run on workers.

A recent change in kube (https://github.com/kubernetes/kubernetes/pull/121028) means that a previous bug has been fixed that now means that workers no longer bootstrap, since Kubelet no longer sets an IP address.

To resolve this issue, we need the CNM to be able to talk to KAS outside of the CNI, this works already on masters, but the url env file is missing on workers so they get stuck.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Multi-arch compute clusters have an issue where the cluster version's image ref is single arch, so this change resolves the image ref without spinning up a pod.

Description of problem:

Test case failure- OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations
The obtained response seems to have unmarshalling errors. 

Failed to fetch alerting rules: unable to parse response 

invalid character 's' after object key

Expected output- The response should be proper and the unmarshalling should have worked

Openshift Version- 4.13 & 4.14

Cloud Provider/Platform- PowerVS

Prow Job Link/Must gather path- https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-ovn-ppc64le-powervs/1700992665824268288/artifacts/ocp-e2e-ovn-ppc64le-powervs/  

Description of problem:

When an OpenShift Container Platform cluster is installed on Bare Metal with RHACM, the "metal3-plugin" for the OpenShift Console is installed automatically.

The "Nodes view (`<console>/k8s/cluster/core~v1~Node`) uses the `BareMetalNodesTable` which has very limited columns. However in the meantime OCP improved their Nodes table and added more features (like metrics) and we havent done any work in metal3. Customers are missing information like metrics or Pods, which are present in the standard Node view.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.33

How reproducible:

Always

Steps to Reproduce:

    1. Install a cluster using RHACM on Bare Metal
    2. Ensure the "metal3-plugin" is enabled
    3. Navigate to the "Nodes" view in the OpenShift Container Platform Console (`<console>/k8s/cluster/core~v1~Node`)

Actual results:

Limited columns (Name, Status, Role, Machine, Management Address) is visible. Columns like Memory, CPU, Pods, Filesystem, Instance Type are missing

Expected results:

All the columns from the standard view are visible, plus the "Management Address" column

Additional info:

* Issue was discussed here: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552957981989
* Screenshot of non-metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552980878029?thread_ts=1702552957.981989&cid=C027TN14SGJ
* Screenshot of metal3 cluster: https://redhat-internal.slack.com/archives/C027TN14SGJ/p1702552995363389?thread_ts=1702552957.981989&cid=C027TN14SGJ

Description of problem

CI is flaky because the TestHostNetworkPort test fails:

=== NAME  TestAll/serial/TestHostNetworkPortBinding
    operator_test.go:1034: Expected conditions: map[Admitted:True Available:True DNSManaged:False DeploymentReplicasAllAvailable:True LoadBalancerManaged:False]
         Current conditions: map[Admitted:True Available:True DNSManaged:False Degraded:False DeploymentAvailable:True DeploymentReplicasAllAvailable:False DeploymentReplicasMinAvailable:True DeploymentRollingOut:True EvaluationConditionsDetected:False LoadBalancerManaged:False LoadBalancerProgressing:False Progressing:True Upgradeable:True]
    operator_test.go:1034: Ingress Controller openshift-ingress-operator/samehost status: {
          "availableReplicas": 0,
          "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=samehost",
          "domain": "samehost.ci-op-xlwngvym-43abb.origin-ci-int-aws.dev.rhcloud.com",
          "endpointPublishingStrategy": {
            "type": "HostNetwork",
            "hostNetwork": {
              "protocol": "TCP",
              "httpPort": 9080,
              "httpsPort": 9443,
              "statsPort": 9936
            }
          },
          "conditions": [
            {
              "type": "Admitted",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "Valid"
            },
            {
              "type": "DeploymentAvailable",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentAvailable",
              "message": "The deployment has Available status condition set to True"
            },
            {
              "type": "DeploymentReplicasMinAvailable",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentMinimumReplicasMet",
              "message": "Minimum replicas requirement is met"
            },
            {
              "type": "DeploymentReplicasAllAvailable",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentReplicasNotAvailable",
              "message": "0/1 of replicas are available"
            },
            {
              "type": "DeploymentRollingOut",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "DeploymentRollingOut",
              "message": "Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n"
            },
            {
              "type": "LoadBalancerManaged",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer",
              "message": "The configured endpoint publishing strategy does not include a managed load balancer"
            },
            {
              "type": "LoadBalancerProgressing",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "LoadBalancerNotProgressing",
              "message": "LoadBalancer is not progressing"
            },
            {
              "type": "DNSManaged",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "UnsupportedEndpointPublishingStrategy",
              "message": "The endpoint publishing strategy doesn't support DNS management."
            },
            {
              "type": "Available",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z"
            },
            {
              "type": "Progressing",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "IngressControllerProgressing",
              "message": "One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n)"
            },
            {
              "type": "Degraded",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z"
            },
            {
              "type": "Upgradeable",
              "status": "True",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "Upgradeable",
              "message": "IngressController is upgradeable."
            },
            {
              "type": "EvaluationConditionsDetected",
              "status": "False",
              "lastTransitionTime": "2024-02-26T17:25:39Z",
              "reason": "NoEvaluationCondition",
              "message": "No evaluation condition is detected."
            }
          ],
          "tlsProfile": {
            "ciphers": [
              "ECDHE-ECDSA-AES128-GCM-SHA256",
              "ECDHE-RSA-AES128-GCM-SHA256",
              "ECDHE-ECDSA-AES256-GCM-SHA384",
              "ECDHE-RSA-AES256-GCM-SHA384",
              "ECDHE-ECDSA-CHACHA20-POLY1305",
              "ECDHE-RSA-CHACHA20-POLY1305",
              "DHE-RSA-AES128-GCM-SHA256",
              "DHE-RSA-AES256-GCM-SHA384",
              "TLS_AES_128_GCM_SHA256",
              "TLS_AES_256_GCM_SHA384",
              "TLS_CHACHA20_POLY1305_SHA256"
            ],
            "minTLSVersion": "VersionTLS12"
          },
          "observedGeneration": 1
        }
    operator_test.go:1036: failed to observe expected conditions for the second ingresscontroller: timed out waiting for the condition
    operator_test.go:1059: deleted ingresscontroller samehost
    operator_test.go:1059: deleted ingresscontroller hostnetworkportbinding

This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1017/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1762147882179235840. Search.ci shows another failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/48873/rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1762576595890999296. The test has failed sporadically in the past, beyond what search.ci is able to search.

TestHostNetworkPort is marked as a serial test in TestAll and marked with t.Parallel() in the test itself. Not sure if this is what is causing a new failure seen in this test, but something is incorrect.

Version-Release number of selected component (if applicable)

The test failures have been observed recently on 4.16 as well as on 4.12 (https://github.com/openshift/cluster-ingress-operator/pull/828#issuecomment-1292888086) and 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/914#issuecomment-1526808286). The logic error was introduced in 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc).

How reproducible

The logic error is self-evident. The test failure is very rare. The failure has been observed sporadically over the past couple years. Presently, search.ci shows two failures, with the following impact, for the past 14 days:

rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact

Steps to Reproduce

N/A.

Actual results

The TestHostNetworkPort test fails. The test is marked as both serial and parallel.

Expected results

Test should be marked as either serial or parallel, and it should pass consistently.

Additional info

When TestAll was introduced, TestHostNetworkPortBinding was initially marked parallel in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc. After some discussion, it was moved to the serial list in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a449e497e35fafeecbee9ea656e0631393182f70, but the commit to remove t.Parallel() evidently got inadvertently dropped.

Description of the problem:

Staging UI 2.31.1, BE 2.31.0 - click on create new cluster - UI have rolling wheel but nothing loads and no error can be found.

Edit:
BE v2/openshift-versions response is empty

How reproducible:

100%

Steps to reproduce:

1. Click on create new cluster

2.

3.

Actual results:

 

Expected results:

Description of problem:

[Azuredisk-csi-driver] allocatable volumes count incorrect in csinode for Standard_B4as_v2 instance types

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-02-132842

How reproducible:

Always

Steps to Reproduce:

1. Install Azure OpenShift cluster use the Standard_B4as_v2 instance type
2. Check the csinode object allocatable volumes count
3. Create a pod with the max allocatable volumes count pvcs(provision by azuredisk-csi-driver)

Actual results:

In step 2 the allocatable volumes count is 16.
$ oc get csinode pewang-0908s-r6lwd-worker-southcentralus3-tvwwr -ojsonpath='{.spec.drivers[?(@.name=="disk.csi.azure.com")].allocatable.count}'
16

In step 3 the pod stuck at containerCreating that caused by attach volume failed of 
09-07 22:38:28.758        "message": "The maximum number of data disks allowed to be attached to a VM of this size is 8.",\r

Expected results:

In step 2 the allocatable volumes count should be 8.
In step 3 the pod should be Running well and all volumes could be read and written data

Additional info:

$ az vm list-skus -l eastus --query "[?name=='Standard_B4as_v2']"| jq -r '.[0].capabilities[] | select(.name =="MaxDataDiskCount")'
{
  "name": "MaxDataDiskCount",
  "value": "8"
}

Currently in 4.14 we use the v1.28.1 driver, I checked the upstream issues and PRs, the issue fixed in v1.28.2
https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.2

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/53

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/223

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This bug just focuses on denoising WellKnown_NotReady. More generic Available=False denoising is tracked in https://issues.redhat.com/browse/OCPBUGS-20056.

Description of problem:

Reviving bugzilla#2010539, the authentication ClusterOperator occasionally blips Available=False with reason=WellKnown_NotReady. For example, this run includes:

: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available expand_less	47m21s
{  1 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 03 19:11:20.502 - 245ms E clusteroperator/authentication condition/Available reason/WellKnown_NotReady status/False WellKnownAvailable: The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://10.0.0.3:6443/.well-known/oauth-authorization-server: dial tcp 10.0.0.3:6443: i/o timeout

While a dial timeout for the Kube API server isn't fantastic, an issue that only persists for 245ms is not long enough to warrant immediate admin intervention. Teaching the authentication operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

4.8, 4.10, and 4.15. Likely all supported versions of the authentication operator have this exposure.

How reproducible:

Looks like 10 to 50% of 4.15 runs have some kind of issue with authentication going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

Steps to Reproduce:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/authentication+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort

Actual results:

periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 44% failed, 13% of failures match = 6% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 67% failed, 17% of failures match = 11% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-azure-ovn-heterogeneous (all) - 18 runs, 56% failed, 30% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ovn-serial-aws-arm64 (all) - 9 runs, 33% failed, 33% of failures match = 11% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-serial-ovn-ppc64le-powervs (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 67% failed, 25% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 18 runs, 50% failed, 33% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 86% of failures match = 36% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 21% failed, 76% of failures match = 16% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn-techpreview-serial (all) - 7 runs, 29% failed, 100% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 28% failed, 36% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 39% failed, 123% of failures match = 48% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 71 runs, 49% failed, 80% of failures match = 39% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 75% of failures match = 43% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 7 runs, 100% failed, 57% of failures match = 57% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 34% failed, 4% of failures match = 1% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-azure-sdn (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial (all) - 7 runs, 43% failed, 67% of failures match = 29% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-serial-ovn-ipv6 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 20% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 100% failed, 57% of failures match = 57% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 12 runs, 58% failed, 14% of failures match = 8% impact

Digging into reason and message frequency in 4.15-releated update CI:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's/[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*/x.x.x.x/g;s|[.]apps[.][^/]*|.apps.../|g' | sort | uniq -c | sort -n
      1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout
      1 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/oauth.openshift.io/v1: Get "https://x.x.x.x:8443/apis/oauth.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
      1 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
      1 authentication APIServices_Error rpc error: code = Unavailable desc = the connection is draining
      1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp: lookup oauth-openshift.apps.../
      1 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": dial tcp x.x.x.x:443: connect: connection refused
      1 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://[fd02::410f]:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      1 Nov 28 09:09:40.407 - 1s    E clusteroperator/authentication condition/Available reason/APIServerDeployment_PreconditionNotFulfilled status/False
      2 authentication APIServerDeployment_NoPod no .openshift-oauth-apiserver pods available on any node.
      2 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
      2 authentication APIServices_Error rpc error: code = Unknown desc = malformed header: missing HTTP content-type
      4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
      4 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: failing or missing response from https://x.x.x.x:8443/apis/user.openshift.io/v1: Get "https://x.x.x.x:8443/apis/user.openshift.io/v1": dial tcp x.x.x.x:8443: i/o timeout
      6 authentication OAuthServerDeployment_NoDeployment deployment/openshift-authentication: could not be retrieved
      7 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.oauth.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
      7 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: i/o timeout (Client.Timeout exceeded while awaiting headers)
      8 authentication APIServerDeployment_NoPod no apiserver.openshift-oauth-apiserver pods available on any node.
      9 authentication APIServerDeployment_NoDeployment deployment/openshift-oauth-apiserver: could not be retrieved
      9 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": EOF
     11 authentication WellKnown_NotReady The well-known endpoint is not yet available: failed to GET kube-apiserver oauth endpoint https://x.x.x.x:6443/.well-known/oauth-authorization-server: dial tcp x.x.x.x:6443: i/o timeout
     23 authentication APIServices_Error "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
     26 authentication APIServices_Error "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
     29 authentication APIServices_Error apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https"
     29 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable Get "https://oauth-openshift.apps...//healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
     30 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": dial tcp x.x.x.x:443: connect: connection refused
     34 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable Get "https://x.x.x.x:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

And simplifying by looking only at reason:

 curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/authentication.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
      1 authentication APIServerDeployment_PreconditionNotFulfilled
      6 authentication OAuthServerDeployment_NoDeployment
      8 authentication APIServerDeployment_NoDeployment
     10 authentication APIServerDeployment_NoPod
     11 authentication WellKnown_NotReady
     36 authentication OAuthServerRouteEndpointAccessibleController_EndpointUnavailable
     43 authentication APIServices_PreconditionNotReady
     66 authentication OAuthServerServiceEndpointAccessibleController_EndpointUnavailable
     95 authentication APIServices_Error

 

Expected results:

Authentication goes Available=False on WellKnown_NotReady if and only if immediate admin intervention is appropriate.

Description of problem:

Running yarn or yarn install on latest master branch of Console fails on MacOS

$ cd /path/to/console/frontend
$ yarn install

https://github.com/openshift/console/pull/13706#issuecomment-2051682156

$ ./scripts/check-patternfly-modules.sh && yarn prepare-husky && yarn generate
Checking \e[0;33myarn.lock\e[0m file for PatternFly module resolutions
grep: invalid option -- P
usage: grep [-abcdDEFGHhIiJLlMmnOopqRSsUVvwXxZz] [-A num] [-B num] [-C[num]]
	[-e pattern] [-f file] [--binary-files=value] [--color=when]
	[--context[=num]] [--directories=action] [--label] [--line-buffered]
	[--null] [pattern] [file ...]

Description of problem:

HO uses the ICSP/IDMS from mgmt cluster to extract the OCP release metadata to be used in the HostedCluster.

But they are extracted only once in main.go:
https://github.com/jparrill/hypershift/blob/9bf1403ae09c0f262ebfe006267e3b442cc70149/hypershift-operator/main.go#L287-L293
before starting the HC and NP controllers but they are never refreshed anymore when ICSP/IDMS changes on the management cluster neither when a new HostedCluster is created.

    

Version-Release number of selected component (if applicable):

    4.14 4.15 4.16

How reproducible:

100%


    

Steps to Reproduce:

    1. ensure that HO is already running
    2. create an ICSP or a IMDS on the management cluster
    3. try to create an hosted-cluster
    

Actual results:

the imageRegistryOverrides setting for the new hosted-cluster ignores the ICSP/IMDS created when the HO was already running.
Killing HO operator pod and wait for it to restart will bring to a different result.

    

Expected results:

HO is consistently consuming  ICSP/IMDS info at runtime without the need to be restarted

    

Additional info:

    It affects disconnected deployments

Description of problem:

A long-lived cluster updating into 4.16.0-ec.1 was bitten by the Engineering Candidate's month-or-more-old api-int CA rotation (details on early rotation in API-1687). After manually updating /var/lib/kubelet/kubeconfig to include the new CA (which OCPBUGS-25821 is working on automating), multus pods still complained about untrusted api-int:

$ oc -n openshift-multus logs multus-pz7zp | grep api-int | tail -n5
E0119 19:33:52.983918    3194 reflector.go:148] k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dbuild0-gstfj-m-2.c.openshift-ci-build-farm.internal&resourceVersion=4723865081": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-01-19T19:33:55Z [error] Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-01-19T19:33:55Z [verbose] ADD finished CNI request ContainerID:"b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62" Netns:"/var/run/netns/36923fe0-e28d-422f-8213-233086527baa" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-machine-api;K8S_POD_NAME=cluster-autoscaler-default-f8dd547c7-dg9t5;K8S_POD_INFRA_CONTAINER_ID=b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62;K8S_POD_UID=f79ff01a-71c2-4f02-b48b-8c23c9e875ce" Path:"", result: "", err: error configuring pod [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5] networking: Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-01-19T19:34:00Z [error] Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
2024-01-19T19:34:00Z [verbose] ADD finished CNI request ContainerID:"cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb" Netns:"/var/run/netns/bc7fbf17-c049-4241-a7dc-7e27acd3c8af" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-storage-version-migrator;K8S_POD_NAME=migrator-558d4d48b9-ggjpj;K8S_POD_INFRA_CONTAINER_ID=cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb;K8S_POD_UID=769153af-350b-492b-9589-ede2574aea85" Path:"", result: "", err: error configuring pod [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj] networking: Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority

The multus pod needed a delete/replace, and after that it recovered:

$ oc --as system:admin -n openshift-multus delete pod multus-pz7zp
pod "multus-pz7zp" deleted
$ oc -n openshift-multus get -o wide pods | grep 'NAME\|build0-gstfj-m-2.c.openshift-ci-build-farm.internal'
NAME                                           READY   STATUS              RESTARTS      AGE     IP               NODE                                                              NOMINATED NODE   READINESS GATES
multus-additional-cni-plugins-wrdtt            1/1     Running             1             28h     10.0.0.3         build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
multus-admission-controller-74d794678b-9s7kl   2/2     Running             0             27h     10.129.0.36      build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
multus-hxmkz                                   1/1     Running             0             11s     10.0.0.3         build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
network-metrics-daemon-dczvs                   2/2     Running             2             28h     10.129.0.4       build0-gstfj-m-2.c.openshift-ci-build-farm.internal               <none>           <none>
$ oc -n openshift-multus logs multus-hxmkz | grep -c api-int
0

That need for multus-pod deletion should be automated, to reduce the number of things that need manual touches when the api-int CA rolls.

Version-Release number of selected component

Seen in 4.16.0-ec.1.

How reproducible:

Several multus on this cluster were bit. But others were not, including some on clusters with old kubeconfigs that did not contain the new CA. I'm not clear on what the trigger is, perhaps some clients escape immediate trouble by having exsting api-int connections to servers from back when the servers used the old CA? But deleting the multus pod on a cluster whose /var/lib/kubelet/kubeconfig has not yet been updated will likely reproduce the breakage, at least until OCPBUGS-25821 is fixed.

Steps to Reproduce:

Not entirely clear, but something like:

  1. Install 4.16.0-ec.1.
  2. Wait a month or more for the Kube API server operator to decide to roll the CA signing api-int.
  3. Delete a multus pod, so the replacement comes up broken on api-int trust.
  4. Manually update /var/lib/kubelet/kubeconfig.

Actual results:

Multus still fails to trust api-int until the broken pod is deleted or the container otherwise restarts to notice the updated kubeconfig.

Expected results:

Multus pod automatically pulls in the updated kubeconfig.

Additional info:

One possible implementation would be a liveness probe failing on api-int trust issues, triggering the kubelet to roll the multus container, and the replacement multus container to come up and load the fresh kubeconfig.

We added a carry patch to change the healthcheck behaviour in the Azure CCM: https://github.com/openshift/cloud-provider-azure/pull/72 and whilst we opened an upstream twin PR for that https://github.com/kubernetes-sigs/cloud-provider-azure/pull/3887 it got closed in favour of a different approach https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4891 .

As such in the next rebase we need to drop the commit introduced in 72, in favour of downstreaming through the rebase the change in 4891. While doing that we need to explicitly set the new probe behaviour, as the default is still the classic behaviour, which doesn't work with our cluster architecture setup on Azure.

For the steps on how to do this, we can follow this comment: https://github.com/openshift/cloud-provider-azure/pull/88#issuecomment-1803832076

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

E1213 07:18:34.291004 1 run.go:74] "command failed" err="error while building transformers: KMSv1 is deprecated and will only receive security updates going forward. Use KMSv2 instead. Set --feature-gates=KMSv1=true to use the deprecated KMSv1 feature."
    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/103

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

During a pod deletion, the whereabouts reconciler correctly detects the pod deletion but it errors out claiming that the IPPool is not found.However, when checking the audit logs, we can see no deletion, no re-creation and we can even see successful "patch" and "get" requests to the same IPPool. This means that the IPPool was never deleted and properly accessible at the time of the issue, so the error in the reconciler looks like it made some mistake while retrieving the IPPool.

Version-Release number of selected component (if applicable):

4.12.22

How reproducible:

Sometimes    

Steps to Reproduce:

    1.Delete pod
    2.
    3.
    

Actual results:

Error in whereabouts reconciler. New pods cannot using additional networks with whereabouts IPAM plugin cannot have IPs allocated due to wrong cleanup. 

Expected results:

    

Additional info:

    

Description of problem:

Configure vm type as Standard_NP10s in install-config, which only supports Generation V1.
--------------
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      type: Standard_NP10s
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      type: Standard_NP10s
  replicas: 3

Continue installation, installer failed when provisioning bootstrap node.
--------------
ERROR                                              
ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" 
ERROR                                              
ERROR   with azurerm_linux_virtual_machine.bootstrap, 
ERROR   on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": 
ERROR  193: resource "azurerm_linux_virtual_machine" "bootstrap" { 
ERROR                                              
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failure applying terraform for "bootstrap" stage: error applying Terraform configs: failed to apply Terraform: exit status 1 
ERROR                                              
ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" 
ERROR                                              
ERROR   with azurerm_linux_virtual_machine.bootstrap, 
ERROR   on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": 
ERROR  193: resource "azurerm_linux_virtual_machine" "bootstrap" { 
ERROR                                              
ERROR                                              

seems that issue is introduced by https://github.com/openshift/installer/pull/7642/   

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-09-012410

How reproducible:

Always

Steps to Reproduce:

    1. configure vm type to Standard_NP10s on control-plane in install-config.yaml
    2. install cluster
    3.
    

Actual results:

    installer failed when provisioning bootstrap node

Expected results:

    installation get successful

Additional info:

    

Description of problem:

We would like to include the CEL IP and CIDR validations in 4.16. They have been mergeded upstream and can be backported into OpenShift to improve out validation downstream.

Upstream PR: https://github.com/kubernetes/kubernetes/pull/121912

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem

When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.

Version-Release number of selected component

Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.

How reproducible

Always.

Steps to Reproduce

With a launch 4.14.10 gcp Cluster Bot cluster (logs):

$ oc adm upgrade
Cluster version is 4.14.10

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
$ oc -n openshift-machine-api get machinesets.machine.openshift.io
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-a   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-b   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-c   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             29m

Pick that set with 0 nodes. They don't come with taints by default:

$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints'
null

So patch one in:

$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"}
]}]'
machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched

And set up autoscaling:

$ cat cluster-autoscaler.yaml
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  maxNodeProvisionTime: 30m
  scaleDown:
    enabled: true
$ oc apply -f cluster-autoscaler.yaml 
clusterautoscaler.autoscaling.openshift.io/default created

I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?

$ cat machine-autoscaler.yaml 
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: test
  namespace: openshift-machine-api
spec:
  maxReplicas: 2
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: ci-ln-s48f02k-72292-5z2hn-worker-f
$ oc apply -f machine-autoscaler.yaml 
machineautoscaler.autoscaling.openshift.io/test created

Checking the autoscaler's logs:

$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint
W0122 19:18:47.246369       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:18:58.474000       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:19:09.703748       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:19:20.929617       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
...

And the MachineSet is failing to scale:

$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             50m

While if I remove the taint:

$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]'
machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched

The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:

$ oc -n openshift-machine-api get machines.machine.openshift.io
NAME                                       PHASE     TYPE                REGION        ZONE            AGE
ci-ln-s48f02k-72292-5z2hn-master-0         Running   e2-custom-6-16384   us-central1   us-central1-a   53m
ci-ln-s48f02k-72292-5z2hn-master-1         Running   e2-custom-6-16384   us-central1   us-central1-b   53m
ci-ln-s48f02k-72292-5z2hn-master-2         Running   e2-custom-6-16384   us-central1   us-central1-c   53m
ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf   Running   e2-standard-4       us-central1   us-central1-a   45m
ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt   Running   e2-standard-4       us-central1   us-central1-b   45m
ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m   Running   e2-standard-4       us-central1   us-central1-c   45m
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             53m
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50
I0122 19:23:17.284762       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:17.687036       1 legacy.go:296] No candidates for scale down
W0122 19:23:27.924167       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:28.510701       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:28.909507       1 legacy.go:296] No candidates for scale down
W0122 19:23:39.148266       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:39.737359       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:40.135580       1 legacy.go:296] No candidates for scale down
W0122 19:23:50.376616       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:50.963064       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:51.364313       1 legacy.go:296] No candidates for scale down
W0122 19:24:01.601764       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:24:02.191330       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:02.589766       1 legacy.go:296] No candidates for scale down
I0122 19:24:13.415183       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:13.815851       1 legacy.go:296] No candidates for scale down
I0122 19:24:24.641190       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:25.040894       1 legacy.go:296] No candidates for scale down
I0122 19:24:35.867194       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:36.266400       1 legacy.go:296] No candidates for scale down
I0122 19:24:47.097656       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:47.498099       1 legacy.go:296] No candidates for scale down
I0122 19:24:58.326025       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:58.726034       1 legacy.go:296] No candidates for scale down
I0122 19:25:04.927980       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0122 19:25:04.938213       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms
I0122 19:25:09.552086       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:09.952094       1 legacy.go:296] No candidates for scale down
I0122 19:25:20.778317       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:21.178062       1 legacy.go:296] No candidates for scale down
I0122 19:25:32.005246       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:32.404966       1 legacy.go:296] No candidates for scale down
I0122 19:25:43.233637       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:43.633889       1 legacy.go:296] No candidates for scale down
I0122 19:25:54.462009       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:54.861513       1 legacy.go:296] No candidates for scale down
I0122 19:26:05.688410       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:06.088972       1 legacy.go:296] No candidates for scale down
I0122 19:26:16.915156       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:17.315987       1 legacy.go:296] No candidates for scale down
I0122 19:26:28.143877       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:28.543998       1 legacy.go:296] No candidates for scale down
I0122 19:26:39.369085       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:39.770386       1 legacy.go:296] No candidates for scale down
I0122 19:26:50.596923       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:50.997262       1 legacy.go:296] No candidates for scale down
I0122 19:27:01.823577       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:27:02.223290       1 legacy.go:296] No candidates for scale down
I0122 19:27:04.938943       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0122 19:27:04.947353       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms

Actual results

Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.

Expected results

Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.

Description of problem:

    When destroying an HCP KubeVirt cluster using the cli and the --destroy-cloud-resources, pvcs are not cleaned up within the guest cluster due to the cli not properly honoring the --destroy-cloud-resources option for KubeVirt.

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

    100%

Steps to Reproduce:

    1. destroy an hcp kubevirt cluster using cli and --destroy-cloud-resources
    2.
    3.
    

Actual results:

the hosted cluster does not have the hypershift.openshift.io/cleanup-cloud-resources: "true" annotation added which ensures the hosted cluster config controller cleans up pvcs   

Expected results:

the hypershift.openshift.io/cleanup-cloud-resources: "true" should get added to the hosted cluster during tear down when the --destroy-cloud-resources cli option is used  

Additional info:

    

Description of problem:

Due to HTTP/2 Connection Coalescing (https://daniel.haxx.se/blog/2016/08/18/http2-connection-coalescing/), routes which use the same certificate can present unexplained 503 errors when attempting to access an HTTP/2 enabled ingress controller.

It appears that HAProxy supports the ability to force HTTP 1.1 on a route-by-route basis, but our Ingress Controller does not expose that option.

This is especially problematic for component routes because generally speaking, customers use a wildcard or SAN to deploy custom component routes (console, OAuth, downloads), but with HTTP/2, this does not work properly.

To address this issue, we're proposing the creation of an annotation haproxy.router.openshift.io/http2-disable, which will allow the disabling of HTTP/2 on a route-by-route basis, or smarter logic built into our Ingress operator to handle this situation.  

Version-Release number of selected component (if applicable):

 OpenShift 4.14

How reproducible:

Serve routes to applications in Openshift.
Observe the routes through a HTTP/2 enabled client.
Notice that http/2 client connections are broken (returns 503 on second connection when using same certificates across a mix of re-encrypt and passthrough routes)

Steps to Reproduce:

(see above notes)

Actual results:

503 error    

Expected results:

no error    

Additional info:

    

Description of problem:

When using an old version "oc" client to extract something newly introduced into the release extract, got an error about it doesn't support linux, which is a bit confusing.


[root@gpei-test-rhel9 0423]# ./oc version
Client Version: 4.15.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Kubernetes Version: v1.27.12+7bee54d

[root@gpei-test-rhel9 0423]# ./oc adm release extract --registry-config ~/.docker/config --command=openshift-install-fips --to ./ registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-04-23-065741
error: command "openshift-install-fips" does not support the operating system "linux"

And for the oc client extracted from the same payload, it works well.
[root@gpei-test-rhel9 fips]# ./oc version
Client Version: 4.16.0-0.ci-2024-04-23-065741
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Kubernetes Version: v1.27.12+7bee54d

[root@gpei-test-rhel9 fips]# ./oc adm release extract --registry-config ~/.docker/config --command=openshift-install-fips --to ./ registry.ci.openshift.org/ocp/release:4.16.0-0.ci-2024-04-23-065741

[root@gpei-test-rhel9 fips]# ls
oc  openshift-install-fips

It would be expected to get error prompt "command "openshift-install-fips" is not supported in current oc client" or something like this, but not saying it does not support the operating system "linux"

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Document URL: 

[1] https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account

Section Number and Name: 

* Required EC2 permissions for installation

Description of problem:

The permission ec2:DisassociateAddress is required for OCP 4.16+ install, but it's missing the official doc [1] - we would like to understand why/if this permission is necessary.

level=info msg=Destroying the bootstrap resources...
...
level=error msg=Error: disassociating EC2 EIP (eipassoc-01e8cc3f06f2c2499): UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::301721915996:user/ci-op-0xjvtwb0-4e979-minimal-perm is not authorized to perform: ec2:DisassociateAddress on resource: arn:aws:ec2:us-east-1:301721915996:elastic-ip/eipalloc-0274201623d8569af because no identity-based policy allows the ec2:DisassociateAddress action. 


    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-03-13-061822
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create OCP cluster with permissions listed in the official doc.
    2.
    3.
    

Actual results:

See description. 
    

Expected results:

Cluster is created successfully.
    

Suggestions for improvement:

Add ec2:DisassociateAddress to `Required EC2 permissions for installation` in [1]

Additional info:

This impacts the permission list in ROSA Installer-Role as well.
    

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1175

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-operator/pull/114

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1187

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When deploying to a Power VS workspace created after February 14th 2024, it will not be found by the installer.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily.

Steps to Reproduce:

    1. Create a Power VS Workspace
    2. Specify it in the install config
    3. Attempt to deploy
    4. Fail with "...is not a valid guid" error.
    

Actual results:

    Failure to deploy to service instance

Expected results:

    Should deploy to service instance

Additional info:

    

Description of problem:

Apply egressqos on OCP, the status of egressqos is empty. Check ovnkube-pod logs, it shows error like below:

 

I0429 09:39:19.013461    4771 egressqos.go:460] Processing sync for EgressQoS abc/default
I0429 09:39:19.022635    4771 egressqos.go:463] Finished syncing EgressQoS default on namespace abc : 9.174361ms
E0429 09:39:19.028426    4771 egressqos.go:368] failed to update EgressQoS object abc/default with status: Apply failed with 1 conflict: conflict with "ip-10-0-62-24.us-east-2.compute.internal" with subresource "status": .status.conditions
I0429 09:39:19.031526    4771 egressqos.go:460] Processing sync for EgressQoS default/default
I0429 09:39:19.039827    4771 egressqos.go:463] Finished syncing EgressQoS default on namespace default : 8.322774ms
E0429 09:39:19.044060    4771 egressqos.go:368] failed to update EgressQoS object default/default with status: Apply failed with 1 conflict: conflict with "ip-10-0-70-102.us-east-2.compute.internal" with subresource "status": .status.conditions
I0429 09:39:19.052877    4771 egressqos.go:460] Processing sync for EgressQoS abc/default
I0429 09:39:19.055945    4771 egressqos.go:463] Finished syncing EgressQoS default on namespace abc : 3.182828ms
E0429 09:39:19.060563    4771 egressqos.go:368] failed to update EgressQoS object abc/default with status: Apply failed with 1 conflict: conflict with "ip-10-0-62-24.us-east-2.compute.internal" with subresource "status": .status.conditions
I0429 09:39:19.072238    4771 egressqos.go:460] Processing sync for EgressQoS default/default 

 

 

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

Steps to Reproduce:

1. create egressqos in ns abc

% cat egress_qos.yaml 
kind: EgressQoS
apiVersion: k8s.ovn.org/v1
metadata:
  name: default
  namespace: abc
spec:
  egress:
  - dscp: 46
    dstCIDR: 3.16.78.227/32
  - dscp: 30
    dstCIDR: 0.0.0.0/0 

2. check egressqos 

% oc get egressqos default -o yaml
apiVersion: k8s.ovn.org/v1
kind: EgressQoS
metadata:
  creationTimestamp: "2024-04-29T09:24:55Z"
  generation: 1
  name: default
  namespace: abc
  resourceVersion: "376134"
  uid: f9dfe380-81ee-4edd-845d-49ba2c856e81
spec:
  egress:
  - dscp: 46
    dstCIDR: 3.16.78.227/32
  - dscp: 30
    dstCIDR: 0.0.0.0/0
status: {} 

3. check crd egressqos

% oc get crd egressqoses.k8s.ovn.org -o yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.8.0
  creationTimestamp: "2024-04-29T05:23:12Z"
  generation: 1
  name: egressqoses.k8s.ovn.org
  ownerReferences:
  - apiVersion: operator.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Network
    name: cluster
    uid: 3bfac7ab-ca29-477f-a97f-27592b7e176d
  resourceVersion: "3642"
  uid: 25dabf13-611f-4c29-bf22-4a0b56e4b7f7
spec:
  conversion:
    strategy: None
  group: k8s.ovn.org
  names:
    kind: EgressQoS
    listKind: EgressQoSList
    plural: egressqoses
    singular: egressqos
  scope: Namespaced
  versions:
  - name: v1
    schema:
      openAPIV3Schema:
        description: EgressQoS is a CRD that allows the user to define a DSCP value
          for pods egress traffic on its namespace to specified CIDRs. Traffic from
          these pods will be checked against each EgressQoSRule in the namespace's
          EgressQoS, and if there is a match the traffic is marked with the relevant
          DSCP value.
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            properties:
              name:
                pattern: ^default$
                type: string
            type: object
          spec:
            description: EgressQoSSpec defines the desired state of EgressQoS
            properties:
              egress:
                description: a collection of Egress QoS rule objects
                items:
                  properties:
                    dscp:
                      description: DSCP marking value for matching pods' traffic.
                      maximum: 63
                      minimum: 0
                      type: integer
                    dstCIDR:
                      description: DstCIDR specifies the destination's CIDR. Only
                        traffic heading to this CIDR will be marked with the DSCP
                        value. This field is optional, and in case it is not set the
                        rule is applied to all egress traffic regardless of the destination.
                      format: cidr
                      type: string
                    podSelector:
                      description: PodSelector applies the QoS rule only to the pods
                        in the namespace whose label matches this definition. This
                        field is optional, and in case it is not set results in the
                        rule being applied to all pods in the namespace.
                      properties:
                        matchExpressions:
                          description: matchExpressions is a list of label selector
                            requirements. The requirements are ANDed.
                          items:
                            description: A label selector requirement is a selector
                              that contains values, a key, and an operator that relates
                              the key and values.
                            properties:
                              key:
                                description: key is the label key that the selector
                                  applies to.
                                type: string
                              operator:
                                description: operator represents a key's relationship
                                  to a set of values. Valid operators are In, NotIn,
                                  Exists and DoesNotExist.
                                type: string
                              values:
                                description: values is an array of string values.
                                  If the operator is In or NotIn, the values array
                                  must be non-empty. If the operator is Exists or
                                  DoesNotExist, the values array must be empty. This
                                  array is replaced during a strategic merge patch.
                                items:
                                  type: string
                                type: array
                            required:
                            - key
                            - operator
                            type: object
                          type: array
                        matchLabels:
                          additionalProperties:
                            type: string
                          description: matchLabels is a map of {key,value} pairs.
                            A single {key,value} in the matchLabels map is equivalent
                            to an element of matchExpressions, whose key field is
                            "key", the operator is "In", and the values array contains
                            only "value". The requirements are ANDed.
                          type: object
                      type: object
                  required:
                  - dscp
                  type: object
                type: array
            required:
            - egress
            type: object
          status:
            description: EgressQoSStatus defines the observed state of EgressQoS
            type: object
        type: object
    served: true
    storage: true
    subresources:
      status: {}
status:
  acceptedNames:
    kind: EgressQoS
    listKind: EgressQoSList
    plural: egressqoses
    singular: egressqos
  conditions:
  - lastTransitionTime: "2024-04-29T05:23:12Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  - lastTransitionTime: "2024-04-29T05:23:12Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
  storedVersions:
  - v1 

 

Actual results:

egressqos status is not updated correctly

Expected results:

egressqos status should be updated once applied.

Additional info:

 % oc version
Client Version: 4.16.0-0.nightly-2024-04-26-145258
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.0-0.nightly-2024-04-26-145258
Kubernetes Version: v1.29.4+d1ec84a

https://github.com/openshift/origin/pull/28522 removes two tests related to http2 testing with the default certificate that are know known to fail with HAProxy 2.8. We are reworking the tests as part of NE-1444 (HAProxy 2.8 bump).

This bug is a reminder that come OCP 4.16 GA we need to have reworked the tests so that they now pass with HAProxy 2.8 or, if not fixed, revert https://github.com/openshift/origin/pull/28522 which is why I'm marking this bug as a blocker. We do not want to ship 4.16 without reinstating the two tests.

The goal of removing the two tests in https://github.com/openshift/origin/pull/28522 is to allow us to make additional progress in https://github.com/openshift/router/pull/551 (which is our HAProxy 2.8 bump). With all tests passing in router#551 we can continue our assessment of HAProxy 2.8 by a) running the payload tests and b) creating a HAproxy 2.8 image that QE can use with their reliability test suite.

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

User Story:

As a HyperShift Engineer, I want to be able to:

  • measure the requests being sent by the HCCO to the management plane

so that I can achieve

  • effective quota limits on our management KAS throughput

As a HyperShift Engineer, I want to be able to:

  • identify the specific requests being sent by every component that talks to the management KAS by their GVR and verb

so that I can achieve

  • simple and effective root-causing and debugging of KAS throughput regressions
  • identification of areas to simplify and make more efficient

 

As a HyperShift Engineer, I want to be able to:

  • measure the API load on the management KAS by component and request type (GVR, verb, etc) over axes of time, release version, hyperscaler, etc

so that I can achieve

  • an understanding of trends over time, between environments, etc

Acceptance Criteria:

Description of criteria:

  • HCCO exposes metrics; management Prometheus ingests them
  • downscaled per-test, per-component API throughput metrics are exposed
  • said metrics are visualized in a UI for ease of consumption
  • said metrics can be validated by a server that can answer questions like "for this test, in this environment, on this release, is $amount of requests within reason or a regression?"

This does not require a design proposal.
This does not require a feature gate.

Failing conformance tests:

  • [sig-instrumentation][Late] OpenShift alerting rules [apigroup:image.openshift.io] should have a runbook_url annotation if the alert is critical [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
  • [sig-instrumentation][Late] OpenShift alerting rules [apigroup:image.openshift.io] should have a valid severity label [Skipped:Disconnected] [Suite:openshift/conformance/parallel] 
  • [sig-instrumentation][Late] OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations [Skipped:Disconnected] [Suite:openshift/conformance/parallel] 

Openshift conformance tests are flagging some alerts added by managed services to not be compliant.

 

Job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-osde2e-main-nightly-4.16-conformance-osd-aws/1770676944715649024

See comments for failure messages

Description of problem:

when use the oc-mirror with v2 format , will save the .oc-mirror dir to the default jjjjuser directory , and the data is very large. Now we don't have flag to specify the path for the log, but should save to the working directory.

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.15.0-202312011230.p0.ge4022d0.assembly.stream-e4022d0", GitCommit:"e4022d08586406f3a0f92bab1d3ea6cb8856b4fa", GitTreeState:"clean", BuildDate:"2023-12-01T12:48:12Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

run command : oc-mirror --from file://out docker://localhost:5000/ocptest --v2 --config config.yaml    

Actual results: 

will save the logs to user directory. 

Expected results:

Better to have flags to specify where to save the logs or use the working directory . 

 

Description of the problem:

Looks like ODF minimum size disk validation set to 75 G per node while it was ~25.
The validation should when only ODF enabled.

ODFMinDiskSizeGB int64 `envconfig:"ODF_MIN_DISK_SIZE_GB" default:"25"`

 

 

insufficient   ODF requirements: Insufficient resources to deploy ODF in compact mode. ODF requires a minimum of 3 hosts. Each host must have at least 1 additional disk of 75 GB minimum and an installation disk.
 

How reproducible:

always

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

While debugging a problem, I noticed some containers lack FallbackToLogsOnError.  This is important for debugging via the API.  Found via https://github.com/openshift/origin/pull/28547

When using an autoscaling MachinePool with OpenStack, setting minReplicas=0 results in a nil pointer panic.

See HIVE-2415 for context.

Description of problem:

 

Rule ocp4-cis-file-permissions-cni-conf returned false negative result
From the CIS benchmark v1.4.0, it is using below command to check the multus config on nodes:

 

$ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/etc/cni/net.d/*.conf"; done
600 /host/etc/cni/net.d/00-multus.conf
600 /host/etc/cni/net.d/00-multus.conf
600 /host/etc/cni/net.d/00-multus.conf
600 /host/etc/cni/net.d/00-multus.conf
600 /host/etc/cni/net.d/00-multus.conf
600 /host/etc/cni/net.d/00-multus.conf

Per the rule instructions, it is checking  /etc/cni/net.d/ on the node.
However, the multus config on nodes is in path  /etc/kubernetes/cni/net.d/, not /etc/cni/net.d/:

 

$ oc debug node/hongli-az-8pzqq-master-0 -- chroot /host ls -ltr /etc/cni/net.d/
Starting pod/hongli-az-8pzqq-master-0-debug ...
To use host binaries, run `chroot /host`
total 8
-rw-r--r--. 1 root root 129 Nov  7 02:18 200-loopback.conflist
-rw-r--r--. 1 root root 469 Nov  7 02:18 100-crio-bridge.conflist
Removing debug pod ...
$ oc debug node/hongli-az-8pzqq-master-0 -- chroot /host ls -ltr /etc/kubernetes/cni/net.d/
Starting pod/hongli-az-8pzqq-master-0-debug ...
To use host binaries, run `chroot /host`
total 4
drwxr-xr-x. 2 root root  60 Nov  7 02:23 whereabouts.d
-rw-------. 1 root root 352 Nov  7 02:23 00-multus.conf
Removing debug pod ...

 

 

$  for node in `oc get node --no-headers|awk '{print $1}'`; do oc debug node/$node -- chroot /host ls -l /etc/kubernetes/cni/net.d/; done
Starting pod/hongli-az-8pzqq-master-0-debug ...
To use host binaries, run `chroot /host`
total 4
-rw-------. 1 root root 352 Nov  7 02:23 00-multus.conf
drwxr-xr-x. 2 root root  60 Nov  7 02:23 whereabouts.d
Removing debug pod ...
Starting pod/hongli-az-8pzqq-master-1-debug ...
To use host binaries, run `chroot /host`
total 4
-rw-------. 1 root root 352 Nov  7 02:23 00-multus.conf
drwxr-xr-x. 2 root root  60 Nov  7 02:23 whereabouts.d
Removing debug pod ...
Starting pod/hongli-az-8pzqq-master-2-debug ...
To use host binaries, run `chroot /host`
total 4
-rw-------. 1 root root 352 Nov  7 02:23 00-multus.conf
drwxr-xr-x. 2 root root  60 Nov  7 02:23 whereabouts.d
Removing debug pod ...
Starting pod/hongli-az-8pzqq-worker-westus-2mx6t-debug ...
To use host binaries, run `chroot /host`
total 4
-rw-------. 1 root root 352 Nov  7 02:38 00-multus.conf
drwxr-xr-x. 2 root root  60 Nov  7 02:38 whereabouts.d
Removing debug pod ...
Starting pod/hongli-az-8pzqq-worker-westus-9qhf5-debug ...
To use host binaries, run `chroot /host`
total 4
-rw-------. 1 root root 352 Nov  7 02:38 00-multus.conf
drwxr-xr-x. 2 root root  60 Nov  7 02:38 whereabouts.d
Removing debug pod ...
Starting pod/hongli-az-8pzqq-worker-westus-bcdpd-debug ...
To use host binaries, run `chroot /host`
total 4
-rw-------. 1 root root 352 Nov  7 02:38 00-multus.conf
drwxr-xr-x. 2 root root  60 Nov  7 02:38 whereabouts.d
Removing debug pod ...

 

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-11-05-194730

How reproducible:

Always

Steps to Reproduce:

1. $ for i in $(oc get pods -n openshift-multus -l app=multus -oname); do oc exec -n openshift-multus $i -- /bin/bash -c "stat -c \"%a %n\" /host/etc/cni/net.d/*.conf"; done
$for node in `oc get node --no-headers|awk '{print $1}'`; do oc debug node/$node -- chroot /host ls -l /etc/kubernetes/cni/net.d/; done

Actual results:

The rule should check the wrong path and return FAIL

Expected results:

The rule should check the right path and return PASS

Additional info:

It was also applicable for both SDN and OVN

https://issues.redhat.com/browse/RHEL-1671 introduces "dns-changed" event that resolv-prepender should act on. So now instead of a bunch of "-change" and "up" and "whatnot" events we have the one that clearly indicates that the DNS has been changed.

By embedding this into our logic, we will heavily optimize number of times our scripts are called.

It is important to check when exactly this is going to be shipped so that we synchronize our change with upstream NM.

 When collecting onprem events, we want to be able to distinguish among the various onprem deployments:

  • SaaS on console.redhat.com
  • An operator packaged with ACM
  • An operator deployed with MCE
  • Deployed via agent-based install (ABI)
  • Deployed via Podman *(unsupported)
  • Deployed as a stand-alone operator on Openshift *(unsupported)

This info we should also make sure we forward it when collecting events

We should also define a human-friendly version for each

 

Slack thread about the supported deployment types

https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1706209886659329

primary_ipv4_address is deprecated in favor of primary_ip[*].address. Replace it with the new attribute.

Request for sending data via telemetry

The goal is to collect metrics about openshift lightspeed because we want to understand how users are making use of the product(configuration options they enable) as well as the experience they are having when using it(e.g. response times)

selected_model_info

Represents the llm provider+model the customer is currently using

Labels

  • <provider> one of the supported provider names (e.g. openai, watsonx, azureopenai)
  • <model> one of the supported llm model names (e.g. gpt-3.5-turbo, granite-13b-chat-v2)

The cardinality of the metric is around 6 currently, may grow somewhat as we add supported providers+models in the future (not all provider + model combinations are valid, so it's not cardinality of models*providers)

model_enabled

Represents all the provider/model combinations the customer has configured in ols (but are not necessarily currently using)

Labels

  • <model>, one of the supported llm model names (e.g. gpt-3.5-turbo, granite-13b-chat-v2)
  • <provider>, one of our 3 supported providers (openai, watsonx, azureopenai)

The cardinality of the metric is around 4 currently since not all provider/model combinations are valid. May grow somewhat as we add supported models in the future.

rest_api_calls_total

number of api calls with path + response code

Labels

  • <status_code>, the http response code returned (e.g. 200, 401, 403, 500)
  • <path>, one of our request paths, e.g. /v1/query, /v1/feedback

cardinality is around 12 (paths times number of likely response codes)

Owing to the older path being referenced in the prow workflow, we saw consistent failure for the /test versions job.

Description of problem:

The installer supports pre-rendering of the PerformanceProfile related manifests. However the MCO render is executed after the PerfProfile render and so the master and worker MachineConfigPools are created too late.

This causes the installation process to fail with:

Oct 18 18:05:25 localhost.localdomain bootkube.sh[537963]: I1018 18:05:25.968719       1 render.go:73] Rendering files into: /assets/node-tuning-bootstrap
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.008421       1 render.go:133] skipping "/assets/manifests/99_feature-gate.yaml" [1] manifest because of unhandled *v1.FeatureGate
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.013043       1 render.go:133] skipping "/assets/manifests/cluster-dns-02-config.yml" [1] manifest because of unhandled *v1.DNS
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.021978       1 render.go:133] skipping "/assets/manifests/cluster-ingress-02-config.yml" [1] manifest because of unhandled *v1.Ingress
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023016       1 render.go:133] skipping "/assets/manifests/cluster-network-02-config.yml" [1] manifest because of unhandled *v1.Network
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023160       1 render.go:133] skipping "/assets/manifests/cluster-proxy-01-config.yaml" [1] manifest because of unhandled *v1.Proxy
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023445       1 render.go:133] skipping "/assets/manifests/cluster-scheduler-02-config.yml" [1] manifest because of unhandled *v1.Scheduler
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.024475       1 render.go:133] skipping "/assets/manifests/cvo-overrides.yaml" [1] manifest because of unhandled *v1.ClusterVersion
Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: F1018 18:05:26.037467       1 cmd.go:53] no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="

Version-Release number of selected component (if applicable):

4.14.0-rc.6

How reproducible:

Always

Steps to Reproduce:

1. Add an SNO PerformanceProfile to extra manifest in the installer. Node selector should be: "node-role.kubernetes.io/master="
2.
3.

Actual results:

no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="

Expected results:

Installation completes

Additional info:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
 name: openshift-node-workload-partitioning-sno
spec:
 cpu:
   isolated: 4-X <- must match the topology of the node
   reserved: 0-3
 nodeSelector:
   node-role.kubernetes.io/master: ""

Description of problem Missing `useEffect` hook dependency warning error in console UI dev env

    Version-Release number of selected component (if applicable): 4.15.0-0.ci.test-2024-02-28-133709-ci-ln-svyfg32-latest

    

How reproducible:


To reproduce:

Run `yarn lint --fix`

Description of problem:

tlsSecurityProfile definitions do not align with documentation.

When using `oc explain` the field descriptions note that certain values are unsupported, but the same values are supported in the OpenShift Documentation. 

This needs to be clarified and the spacing should be fixed in the descriptions as they are hard to understand.

Version-Release number of selected component (if applicable):

    4.14.1

How reproducible:

⇒ oc explain ingresscontroller.spec.tlsSecurityProfile.modern   

Steps to Reproduce:

    1. Check the `oc explain` output

Actual results:

    ⇒ oc explain ingresscontroller.spec.tlsSecurityProfile.modern KIND:     IngressController VERSION:  operator.openshift.io/v1DESCRIPTION:      modern is a TLS security profile based on:      https://wiki.mozilla.org/Security/Server_Side_TLS#Modern_compatibility and      looks like this (yaml):      ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 -      TLS_CHACHA20_POLY1305_SHA256 minTLSVersion: TLSv1.3 NOTE: Currently      unsupported.   

Expected results:

    An output that aligns with the documentation regarding support/unsupported TLS versions Additionally, fixing the output format would be useful as it is very hard to understand/read in it's current form.

Here in the 4.14 Documentation, it states:
```
The HAProxy Ingress Controller image supports TLS 1.3 and the Modern profile.
```

Additional info:

The `apiserver` CR should also be checked for the same thing.    

What

Add tests for the hardcoded authorizer.

Why

This feature is specific to OpenShift and not part of the upstream project. Therefore it would be good to have an actual E2E-test protect this feature from being destroyed by an upstream bump.

Description of problem:

deploying compact 3-nodes cluster on GCP, by setting mastersSchedulable as true and removing worker machineset YAMLs, got panic

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install 4.13.0-0.nightly-2022-12-04-194803
built from commit cc689a21044a76020b82902056c55d2002e454bd
release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. create manifests
2. set 'spec.mastersSchedulable' as 'true', in <installation dir>/manifests/cluster-scheduler-02-config.yml
3. remove the worker machineset YAML file from <installation dir>/openshift directory
4. create cluster 

Actual results:

Got "panic: runtime error: index out of range [0] with length 0".

Expected results:

The installation should succeed, or giving clear error messages. 

Additional info:

$ openshift-install version
openshift-install 4.13.0-0.nightly-2022-12-04-194803
built from commit cc689a21044a76020b82902056c55d2002e454bd
release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea
release architecture amd64
$ 
$ openshift-install create manifests --dir test1
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
? Platform gcp
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
? Project ID OpenShift QE (openshift-qe)
? Region us-central1
? Base Domain qe.gcp.devcluster.openshift.com
? Cluster Name jiwei-1205a
? Pull Secret [? for help] ******
INFO Manifests created in: test1/manifests and test1/openshift 
$ 
$ vim test1/manifests/cluster-scheduler-02-config.yml
$ yq-3.3.0 r test1/manifests/cluster-scheduler-02-config.yml spec.mastersSchedulable
true
$ 
$ rm -f test1/openshift/99_openshift-cluster-api_worker-machineset-?.yaml
$ 
$ tree test1
test1
├── manifests
│   ├── cloud-controller-uid-config.yml
│   ├── cloud-provider-config.yaml
│   ├── cluster-config.yaml
│   ├── cluster-dns-02-config.yml
│   ├── cluster-infrastructure-02-config.yml
│   ├── cluster-ingress-02-config.yml
│   ├── cluster-network-01-crd.yml
│   ├── cluster-network-02-config.yml
│   ├── cluster-proxy-01-config.yaml
│   ├── cluster-scheduler-02-config.yml
│   ├── cvo-overrides.yaml
│   ├── kube-cloud-config.yaml
│   ├── kube-system-configmap-root-ca.yaml
│   ├── machine-config-server-tls-secret.yaml
│   └── openshift-config-secret-pull-secret.yaml
└── openshift
    ├── 99_cloud-creds-secret.yaml
    ├── 99_kubeadmin-password-secret.yaml
    ├── 99_openshift-cluster-api_master-machines-0.yaml
    ├── 99_openshift-cluster-api_master-machines-1.yaml
    ├── 99_openshift-cluster-api_master-machines-2.yaml
    ├── 99_openshift-cluster-api_master-user-data-secret.yaml
    ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
    ├── 99_openshift-machineconfig_99-master-ssh.yaml
    ├── 99_openshift-machineconfig_99-worker-ssh.yaml
    ├── 99_role-cloud-creds-secret-reader.yaml
    └── openshift-install-manifests.yaml2 directories, 26 files
$ 
$ openshift-install create cluster --dir test1
INFO Consuming Openshift Manifests from target directory
INFO Consuming Master Machines from target directory 
INFO Consuming Worker Machines from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Common Manifests from target directory 
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
panic: runtime error: index out of range [0] with length 0goroutine 1 [running]:
github.com/openshift/installer/pkg/tfvars/gcp.TFVars({{{0xc000cf6a40, 0xc}, {0x0, 0x0}, {0xc0011d4a80, 0x91d}}, 0x1, 0x1, {0xc0010abda0, 0x58}, ...})
        /go/src/github.com/openshift/installer/pkg/tfvars/gcp/gcp.go:70 +0x66f
github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1daff070, 0xc000cef530?)
        /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:479 +0x6bf8
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c78870, {0x1a777f40, 0x1daff070}, {0x0, 0x0})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:226 +0x5fa
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffc4c21413b?, {0x1a777f40, 0x1daff070}, {0x1dadc7e0, 0x8, 0x8})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48
main.runTargetCmd.func1({0x7ffc4c21413b, 0x5})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:259 +0x125
main.runTargetCmd.func2(0x1dae27a0?, {0xc000c702c0?, 0x2?, 0x2?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:289 +0xe7
github.com/spf13/cobra.(*Command).execute(0x1dae27a0, {0xc000c70280, 0x2, 0x2})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3a500)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
$ 

 

 

    BMH is the custom resource that is used to add a new host. Agent on the
    other hand is created automatically when a host registers. Since there
    is a need to control agent labels the following agent label support was
    added:
    
    In order to add an entry that controls agent label, a new BMH annotation
    needs to be added.
    The annotation key is prefixed with the string
    'bmac.agent-install.openshift.io.agent-label.'.  The remainder of the
    annotation is considered the label key.
    The value of the annotation is a JSON dictionary with 2 possible keys.
    The key 'operation' can contain one of the values ["add","delete"] which
    mean that the label can either added , or deleted.
    The dictionary key 'value' contains the label value.

Description of problem:

It was notices that the openshift-hyperkube RPM which is primarilly, perhaps exclusively, used to install the kubelet in RHCOS or other environments included kube-apiserver, kube-controller-manager, and kube-scheduler binaries. Those binaries are all built and used via container images, which as far as I can tell don't make use of the RPM.    

Version-Release number of selected component (if applicable):

4.12 - 4.16

How reproducible:

100%

Steps to Reproduce:

    1. rpm -ql openshift-hyperkube on any node
    2.
    3.
    

Actual results:

# rpm -ql openshift-hyperkube
/usr/bin/hyperkube
/usr/bin/kube-apiserver
/usr/bin/kube-controller-manager
/usr/bin/kube-scheduler
/usr/bin/kubelet
/usr/bin/kubensenter

# ls -lah /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/hyperkube /usr/bin/kubensenter /usr/bin/kubelet
-rwxr-xr-x. 2 root root  945 Jan  1  1970 /usr/bin/hyperkube
-rwxr-xr-x. 2 root root 129M Jan  1  1970 /usr/bin/kube-apiserver
-rwxr-xr-x. 2 root root 114M Jan  1  1970 /usr/bin/kube-controller-manager
-rwxr-xr-x. 2 root root  54M Jan  1  1970 /usr/bin/kube-scheduler
-rwxr-xr-x. 2 root root 105M Jan  1  1970 /usr/bin/kubelet
-rwxr-xr-x. 2 root root 3.5K Jan  1  1970 /usr/bin/kubensenter

Expected results:

Just the kubelet and deps on the host OS, that's all that's necessary

Additional info:

My proposed change would be for people that cared about making this slim to install `openshift-hyperkube-kubelet` instead.

Component Readiness has found a potential regression in [Jira:"Networking / router"] monitor test service-type-load-balancer-availability cleanup.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.16
Start Time: 2024-04-02T00:00:00Z
End Time: 2024-04-08T23:59:59Z
Success Rate: 94.67%
Successes: 213
Failures: 12
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 751
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Networking%20%2F%20router&confidence=95&environment=sdn%20upgrade-minor%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=sdn&network=sdn&pity=5&platform=azure&platform=azure&sampleEndTime=2024-04-08%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-04-02%2000%3A00%3A00&testId=openshift-tests-upgrade%3A9bc4661b05ba13ed49d4c91f63899776&testName=%5BJira%3A%22Networking%20%2F%20router%22%5D%20monitor%20test%20service-type-load-balancer-availability%20cleanup&upgrade=upgrade-minor&upgrade=upgrade-minor&variant=standard&variant=standard

The failure message that we're after here is

{  failed during cleanup
Get "https://api.ci-op-tgk1b3if-9d969.ci2.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-service-lb-test-xqptd": http2: client connection lost}

Looking at the sample runs, the failure is in monitortest e2e xml junit files, and it appears this one always happens after upgrade, but before conformance. Unfortunately that means we may not have reliable intervals during the time this occurs. It also means there's no excuse for a connection lost to the apiserver.

Example: this junit xml from this job run

 

The problem actually dates back to March 3, see attachment for the full list of job runs affected. Almost entirely Azure, entirely 4.16 (never happened prior as far as we can see back).

It occurs in a poll loop checking if a namespace exists after being deleted. Failure rate seems to be around 5% of the time on this specific job.

Description of problem:

With the implementation of bug https://issues.redhat.com/browse/MGMT-14527,we see that the Vsphere pluging is degraded and it gives a pop up to fill up the details of the Vcenter Configuration so the configuration are then stored in cloud-provider-config, There are requirement on how those details should be entered in the popup but there is no details about the format in which customer should fill the details.

Requirement from the bug

1. The UI should display the format in which the data is to be entered
2. Warning that if the configurations are saved then new Machine config will be rolled out which will lead to node reboot

Version-Release number of selected component (if applicable):

 4.13+

How reproducible:

    Steps to reproduce unavailable

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Vspehere plugin is degraded     

Expected results:

Plugin should not be degraded

Additional info:

Ideally would be seen in a situation where clusters are upgraded from a lower version to high version.    

The failure is fairly rare globally but some platforms seem to see it more often. Last night we happened to see it twice in 10 azure runs and aggregation failed on it. It appears to be a longstanding issue however.

The following test catches the problem

[sig-arch] events should not repeat pathologically for ns/openshift-authentication-operator

And the error will show something similar to:

{  1 events happened too frequently

event happened 70 times, something is wrong: namespace/openshift-authentication-operator deployment/authentication-operator hmsg/16eeb8c913 - reason/OpenShiftAPICheckFailed "oauth.openshift.io.v1" failed with an attempt failed with statusCode = 503, err = the server is currently unable to handle the request From: 15:46:39Z To: 15:46:40Z result=reject }

This is quite severe for just 1 second. The intervals database shows occurrences of over 100.

Sippy's test page provides insight into what platforms see the problem more, and can be used to find job runs where this happens, but the runs from yesterday were:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1729512594592501760

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1729512598153465856

Description of problem:

When creating an ImageDigestMirrorSet with conflicting mirrorSourcePolicy, it didn't prompt error.

Version-Release number of selected component (if applicable):

% oc get clusterversion 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-01-14-100410   True        False         27m     Cluster version is 4.15.0-0.nightly-2024-01-14-100410

How reproducible:

always

Steps to Reproduce:

1. create an ImageContentSourcePolicy 

ImageContentSourcePolicy.yaml:
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: ubi8repo
spec:
  repositoryDigestMirrors:
  - mirrors:
    - example.io/example/ubi-minimal
    - example.com/example/ubi-minimal
    source: registry.access.redhat.com/ubi6/ubi-minimal
  - mirrors:
    - mirror.example.net
    source: registry.example.com/example

2.After the mcp finish updating, check the /etc/containers/registries.conf update as expected

3.create an ImageDigestMirrorSet with conflicting mirrorSourcePolicy for the same source "registry.example.com/example"

ImageDigestMirrorSet-conflict.yaml: 
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  name: digest-mirror
spec:
  imageDigestMirrors:
  - mirrors:
    - example.io/example/ubi-minimal
    - example.com/example/ubi-minimal
    source: registry.access.redhat.com/ubi8/ubi-minimal
    mirrorSourcePolicy: AllowContactingSource
  - mirrors:
    - mirror.example.net
    source: registry.example.com/example
    mirrorSourcePolicy: NeverContactSource
   

Actual results:

3. create successfully, but the mcp didn't get updated and no relevant mc generated.

The machine-config-controller log showed:
I0116 02:34:03.897335       1 container_runtime_config_controller.go:417] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: conflicting mirrorSourcePolicy is set for the same source "registry.example.com/example" in imagedigestmirrorsets and/or imagetagmirrorsets

Expected results:

3. it should prompt: there exist conflicting mirrorSourcePolicy for the same source "registry.example.com/example" in ICSP  

Additional info:

    

Description of problem:

When running agent-based installation with arm64 and multi payload, after booting the iso file, assisted-service raise the error, and the installation fail to start:

Openshift version 4.16.0-0.nightly-arm64-2024-04-02-182838 for CPU architecture arm64 is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-arm64-2024-04-02-182838' and CPU architecture 'arm64'" go-id=419 pkg=Inventory request_id=5817b856-ca79-43c0-84f1-b38f733c192f 

The same error when running the installation with multi-arch build in assisted-service.log:

Openshift version 4.16.0-0.nightly-multi-2024-04-01-135550 for CPU architecture multi is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-multi-2024-04-01-135550' and CPU architecture 'multi'" go-id=306 pkg=Inventory request_id=21a47a40-1de9-4ee3-9906-a2dd90b14ec8 

Amd64 build works fine for now.

Version-Release number of selected component (if applicable):

    

How reproducible:

always

Steps to Reproduce:

1. Create agent iso file with openshift-install binary: openshift-install agent create image with arm64/multi payload
2. Booting the iso file 
3. Track the "openshift-install agent wait-for bootstrap-complete" output and assisted-service log
    

Actual results:

 The installation can't start with error

Expected results:

 The installation is working fine

Additional info:

assisted-service log: https://docs.google.com/spreadsheets/d/1Jm-eZDrVz5so4BxsWpUOlr3l_90VmJ8FVEvqUwG8ltg/edit#gid=0

Job fail url: 
multi payload: 
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14/1774134780246364160

arm64 payload:
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-arm64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1773354788239446016

Description of problem:

checked with 4.15.0-0.nightly-2023-12-11-033133, there are not PodMetrics/NodeMetrics in server

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2023-12-11-033133   True        False         122m    Cluster version is 4.15.0-0.nightly-2023-12-11-033133

$ oc api-resources | grep -i metrics
nodes                                                                                                                        metrics.k8s.io/v1beta1                        false        NodeMetrics
pods                                                                                                                         metrics.k8s.io/v1beta1                        true         PodMetrics

$ oc explain PodMetrics
the server doesn't have a resource type "PodMetrics"
$ oc explain NodeMetrics
the server doesn't have a resource type "NodeMetrics"

$ oc get NodeMetrics
error: the server doesn't have a resource type "NodeMetrics"
$ oc get PodMetrics -A
error: the server doesn't have a resource type "PodMetrics"

no issue with 4.14.0-0.nightly-2023-12-11-135902

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-12-11-135902   True        False         88m     Cluster version is 4.14.0-0.nightly-2023-12-11-135902

$ oc api-resources | grep -i metrics
nodes                                                                                                                        metrics.k8s.io/v1beta1                        false        NodeMetrics
pods                                                                                                                         metrics.k8s.io/v1beta1                        true         PodMetrics

$ oc explain PodMetrics
GROUP:      metrics.k8s.io
KIND:       PodMetrics
VERSION:    v1beta1DESCRIPTION:
    PodMetrics sets resource usage metrics of a pod.
...

$ oc explain NodeMetrics
GROUP:      metrics.k8s.io
KIND:       NodeMetrics
VERSION:    v1beta1DESCRIPTION:
    NodeMetrics sets resource usage metrics of a node.
...

$ oc get PodMetrics -A
NAMESPACE                                          NAME                                                                       CPU    MEMORY      WINDOW
openshift-apiserver                                apiserver-65f777466-4m8nj                                                  9m     297512Ki    5m0s
openshift-apiserver                                apiserver-65f777466-g7n72                                                  10m    313308Ki    5m0s
openshift-apiserver                                apiserver-65f777466-xzd8l                                                  12m    293008Ki    5m0s
openshift-apiserver-operator                       openshift-apiserver-operator-54945b8bbd-bxkcj                              3m     119264Ki    5m0s
...

$ oc get NodeMetrics
NAME                                        CPU     MEMORY      WINDOW
ip-10-0-20-163.us-east-2.compute.internal   765m    8349848Ki   5m0s
ip-10-0-22-189.us-east-2.compute.internal   388m    5363132Ki   5m0s
ip-10-0-41-231.us-east-2.compute.internal   1274m   7243548Ki   5m0s
... 

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2023-12-11-033133

How reproducible:

always

Steps to Reproduce:

1. see the description

Actual results:

4.15 server does not have PodMetrics/NodeMetrics

Expected results:

should have

David mention this issue here: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702312628947029

 

 duplicated_event_patterns I think it's creating a blackout range (events ok during time X) and then checks the time range itself, but doesn't appear to exclude the change in counts?
 

The count of the last event within the allowed range should be subtracted from the first event that is outside of the allowed time range for pathological event test calculation. 

 

David has a demonstration of the count here: https://github.com/openshift/origin/pull/28456 but to fix you have to invert the testDuplicatedEvents to iterate through the event registry, not the the events

 

 

For ingress controllers that's exposed via LBs, there are considerations for external and internal publishing scope. Requesting support for providers the ability to specify the LB scope on the HostedCluster at initial create time.

Please review the following PR: https://github.com/openshift/cluster-authentication-operator/pull/644

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

Installing cluster . during installation hit on timeout warnings and installation "stucked" in a state of installing .

No more events in the log events and installation still at the same state after ~48 hours.
Looks like stuck forever ....
test-infra-cluster-cfb47d07_608f175e-aa23-493d-8a5c-d5bcaf15468f(1).tar

Screencast from 2024-03-15 21-05-34.webm

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

 

As the team members changed, this is a placeholder issue to use in order to track OWNERs file changes for different projects as it'd be required by the CI.

Description of problem:

1. vSphere connection configuration modal stays in unresponsive status for a long time after user updates 'Virtual Machine Folder' value
2. user is not able to update the configuration again when the changes are applying

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-22-021321

How reproducible:

Always    

Steps to Reproduce:

1. setup cluster landed in vSphere
2. try to update 'Virtual Machine Folder' value in vSphere connection configuration modal
3. click 'Save'
    

Actual results:

3. vSphere connection configuration modal will be in Saving status for a very long time, user can not Close nor Cancel the changes.

when we check in backend, the changes already in place in cm/cloud-provider-config 

also user is not able to update the configuration again when the changes are applying

Expected results:

3. user should be able to continue updating the values or close the modal since the modal is only exposed to user to update the value, user don't need to wait until all finished, also the changes already take place in backend

Additional info:

 

 When installation fails, status_info is reporting an incorrect status

Most likely is one of those two scenarios:

  • we are not notifying the status when changing right before failure
  • we are changing status after sending the event. While this was OK before, as the scraper would likely run after the event, but now order is important

Description of problem:

Hypershift management clusters are using a network-load-balancer to route to their own openshift-ingress router pods for cluster ingress.
These NLBs are provisioned by the https://github.com/openshift/cloud-provider-aws.

The cloud-provider-aws uses the cluster-tag on the subnets to select the correct subnets for the NLB and the SecurityGroup adjustments.

On management clusters *all* subnets are tagged with the MCs cluster-id.
This can lead to the cloud-provider-aws to possibly selecting the incorrect subnet, due to breaking conflicts between multiple subnets in an AZ using lexicographical comparisons: https://github.com/openshift/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L3626

This can lead to a situation, where a SecurityGroup will only allow Ingress from a subnet that is not actually part of the NLB - in this case the TargetGroup will not be able to correctly perform a HealthCheck in that AZ.

In certain cases this can lead to all targets reporting unhealthy as the nodes hosting the ingress pods have the incorrect SecurityGroup rules.

In that case routing to nodes that are part of the target group can select nodes that should not be chosen as they are not ready yet/anymore leading to problems when attempting to access management cluster services (e.g. the console).

Version-Release number of selected component (if applicable):

4.14.z & 4.15.z

How reproducible:

Most MCs that are using NLBs will have some of the SecurityGroups misconfigured.

Steps to Reproduce:

1. Have the cloud-provider-aws update the NLB while there are subnets in an AZ with lexicographically smaller names than the MCs default subnets - this can lead to the other subnets being chosen instead.
2. Check the securitygroups to see if the source CIDRs are incorrect.

Actual results:

SecurityGroups can have incorrect source CIDRs used for the MCs own NLB.

Expected results:

The MC should only tag their own subnet with the clusterid of the MC, so subnet selection of the cloud-provider-aws is not affected by the HCP subnets in the same availability zones. 

Additional info:

Related OHSS ticket from SREP: https://issues.redhat.com/browse/OSD-20289
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create bootstrap: unsupported configuration: No emulator found for arch 'x86_64'
               

Description of problem:

Bootstrap process failed due to API_URL and API_INT_URL are not resolvable:

Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap...
Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane...
Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API
Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up

install logs:
...
time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host"
time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz"
time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
...


    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165

    

How reproducible:


Always.
    

Steps to Reproduce:

    1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade
    2. Create cluster
    3.
    

Actual results:

Failed to complete bootstrap process.
    

Expected results:

See description.

    

Additional info:

I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969

This looks very much like a 'downstream a thing' process, but only making a modification to an existing one.

Currently, the operator-framework-olm monorepo generates a self-hosting catalog from operator-registry.Dockerfile.  This image also contains cross-compiled opm binaries for windows and mac, and joins the payload as ose-operator-registry.

To separate concerns, this introduces a new operator-framework-cli image which will be based on scratch, not self-hosting in any way, and just a container to convey repeatably produced o-f CLIs.  Right now, this will focus on opm for olm v0 only, but others can be added in future.

 

Slack discussion here: https://redhat-internal.slack.com/archives/C02F1J9UJJD/p1702394712492839

  Repo: openshift/kubernetes/pkg/controller/podautoscaler
    MISSING: jkyros, joelsmith
  Repo: openshift/kubernetes-autoscaler/vertical-pod-autoscaler
    MISSING: jkyros
  Repo: openshift/vertical-pod-autoscaler-operator/
    MISSING: jkyros

The openshift/kubernetes one was the only real weird one where there might not be a precedent. Looking at the kubernetes repo it looks like the custom is to add a DOWNSTREAM_APPROVERS file as a carry patch for downstream approvers?

 

Description of the problem:
OCI external platform should be shown as Tech Preview when OCP 4.14 is selected.
 

https://redhat-internal.slack.com/archives/C04RBMZCBGW/p1711029226861489

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

After fixing https://issues.redhat.com/browse/OCPBUGS-29919 by merging https://github.com/openshift/baremetal-runtimecfg/pull/301 we have lost ability to properly debug the logic of selection Node IP used in runtimecfg.

In order to preserve debugability of this component, it should be possible to selectively enable verbose logs.