Requirement	Notes	isMvp?
Secrets and ConfigMaps can get shared across namespaces		YES

Feature ACM-955: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic ACM-14964: Integrate CAPI providers into MCE component

View the Description

Epic Goal

The EPIC purpose is to hold all the required works to integrate CAPI & CAPI providers (CAPA & CAPZ) into MCE

Why is this important?

It is important for end users to be able to have life cycle management for ROSA-HCP and ARO-HCP clusters

Scenarios

...

Acceptance Criteria

...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub
Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Doc issue opened with a completed template. Separate doc issue
opened for any deprecation, removal, or any current known
issue/troubleshooting removal from the doc, if applicable.

Task ACM-14968: Test using redhat cert-manager operator with OCP cluster-api-operator

View the Description View the linked PRs

We need to test using the redhat cert-manager operator instead of using the upstream helm chart for certificate manager as our doc mentioned here

https://github.com/openshift/cluster-api-operator/pull/55

Feature OBSDA-1046: CCX OCP core maintenance 2025

View the Description

Placeholder feature for ccx-ocp-core maintenance tasks.

Epic CCXDEV-14640: IO maintenance OCP 4.19

View the Description

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Bug OCPBUGS-35196: Insights API - duration attribute validation pattern is not correct

View the Description View the linked PRs

The following Insights APIs use duration attributes:

insightsoperator.operator.openshift.io
datagathers.insights.openshift.io

The kubebuilder validation patterns are defined as

^0|([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$

and

^([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$

Unfortunately this is not enough and it fails in case of updating the resource e.g with value "2m0s".

The validation pattern must allow these values.

Task CCXDEV-14824: Golang upgrade to 1.23 (+1.23.5)

View the Description View the linked PRs

target

This task aims to upgrade the Insights Operator to Golang 1.23 (currently at 1.22).

caveats

Some libraries have introduced changes that are not compatible with the current code, and some fixes are required to fully update the dependencies.

https://github.com/openshift/insights-operator/pull/1071

Bug OCPBUGS-45926: Component Readiness: [Insights Operator] [Other] test regressed

View the Description View the linked PRs

It looks like the insights-operator doesn't work with IPv6, there are log errors like this:

E1209 12:20:27.648684   37952 run.go:72] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp: address fd01:0:0:5::6:8000: too many colons in address"

It's showing up in metal techpreview jobs.

The URL isn't being constructed correctly, use NetJoinHostPort over Sprintf. Some more details here https://github.com/stbenjam/no-sprintf-host-port. There's a non-default linter in golangci-lint for this.

Component Readiness has found a potential regression in the following test:

[sig-architecture] platform pods in ns/openshift-insights should not exit an excessive amount of times

Test has a 56.36% pass rate, but 95.00% is required.

Sample (being evaluated) Release: 4.18
Start Time: 2024-12-02T00:00:00Z
End Time: 2024-12-09T16:00:00Z
Success Rate: 56.36%
Successes: 31
Failures: 24
Flakes: 0

View the test details report for additional context.

https://github.com/openshift/insights-operator/pull/1050

Feature OBSDA-390: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OU-427: DP: Perses Dashboards in OpenShift - Internal teams

View the Description

Description

“In order to allow internal teams to define their dashboards in Perses, we as the Observability UI Team need to Add support on the console to display Perses dashboards”

Goals & Outcomes

Product Requirements:

The console dashboards plugin is able to render dashboards coming from Perses

Task OU-433: Toggle between PersesDashboards or Dashboards component in the monitoring plugin

View the Description View the linked PRs

Background

In order for customers or internal teams to troubleshoot better, they need to be able to see the dashboards created using Perses inside the OpenShift console. We will use the monitoring plugin which already supports console dashboards comming from Grafana, to provide the Perses dashboard funcionallity

Outcomes

Create a component in the monitoring plugin that can render a Perses dashboard based on the dashboard schema returned by the Perses API.

There are 2 dropdowns, one for selecting namespaces and another for selecting the dashboards in the selected namespace.

Steps

Use the same dashboards route for the Perses dashboards component
Use the monitoring-console-plugin proxy to call the Perses API, this was covered by https://issues.redhat.com/browse/OU-432
Choose the component being rendered: Dashboards or Perses Dashboards if Perses is detected in the cluster
1. Perses dashboards
  1. Add the namespace dropdown, visible only if Perses is detected on the cluster using an API call to the proxy
  2. Create a Perses: hello world component to be rendered in the dashboard page
  3. When selecting a Perses project on the dropdown, render the Perses hello world component.
  4. When selecting the legacy namespace, render the current Dashboard component
  5. The implementation of the Perses component will be done in https://issues.redhat.com/browse/OU-618
2. Dashboards. Keep the page as it is

Previous work

https://docs.google.com/presentation/d/1h7aRZkl5Kr9laXaBodxQv5IsLBZF06g0gigXbCqv9H4/edit#slide=id.g1dd06ee962a_0_4384

Create dashboard flow chart

Figma design

https://github.com/openshift/monitoring-plugin/pull/304

Task OU-432: Merge Perses Dashboards with the current dashboard list

View the Description View the linked PRs

Background

In order to allow customers and internal teams to see dashboards created using Perses, we must add them as new elements on the current dashboard list

Outcomes

When navigating to the Monitoring / Dashboards. Perses dashboards are listed with the current console dashboards. The extension point is backported to 4.14

Steps

COO (monitoring-console-plugin)
- Add the Perses dashboards feature called "perses-dashboards" in the monitoring plugin.
- Create a function to fetch dashboards from the Perses API
CMO (monitoring-plugin)
- An extension point is added to inject the function to fetch dashboards from Perses API and merge the results with the current console dashboards

https://github.com/openshift/monitoring-plugin/pull/266

Feature OBSDA-390: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task OU-618: Render Perses Dashboards with the monitoring plugin

View the Description View the linked PRs

Background

In order for customers or internal teams to troubleshoot better, they need to be able to see the dashboards created using Perses inside the OpenShift console. We will use the monitoring plugin which already supports console dashboards coming from Grafana, to provide the Perses dashboard functionality

Outcomes

Create a component in the monitoring plugin that can render a Perses dashboard based on the dashboard schema returned by the Perses API.

There are 2 dropdowns, one for selecting namespaces and another for selecting the dashboards in the selected namespace.

Steps

From the Perses Hello world component created in ~~OU-433~~
Copy the dashboard selector, variables, time range and auto refresh from the current Dashboard component to Perses component
Use the DashboardView component from Perses to render the perses dashboard, based on the Perses dashboard definition.
Create a single PersesWrapper and include all the Perses providers like in the Distributed tracing plugin. Wrap also the variables in this provider.
Use the Patternfly theme for Perses (ECharts) charts
Adjust the Grid to use Patternfly grid instead of material UI Grid

Previous work

https://docs.google.com/presentation/d/1h7aRZkl5Kr9laXaBodxQv5IsLBZF06g0gigXbCqv9H4/edit#slide=id.g1dd06ee962a_0_4384

Create dashboard flow chart

Figma design

Feature OBSDA-823: Enable Multiarchitecture Simple content access via Insights Operator

View the Description

1. Proposed title of this feature request:

Insights Operator Entitlements for Multi-Arch Clusters

2. What is the nature and description of the request?

When the Insights Operator syncs the cluster's Simple Content Access (SCA) certificate, it should survey the architectures of the worker nodes on the cluster and enable repositories for the appropriate architectures.

3. Why does the customer need this? (List the business requirements here)

Konflux and other customers needs this to produce multi-arch container images that install RHEL content via yum/dnf.

4. List any affected packages or components.

Insights Operator
Candlepin

Epic CCXDEV-12875: enable Insight Operator entitlements for multi arch clusters

View the Description View the linked PRs

Insights Operator periodically pulls down the entitlements certificates into the cluster. Technically it means there's a HTTP POST request to the
https://api.openshift.com/api/accounts_mgmt/v1/certificates. The payload of the POST requests looks like:

`{"type": "sca","arch": "x86_64"}`

And this is currently the limitation. It prevents builds using the entitlement certificates on different architectures as s390x, ppc64le, arm64.

https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetes-io-arch label can be used

https://github.com/openshift/insights-operator/pull/1066

Feature OBSDA-823: Enable Multiarchitecture Simple content access via Insights Operator

View the Description

1. Proposed title of this feature request:

Insights Operator Entitlements for Multi-Arch Clusters

2. What is the nature and description of the request?

3. Why does the customer need this? (List the business requirements here)

Konflux and other customers needs this to produce multi-arch container images that install RHEL content via yum/dnf.

4. List any affected packages or components.

Insights Operator
Candlepin

Feature OCPPLAN-7878: NetEdge - Maintainability and Debugability & Tech Backlog

View the Description

tldr: three basic claims, the rest is explanation and one example

We cannot improve long term maintainability solely by fixing bugs.
Teams should be asked to produce designs for improving maintainability/debugability.
Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.

Relevant links:

Documentation:
- Edge Diagnostics Scratchpad, our team's internal diagnostic guide.
- Troubleshooting OCP networking issues - The complete guide, the SDN team's diagnostic guide.
- Linux Performance, Brendan Gregg's guide to analyzing Linux performance issues.
- RFC: A proper feedback loop on Alerts.
- OpenShift Router Reload Technical Overview on Access.
- Performance Scaling HAProxy with OpenShift on Access.
- How to collect worker metrics to troubleshoot CPU load, memory pressure and interrupt issues and networking on worker nodes in OCP 4 on Access.
- OpenShift Performance and Scale Knowledge Base on Mojo, results from OpenShift scalability testing.
- Scalability and performance, OCP 4.5 documentation about the router's currently known scalability limits.
- Scaling OpenShift Container Platform HAProxy Router, OCP 3.11 documentation about the manual performance configuration that was possible in OCP 3.
- Timing web requests with cURL and Chrome from the Cloudflare blog.
- tcpdump advanced filters, some useful tcpdump commands.
- OpenShift SDN - Networking, OCP 3.11 documentation on the SDN (useful background reading).
- Ingress Operator and Controller Status Conditions, design document for improved status condition reporting.
- Observability tips for HAProxy, a slide deck by Willy Tarreau.
- Interesting Traces - Out of Order versus Retransmissions, analysis using tshark.
- The PCP Book: A Complete Documentation of Performance Co-Pilot, by Yogesh Babar.
- Debugging kernel networking bug, brief guide to using SystemTap on RHCOS.
- Troubleshooting throughput issues from the OCP 4.5 documentation.
- Troubleshooting OpenShift Clusters and Workloads.
- Red Hat Enterprise Linux Network Performance Tuning Guide (PDF).
- openshift/enhancements#289 stability: point to point network check, a diagnostic built into the kube-apiserver operator.
Diagnostic tools:
- dropwatch to watch for packet drops.
- ethtool to check NIC configuration.
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more to trace and diagnose various issues in the networking stack.
- r-curler to gather timing information about HTTP/HTTPS connections.
- route-monitor, to monitor routes for reachability.
- hping(3), a programmable packet generator.
- OpenTracing / Jaeger in OpenShift.
- node-problem-detector, a possible integration point for new diagnostics.
- Using SystemTap by Brendan Gregg.
- DTrace SystemTap cheatsheet (PDF).
Visualization and more sophisticated diagnostic tools:
- eldadru/ksniff, kubectl plugin for tcpdump & Wireshark.
- ironcladlou/ditm, Dan's "Dan in the Middle" tool.
- Skydive, network diagnostic and visualization tool.
- ali, a "load testing tool capable of performing real-time analysis" with visualization.
Testing tools:
- stress-ng, a general stress-loading tool (CPU, filesystem, network, ...).
- mb, the networking benchmarking tool written and used by Jiri Mencak from our Perf+Scale team.
Case studies:
- BZ1763206 is an example of diagnosing DNS latency/timeouts.
- BZ1829779 Investigation details the diagnosis of route latency.
- BZ1845545 is an example of diagnosing misconfigured DNS for an external LB.
- Debugging network stalls on Kubernetes, from the GitHub Blog, about diagnosing Kubernetes performance issues related to ksoftirqd.

Story NE-1908: Add instructions for keepalived-ipfailover image testing

View the linked PRs

https://github.com/openshift/images/pull/198

Epic NE-1865: [Tech Debt] Fix OWNERS files in openshift/origin

View the Description

Add a NID alias to OWNERS_ALIASES and update the OWNERS file in test/extended/router and add OWNERS file to test/extended/dns

Story NE-1870: Submit PR to fix OWNERS files in openshift/origin

View the linked PRs

https://github.com/openshift/origin/pull/29247

Story NE-1908: Add instructions for keepalived-ipfailover image testing

View the linked PRs

https://github.com/openshift/images/pull/198

Feature OCPSTRAT-1003: Remove Terraform from the IBM Cloud VPC IPI installer

View the Description

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The IBM Cloud VPC IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing IBM Cloud VPC Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic CORS-3278: Replace Terraform with CAPI Provider for IBM Cloud

View the Description View the linked PRs

Epic Goal

Replace Terraform infrastructure and machine (bootstrap, control plane) provisioning with CAPI-based approach.

https://github.com/openshift/installer/pull/9523

Feature OCPSTRAT-1064: Improve upgrades - phase 3 - Control plane & worker node independence

View the Description

Feature Overview

As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.

Background:

This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.

~~OTA-700~~ Reduce False Positives (such as Degraded)
~~OTA-922~~ - Better able to show the progress made in each discrete step
[Cover by status command]Better visibility into any errors during the upgrades and documentation of what they error means and how to recover.

Goals

Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are:
- Control plane upgrade
- Worker nodes upgrade
- Workload enabling upgrade (i..e. Router, other components) or infra nodes
An user experience around an end-2-end back-up and restore after a failed upgrade
MCO-530 - Support in Telemetry for the discrete steps of upgrades

References

Epic TRT-1578: Ensure all HA components are not degraded by design during upgrades

View the Description

Epic Goal

Eliminate the gap between measured availability and Available=true

Why is this important?

Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
Address all identified issues

Acceptance Criteria

openshift/enhancements CONVENTIONS outlines these requirements
CI - Release blocking jobs include these new/updated tests
Release Technical Enablement - N/A if we do this we should need no docs
No outstanding identified issues

Dependencies (internal and external)

Previous Work (Optional):

Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Tests in place
DEV - No outstanding failing tests
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story TRT-1575: CI: fail update suite if any ClusterOperator go Degraded=True

View the Description View the linked PRs

These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".

Definition of done:

Same as ~~OTA-362~~
File bugs or the existing issues
If bug exists then add the tests to the exception list.
Unless tests are in exception list , they should fail if we see degraded != false.

Feature OCPSTRAT-108: [TP] Support Kube KMS Integration in OCP (User-Provided)

View the Description

Feature Overview

This feature aims to enable customers of OCP to integrate 3rd party KMS solutions for encrypting etcd values at rest in accordance with:

https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/

Goals

Bring KMS v2 API to beta|stable level
Create/expose mechanisms for customers to plug in containers/operators which can serve the API server's needs (can it be an operator, something provided via CoreOS layering, vanilla container spec provided to API server operator?)
Provide similar UX experience for all of self-hosted, hypershift, SNO scenarios
Provide example container/operator for the mechanism

General Prioritization for the Feature

Approved design for detection & actuation for stand-alone OCP clusters.
1. How to detect a problem like an expired/lost key and no contact with the KMS provider?
2. How to inform/notify the situation, even at node level, of the situation
Tech Preview (Feature gated) enabling Kube-KMS v2 for partners to start working on KMS plugin provider integrations:
1. Cloud: (priority Azure > AWS > Google)
  1. Azure KMS
  2. Azure Dedicated HSM
  3. AWS KMS
  4. AWS CloudHSM
  5. Google Cloud HSM
2. On-premise:
  1. HashiCorp Vault
  2. EU FSI & EU Telco KMS/HSM top-2 providers
GA after at least one stable KMS plugin provider

Epic CNTRLPLANE-243: [TP] Support KMS on self-managed OCP

View the Description

Scenario:

For an OCP cluster with external KMS enabled:

The customer loses the key to the external KMS
The external KMS service is degraded or unavailable

How doe the above scenario(s) impact the cluster? The API may be unavailable

Goal:

Detection: The ability to detect these failure condition(s) and make it visible to the cluster admin.
Actuation: To what extent can we restore the cluster? ( API availability, Control Plane operators). Recovering customer data is outside of the scope

Investigation Steps:

Detection:

How do we detect issues with the external KMS?
How do we detect issues with the KMS plugins?
How do we surface the information that an issue happened with KMS?
- Metrics / Alerts? Will not work with SNO
- ClusterOperatorStatus?

Actuation:

Is the control-plane self-recovering?
What actions are required for the user to recover the cluster partially/completely?
Complete: kube-apiserver? KMS plugin?
Partial: kube-apiserver? Etcd? KMS plugin?

User stories that might result in KCS:

KMS / KMS plugin unavailable
- Is there any degradation? (most likely not with kms v2)
KMS unavailable and DEK not in cache anymore
- Degradation will most likely occur, but what happens when the KMS becomes available again? Is the cluster self-recovering?
Key has been deleted and later recovered
- Is the cluster self-recovering?
KMS / KMS plugin misconfigured
- Is the apiserver rolled-back to the previous healthy revision?
- Is the misconfiguration properly surfaced?

Plugins research:

What are the pros and cons of managing the plugins ourselves vs leaving that responsibility to the customer?
What is the list of KMS we need to support?
Do all the KMS plugins we need to use support KMS v2? If not reach out to the provider
HSMs?

POCs:

Have a running POC of KMS on OCP to iterate over the user stories and start testing things out
Have a hacked version of o/k with https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3926-handling-undecryptable-resources to be able to easily take actions to fix the clusters as it will be for the customers in 4.17.

Acceptance Criteria:

Document the detection and actuation process in a KEP.
Generate new Jira work items based on the new findings.

Task API-1843: Add TP-gated KMS type to the APIServer API

View the Description View the linked PRs

We did something similar for the aesgcm encryption type in https://github.com/openshift/api/pull/1413/.

https://github.com/openshift/api/pull/2071

Feature OCPSTRAT-1148: Implement RWO/RWX SELinux context mounts (TechPreview)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Provide a long term solution to SELinux context labeling in OCP. Continue the implementation with RWO/RWX PVs which are the most expected from the field. Start with a TechPreview support grade.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.

https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.

This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file. This covers RWO/RWX PVs, RWOP is already being implemented and should GA in 4.17.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Should pass all regular regression CI. All the drivers we ship should have it enabled and partners may enable it if they want it to be consumed.

Performances should drascillaly improved and security should remain the same as the legacy chcon approach.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	Y
Hosted control planes	Y
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network	Both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility	AWS EBS, Azure Disk, GCP PD, IBM VPC block, OSP cinder, vSphere
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	No need
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Apply new context when there is none
Change context of all files/folders when changing context
RWO & RWX PVs

As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

RWOP PVs

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Lots of support cases due to pod taking too long to start because of selinux relabeling with chcon. This epics covers the most "unpopular" RWX case specially for PVs with lots of files and backends that are "slow" at updating metadata.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Most cases / concerns are on RWX, RWOP was the first step and has limited customer's impact though it is easier to implement first and gather feedback / metrics. https://access.redhat.com/solutions/6221251

This feature track TP for RWO/RWX PVs

Documentation Considerations

Relnotes + table of drivers supporting it.

Interoperability Considerations

Partners may want to enable the feature.

Epic STOR-2267: [Upstream Cycle] BETA - SELinux context mounts for RWO/RWX PVs

View the Description View the linked PRs

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Tracking upstream beta promotion of the SELinux context mounts for RWX/RWO PVs

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Feature OCPSTRAT-1190: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic COS-2572: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-51204: node-image-pull.sh in bootstrap fails to pull images in PXE boot

View the Description View the linked PRs

Description of problem:
Seeing this error occasionally when using the agent-based installer in 4.19.0-ec.2. This causes the installation to fail.

Feb 24 19:56:00 master-0 node-image-pull.sh[5150]: Currently on CoreOS version 9.6.20250121-0
Feb 24 19:56:00 master-0 node-image-pull.sh[5150]: Target node image is quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3b38df9ee395756ff57e9fdedd5c366ac9b129330aea9d3b07f1431fe9bcbab1
Feb 24 19:56:00 master-0 ostree-containe[5317]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3b38df9ee395756ff57e9fdedd5c366ac9b129330aea9d3b07f1431fe9bcbab1
Feb 24 19:56:01 master-0 node-image-pull.sh[5317]: layers already present: 11; layers needed: 40 (691.0 MB)
Feb 24 19:56:01 master-0 ostree-containe[5317]: layers already present: 11; layers needed: 40 (691.0 MB)
Feb 24 19:56:01 master-0 node-image-pull.sh[5317]: error: Importing: Unencapsulating base: Layer sha256:2cda1fda6bb7b2171d8d65eb2f66101e513f33808a5e5ce0bf31fa0314e442ed: mkdirat: Read-only file system
Feb 24 19:56:01 master-0 node-image-pull.sh[5150]: Failed to fetch release image; retrying...

Version-Release number of selected component (if applicable):
openshift 4.19.0-ec.2

How reproducible: Happens occasionally and inconsistently.

Steps to Reproduce:
1. Create the agent-installer iso via 'openshift-install agent create image'
2. Boot the ISO on 3 hosts - compact cluster
3.

Actual results:
Two of the hosts are installed correctly, in the 3rd host running bootstrap we see the error as above.

Expected results:
Installation succeeds.

https://github.com/openshift/installer/pull/9516

Task COS-3013: Add support in OpenShift installer to use pure RHEL bootimages

View the Description View the linked PRs

As part of https://issues.redhat.com/browse/COS-2572, the OpenShift installer needs to be adapted so that it can work with bootimages which have no OCP content.

Feature OCPSTRAT-1252: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OSASINFRA-3578: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task OSASINFRA-3707: Sync downstream CAPO 4.19 to upstream 0.12.x

View the Description View the linked PRs

We want to ship OpenShift 4.19 with CAPO synced to upstream 0.12.x.

Two tasks:
1. In openshift/release, update merge-bot to sync CAPO main (current 4.19) to sync with upstream release-0.12 (instead of release-0.11)
2. In openshift/cluster-api-provider-openstack, wait for the bot to submit a sync and work out the PR to pass CI and merge.

This depends on ~~OSASINFRA-3717~~. Per the discussion here, once that is complete and ORC is built by ART, we will need to update the PR for 2 to override the image.

Condition of satisfaction: ART has produced an image of CAPO synced to upstream 0.12.x.

https://github.com/openshift/cluster-api-provider-openstack/pull/352

Feature OCPSTRAT-1327: [Tech Preview] (Phase 1) Next-gen OLM UX: Unifying workload management in the console

View the Description

Feature Overview (aka. Goal Summary)

Initial Tech Preview of Next-Gen OLM UX: This ticket introduces the initial Tech Preview release of the next-generation OLM (OLM v1) user experience in the console.

Unified Catalog UX: This ticket focuses on enabling a unified catalog UX in the console. This will allow customers to manage layered capabilities delivered through operators and partners' workloads, including OpenShift certified Helm charts, using the next-generation OLM (OLM v1) in the OpenShift console.

In-Cluster Catalog Content Service: This is enabled through the novel in-cluster efficient catalog content service designed in ~~OCPSTRAT-1655~~ and delivered in the 4.18 timeframe.

Goals (aka. expected user outcomes)

In essence, customers can:

discover collections of k8s extension/operator contents released in the FBC format with richer visibility into their release channels, versions, update graphs, and the deprecation information (if any) to make informed decisions about installation and/or update them.

install a k8s extension/operator declaratively and potentially automate with GitOps to ensure predictable and reliable deployments.

update a k8s extension/operator to a desired target version or keep it updated within a specific version range for security fixes without breaking changes.

remove a k8s extension/operator declaratively and entirely including cleaning up its CRDs and other relevant on-cluster resources (with a way to opt out of this coming up in a later release).

Requirements (aka. Acceptance Criteria):

1) Pre-installation:

Both cluster-admins or non-privileged end-users can explore and discover the layered capabilities or workloads delivered by k8s extensions/operators or plain helm charts from a unified ecosystem catalog UI in the ‘Administrator Perspective’ in the console.

Users can filter the available offerings based on the delivery mechanism/source type (i.e., operator-backed or plain helm charts), providers (i.e., from Red Hat or ISVs), valid subscriptions, infrastructure features, etc.

Users can discover all versions in all channels that an offering/package defines in a catalog, select a version from a channel, and see its detailed description, provided APIs, and other metadata before the installation.

2) Installation:

Users (who have access to OLM v1’s user facing ‘ClusterExtension’ API) using a ServiceAccount with sufficient permissions can install a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.

Users can see the recommended installation namespace if provided by the package authors for installation.

Users get notified through error messages from the OLM API whenever two conflicting k8s extensions/operators (will be) owning the same API objects, i.e., no conflicting ownership, after triggering the installation.

During the installation, users can see the installation progress reported from the ‘ClusterExtension’ API object.

After installed, users (who have access to OLM v1’s user-facing ‘ClusterExtension’ API) can see can access the metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.

3) Update:

Users (who have access to OLM v1’s user facing ‘ClusterExtension’ API) can see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.

Users can trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.

Users get notified through error messages whenever a k8s extension/operator is prevented from updating to a newer version that has a backward incompatible CustomResourceDefinition (CRD) that will cause workload or k8s extension/operator breakage.

During OpenShift cluster update, users get Informed when installed k8s extensions/operators do not support the next OpenShift version (when annotated by the package author/provider). Customers must update those k8s extensions/operators to a newer/compatible version before OLM unblocks the OpenShift cluster update.

During the update, users can see the progress reported from the ‘ClusterExtension’ API object.

4) Uninstallation/Deletion:

Users are made aware of OLM v1 by default cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources to revert the cluster to its original state before the installation.

Users can see a list of resources that are relevant to the installed k8s extension/operator they are about to remove and then explicitly confirm the deletion.

Questions to Answer (Optional):

What impact will the console's "perspective consolidation" initiative have on this?

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Our customers will experience a streamlined approach to managing layered capabilities and workloads delivered through operators, operators packaged in Helm charts, or even plain Helm charts. The next generation OLM will power this central distribution mechanism within the OpenShift in the future.

Customers will be able to explore and discover the layered capabilities or workloads, and then install those offerings and make them available on their OpenShift clusters. Similar to the experience with the current OperatorHub, customers will be able to sort and filter the available offerings based on the delivery mechanism (i.e., operator-backed or plain helm charts), source type (i.e., from Red Hat or ISVs), valid subscriptions, infrastructure features, etc. Once click on a specific offering, they see the details which include the description, usage, and requirements of the offering, the provided services in APIs, and the rest of the relevant metadata for making the decisions.

The next-gen OLM aims to unify workload management. This includes operators packaged for current OLM, operators packaged in Helm charts, and even plain Helm charts for workloads. We want to leverage the current support for managing plain Helm charts within OpenShift and the console for leveraging our investment over the years.

Documentation Considerations

Refer to the “Documentation Considerations” section of the OLM v1 GA feature.

Relevant documents

Epic OPRUN-3597: [UPSTREAM] catalogd web interface performance improvements #451

View the Description

The RFC written for https://github.com/operator-framework/catalogd/issues/426 identified a desire to formalize the catalogd web API, and divided work into a set of v1.0-blocking changes to enable versioned web interfaces ([phase 1](https://github.com/operator-framework/catalogd/issues/427)) and non-blocking changes to express and extend a formalized API specification (phase 2).

This epic is to track the design and implementation work associated with phase 2. During phase 1 RFC review we identified that we needed more work to capture the extensibility design but didn't want to slow progress on the v1.0 blocking changes so the first step should be an RFC to capture the design goals for phase 2, and then any implementation trackers we feel are necessary.

Work here will be behind a feature gate.

Downstreaming this feature

We need to follow this guide to downstream this feature: https://docs.google.com/document/d/1krN-4vwaE47aLRW9QjwD374-0Sh80taa_7goqVNi_-s/edit?tab=t.0#heading=h.9l63b4k04g9e

Story OPRUN-3690: Modify cluster-olm-operator to watch for new catalogd web api feature gate

View the Description View the linked PRs

Modify cluster-olm-operator to watch for the new catalogd web api feature gate and implement reconciliation logic (openshift/ cluster-olm-operator)

See dev guide for more info: https://docs.google.com/document/d/1krN-4vwaE47aLRW9QjwD374-0Sh80taa_7goqVNi_-s/edit?tab=t.0#heading=h.th7069vduyim

https://github.com/openshift/cluster-olm-operator/pull/104

Feature OCPSTRAT-1328: Implement Shared Ingress for Tenant Clusters API servers

View the Description

Goal

This goals of this features are:

As part of a Microsoft guideline/requirement for implementing ARO HCP, we need to design a shared-ingress to kube-apiserver because MSFT has internal restrictions on IPv4 usage.

Background

Given Microsoft's constraints on IPv4 usage, there is a pressing need to optimize IP allocation and management within Azure-hosted environments.

Interoperability Considerations

Impact: Which versions will be impacted by the changes?
Test Scenarios: Must test across various network and deployment scenarios to ensure compatibility and scale (perf/scale)

Epic CNTRLPLANE-198: Implement shared ingress solution for ARO

View the Description

There's currently multiple ingress strategies we support for hosted cluster service endpoints (kas, nodePort, router...).
In a context of uncertainty about what use cases would be more critical to support, we initially exposed this in a flexible API that enables to potentially choose any combination of ingress strategies and endpoints.
ARO has internal restrictions on IPv4 usage. Because of this, to simplify the above and to be more cost effective in terms of infra we'd want to have a common shared ingress solution for all hosted clusters fleet.

Story CNTRLPLANE-209: Add SharedIngress upstream docs

View the linked PRs

https://github.com/openshift/hypershift/pull/5564

Feature OCPSTRAT-134: Gateway API using Istio for Cluster Ingress - GA

View the Description

Goal:
Graduate to GA (full support) Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.

Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

Epic NE-1260: Conformance testing with upstream Gateway API

View the Description View the linked PRs

The team agrees, we should be running the upstream GWAPI conformance tests, as they are readily available and we are an integration product with GWAPI.

We need to answer these questions asked at the March 23, 2023 GWAPI team meeting:

Would it make sense to do it as an optional job in the cluster-ingress-operator?
Is OSSM running the Gateway Conformance test in their CI?
Review what other implementers do with conformance tests to understand what we should do (Do we fork the repo? Clone it? Make a new repo?)

https://github.com/openshift/cluster-ingress-operator/pull/1176

Epic NE-1898: CRD Lifecycle Management for Gateway API

View the Description

Overview

Gateway API is the next generation of the Ingress API in upstream Kubernetes.

OpenShift Service Mesh (OSSM) and several other offering of ours like Kuadrant, Microshift and OpenShift AI all have critical dependencies on Gateway API's API resources. However, even though Gateway API is an official Kubernetes project its API resources are not available in the core API (like Ingress) and instead require the installation of Custom Resource Definitions (CRDs).

OCP will be fully in charge of managing the life-cycle of the Gateway API CRDs going forward. This will make the Gateway API a "core-like" API on OCP. If the CRDs are already present on a cluster when it upgrades to the new version where they are managed, the cluster admin is responsible for the safety of existing Gateway API implementations. The Cluster Ingress Operator (CIO) enacts a process called "CRD Management Succession" to ensure the transfer of control occurs safely, which includes multiple pre-upgrade checks and CIO startup checks.

Acceptance Criteria

If not present the Gateway API CRDs should be deployed at the install-time of a cluster, and management thereafter handled by the platform
Any existing CRDs not managed by the platform should be removed, or management and control transferred to the platform
Only the platform can manage or make any changes to the Gateway API CRDs, others will be blocked
Documentation about these APIs, and the process to upgrade to a version where they are being managed needs to be provided

Cross-Team Coordination

The organization as a whole needs to be made aware of this as new projects will continue to pop up with Gateway API support over the years. This includes (but is not limited to)

OSSM Team (Istio)
Connectivity Link Team (Kuadrant)
MicroShift Team
OpenShift AI Team (KServe)

Importantly our cluster infrastructure work with Cluster API (CAPI) is working through similar dilemmas for CAPI CRDs, and so we need to make sure to work directly with them as they've already broken a lot of ground here. Here are the relevant docs with the work they've done so far:

lifecycling-cluster-api-crds-within-openshift - HackMD
~~OCPCLOUD-2114~~ Investigate lifecycle of Cluster API APIs within OpenShift

Story NE-1953: Validating Admission Policy for Gateway API CRD Management

View the Description View the linked PRs

What?

The purpose of this task is to provide API validation on OCP that blocks upgrades to Gateway API CRDs from all entities except the platform itself.

Why?

See the description of NE-1898.

How?

We will use a Validating Admission Policy (VAP) to block ALL actions on the Gateway API CRDs from ALL entities besides the Cluster Ingress Operator (CIO).

Blocking in the VAP should occur at the group level, meaning only the CIO is capable of creating or changing any CRDs across the entire group at any version. As such this VAP will block access to ALL Gateway API CRDs, not just the ones we use (GatewayClass, Gateway, HTTPRoute, GRPCRoute, ReferenceGrant). Note that this means experimental APIs (e.g. TCPRoute, UDPRoute, TLSRoute) and older versions of APIs (e.g. v1beta1.HTTPRoute) are restricted as well from creation/modification. The effect should be that only the standard versions of GatewayClass, Gateway, HTTPRoute, GRPCRoute and ReferenceGrant (at the time of writing, these fully represent the standard APIs) are present and nobody can modify those, or deploy any others.

This VAP should be deployed alongside the CIO manifests, such that it is deployed along with the CIO itself.

Prior Art

Example of a VAP restricting actions to a single entity: https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/pkg/cloud/azure/assets/validating-admission-policy.yaml

Helpful Links

Here's where the current operator manifests can be found: https://github.com/openshift/cluster-ingress-operator/tree/edf5e71e8b08ef23a4d8f0b3fee5630c66625967/manifests

https://github.com/openshift/cluster-ingress-operator/pull/1192

Story NE-1954: Implement GatewayAPI Controller featuregate

View the Description View the linked PRs

What?

Add a new featuregate for OSSM installation, and move OSSM installation from the existing GatewayAPI feature gate to the new separate featuregate, so we have one featuregate only for CRDs and one featuregate only for installing OSSM. This will help us with staging component releases.

Epic NE-1747: [GWAPI-GA] Graduate OpenShift GatewayAPI featuregate to TechPreview

View the Description

As a developer, I need a featuregate to develop behind so that the Gateway API work does not impact other development teams until tests pass.

The featuregate is currently in the DevPreviewNoUpgrade featureset. We need to graduate it to the TechPreviewNoUpgrade featureset to give us more CI signal and testing. Ultimately the featuregate needs to graduate to GA (default on) once tests pass so that the feature can GA.

Feature Overview (aka. Goal Summary)

BYOPKI for image verification in OCP

Epic OCPNODE-2269: Dev Preview Support BYOPKI for image verification in OCP

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Support BYOPKI for image verification in OCP

Why is this important?

As an administrator of an independent org, I would like to verify our container images using our own CA.

Scenarios

Verify container images using own CA

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPNODE-2339: Add fields to ClusterImagePolicy API for BYOPKI

View the Description View the linked PRs

As an openshift developer, I want to extend the fields of ClusterImagePolicy CRD for BYOPKI verification, so the containerruntimeconfig controller can roll out the configurationto policy.json for verification.

https://github.com/openshift/api/pull/2088

Feature OCPSTRAT-1361: BGP for UDN GA

View the Description

Feature Overview (aka. Goal Summary)

OVN Kubernetes BGP support as a routing protocol for User Defined Network (Segmentation) pod and VM addressability.

Goals (aka. expected user outcomes)

OVN-Kubernetes BGP support enables the capability of dynamically exposing cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN.

OVN-Kubernetes currently has no native routing protocol integration, and relies on a Geneve overlay for east/west traffic, as well as third party operators to handle external network integration into the cluster. This enhancement adds support for BGP as a supported routing protocol with OVN-Kubernetes. The extent of this support will allow OVN-Kubernetes to integrate into different BGP user environments, enabling it to dynamically expose cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN. In a follow-on release, this enhancement will provide support for EVPN, which is a common data center networking fabric that relies on BGP.

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Design Document

Use Cases (Optional):

Integration with 3rdparty load balancers that send packets directly to OpenShift nodes with the destination IP address of a targeted pod, without needing custom operators to detect which node a pod is scheduled to and then add routes into the load balancer to send the packet to the right node.

Questions to Answer (Optional):

Out of Scope

Support of any other routing protocol
Running separate BGP instances per VRF network
Support for any other type of L3VPN with BGP, including MPLS
Providing any type of API or operator to automatically connect two Kubernetes clusters via L3VPN
Replacing the support that MetalLB provides today for advertising service IPs
Asymmetric Integrated Routing and Bridging (IRB) with EVPN

Background

BGP

Importing Routes from the Provider Network
Today in OpenShift there is no API for a user to be able to configure routes into OVN. In order for a user to change how cluster traffic is routed egress into the cluster, the user leverages local gateway mode, which forces egress traffic to hop through the Linux host's networking stack, where a user can configure routes inside of the host via NM State. This manual configuration would need to be performed and maintained across nodes and VRFs within each node.

Additionally, if a user chooses to not manage routes within the host and use local gateway mode, then by default traffic is always sent to the default gateway. The only other way to affect egress routing is by using the Multiple External Gateways (MEG) feature. With this feature the user may choose to have multiple different egress gateways per namespace to send traffic to.

As an alternative, configuring BGP peers and which route-targets to import would eliminate the need to manually configure routes in the host, and would allow dynamic routing updates based on changes in the provider’s network.

Exporting Routes into the Provider Network
There exists a need for provider networks to learn routes directly to services and pods today in Kubernetes. Metal LB is already one solution whereby load balancer IPs are advertised by BGP to provider networks, and this feature development does not intend to duplicate or replace the function of Metal LB. Metal LB should be able to interoperate with OVN-Kubernetes, and be responsible for advertising services to a provider’s network.

However, there is an alternative need to advertise pod IPs on the provider network. One use case is integration with 3rd party load balancers, where they terminate a load balancer and then send packets directly to OCP nodes with the destination IP address being the pod IP itself. Today these load balancers rely on custom operators to detect which node a pod is scheduled to and then add routes into its load balancer to send the packet to the right node.

By integrating BGP and advertising the pod subnets/addresses directly on the provider network, load balancers and other entities on the network would be able to reach the pod IPs directly.

EVPN

Extending OVN-Kubernetes VRFs into the Provider Network
This is the most powerful motivation for bringing support of EVPN into OVN-Kubernetes. A previous development effort enabled the ability to create a network per namespace (VRF) in OVN-Kubernetes, allowing users to create multiple isolated networks for namespaces of pods. However, the VRFs terminate at node egress, and routes are leaked from the default VRF so that traffic is able to route out of the OCP node. With EVPN, we can now extend the VRFs into the provider network using a VPN. This unlocks the ability to have L3VPNs that extend across the provider networks.

Utilizing the EVPN Fabric as the Overlay for OVN-Kubernetes
In addition to extending VRFs to the outside world for ingress and egress, we can also leverage EVPN to handle extending VRFs into the fabric for east/west traffic. This is useful in EVPN DC deployments where EVPN is already being used in the TOR network, and there is no need to use a Geneve overlay. In this use case, both layer 2 (MAC-VRFs) and layer 3 (IP-VRFs) can be advertised directly to the EVPN fabric. One advantage of doing this is that with Layer 2 networks, broadcast, unknown-unicast and multicast (BUM) traffic is suppressed across the EVPN fabric. Therefore the flooding domain in L2 networks for this type of traffic is limited to the node.

Multi-homing, Link Redundancy, Fast Convergence
Extending the EVPN fabric to OCP nodes brings other added benefits that are not present in OCP natively today. In this design there are at least 2 physical NICs and links leaving the OCP node to the EVPN leaves. This provides link redundancy, and when coupled with BFD and mass withdrawal, it can also provide fast failover. Additionally, the links can be used by the EVPN fabric to utilize ECMP routing.

Customer Considerations

For customers using MetalLB, it will continue to function correctly regardless of this development.

Documentation Considerations

Interoperability Considerations

Multiple External Gateways (MEG)
Egress IP
Services
Egress Service
Egress Firewall
Egress QoS

Epic SDN-5214: [4.18] OVN Kubernetes support for BGP as a routing protocol

View the Description

Epic Goal

OVN Kubernetes support for BGP as a routing protocol.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement
details and documents.

...

Dependencies (internal and external)

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SDN-5085: Enable support to enable OVN-Kubernetes BGP in CNO

View the Description View the linked PRs

CNO should deploy the new RouteAdvertisements OVN-K CRD.

When the OCP API flag to enable BGP support in the cluster is set, CNO should enable support on OVN-K through a CLI arg.

https://github.com/openshift/cluster-network-operator/pull/2442

Feature OCPSTRAT-1364: Support TLS v1.3: Improve validation of TLS Modern Profile for Control-Plane components

View the Description

Feature Overview (aka. Goal Summary)

Customers in highly regulated environments are required to adopt strong ciphers. For control-plane components, this means all components must support the modern TLSProfile with TLS 1.3.

Note: Post-Quantum Cryptography support in TLS will ONLY be available with TLS 1.3, thus support is required. For more information about PQC support, see https://issues.redhat.com/browse/OCPSTRAT-1858)

During internal discussions [1] for RFE-4992 and follow-up conversations, it became clear we need a dedicated CI job to track and validate that OpenShift control-plane components are aligned and functional with the TLSProfiles configured on the system.

[1] https://redhat-internal.slack.com/archives/CB48XQ4KZ/p1713288937307359

Goals (aka. expected user outcomes)

The resulting user outcome from this internal CI validation will be the support of TLS 1.3 in the OpenShift control-plane components.

Requirements (aka. Acceptance Criteria):

Create a CI job that would test that the cluster stays fully operational when enabling the Modern TLS profile
This requires updates to our docs

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	yes
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network	all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility	??
Backport needed (list applicable versions)	n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	n/a
Other (please specify)	unknown

Epic API-1854: Enable TLS v1.3

View the Description

Epic Goals

Support for TLS v1.3 by default.

CI validation will be the support of TLS 1.3 in the OpenShift control-plane components.

Requirements

Create a CI job that would test that the cluster stays fully operational when enabling the Modern TLS profile
This requires updates to our docs

Bug OCPBUGS-43748: etcd pod containers do not start when tls min version is 1.3

View the Description View the linked PRs

If we try to enable a Modern TLS profile:

EnvVarControllerDegraded: no supported cipherSuites not found in observedConfig

also, if we do manage to pass along the Modern TLS profile cipher suit, we see:

http2: TLSConfig.CipherSuites is missing an HTTP/2-required AES_128_GCM_SHA256 cipher (need at least one of TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 or TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256)

https://github.com/openshift/cluster-etcd-operator/pull/1364

Feature OCPSTRAT-1389: On Cluster Layering: Phase 3 (GA)

View the Description

Feature Overview

This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience.
Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

The goal of this feature is primarily to bring the 4.14 progress (~~OCPSTRAT-35~~) to a Tech Preview or GA level of support.
Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
- The admin should then be able to correct the build and resume the upgrade.
Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
Users can return a pool to an unmodified image easily.
RHEL entitlements should be wired in or at least simple to set up (once).
Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

Epic MCO-1316: On-Cluster Layering GA - upgrades and integrations

View the Description

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.

As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(~~MCO-770~~, ~~MCO-578~~, ~~MCO-574~~ )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: ~~MCO-1097~~, ~~MCO-1099~~

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Story MCO-1437: Inherit from global pull secret if baseImagePullSecret field is not specified

View the Description View the linked PRs

As a follow up to https://issues.redhat.com/browse/MCO-1284, the one field we identified that is best updated pre-GA is to make the baseImagePullSecret optional. The builder pod should have the logic to fetch from baseImagePullSecret if the user does not specify this via a MachineOSConfig object.

https://github.com/openshift/machine-config-operator/pull/4756

Story MCO-1443: Graduate MOSB/MOSC API to v1

View the Description View the linked PRs

To make OCL ready for GA, the first step would be graduating the MCO's APIs from v1alpha1 to v1. This requires changes in the openshift/api repo.

https://github.com/openshift/api/pull/2090

Epic MCO-828: On-Cluster Layering GA

View the Description

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: ~~MCO-1097~~, ~~MCO-1099~~

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Story MCO-1416: Separate OCL e2e tests into new test suite

View the Description View the linked PRs

The original scope of this task is represented across this story & MCO-1494.

With OCL GA'ing soon, we'll need a blocking path within our e2e test suite that must pass before a PR can be merged. This story represents the first stage in creating the blocking path:

Migrate the tests from e2e-gcp-op-techpreview into a new test suite called e2e-ocl. This can be done by moving the tests in the MCO repo from the test/e2e-techpreview folder to a new test/e2e-ocl folder. There might be some minor cleanups such as fixing duplicate function names, etc. but it should be fairly straightforward to do.
Make a new e2e-gcp-op-ocl job to call the newly created e2e-ocl test suite. This test should first be added as optional for 4.19 so it can be stability tested before it is made blocking for 4.18 & 4.19. This will require a PR to the openshift/release repo to call the new test for 4.19 & master. This should be a pretty straightforward config change.

Bug OCPBUGS-43552: maxUnavailable value is not honored when disabling OCL

View the Description View the linked PRs

Description of problem:

Intermittently, when OCL is disabled and the nodes are updated with the new non-ocl image the process is not honoring the maxUnavailable value in the MCP.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Intermittent

Steps to Reproduce:

  
 In the second must-gather file the issue happened when we did:

   1. Enable OCL in the worker pool
   2. Remove the MOSC resource to disable OCL
   
The result was that the worker pool had 2 nodes being drained at the same time.

Actual results:

The worker pool has 2 nodes being drained at the same time.

Expected results:

Since MCP has no maxUnavailable value defined, the default value is 1. Hence, there should be only 1 node being drained at a time.

Moreover, they should be updated in the right order.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4817

Story MCO-1165: [Regression] BuildController should have a rebuild function

View the Description View the linked PRs

[REGRESSION] We need to reinvent the wheel for triggering rebuild functionality and the rebuild mechanism as pool labeling and annotation is no longer a favorable way to interact with layered pools

There are a few situations in which a cluster admin might want to trigger a rebuild of their OS image in addition to situations where cluster state may dictate that we should perform a rebuild. For example, if the custom Dockerfile changes or the machine-config-osimageurl changes, it would be desirable to perform a rebuild in that case. To that end, this particular story covers adding the foundation for a rebuild mechanism in the form of an annotation that can be applied to the target MachineConfigPool. What is out of scope for this story is applying this annotation in response to a change in cluster state (e.g., custom Dockerfile change).

Done When:

BuildController is aware of and recognizes a special annotation on layered MachineConfigPools (e.g., machineconfiguration.openshift.io/rebuildImage).
Upon recognizing that a MachineConfigPool has this annotation, BuildController will clear any failed build attempts, delete any failed builds and their related ephemeral objects (e.g., rendered Dockerfile / MachineConfig ConfigMaps), and schedule a new build to be performed.
This annotation should be removed when the build process completes, regardless of outcome. In other words, should the build success or fail, the annotation should be removed.
[optional] BuildController keeps track of the number of retries for a given MachineConfigPool. This can occur via another annotation, e.g., machineconfiguration.openshift.io/buildRetries=1 . For now, this can be a hard-coded value (e.g., 5), but in the future, this could be wired up to an end-user facing knob. This annotation should be cleared upon a successful rebuild. If the rebuild is reached, then we should degrade.

https://github.com/openshift/machine-config-operator/pull/4694

Feature OCPSTRAT-1390: HCP KubeVirt VM Enhanced Topology Spread

View the Description

Feature Overview (aka. Goal Summary)

Today VMs for a single nodepool can “clump” together on a single node after the infra cluster is updated. This is due to live migration shuffling around the VMs in ways that can result in VMs from the same nodepool being placed next to each other.

Through a combination of TopologySpreadConstraints and the De-Scheduler, it should be possible to continually redistributed VMs in a nodepool (via live migration) when clumping occurs. This will provide stronger HA guarantees for nodepools

Goals (aka. expected user outcomes)

VMs within a nodepool should re-distribute via live migration in order to best satisfy topology spread constraints.

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CNV-41958: HCP KubeVirt VM Enhanced Topology Spread

View the Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

VMs within a nodepool should re-distribute via live migration in order to best satisfy topology spread constraints.

Story CNV-57648: upstream documentation for descheduler integration with hcp nodepools

View the Description View the linked PRs

In the hypershift upstream documentation, outline how the de-scheduler can be used to continually redistribute VMs in a nodepool when clumping of VMs occur after live migration.

https://github.com/openshift/hypershift/pull/5779

Feature OCPSTRAT-1418: Allow Custom machine names when using the CPMS feature

View the Description

Feature Overview

As a cluster admin for standalone OpenShift, I want to customize the prefix of the machine names created by CPMS due to company policies related to nomenclature. Implement the Control Plane Machine Set (CPMS) feature in OpenShift to support machine names where user can set custom names prefixes. Note the prefix will always be suffixed by "<5-chars>-<index>" as this is part of the CPMS internal design.

Acceptance Criteria

A new field called machineNamePrefix has been added to CPMS CR.
This field would allow the customer to specify a custom prefix for the machine names.
The machine names would then be generated using the format: <machineNamePrefix>~~<5-chars>~~<index>
Where:
<machineNamePrefix> is the custom prefix provided by the customer
<5-chars> is a random 5 character string (this is required and cannot be changed)
<index> represents the index of the machine (0, 1, 2, etc.)
Ensure that if the machineNamePrefix is changed, the operator reconciles and succeeds in rolling out the changes.

Epic OAPE-16: [TP] Ability to assign custom name formats to Control Plane Machines via CPMS

View the Description

Epic Goal

Provide a new field to the CPMS that allows to define a Machine name prefix
This prefix will supersede the current usage of the control plane label and role combination we use today
The names must still continue to be suffixed with <chars>-<idx> as this is important to the operation of CPMS

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Downstream code and tests merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OAPE-18: As a developer, I want to vendor openshift/api changes into cpms-operator

View the Description View the linked PRs

Bump openshift/api to vendor machineNamePrefix field and CPMSMachineNamePrefix feature-gate into cpms-operator

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/333

Story OAPE-19: As a developer, I want to support prefix name formats to control plane machines via CPMS-operator

View the Description View the linked PRs

Utilize the new field added in openshift/api and add the implementation so that custom name (prefix) formats can be assigned to Control Plane Machines via CPMS.
All the changes should be behind the feature gate.
This prefix will supersede the current usage of the control plane label and role combination we use today
The names must still continue to be suffixed with <chars><idx> as this is important to the operation of CPMS
Add unit tests and E2E tests.

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/332

Story CFE-1168: As a developer, I want to add a new feature gate in openshift/api

View the Description View the linked PRs

Define a new feature gate in openshift/api for this feature so that all the implementation can be safe guarded behind this gate.

https://github.com/openshift/api/pull/2094

Story OAPE-78: As a developer, I want to add techpreview periodic jobs for all cloud providers to signal feature stabilization

View the Description View the linked PRs

To be able to gather test data for this feature, we will need to introduce tech preview periodics, so we need to duplicate each of https://github.com/openshift/release/blob/8f8c7c981c3d81c943363a9435b6c48005eec6e3[…]control-plane-machine-set-operator-release-4.19__periodics.yaml and add the techpreview configuration.

It's configured as an env var, so copy each job, add the env var, and change the name to include -techpreview as a suffic
env: FEATURE_SET: TechPreviewNoUpgrade

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/338

Bug OCPBUGS-48260: Consecutive points are not allowed in machineNamePrefix

View the Description View the linked PRs

Description of problem:

    Consecutive points are not allowed in machineNamePrefix, but from the prompt "spec.machineNamePrefix: Invalid value: "string": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.", consecutive points should be allowed. 
And I can create machine on providers console with consecutive points https://drive.google.com/file/d/1p5QLhkL4VI3tt3viO98zYG8uqb1ePTnB/view?usp=sharing

Version-Release number of selected component (if applicable):

    4.19.0-0.nightly-2025-01-08-165032

How reproducible:

    Always

Steps to Reproduce:

    1.Update machineNamePrefix containing consecutive points in controlplanemachineset, like

  machineNamePrefix: huliu..azure

    2.Cannot save, get prompt
# * spec.machineNamePrefix: Invalid value: "string": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.

    3. If change to 

  machineNamePrefix: huliu.azure   

then it can save successfully, and rollout succeed.

Actual results:

    Cannot save, get prompt

Expected results:

    Can save successfully

Additional info:

    New feature testing for https://issues.redhat.com/browse/OAPE-22

https://github.com/openshift/api/pull/2147

Story CFE-1167: As a developer, I want to add a new field to openshift/api

View the Description View the linked PRs

Provide a new field to the CPMS that allows to define a Machine name prefix

https://github.com/openshift/api/pull/2086

Epic OAPE-142: [GA] Ability to assign custom name formats to Control Plane Machines via CPMS

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Placeholder to track GA activities for OAPE-16 feature.

Why is this important?

Scenarios

Acceptance Criteria

Moving the feature to default feature set.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OAPE-126: Include CPMSMachineNamePrefix feature-gate name in e2e tests

View the Description View the linked PRs

Include `CPMSMachineNamePrefix` feature gate name in e2e tests.

This is required for sippy tool to filter the e2e tests specific to this featuregate.
xRef: https://github.com/openshift/api?tab=readme-ov-file#defining-featuregate-e2e-tests

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/351

Feature OCPSTRAT-1420: Add support to Confidential Computing on GCP with TDX

View the Description

Feature Overview (aka. Goal Summary)

Enable OpenShift to be deployed on Confidential VMs on GCP using Intel TDX technology

Goals (aka. expected user outcomes)

Users deploying OpenShift on GCP can choose to deploy Confidential VMs using Intel TDX technology to rely on confidential computing to secure the data in use

Requirements (aka. Acceptance Criteria):

As a user, I can choose OpenShift Nodes to be deployed with the Confidential VM capability on GCP using Intel TDX technology at install time

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Background

This is a piece of a higher-level effort to secure data in use with OpenShift on every platform

Documentation Considerations

Documentation on how to use this new option must be added as usual

Epic OCPCLOUD-2889: GCP - Add support to deploy Confidential VMs using Intel TDX

View the Description View the linked PRs

Epic Goal

Add support to deploy Confidential VMs on GCP using Intel TDX technology

Why is this important?

As part of the Zero Trust initiative we want to enable OpenShift to support data in use protection using confidential computing technologies

Scenarios

As a user I want all my OpenShift Nodes to be deployed as Confidential VMs on Google Cloud using Intel TDX technology

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Previous Work (Optional):

We enabled Confidential VMs for GCP using SEV technology already - ~~OCPSTRAT-690~~

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OCPSTRAT-1500: Two Node OpenShift with Arbiter (TNA) - Tech Preview

View the Description View Demos

Feature Overview (aka. Goal Summary)

Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that can optionally be remote/virtual to reduce onsite HW cost.

Goals (aka. expected user outcomes)

Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third small “arbiter” node which ensure quorum. See requirements for more details.

Requirements (aka. Acceptance Criteria):

Co-located arbiter node - 3d node in same network/location with low latency network access, but the arbiter node is much smaller compared to the two main nodes. Initial resource requirements for the arbiter node documented as with Compact CLusters: 4 cores / 8 vcpu, 16G RAM, 120G disk (non-spinning), 1x1 GbE network ports, no BMC. To be refined and hopefully reduced for GA.
OCP Virt fully functionally, incl. Live migration of VMs (assuming RWX CSI Driver is available) - Could slip into GA release
Single Control Plane Node outage is handled seamlessly the same way as a compact cluster (cluster remains quorum, workload can be scheduled). If one of the worker nodes is down, scheduling failures might occur if the cluster is over provisioned.
In case the arbiter node is down , a reboot/restart of the two remaining nodes has to work, i.e. the two remaining nodes re-gain quorum (in degraded states because arbiter is still down) and spin-up the workload. Or differently formulated: total power outage, all nodes down. Then node 1 and node 2 are restarted, but not the arbiter. The expectation is that the nodes gain quorum and start the workload in this scenario.
~~Scale out of the cluster by adding additional worker nodes should be possible~~ (Deferred to GA)
~~Transition the cluster into a regular 3 node compact cluster, e.g. by adding a new node as control plane node, then removing the arbiter node, should be possible.~~ Deferred to GA or even Post-GA, has OpenShift currently does not support any control plane topology changes.
Regular workload should not be scheduled to the arbiter node (e.g by making it un-schedulabe, or introduce a new node role “arbiter”). Only essential control plane workload (etcd components) should run on the arbiter node. Non-essential control plan workload (i.e. router, registry, console, monitoring etc) should also not be scheduled to the arbiter nodded.
It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have quorum based mechanisms.
must seamlessly integrate into existing installation/update mechanisms, esp. zero touch provisioning etc.
1. OpenShift Installer UPI
2. OpenShift Installer IPI
3. Assistent Installer
4. ~~Agent Based Installer~~ --> Deferred to GA
5. ~~ACM Cluster Lifecycle Manager~~ --> Deferred to GA
Added: ability to track TNA usage in the fleet of connected clusters via OCP telemetry data (e.g. number of clusters with TNA topology) - This is a stretch goal for TP.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	self-managed
Classic (standalone cluster)	yes
Hosted control planes	no
Multi node, Compact (three node), or Single node (SNO), or all	Multi node and Compact (three node)
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_86 and ARM
Operator compatibility	full
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	no
Other (please specify)	n/a

Questions to Answer (Optional):

How to implement the scheduling restrictions to the arbiter node? New node role “arbiter”?
~~Can this be delivered in one release, or do we need to split, e.g. TechPreview + GA? -->~~ Yes, multiple releases needed, see linked feature for GA tracking

Out of Scope

Storage driver providing RWX shared storage
…

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Two node support is in high demand by telco, industrial and retail customers.
VMWare supports a two node VSan solution: https://core.vmware.com/resource/vsan-2-node-cluster-guide
Example edge hardware frequently used for edge deployments with a co-located small arbiter node: Dell PowerEdge XR4000z Server is an edge computing device that allows restaurants, retailers, and other small to medium businesses to set up local computing for data-intensive workloads.

Customer Considerations

See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.

Documentation Considerations

Topology needs to be documented, esp. The requirements of the arbiter node.

Interoperability Considerations

OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)

https://spaces.redhat.com/display/PLUG/Edge+Sprint+Demos#EdgeSprintDemos-Sprint260

Epic OCPEDGE-1191: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPEDGE-1306: Update Authentication Operator to Allow 2 Kube-API Instances for it's precondition check

View the Description View the linked PRs

Authentication Operator will need to be updated to take in the newly created `HighlyAvailableArbiter` topology flag, and allow the minimum of kube-apiservers running to equal 2.

https://github.com/openshift/cluster-authentication-operator/blob/master/pkg/controllers/readiness/unsupported_override.go#L58-L70

https://github.com/openshift/cluster-authentication-operator/pull/734

Story OCPEDGE-1307: Add HighlyAvailableArbiter topology to Infrastructure API for TechPreview

View the Description View the linked PRs

We need to add the `HighlyAvailableArbiter` value to the controlPlaneTopology in ocp/api package as a TechPreview

https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L95-L103

https://github.com/openshift/api/pull/2082

Story OCPEDGE-1310: Update installer to handle creating arbiter node

View the Description View the linked PRs

We need to update the installer to support the machines for the arbiter node during install.

This will be flushed out with explicit direction, currently discussing next steps in EP comments.

https://github.com/openshift/installer/pull/9159

Story OCPEDGE-1559: Update MCO Dependencies for Arbiter Node

View the Description View the linked PRs

Bump the openshift/api dependency to latest for the arbiter node enum.

https://github.com/openshift/machine-config-operator/pull/4772

Story OCPEDGE-1313: Update MCO to include Arbiter Node assets

View the Description View the linked PRs

We will need to make sure the MCO contains the bootstrap files for the arbiter node similar to the files it contains for master nodes.

https://github.com/openshift/machine-config-operator/pull/4675

Story OCPEDGE-1345: Add Arbiter Check to the Console Operator

View the Description View the linked PRs

The console operator performs a check for HighlyAvailable, we will need to add the HighlyAvailableArbiter check there so that it supports it on the same level as HA

https://github.com/openshift/console-operator/pull/939/files#diff-3a9b4658749a0ccf417afe0f5e9aa46465c37c31b3ead6cbaf424bff7a9f5263L132

https://github.com/openshift/console-operator/pull/939

Story OCPEDGE-1346: Add HighlyAvailableArbiter field to console check

View the Description View the linked PRs

The console operand checks for valid infra control plane topology fields.

We'll need to add the HighlyAvailableArbiter flag to that check

https://github.com/openshift/console/pull/14469/files#diff-03f857fdb932ab63ab530d3176525f2ee4941879dcd80f5d82df60e1544e15e0R131

https://github.com/openshift/console/pull/14469

Story OCPEDGE-1308: Update Dependencies for cluster-config-operator

View the Description View the linked PRs

Once the HighlyAvailableArbiter has been added to the ocp/api, we need to update the cluster-config-operator dependencies to reference the new change, so that it propagates to cluster installs in our payloads.

https://github.com/openshift/cluster-config-operator/pull/428

Task OCPEDGE-1635: Update installer to set default values for arbiter

View the Description View the linked PRs

If arbiter is set in the install config, then we want to set it's default values.

https://github.com/openshift/installer/pull/9530

Story OCPEDGE-1195: Update CEO to Support Arbiter Node

View the Description View the linked PRs

We need to update CEO (cluster etcd operator) to understand what an arbiter/witness node is so it can properly assign an etcd member on our less powerful node.

https://github.com/openshift/cluster-etcd-operator/pull/1366

Story OCPEDGE-1435: Update CEO Dependencies for Arbiter Node

View the Description View the linked PRs

Update the dependencies for CEO for library-go and ocp/api to support the Arbiter additions, doing this in a separate PR to keep things clean and easier to test.

https://github.com/openshift/cluster-etcd-operator/pull/1378

Story OCPEDGE-1607: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-baremetal-operator/pull/460

Story OCPEDGE-1314: Add arbiter node to ocp/kubernetes known roles

View the Description View the linked PRs

We need to make sure kubelet respects the new node role type for arbiter. This will be a simple update on the well_known_openshift_labels.go list to include the new node role.

https://github.com/openshift/kubernetes/pull/2109

Feature OCPSTRAT-1514: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic ECOPROJECT-2053: TNF - Controlling etcd with Pacemaker

View the Description

Make sure pacemaker configured in TNF controls etcd.

Verify the following scenarios:

Etcd agent created: start/stop/promote/demote
Cold boot
Graceful shutdown

Story ETCD-677: Configure removal of etcd container from static pod manifest

View the Description View the linked PRs

For TNF we need to replace our currently running etcd after the installation with the one that is managed by pacemaker.

This allows us to keep the following benefits:

installation can stay the way it is
upgrade paths are not blocked
we can keep existing operator functionality around (cert rotation, defrags etc)
pacemaker can control etcd without any other interference

AC:

on a (later specified) signal, we roll out a static pod revision without the etcd container

https://github.com/openshift/cluster-etcd-operator/pull/1352

Epic OCPEDGE-1444: Setup CoreOS Extension for TNF

View the Description

As a developer of TNF, I need:

Pacemaker and Corosync to be available during cluster deployment

Acceptance Criteria

Pacemaker and corosync are made available in CoreOS via MCO extension

TNF Enhancement Proposal

Story OCPEDGE-1493: Add MCO extension for pacemaker/corosync

View the Description View the linked PRs

As a developer of 2NO, I need:

To be able to configure the installer to update CoreOS to have pacemaker and corosync installed by default

Acceptance Criteria

Implementation PR is merged

https://github.com/openshift/machine-config-operator/pull/4843

Epic CNTRLPLANE-168: Two Node OpenShift - etcd/control plane work

View the Description

This epic is used to track tasks/stories delivered from OCP core engineering control plane gorup.

As a developer of TNF, I need:

To include a new scaling strategy for TNF that allows bootstrap to allows 1 etcd at bootstrap but 2 when running (assisted/ABI flow)
To include a new scaling strategy for TNF that requires 2 etcd nodes at bootstrap and when running (core installer flow)
For CEO to propagate updates to pacemaker controlled etcd pod (e.g. version changes, cert updates)
For cert rotations to be disallowed during pacemaker etcd updates

Acceptance Criteria

Pacemaker is intialized by CEO when it starts and detects that it's running in 2-node topology
etcd is running under pacemaker post transition and this transition keeps CEO in a supported state
CEO now has scaling strategies to handle TNF installs in a supported way for both 3-node (core installer) and 2-node (assisted installer, ABI) based deployments
CEO can now notify pacemaker of etcd updates & restarting the etcd pod
Tests are added to cover new CEO functionality
Cert rotation tests are run on TNF topology to ensure this can never overlap with etcd-pod update events

TNF Enhancement Proposal

Story OCPEDGE-1500: Add new scaling strategy for TNF to etcd

View the Description View the linked PRs

As a developer of TNF, I need:

To update etcd to use a pair of new scaling strategies for TNF: one to handle core-installer, and the other for assisted installer and ABI

Acceptance Criteria

PR is merged to openshift/cluster-etcd-operator
Unit tests are added for new scaling strategies

https://github.com/openshift/cluster-etcd-operator/pull/1396

Epic OCPEDGE-1449: Set up new topology and feature gate for TNF

View the Description

As a developer of TNF, I need:

To have a feature gate ensuring clusters are not upgradable while in dev or tech preview
To have a new control plane topology for TNF set in the installer
To fix any operator logic that is sensitive to topology declarations

Acceptance Criteria

Feature gate is added for TNF
New control plane topology is added to the infrastructure spec
Topology-sensitive operators are updated with TNF specific logic
Installer is updated to set the new topology in the infra config

TNF Enhancement Proposal

Story OCPEDGE-1643: Bump Config Operator for New Topolofy

View the Description View the linked PRs

We need to bump the go.mod for the cluster-config-operator to make sure the DualReplica CRD is applied at bootstrap after we merge in the topology change to the ocp/api repo.

https://github.com/openshift/cluster-config-operator/pull/431

Story OCPEDGE-1512: Create a new feature gate for 2NO

View the Description View the linked PRs

As a developer of 2NO, I need:

To have a way for the cluster to know it cannot upgrade when deployed in 2NO topology
To have an indication that dev and tech preview releases of 2NO are not supported

Acceptance Criteria

PR is merged in relevant openshift repo
Unit test is added (as required)

https://github.com/openshift/api/pull/2196

Feature OCPSTRAT-1538: Productization of OLS

View the Description

Feature Overview (aka. Goal Summary)

As an OpenShift Container Platform (OCP) user, I will be able to install the GA version of OpenShift Lightspeed (OLS) from the Operator Hub and access information about OLS in the OCP documentation.

1) disconnected

2) FIPS

3) Multi arch -> Arm for GA

4) HCP

5) Konflux

Epic OLS-1245: Support SD org

Story OLS-1370: Verify OLS in a custom namespace on HCP clusters

View the Description View the linked PRs

Description:

As a OLS developer, I want to verify if OLS can be configured on custom namespace on 2 different HCP clusters provisioned from same Management Cluster, so that SD team can run on their HCP clusters.

Acceptance Criteria :

verify that OLS can run on custom namespace on 2 different HCP clusters provisioned from same Management Cluster

https://github.com/openshift/hypershift/pull/5746

Feature OCPSTRAT-1577: [Tech Preview] OpenShift Zones support for vSphere Host Groups

View the Description

Feature Overview

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

Users can define OpenShift zones mapping them to host groups at installation time (day 1)
Users can use host groups as OpenShift zones post-installation (day 2)

Epic SPLAT-1728: [Tech Preview] OpenShift Zones support for vSphere Host Groups

View the Description

Epic Goal

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

Users can define OpenShift zones mapping them to host groups at installation time (day 1)
Users can use host groups as OpenShift zones post-installation (day 2)

Story SPLAT-1742: Modify the installer to support host and vm group based zonal

View the Description View the linked PRs

{}USER STORY:{}

As someone that installs openshift on vsphere, I want to install zonal via host and vm groups so that I can use a stretched physical cluster or use a cluster as a region and hosts as zones .

{}DESCRIPTION:{}

{}Required:{}

{}Nice to have:{}

{}ACCEPTANCE CRITERIA:{}

start validating tag naming
validate tags exist
validate host group exists
update platform spec
unit tests
create capv deployment and failure domain manifests
per failure domain, create vm group and vm host rule

{}ENGINEERING DETAILS:{}

Configuration steps:

Create tag and tag categories
Attach zonal tags to ESXi hosts - capv will complain extensively if this is not done
Host groups {}MUST{} be created and populated prior to installation (maybe, could we get the hosts that are attached or vice versa hosts in host group are attached?)
one per zone

vm groups wil be created by the installer one per zone

vm / host rule will be created by the installer one per zone

https://github.com/openshift/installer/compare/master...jcpowermac:installer:test-vm-host-group

https://github.com/openshift/installer/pull/8873

Story SPLAT-1800: Enable host vm group zonal for mao

View the Description View the linked PRs

As an openshift engineer enable host vm group zonal in mao so that compute nodes properly are deployed

Acceptance Criteria:

Modify workspace to include vmgroup
properly configure vsphere cluster to add vm into vmgroup

Story SPLAT-1799: Enable host vm group zonal for cpms

View the Description View the linked PRs

As an openshift engineer enable host vm group zonal in CPMS so that control plane nodes properly are redeployed

Acceptance Criteria:

Control plane nodes properly roll out when requried
Control plane nodes do not roll out when not needed

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/325

Story SPLAT-1743: Change API to support host vm group based zonal

View the Description View the linked PRs

As an openshift engineer update the infrastructure and the machine api objects so it can support host vm group zonal

Acceptance criteria

Add the appropriate fields for host vm group zonal
Add the appropriate documentation for the above fields
Add a new feature gate for host vm zonal

Feature OCPSTRAT-1579: Integrate Cluster API (CAPI) in standalone OCP-Phase 3

View the Description

Feature Overview (aka. Goal Summary)

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

Phase 2 Goal:

Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
manage the lifecycle of Cluster API components within OpenShift standalone clusters
E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
CAPI has much better community interaction than MAPI.
Other projects are considering using CAPI and it would be cleaner to have one solution
Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic OCPCLOUD-2200: Install Cluster API into OpenShift Standalone Clusters (GA-tasks)

View the Description

Epic Goal

To prepare the openshift cluster-api project for general availability
Ensure proper testing of the CAPI installer operator

Why is this important?

We need to be able to install and lifecycle the Cluster API ecosystem within standalone OpenShift
We need to make sure that we can update the components via an operator
We need to make sure that we can lifecycle the APIs via an operator

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Bug OCPCLOUD-2792: cluster-capi-operator: refactor clusteroperator reconciliation + set controller level conditions for corecluster

View the Description View the linked PRs

Update: We did this for the corecluster controller, we will create separate cards to do it for the other cluster-capi-operator controllers too.

–

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the ClusterOperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

Moreover, when running `oc get co` against the cluster-capi-operator, only the name and version are returned. The status is not rolled up into the additional columns as expected.

Before GA, this state information should be visible from `oc get co`

Feature OCPSTRAT-1592: Support for Configuring Additional Disks During OpenShift Installation - Phase I

View the Description

Feature Overview (aka. Goal Summary)

OpenShift is traditionally a single-disk system, meaning the OpenShift installer is designed to install the OpenShift filesystem on one disk. However, new customer use cases have highlighted the need for the ability to configure additional disks during installation. Here are the specific use cases:

Dedicated Disk for etcd:

- As a user, I would like to install a cluster with a dedicated disk for etcd.
- Our recommended practices for etcd suggest using a dedicated disk for optimal performance.
- Managing disk mounting through MCO can be challenging and may introduce additional issues.
- Cluster-API supports running etcd on a dedicated disk.
Dedicated Disk for Swap Partitions:

- As a user, I would like to install a cluster with swap enabled on each node and utilize a dedicated disk for swap partitions.
- A dedicated disk for swap would help prevent swap activity from impacting node performance.
Dedicated Disk for Container Runtime:

- As a user, I would like to install a cluster and assign a separate disk for the container runtime.
Dedicated Disk for Image Storage:

- As a user, I would like to install a cluster with images stored on a dedicated disk, while keeping ephemeral storage on the node's main filesystem.

Epic SPLAT-1880: [vsphere] Multi Disk Support

View the Description

User Story:
As an OpenShift administrator, I need to be able to configure my OpenShift cluster to have additional disks on each vSphere VM so that I can use the new data disks for various OS needs.

Description:
This goal of this epic is to be able to allow the cluster administrator to install and configure after install new machines with additional disks attached to each virtual machine for various OS needs.

Required:

Installer allows configuring additional disks for control plane and compute virtual machines
Control Plane Machine Sets (CPMS) allows configuring control plane virtual machines with additional disks
Machine API (MAPI) allows for configuring Machines and MachineSets with additional disks
Cluster API (CAPI) allows for configuring Machines and MachineSets with additional disks

Nice to Have:

Acceptance Criteria:

Notes:

Story SPLAT-1808: Create feature gate and initial CRD changes for multi disk

View the Description View the linked PRs

User Story:
As an OpenShift Engineer I need to create a new feature gate and CRD changes for vSphere multi disk so that we can gate the new function until all bugs are ironed out.

Description:
This task is to create the new feature gate to be used by all logical changes around multi disk support for vSphere. We also need to update the types for vsphere machine spec to include new array field that contains data disk definitions.

Acceptance Criteria:
- New feature gate exists for components to use.
- Initial changes to the CRD for data disks are present for components to start using.

https://github.com/openshift/api/pull/2028

Story SPLAT-1811: Add vSphere multi disk support to machine api operator

View the Description View the linked PRs

User Story:
As an OpenShift Engineer I need to ensure the the MAPI Operator.

Description:
This task is to verify if any changes are needed in the MAPI Operator to handle the change data disk definitions in the CPMS.

Acceptance Criteria:
- Adding a disk to MachineSet does not result in new machines being rolled out.
- Removing a disk from MachineSet does not result in new machines being rolled out.
- After making changes to a MachineSet related to data disks, when MachineSet is scaled down and then up, new machines contain the new data disk configurations.
- All attempts to modify existing data disk definitions in an existing Machine definition are blocked by the webhook.

Notes:
The behaviors for the data disk field should be the same as all other provider spec level fields. We want to make sure that the new fields are no different than the others. This field is not hot swap capable for running machines. A new VM must be created for this feature to work.

https://github.com/openshift/machine-api-operator/pull/1290

Story SPLAT-1809: Enhance Installer to support vSphere multi disk

View the Description View the linked PRs

User Story:
As an OpenShift Engineer I need to enhance the OpenShift installer to support creating a cluster with additional disks added to control plane and compute nodes so that I can use the new data disks for various OS needs.

Description:
This task is to enhance the installer to allow configuring data disks in the install-config.yaml. This will also require setting the necessary fields in machineset and machine definitions. The important one being for CAPV to do the initial creation of disks for the configured masters.

Acceptance Criteria:
- install-config.yaml supports configuring data disks in all machinepools.
- CAPV has been updated with new multi disk support.
- CAPV machines are created that result in control plane nodes with data disks.
- MachineSet definitions for compute nodes are created correctly with data disk values from compute pool.
- CPMS definition for masters has the data disks configured correctly.

Notes:
We need to be sure that after installing a cluster, the cluster remains stable and has all correct configurations.

https://github.com/openshift/installer/pull/9035

Story SPLAT-1817: Update CPMS operator to support vSphere multi disk

View the Description View the linked PRs

User Story:
As an OpenShift Engineer I need to ensure the the CPMS Operator now works with detecting any changes needed when data disks are added to the CPMS definition.

Description:
This task is to verify if any changes are needed in the CPMS Operator to handle the change data disk definitions in the CPMS.

Acceptance Criteria:
- CPMS does not roll out changes when initial install is performed.
- Adding a disk to CPMS results in control plane roll out.
- Removing a disk from CPMS results in control plane roll out.
- No errors logged as a result of data disks being present in the CPMS definition.

Notes:
Ideally we just need to make sure the operator is updated to pull in the new CRD object definitions that contain the new data disk field.

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/335

Feature OCPSTRAT-1654: GA User Name Space in OpenShift 4.19

View the Description

Feature Overview (aka. Goal Summary)

GA User Name Space in OpenShift 4.19

continue work from https://issues.redhat.com/browse/OCPSTRAT-207

Epic OCPNODE-2506: GA User Namespaces

View the Description

Epic Goal

Prepare user namespaces for GA by enhancing SCC support and testing

Why is this important?

Enable nested containers use cases and enhance security

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPNODE-2940: update admission pieces of o/k && o/cluster-kube-apiserver PRs for comments

View the Description View the linked PRs

https://github.com/openshift/kubernetes/pull/2104 and https://github.com/openshift/cluster-kube-apiserver-operator/pull/1754 have some updated comments

Feature OCPSTRAT-1657: Add a Mechanism to Label all Pods for a Hosted Cluster in the Control Plane Namespace

View the Description

Background

As part of being a first party Azure offering, ARO HCP needs to adhere to Microsoft secure supply chain software requirements. In order to do this, we require setting a label on all pods that run in the hosted cluster namespace.

Goal

Implement Mechanism for Labeling Hosted Cluster Control Plane Pods

Use-cases

Adherance to Microsoft 1p Resource Provider Requirements

Components

Any pods that hypershift deploys or run in the hosted cluster namespace.

Epic HOSTEDCP-2004: Add a Mechanism to Label all Pods in the Control Plane Namespace

View the Description

Goal

Hypershift has a mechanism for Labeling Control Plane Pods
Cluster service should be able to set the label for a given hosted cluster

Why is this important?

As part of being a first party Azure offering, ARO HCP needs to adhere to Microsoft secure supply chain software requirements. In order to do this, we require setting a label on all pods that run in the hosted cluster namespace.
See Documentation: https://eng.ms/docs/more/containers-secure-supply-chain/other

Scenarios

Given a subscriptionID of "1d3378d3-5a3f-4712-85a1-2485495dfc4b", there needs to be the following label on all pods hosted on behalf of the customer:
```
kubernetes.azure.com/managedby: sub_1d3378d3-5a3f-4712-85a1-2485495dfc4b
```

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2120: e2e testing automation: Add a Mechanism to Label all Pods in the Control Plane Namespace

View the linked PRs

Feature OCPSTRAT-1665: CAPI-based Installer technical debt

View the Description

Feature Overview (aka. Goal Summary)

Review, refine and harden the CAPI-based Installer implementation introduced in 4.16

Goals (aka. expected user outcomes)

From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.

Requirements (aka. Acceptance Criteria):

Review existing implementation, refine as required and harden as possible to remove all the existing technical debt

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Documentation Considerations

There should not be any user-facing documentation required for this work

Epic CORS-3623: Technical debt for 4.19

View the Description

Epic Goal

This epic includes tasks the team would like to tackle to improve our process, QOL, CI. It may include tasks like updating the RHEL base image and vendored assisted-service.

Why is this important?

We need a place to add tasks that are not feature oriented.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3637: OWNERS files for platform providers

View the Description View the linked PRs

Most of the platform subdirectories don't have OWNERS files

https://github.com/openshift/installer/tree/1e78bd7859d1d9019924fa3a1acb73ab3dbb0cbd/pkg/infrastructure

we should add the aliases for everything that's missing

backport to 4.16

https://github.com/openshift/installer/pull/9407

Feature OCPSTRAT-1666: Ensure Sustainability of the HyperShift Project through Comprehensive Refactor and Standardization of Key Components

View the Description

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Epic CNTRLPLANE-308: Implements proper Controller and Component Abstractions

View the Description

Goal

Refactor and modularize controllers and other components to improve maintainability, scalability, and ease of use.

Story HOSTEDCP-2256: CPO Refactor: Components should be deleted when predicate changes to false

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

As an external dev I want to be able to add new components to the CPO easily
As a core dev I want to feel safe when adding new components to the CPO
As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. However, when a component or a sub-resources predicate changes to false, the resources are not removed from the cluster. All resources should be deleted from the cluster.
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

https://github.com/openshift/hypershift/pull/5509

Story CNTRLPLANE-280: Refactor CPO Components to use the new abstraction

View the Description

User Story:

As a (user persona), I want to be able to:

As an external dev I want to be able to add new components to the CPO easily
As a core dev I want to feel safe when adding new components to the CPO
As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. We need to refactor every component to use this abstraction.

Acceptance Criteria:

Description of criteria:

All ControlPlane Components are refactored:

~~HCCO~~
~~kube-apiserver (Mulham)~~
~~kube-controller-manager (Mulham)~~
~~ocm (Mulham)~~
~~etcd (Mulham)~~
~~oapi (Mulham)~~
~~scheduler (Mulham)~~
~~clusterpolicy (Mulham)~~
~~CVO (Mulham)~~
~~oauth (Mulham)~~
~~hcp-router (Mulham)~~
~~storage (Mulham)~~
~~CCO (Mulham)~~
CNO (Jparrill)
CSI (Jparrill)
dnsoperator
ignition (Ahmed)
ingressoperator
machineapprover
nto
olm
pkioperator
registryoperator
snapshotcontroller

Example PR to refactor cloud-credential-operator : https://github.com/openshift/hypershift/pull/5203
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

Sub-task CNTRLPLANE-290: Refactor CSI

View the Description

Provide a PR with a CSI standard refactor

Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

Sub-task CNTRLPLANE-292: Refactor NTO

Sub-task CNTRLPLANE-289: Refactor cluster-storage-operator

Sub-task CNTRLPLANE-305: Refactor dns-operator

Sub-task CNTRLPLANE-295: Refactor registryoperator

Sub-task CNTRLPLANE-291: Refactor CNO

View the Description

Provide a PR with a CNO standard refactor

Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

Sub-task CNTRLPLANE-304: Refactor ingressoperator

Sub-task CNTRLPLANE-307: Refactor snapshotcontroller

Epic HOSTEDCP-1979: API UX Validation

View the Description

Goal

Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.

Why is this important?

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2075: Add missing API validation and docs for HostedCluster

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature OCPSTRAT-1680: Migrating CNS volumes between datastores via vSphere UI (GA)

View the Description

Feature Overview (aka. Goal Summary)

Allow customers to migrate CNS volumes (i.e vsphere CSI volumes) from one datastore to another.

This operator relies on a new VMware CNS API and requires 8.0.2 or 7.0 Update 3o minimum versions

https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-vcenter-server-802-release-notes/index.html

In 4.17 we shipped a devpreview CLI tool (~~OCPSTRAT-1619~~) to cover existing urgent requests. This CLI tool will be removed as soon as this feature is available in OCP.

Goals (aka. expected user outcomes)

Often our customers are looking to migrate volumes between datastores because they are running out of space in current datastore or want to move to more performant datastore. Previously this was almost impossible or required modifying PV specs by hand to accomplish this. It was also very error prone.

As a first version, we developed a CLI tool that is shipped as part of the vsphere CSI operator. We keep this tooling internal for now, support can guide customers on a per request basis. This is to manage current urgent customer's requests, a CLI tool is easier and faster to develop it can also easily be used in previous OCP releases.

After multiple discussion with VMware we now have confidence that we can rely on their built-in vSphere UI tool to migrate CNS volume from one datastore to another. This includes attached and detached volumes. Vmware confirmed they have confidence in this scenario and they fully support this operation for attached volumes.

Requirements (aka. Acceptance Criteria):

SInce the feature is external to OCP, it is mostly a matter of testing it works as expected with OCP but customers will be redirected to Vmware documentation as all the steps are done through the vSphere UI.

Perform testing for attached and detached volumes + special cases such as RWX, zonal, encrypted.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	Yes
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all	Yes
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86
Operator compatibility	vsphere CSI operator but feature is external to OCP
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	no done through vsphere UI
Other (please specify)	OCP on vsphere only

Use Cases (Optional):

As a admin - want to migrate all my PVs or optional PVCs belonging to certain namespace to a different datastore within cluster without potentially requiring extended downtime.

I want to move volumes to another datastore that has better performances
I want to move volumes to another datastore current the current one is getting full
I want to move all volumes to another datastore because the current one is being decommissioned.

Questions to Answer (Optional):

Get full support confirmation from vmware that their CNS volume migration feature

Can be supported for OCP - YES
is supported with attached volumes - YES
Should detect if a volume is not migreable - YES

Out of Scope

Limited to what VMware supports. At the moment only one volume can be migrated at a time.

Background

We had a lot of requests to migrate volumes between datastore for multiple reason. Up until now it was not natively supported by VMware. In 8.0.2 they added a CNS API and a vsphere UI feature to perform volume migration.

In 4.17 we shipped a devpreview CLI tool (~~OCPSTRAT-1619~~) to cover existing urgent requests. This CLI tool will be removed as soon as this feature is available in OCP.

This feature also includes the work needed to remove the CLI tool

Customer Considerations

Need to be explicit on requirements and limitations.

Documentation Considerations

Documented as part of the vsphere CSI OCP documentation.

Specify min vsphere version. Document any limitation found during testing

Redirect to vmware documentation.

Announce removal of the CLI tool + update KB.

Interoperability Considerations

OCP on vSphere only

Epic STOR-2301: Document and support CNS volume migration using vCenter UI

View the Description

Epic Goal*

We need to document and support CNS volume migration using native vCenter UI, so as customers can migrate volumes between datastores.

Why is this important? (mandatory)

Scenarios (mandatory)

As an vCenter/Openshift admin, I want to migrate CNS volumes between datastores for existing vSphere CSI persistent volumes (PVs).

This should cover attached and detached volumes. Special cases such as RWX, zonal or encrypted should also be tested to confirm is there is any limitation we should document.

Dependencies (internal and external) (mandatory)

This feature depends on VMware vCenter Server 7.0 Update 3o or vCenter Server 8.0 Update 2.

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/release-notes/vcenter-server-update-and-patch-releases/vsphere-vcenter-server-70u3o-release-notes.html

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/release-notes/vcenter-server-update-and-patch-release-notes/vsphere-vcenter-server-802-release-notes.html

Contributing Teams(and contacts) (mandatory)

Development - STOR
Documentation - STOR
QE - STOR
PX -
Others -

Acceptance Criteria (optional)

This is mostly a testing / documentation epic, which will change current wording about unsupported CNS volume migration using vCenter UI.

As part of this epic, we also want to remove the CLI tool we developed for https://github.com/openshift/vmware-vsphere-csi-driver-operator/blob/master/docs/cns-migration.md from the payload.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Story STOR-2319: Remove cns-migration CLI tool

View the Description View the linked PRs

We have decided to remove cns-migration CLI tool for now.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/286

Feature OCPSTRAT-1682: OCP Console - Upgrade to PatternFly 6 (PF6)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Upgrade the OCP console to Pattern Fly 6.

Goals (aka. expected user outcomes)

The core OCP Console should be upgraded to PF 6 and the Dynamic Plugin Framework should add support for PF6 and deprecate PF4.

Requirements (aka. Acceptance Criteria):

Console, Dynamic Plugin Framework, Dynamic Plugin Template, and Examples all should be upgraded to PF6 and all PF4 code should be removed.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

As a company we have all agreed to getting our products to look and feel the same. The current level is PF6.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic ODC-7737: Support to adopt PatternFly 6 in the Console

View the Description

Problem:

Console is adopting PF6 and removing the PF4 support. It creates lots of UI issues in the Developer Console which we need to support to fix.

Goal:

Fix all the UI issues in the ODC related to PF6 upgrade

Why is it important?

Acceptance criteria:

Fix all the ODC issues https://docs.google.com/spreadsheets/d/1J7udCkoCks7Pc_jIRdDDBDtbY4U5OOZu_kG4aqX1GlU/edit?gid=0#gid=0

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Bug OCPBUGS-49919: Catalog card label should be right aligned

View the Description View the linked PRs

Description of problem:

    Label/tag on software catalog card is left align but it should be align right.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Navigate to software catalog
    2.
    3.

Actual results:

    Label/tag on software catalog card is left align

Expected results:

    Label/tag on software catalog card should be right align

Additional info:

https://github.com/openshift/console/pull/14722

Epic CONSOLE-4325: Adopt PatternFly 6 and remove PatternFly 4

View the Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug CONSOLE-4477: Update invalid tokens in plugins/topology

View the Description View the linked PRs

Per slack discussion, the following files contain invalid tokens that need to be fixed/updated/removed:

knative-plugin/src/components/functions/GettingStartedSection.scss

pipelines-plugin/src/components/pipelineruns/PipelineRunDetailsPage.scss

pipelines-plugin/src/components/pipelineruns/list-page/PipelineRunList.scss

pipelines-plugin/src/components/pipelines/detail-page-tabs/pipeline-details/PipelineVisualizationStepList.scss

pipelines-plugin/src/components/taskruns/TaskRunDetailsPage.scss

pipelines-plugin/src/components/taskruns/list-page/TaskRunsRow.scss

topology/src/components/graph-view/components/GraphComponent.scss

topology/src/components/graph-view/components/edges/ServiceBinding.scss

topology/src/components/graph-view/components/groups/GroupNode.scss

topology/src/components/page/TopologyView.scss

https://github.com/openshift/console/pull/14752

Bug CONSOLE-4480: Command Line Terminal tab background color is incorrect

View the Description View the linked PRs

The background color of the tab should be solid and not transparent

https://github.com/openshift/console/pull/14759

Bug CONSOLE-4475: QueryBrowser tooltip has incorrect styling

View the Description View the linked PRs

Untitled.mov

https://github.com/openshift/console/pull/14751

Bug CONSOLE-4469: Improve layout and functionality of Edit upstream configuration modal

View the Description View the linked PRs

The layout and functionality of the Edit upstream configuration modal could be improved.

Add space between the two radio options
Use PatternFly components instead of custom ones
Focus text input when "Custom" is selected

https://github.com/openshift/console/pull/14734

Bug CONSOLE-4465: Scrollable menus don't scroll

View the Description View the linked PRs

https://github.com/patternfly/patternfly/issues/7256 is a bug in PatternFly that impacts console. Once it is fixed upstream, we need to update PatternFly to a version that includes the fix.

https://github.com/openshift/console/pull/14745

Bug CONSOLE-4486: Edit resource limits modal content has way too much spacing

View the Description View the linked PRs

https://i.imgur.com/jUSCZc6.png

https://github.com/openshift/console/pull/14782

Bug CONSOLE-4471: Close button can overlap guided tour popover contents

View the Description View the linked PRs

https://github.com/openshift/console/pull/14737

Story CONSOLE-4464: Update PatternFly to official releases

View the Description View the linked PRs

Some of the PatternFly releases in https://github.com/openshift/console/pull/14621 are prereleases. Once final releases are available (v.6.2.0 is scheduled for the end of March), we should update to them.

Also update https://github.com/openshift/console/blob/900c19673f6f3cebc1b57b6a0a9cadd1573950d9/dynamic-demo-plugin/package.json#L21-L24, and https://github.com/openshift/console-crontab-plugin to the same versions.

https://github.com/openshift/console/pull/14750

Bug CONSOLE-4472: QuickStart layout issues

View the Description View the linked PRs

there is a bullet for each step of the quickstart
the steps are indented but should't be
ordered list items are missing their numbers

https://github.com/openshift/console/pull/14738

Story CONSOLE-4379: Remove PatternFly 4

View the Description View the linked PRs

Before we can adopt PatternFly 6, we need to drop PatternFly 4. We should drop 4 first so we can understand what impact if any that will have on plugins.

AC:

Remove PF4 package. Console should not load any PF4 assets during the runtime.
Remove PF4 support for DynamicPlugins - SharedModules + webpack configuration

https://github.com/openshift/console/pull/14671

Story CONSOLE-4377: Update ActionItemMenu.tsx to use new DropdownItemProps from PatternFly 5

View the Description View the linked PRs

https://github.com/openshift/console/blob/master/frontend/packages/console-shared/src/components/actions/menu/ActionMenuItem.tsx#L3 contains a reference to `@patternfly/react-core/deprecated`. In order to drop PF4 and adopt PF6, this reference needs to be removed.

https://github.com/openshift/console/pull/14593

Story CONSOLE-4376: Remove orphaned ClusterConfigurationDropdownField.tsx and related code

View the Description View the linked PRs

This component was never finished and should be removed as it includes a reference to `@patternfly/react-core/deprecated`, which blocks the removal of PF4 and the adoption of PF6.

https://github.com/openshift/console/pull/14592

Story CONSOLE-4381: Adopt PatternFly 6

View the Description View the linked PRs

AC:

Vendor PF6
Remove all the unecessary PF5 packages and keep the patternfly/patternfly as a npm dependancy
Remove the overpass font from webpack config
Follow the PF6 upgrade guide. Run the codemods to automatically identify and fix major issues.
- Create a followUp stories for addressing the remaining update issues, presumably per console package.
Update docs
- https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/upgrade-PatternFly.md#console-415
- https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/README.md#console-416x

Check https://www.patternfly.org/get-started/upgrade/#potential-test-failures

https://github.com/openshift/console/pull/14729

Bug CONSOLE-4473: Bootstrap radio buttons and checkboxes are not vertically aligned with label

View the Description View the linked PRs

https://github.com/openshift/console/pull/14739

Story CONSOLE-4503: Replace custom Banner with PatternFly equivalent

View the linked PRs

https://github.com/openshift/console/pull/14825

Story CONSOLE-4496: Replace custom Checkbox filter with PatternFly equivalent

View the Description View the linked PRs

The API Explorer > Resource details > Access review page utilizes a custom Checkbox filter. This custom component is unnecessary as PatternFly offers comparable functionality. We should replace this customer component with a PatternFly one.

AC: Replace the CheckBox component for the Switch in the API Explorer

https://github.com/openshift/console/pull/14807

Story CONSOLE-4504: Align LogViewer theme with console theme

View the Description View the linked PRs

Instances of LogViewer are hard coded to use the dark theme. We should make that responsive to the user's choice the same we we are with the CodeEditor.

https://github.com/openshift/console/pull/14827

Story CONSOLE-4434: Update login page

View the linked PRs

https://github.com/openshift/cluster-authentication-operator/pull/751

Story CONSOLE-4484: use PF6 component for tabs

View the linked PRs

Bug CONSOLE-4470: Edit button link text is bold but should not be

View the Description View the linked PRs

https://github.com/openshift/console/pull/14736

Bug CONSOLE-4489: Create Secret dropdown is misaligned at medium and larger screen sizes

View the Description View the linked PRs

Visit `k8s/all-namespaces/core~v1~Secret` with browser width 768px or greater.
Click the `Create` button in the upper right of the page.
Note the dropdown is bleeding off the page as it is aligned left and not right.

https://github.com/openshift/console/pull/14780

Bug CONSOLE-4474: Switch labels are bolded when they shouldn't be

View the Description View the linked PRs

https://github.com/openshift/console/pull/14748

Story CONSOLE-4498: Replace checkboxes with Switch in ResourceLog

View the Description View the linked PRs

The checkboxes at the top of the ResourceLog component should be changed to Switches as that same change is being made for the YAMLEditor.

AC: Replace CheckBox component in the ResourceLog component with Switch from PF6

https://github.com/openshift/console/pull/14815

Bug CONSOLE-4467: Replace heading html elements with PatternFly equivalent

View the Description View the linked PRs

Throughout the code base, there are many instances of `<h1>` - `<h6>`. As a result, we have to manually style these elements to align with PatternFly. By replacing the html elements with a PatternFly component, we get the correct styling for free.

Story CONSOLE-4378: Remove PopupKebabMenu and related code

View the Description View the linked PRs

PopupKebabMenu is orphaned and contains a reference to `@patternfly/react-core/deprecated`. It and related code should be removed so we can drop PF4 and adopt PF6.

https://github.com/openshift/console/pull/14594

Story CONSOLE-4443: Upgrade dynamic-demo-plugin to use PatternFly 6

View the Description View the linked PRs

The dynamic-demo-plugin in the openshift/console repository currently relies on PatternFly 5. To align with the latest design standards and to ensure consistency with other parts of the platform, the plugin must be upgraded to use PatternFly 6.

This involves updating dependencies, refactoring code where necessary to comply with PatternFly 6 components and APIs, and testing to ensure that functionality and UI are not disrupted by the changes.

AC:

PatternFly 5 is replaced with PatternFly 6 in dynamic-demo-plugin. Follow the PF6 upgrade guide.
All affected components and styles are updated to comply with PatternFly 6 standards.
Update integration tests, if necessary
Plugin functionality is tested to ensure no regressions.

https://github.com/openshift/console/pull/14682

Bug CONSOLE-4476: Cleanup now-redundant PF5 workarounds

View the linked PRs

https://github.com/openshift/console/pull/14756

Feature OCPSTRAT-1684: Add all Dev only UI pages to the Admin Perspective

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic ODC-7716: Merge Admin and Dev Perspectives

View the Description

Epic Goal

Base on user analytics many of customers switch back and fourth between perspectives, and average15 times per session.
The following steps will be need:
- Surface all Dev specific Nav items in the Admin Console
- Disable the Dev perspective by default but allow admins to enable via console setting
- All quickstarts need to be updated to reflect the removal of the dev perspective
- Guided tour to show updated nav for merged perpspective

Why is this important?

We need to alleviate this pain point and improve the overall user experience for our users.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story ODC-7767: Expose remaining Topology components and utils to openshift-console/dynamic-plugin-sdk

View the Description View the linked PRs

Description

As a user, I want to use Topology components in the dynamic plugin

Acceptance Criteria

Should expose the below Topology components and utils to dynamic-plugin-sdk

getModifyApplicationAction
baseDataModelGetter
getWorkloadResources
contextMenuActions
CreateConnector
createConnectorCallback (e

https://github.com/openshift/console/pull/14705

Story ODC-7718: e2e testing automation: Merge Admin and Dev Perspectives

View the Description

Description:

Update ODC e2e automation tests

Acceptance Criteria

Should add/update the test case to test default behaviour in console i.e only admin perspective is enabled and no perspective switcher is there (Should add tests for ODC component in admin perspective which are visible in navigation.)
Should add/update the tests for the scenario where dev perspective is enabled by user - Test dev persona

Sub-task ODC-7735: Update CI tests to enable developer perspective

View the linked PRs

https://github.com/openshift/console/pull/14684

Story ODC-7727: Favoriting page in the Console admin perspective

View the Description View the linked PRs

Description

As a user, I want to favorite pages in the Console admin perspective

Acceptance Criteria

Should favorite the pages in the console
Favorite pages can be accessible from the left navigation

Additional Details:

Design https://www.figma.com/design/CMGLcRam4M523WVDqdugWz/Favoriting?node-id=0-1&p=f&t=y07SBX01YxDa6pTv-0

Bug OCPBUGS-48491: Namespace is not persisting when switching to developer view from the topology page of admin page

View the Description View the linked PRs

Description of problem:

Namespace does not persist when switching to developer view from the topology page of the admin page. This is based on the changes from pr  https://github.com/openshift/console/pull/14588

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Go to topology page on ain view after login
    2. Select/Create a namespace 
    3. Switch to developer perspective

Actual results:

Namespace do not persist

Expected results:

Namespace should persist

Additional info:

https://drive.google.com/file/d/1VYg-pWt4ZCYKtmPx4bIK6L2Lkt3WpOPv/view?usp=sharing

https://github.com/openshift/console/pull/14812

Story ODC-7769: Add Getting started section on the Project overview page

View the Description View the linked PRs

Description

As a user,

Acceptance Criteria

<criteria>

Additional Details:

https://github.com/openshift/console/pull/14792

Story ODC-7725: Hide perspective preference option on user preferences page incase on one perspective

View the Description View the linked PRs

Description

As a user, I do not want a perspective preferences option if only one perspective is enabled.

Acceptance Criteria

Should not show perspective preference option if only one perspective is enabled.

Additional Details:

https://github.com/openshift/console/pull/14644

Story ODC-7775: Update getting started resources section actions

View the Description View the linked PRs

Description

As a user, I want to see actions regarding the merge perspective in getting started resources section

Acceptance Criteria

Update getting started resources action on the cluster overview page
Update getting started resources action on the project overview page in admin perspective
content for cluster overview page and project overview page will be same

Additional Details:

https://github.com/openshift/console/pull/14829

Story ODC-7723: Add quick start for how to enable developer perspective

View the Description View the linked PRs

Description

As a user, I want to know how I can enable the developer perspective in the Web console

Acceptance Criteria

Add a quick start to let the user know about the steps to enable the Developer perspective in the web console

Additional Details:

Steps to enable dev perspective through UI

search for console (console.operator.openshift.io/cluster) on search page
open cluster details page
click on action menu and select Customize option. It will open Cluster configuration page
under General tab there is Perspectives option to enable and disabled

UXD: https://www.figma.com/proto/gDIRyooHfJnQF71DMp9UXX/Onboarding?node-id=82-2398&t=Uvrnk3X3czn6jKxm-0&scaling=min-zoom&content-scaling=fixed&page-id=51%3A1311&starting-point-node-id=51%3A1312

https://github.com/openshift/console-operator/pull/968

Bug OCPBUGS-53047: Disable guided tour for now

View the Description View the linked PRs

Description of problem:

    Disable Guided tour in admin perspective for now as e2e is failing because of it and will re-enable it once we fix the e2e.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14866

Story ODC-7724: Add guided tour in the Admin perspective

View the Description View the linked PRs

Description

As a user, I want to know about the nav option added to the Admin perspective from the Developer perspective.

Acceptance Criteria

Add guided tour to the Admin perspective to let the user know about the unified perspective

Additional Details:

https://www.figma.com/proto/xSf03qhQdjL1JP0VXegckQ/Navigation?node-id=262-2537

updated design https://www.figma.com/proto/gDIRyooHfJnQF71DMp9UXX/Onboarding?node-id=51-1312&t=OUVpHcobilnH09yE-0&scaling=min-zoom&content-scaling=fixed&page-id=51%3A1311&starting-point-node-id=51%3A1312

https://github.com/openshift/console/pull/14776

Story ODC-7726: Expose Topology components and utils to openshift-console/dynamic-plugin-sdk

View the Description View the linked PRs

Description

As a user, I want to use Topology components in the dynamic plugin

Acceptance Criteria

Should expose the Topology components and utils to dynamic-plugin-sdk

Additional Details:

Utils and component needs to be exposed

https://docs.google.com/spreadsheets/d/1B0TLMtRY9uUR-ma0po3ma0rwgEE7T5w-v-k6f0Ak_tk/edit?gid=0#gid=0

https://github.com/openshift/console/pull/14705

Story ODC-7710: Remove RHOAS plugin from the console

View the Description View the linked PRs

Description

As a developer, I do not want to maintain the code for a project already dead.

Acceptance Criteria

Remove RHOAS plugin https://github.com/openshift/console/tree/master/frontend/packages/rhoas-plugin
Remove RHOAS-catalog-source https://github.com/openshift/console/blob/master/frontend/packages/dev-console/integration-tests/testData/yamls/operator-installtion/RHOAS-catalog-source.yaml
Check if there is dependencies in other package and fix it

Additional Details:

https://github.com/openshift/console/pull/14577

Story ODC-7720: Add dev perspective nav options to admin perspective

View the Description View the linked PRs

Description

As a user, I want access to all the pages present in the developer perspective from the admin perspective.

Acceptance Criteria

Add Topology, Helm, Serverless function, and Developer catalog nav items to Admin perspective as per the UX design.

Additional Details:

UX design https://www.figma.com/design/xSf03qhQdjL1JP0VXegckQ/Navigation?node-id=0-1&node-type=canvas&t=vfZgzTBeBFLhpuL0-0

https://github.com/openshift/console/pull/14588

Feature OCPSTRAT-1697: VolumeAttributesClass (TP)

View the Description

Feature Overview (aka. Goal Summary)

K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.

VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver.

Goals (aka. expected user outcomes)

Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.

Requirements (aka. Acceptance Criteria):

Disabled by default
put it under TechPreviewNoUpgrade
make sure VolumeAttributeClass object is available in beta APIs
enable the feature in external-provisioner and external-resizer at least in AWS EBS CSI driver, check the other drivers.
- Add RBAC rules for these objects
make sure we run its tests in one of TechPreviewNoUpgrade CI jobs (with hostpath CSI driver)
reuse / add a job with AWS EBS CSI driver + tech preview.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	yes
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility	N/A core storage
Backport needed (list applicable versions)	None
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	TBD for TP
Other (please specify)	n/A

Use Cases (Optional):

As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.

Questions to Answer (Optional):

Get list of drivers that supports it (from the ones we ship)

Out of Scope

No UI for TP

Background

There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.

Customer Considerations

Customer should not use it in production atm. The driver used by customers must support this feature.

Documentation Considerations

Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.

Interoperability Considerations

Check which drivers support it for which parameters.

Epic STOR-2078: Upstream Beta Tracking: VolumeAttributesClass (TP)

View the Description View the linked PRs

Epic Goal

Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.

Why is this important?

We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Enhancement issue: https://github.com/kubernetes/enhancements/issues/3751
KEP: https://github.com/kubernetes/enhancements/pull/3780

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OCPSTRAT-1699: Configure containers to set readOnlyRootFilesystem to true

View the Description

Red Hat Product Security recommends that pods be deployed with readOnlyRootFilesystem set to true in the SecurityContext, but does not require it because a successful attack can only be carried out with a combination of weaknesses and OpenShift runs with a variety of mitigating controls.

However, customers are increasingly asking questions about why pods from Red Hat, and deployed as part of OpenShift, do not follow common hardening recommendations.

Note that setting readOnlyRootFilesystem to true ensures that the container's root filesystem is mounted as read-only. This setting has nothing to do with host access.

For more information, see
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/

Setting the readOnlyRootFilesystem flag to true reduces the attack surface of your containers, preventing an attacker from manipulating the contents of your container and its root file system.

If your container needs to write temporary files, you can specify the ability to mount an emptyDir in the Security Context for your pod as described here. https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-pod

The following containers have been identified by customer scans as needing remediation. If your pod will not function with readOnlyRootFilesystem set to true, please document why so that we can document the reason for the exception.

Service Mesh operator with sidecar-injector (this needs some additional investigation as we no longer ship the sidecar-injector with Service Mesh)
S2I and Build operators: webhook
tekton-pipelines-controller
tekton-chains-controller
openshift-pipelines-operator-cluster-operations
tekton-operator-webhook
openshift-pipelines-operator-lifecycle-event-listener
Pac-webhook (part of Pipelines)
Cluster ingress operator: serve-healthcheck-canary
Node tuning operator: Tuned
Machine Config Operator: Machine-config-daemon
ACM Operator: Klusterlet-manifestwork-agent. This was fixed in ACM 2.10. https://github.com/stolostron/ocm/blob/backplane-2.5/manifests/klusterlet/management/klusterlet-work-deployment.yaml

Epic CCO-385: readOnlyRootFilesystem should be explicitly to true and if required to false for security reason

View the Description

1. Proposed title of this feature request
[openshift-cloud-credential-operator] - readOnlyRootFilesystem should be explicitly to true and if required to false for security reason

2. What is the nature and description of the request?
According to security best practice, it's recommended to set readOnlyRootFilesystem: true for all containers running on kubernetes. Given that openshift-cloud-credential-operator does not set that explicitly, it's requested that this is being evaluated and if possible set to readOnlyRootFilesystem: true or otherwise to readOnlyRootFilesystem: false with a potential explanation why the file-system needs to be write-able.

3. Why does the customer need this? (List the business requirements here)
Extensive security audits are run on OpenShift Container Platform 4 and are highlighting that many vendor specific container is missing to set readOnlyRootFilesystem: true or else justify why readOnlyRootFilesystem: false is set.

4. List any affected packages or components.
openshift-cloud-credential-operator

Story CCO-647: Enable readOnlyRootFilesystem on all pods

View the Description View the linked PRs

Enable readOnlyRootFilesystem on all of the cloud-credential-operator pods. This will require reverting prior changes that caused the tls-ca-bundler.pem to be mounted in a temporary location and then moved to the default location as part of the cloud-credential-operator pod's command.

https://github.com/openshift/cloud-credential-operator/pull/819

Feature OCPSTRAT-1708: Control Plane fleet wide fix delivery mechanism

View the Description

Feature Overview (aka. Goal Summary)

A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:

Have a hotfix process that is customer/support-exception targeted rather than fleet targeted
Can take weeks to be available for Managed OpenShift

This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation

Goals (aka. expected user outcomes)

Hosted Control Plane fixes are delivered through Konflux builds
No additional upgrade edges
Release specific
Adequate, fleet representative, automated testing coverage
Reduced human interaction

Requirements (aka. Acceptance Criteria):

Overriding Hosted Control Plane components can be done automatically once the PRs are ready and the affected versions have been properly identified
Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving
Fix can be promoted through integration, stage and production canary with a good degree of observability

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	managed (ROSA and ARO)
Classic (standalone cluster)	No
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All supported ROSA/HCP topologies
Connected / Restricted Network	All supported ROSA/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All supported ROSA/HCP topologies
Operator compatibility	CPO and Operators depending on it
Backport needed (list applicable versions)	TBD
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	No
Other (please specify)	No

Use Cases (Optional):

Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Out of Scope

HyperShift Operator binary bundling

Background

Discussed previously during incident calls. Design discussion document

Customer Considerations

Because the Managed Control Plane version does not change but it is overridden, customer visibility and impact should be limited as much as possible.

Documentation Considerations

SOP needs to be defined for:

Requesting and approving the fleet wide fixes described above
Building and delivering them
Identifying clusters with deployed fleet wide fixes

Epic CNTRLPLANE-16: Control Plane Operator Konflux pipeline

View the Description

Goal

Have a Konflux build for every supported branch on every pull request / merge that modifies the Control Plane Operator

Why is this important?

In order to build the Control Plane Operator images to be used for management cluster wide overrides.
To be able to deliver managed Hosted Control Plane fixes to managed OpenShift with a similar SLO as the fixes for the HyperShift Operator.

Scenarios

A PR that modifies the control plane in a supported branch is posted for a fix affecting managed OpenShift

Acceptance Criteria

Dev - Konflux application and component per supported release
Dev - SOPs for managing/troubleshooting the Konflux Application
Dev - Release Plan that delivers to the appropriate AppSre production registry
QE - HyperShift Operator versions that encode an override must be tested with the CPO Konflux builds that they make

Dependencies (internal and external)

Konflux

Previous Work (Optional):

~~HOSTEDCP-2027~~

Open questions:

Antoni Segura Puimedon How long or how many times should the CPO override be tested?

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Konflux App link: <link to Konflux App for CPO>
DEV - SOP: <link to meaningful PR or GitHub Issue>
QE - Test plan in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task CNTRLPLANE-10: Limit CPO Konflux builds to PRs that actually should have it built

View the Description View the linked PRs

The default PR posting and pushing tekton file that Konflux generates builds always. We can be more efficient with resources.

https://github.com/openshift/hypershift/pull/5287

Task CNTRLPLANE-9: Control Plane Operator Konflux builds for the main branch

View the Description View the linked PRs

It is necessary to get the builds off of main for CPO overrides.

https://github.com/openshift/hypershift/pull/5282

Feature OCPSTRAT-1709: User triggered delayed node rollouts in HyperShift upgrades

View the Description

Feature Overview (aka. Goal Summary)

Rolling out new versions of HyperShift Operator or Hosted Control Plane components such as HyperShift's Control Plane Operator will no longer carry the possibility of triggering a Node rollout that can affect customer workloads running on those nodes

Goals (aka. expected user outcomes)

Customer Nodepool rollouts exhaustive cause list will be:

Due to customer direct scaling up/down of the Nodepool
Due to customer change of Hosted Cluster or Nodepool configuration that is documented to incur in a rollout

Customers will have visibility on rollouts that are pending so that they can effect a rollout of their affected nodepools at their earliest convenience

Requirements (aka. Acceptance Criteria):

Observability:
- It must be possible to account for all Nodepools with pending rollouts
- It must be possible to identify all the Hosted Clusters with Nodepools with pending rollouts
- It must be possible for a customer to see that a Nodepool has pending rollouts
Kubernetes expectations on resource reconciliation must be upheld
Queued rollouts must survive HyperShift restarts

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Managed (ROSA and ARO)
Classic (standalone cluster)	No
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All supported Managed Hosted Control Plane topologies and configurations
Connected / Restricted Network	All supported Managed Hosted Control Plane topologies and configurations
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All supported Managed Hosted Control Plane topologies and configurations
Operator compatibility	N/A
Backport needed (list applicable versions)	All supported Managed Hosted Control Plane releases
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	Yes. Console represtation of Nodepools/Machinepools should indicate pending rollouts and allow them to be triggered
Other (please specify)

Use Cases (Optional):

Managed OpenShift fix requires a Node rollout to be applied

Questions to Answer (Optional):

Bryan Cox
- Should we explicitly call out the OCP versions in Backport needed (list applicable versions)?
  - Antoni Segura Puimedon: Depends on what the supported OCP versions are in Managed OpenShift by the time this feature is delivered
- Is there a missing Goal of this feature, to force a rollout at a particular date? My line of thinking is what about CVE issues on the NodePool OCP version - are we giving them a warning like "hey you have a pending rollout because of a CVE; if you don't update the nodes yourself, we will on such & such date"?
  - Antoni Segura Puimedon: Date based rollouts are out of scope (see its section).
Juan Manuel Parrilla Madrid
- What’s the expectations on a regular customer nodePool upgrade? The change will be directly applied or queued following the last requested change?
  - Antoni Segura Puimedon: Combinded single rollout.
- This only applies to NodePool changes or also would affect CP upgrades (thinking of OVN changes that could also affect the data plane)?
  - Antoni Segura Puimedon: CP upgrades that would trigger Nodepool rollouts are in scope. OVN changes should only apply if CNO or its daemonsets are going to cause reboots
- How the customer will trigger the pending rollouts? An alert will trigger in the hosted cluster console?
  - Antoni Segura Puimedon: I guess there are multiple options like scaling down and up and also adding some API to Nodepool
- I assume we will use a new status condition to reflect the current queue of pending rollouts, it’s that the case?.
  - Antoni Segura Puimedon: That's a good guess. Hopefully we can represent all we want with it or we constrain ourselves to what it can express
- With "Queued rollouts must survive HyperShift restarts"... What kind of details we wanna store there (“there” should be the place to persist the changes queued), the order, the number of rollouts, the destination Hashes, more info…?

- - Antoni Segura Puimedon: I'll leave that as an open question to refine
    I'll If there are more than one change pending, we asume there will be more than one reboot?

Out of Scope

Maintenance windows
Queuing of rollouts on user actions (as that does not meet the Kubernetes reconciliation expectations and is better addressed at either the Cluster Service API level or better yet, at the customer automation side).
Forced rollouts of pending updates on a certain date. That is something that should be handled at the Cluster Service level if there is desire to provide it.

Background

Past incidents with fixes to ignition generation resulting in rollout unexpected by the customer with workload impact

Customer Considerations

There should be an easy way to see, understand the content and trigger queued updates

Documentation Considerations

SOPs for the observability above

ROSA documentation for queued updates

Interoperability Considerations

ROSA/HCP and ARO/HCP

Epic HOSTEDCP-1992: Delayed, triggerable Node rollouts

View the Description

Goal

Nodepools can hold off on performing a rollout of their nodes until said roll-out is triggered

Why is this important?

Allows Managed Service to implement their policies or Node rollouts regardless if that is:
Node maintenance windows
Explicit user confirmation

Scenarios

HyperShift upgrade fixes/adds reconciliation of some NodePool spec fields that result in ignition changes. Said upgrade should respect the Nodepool contract of not triggering node replacement/reboot without user/service intervention
An important fix for the Nodes (potentially a CVE) comes up and the user wants to trigger a rollout at any given point in time to receive the benefits of the fix.

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
SD QE - covered in SD promotion testing
Release Technical Enablement - Must have TE slides

Open questions:

What is the right UX to trigger the roll-out in these cases? Should Cluster service just scale down and up or can we offer a more powerful/convenient UX to the service operators.

Done Checklist

CI - CI is running, tests are automated and merged.
* Documentation and SRE enablement
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1971: test to capture HO updates causing nodePool rollouts

View the Description View the linked PRs

test to capture HO updates causing nodePool reboots

https://github.com/openshift/hypershift/pull/4999

Feature OCPSTRAT-1712: IPsec Design Modernization

View the Description

Feature Overview

The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.

Goals

The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.

Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.

Requirements

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Questions to answer…

Out of Scope

Configuration of external-to-cluster IPsec endpoints for N-S IPsec.

Background, and strategic fit

The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.

Assumptions

Customer Considerations

Customers require the option to use their own certificates or CA for IPsec.
Customers require observability of configuration (e.g. is the IPsec tunnel up and passing traffic)
If the IPsec tunnel is not up or otherwise functioning, traffic across the intended-to-be-encrypted network path should be blocked.

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic SDN-5334: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-33656: IPsec state not cleaned up on the cluster

View the Description View the linked PRs

While running IPsec e2e tests in the CI, the data plane traffic is not flowing with desired traffic type esp or udp. For example, ipsec mode external, the traffic type seems to seen as esp for EW traffic, but it's supposed to be geneve (udp) taffic.

Example CI run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50687/rehearse-50687-pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-ipsec-serial/1789527351734833152

This issue was reproducible on a local cluster after many attempts and noticed ipsec states are not cleanup on the node which is a residue from previous test run with ipsec full mode.

[peri@sdn-09 origin]$ kubectl get networks.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2024-05-13T18:55:57Z"
generation: 1362
name: cluster
resourceVersion: "593827"
uid: 10f804c9-da46-41ee-91d5-37aff920bee4
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
defaultNetwork:
ovnKubernetesConfig:
egressIPConfig: {}
gatewayConfig:
ipv4: {}
ipv6: {}
routingViaHost: false
genevePort: 6081
ipsecConfig:
mode: External
mtu: 1400
policyAuditConfig:
destination: "null"
maxFileSize: 50
maxLogFiles: 5
rateLimit: 20
syslogFacility: local0
type: OVNKubernetes
deployKubeProxy: false
disableMultiNetwork: false
disableNetworkDiagnostics: false
logLevel: Normal
managementState: Managed
observedConfig: null
operatorLogLevel: Normal
serviceNetwork:
- 172.30.0.0/16
unsupportedConfigOverrides: null
useMultiNetworkPolicy: false
status:
conditions:
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "False"
type: ManagementStateDegraded
- lastTransitionTime: "2024-05-14T10:13:12Z"
status: "False"
type: Degraded
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2024-05-14T11:50:26Z"
status: "False"
type: Progressing
- lastTransitionTime: "2024-05-13T18:57:13Z"
status: "True"
type: Available
readyReplicas: 0
version: 4.16.0-0.nightly-2024-05-08-222442
[peri@sdn-09 origin]$ oc debug node/worker-0
Starting pod/worker-0-debug-k6nlm ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.111.23
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# toolbox
Checking if there is a newer version of registry.redhat.io/rhel9/support-tools available...
Container 'toolbox-root' already exists. Trying to start...
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-root')
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@worker-0 /]# tcpdump -i enp2s0 -c 1 -v --direction=out esp and src 192.168.111.23 and dst 192.168.111.24
dropped privs to tcpdump
tcpdump: listening on enp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:07:01.854214 IP (tos 0x0, ttl 64, id 20451, offset 0, flags [DF], proto ESP (50), length 152)
worker-0 > worker-1: ESP(spi=0x52cc9c8d,seq=0xe1c5c), length 132
1 packet captured
6 packets received by filter
0 packets dropped by kernel
[root@worker-0 /]# exit
exit

sh-5.1# ipsec whack --trafficstatus
006 #20: "ovn-1184d9-0-in-1", type=ESP, add_time=1715687134, inBytes=206148172, outBytes=0, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #19: "ovn-1184d9-0-out-1", type=ESP, add_time=1715687112, inBytes=0, outBytes=40269835, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #27: "ovn-185198-0-in-1", type=ESP, add_time=1715687419, inBytes=71406656, outBytes=0, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #26: "ovn-185198-0-out-1", type=ESP, add_time=1715687401, inBytes=0, outBytes=17201159, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #14: "ovn-922aca-0-in-1", type=ESP, add_time=1715687004, inBytes=116384250, outBytes=0, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #13: "ovn-922aca-0-out-1", type=ESP, add_time=1715686986, inBytes=0, outBytes=986900228, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #6: "ovn-f72f26-0-in-1", type=ESP, add_time=1715686855, inBytes=115781441, outBytes=98, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
006 #5: "ovn-f72f26-0-out-1", type=ESP, add_time=1715686833, inBytes=9320, outBytes=29002449, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
sh-5.1# ip xfrm state; echo ' '; ip xfrm policy
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0x7f7ddcf5 reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6158d9a0f4a28598500e15f81a40ef715502b37ecf979feb11bbc488479c8804598011ee 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x18564, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0xda57e42e reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x810bebecef77951ae8bb9a46cf53a348a24266df8b57bf2c88d4f23244eb3875e88cc796 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0xf84f2fcf reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x0f242efb072699a0f061d4c941d1bb9d4eb7357b136db85a0165c3b3979e27b00ff20ac7 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0x9523c6ca reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xe075d39b6e53c033f5225f8be48efe537c3ba605cee2f5f5f3bb1cf16b6c53182ecf35f7 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x10fb2
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0x459d8516 reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xee778e6db2ce83fa24da3b18e028451bbfcf4259513bca21db832c3023e238a6b55fdacc 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x3ec45, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x3142f53a reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6238fea6dffdd36cbb909f6aab48425ba6e38f9d32edfa0c1e0fc6af8d4e3a5c11b5dfd1 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0xeda1ccb9 reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xef84a90993bd71df9c97db940803ad31c6f7d2e72a367a1ec55b4798879818a6341c38b6 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x02c3c0dd reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x858ab7326e54b6d888825118724de5f0c0ad772be2b39133c272920c2cceb2f716d02754 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x26f8e
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0xc9535b47 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xd7a83ff4bd6e7704562c597810d509c3cdd4e208daabf2ec074d109748fd1647ab2eff9d 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x53d4c, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0xb66203c8 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xc207001a7f1ed7f114b3e327308ddbddc36de5272a11fe0661d03eaecc84b6761c7ec9c4 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0x2e4d4deb reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x91e399d83aa1c2626424b502d4b8dae07d4a170f7ef39f8d1baca8e92b8a1dee210e2502 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0x52cc9c8d reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xb605451f32f5dd7a113cae16e6f1509270c286d67265da2ad14634abccf6c90f907e5c00 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0xe2735
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0x973119c3 reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x87d13e67b948454671fb8463ec0cd4d9c38e5e2dd7f97cbb8f88b50d4965fb1f21b36199 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x2af9a, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0x4c3580ff reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x2c09750f51e86d60647a60e15606f8b312036639f8de2d7e49e733cda105b920baade029 128
lastused 2024-05-14 14:36:43
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0xa3e469dc reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x1d5c5c232e6fd4b72f3dad68e8a4d523cbd297f463c53602fad429d12c0211d97ae26f47 128
lastused 2024-05-14 14:18:42
anti-replay esn context:
seq-hi 0x0, seq 0xb, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 000007ff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0xdee8476f reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x5895025ce5b192a7854091841c73c8e29e7e302f61becfa3feb44d071ac5c64ce54f5083 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1f1a3
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081

src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir in priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir in priority 1 ptype main
sh-5.1# cat /etc/ipsec.conf
# /etc/ipsec.conf - Libreswan 4.0 configuration file
#
# see 'man ipsec.conf' and 'man pluto' for more information
#
# For example configurations and documentation, see https://libreswan.org/wiki/

config setup
# If logfile= is unset, syslog is used to send log messages too.
# Note that on busy VPN servers, the amount of logging can trigger
# syslogd (or journald) to rate limit messages.
#logfile=/var/log/pluto.log
#
# Debugging should only be used to find bugs, not configuration issues!
# "base" regular debug, "tmi" is excessive and "private" will log
# sensitive key material (not available in FIPS mode). The "cpu-usage"
# value logs timing information and should not be used with other
# debug options as it will defeat getting accurate timing information.
# Default is "none"
# plutodebug="base"
# plutodebug="tmi"
#plutodebug="none"
#
# Some machines use a DNS resolver on localhost with broken DNSSEC
# support. This can be tested using the command:
# dig +dnssec DNSnameOfRemoteServer
# If that fails but omitting '+dnssec' works, the system's resolver is
# broken and you might need to disable DNSSEC.
# dnssec-enable=no
#
# To enable IKE and IPsec over TCP for VPN server. Requires at least
# Linux 5.7 kernel or a kernel with TCP backport (like RHEL8 4.18.0-291)
# listen-tcp=yes
# To enable IKE and IPsec over TCP for VPN client, also specify
# tcp-remote-port=4500 in the client's conn section.

# if it exists, include system wide crypto-policy defaults
include /etc/crypto-policies/back-ends/libreswan.config

# It is best to add your IPsec connections as separate files
# in /etc/ipsec.d/
include /etc/ipsec.d/*.conf
sh-5.1# cat /etc/ipsec.d/openshift.conf
# Generated by ovs-monitor-ipsec...do not modify by hand!

config setup
uniqueids=yes

conn %default
keyingtries=%forever
type=transport
auto=route
ike=aes_gcm256-sha2_256
esp=aes_gcm256
ikev2=insist

conn ovn-f72f26-0-in-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-f72f26-0-out-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

conn ovn-1184d9-0-in-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-1184d9-0-out-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

conn ovn-922aca-0-in-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-922aca-0-out-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

conn ovn-185198-0-in-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-185198-0-out-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

sh-5.1#

https://github.com/openshift/cluster-network-operator/pull/2372

Epic SDN-5555: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story SDN-5330: Add ipsec upgrade ci job as mandatory lane

View the Description View the linked PRs

The e2e-aws-ovn-ipsec-upgrade job is currently an optional job and always_run: false because the job not reliable and success rate is so low. This must be made as mandatory CI lane after fixing its relevant issues.

Bug OCPBUGS-40906: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/api/pull/1472

Story SDN-4829: Make CNO to react for Machine Config Pool status

View the Description View the linked PRs

The CNO rolls out ipsec mc plugin for rolling out IPsec for the cluster, but it doesn't really check master and work role machine config pools status to confirm if that's successfully installed in the cluster nodes.
Hence CNO should be made to listen for MachineConfigPool status object updates and set network operator condition accordingly based on ipsec mc plugin rollout status.

https://github.com/openshift/cluster-network-operator/pull/2383

Story SDN-4168: Improve ipsec tests

View the linked PRs

Feature OCPSTRAT-172: [GA] Cert-manager support router to load secrets

View the Description

Epic Goal

Review design and development PRs that require feedback from NE team.

Why is this important?

Customer requires certificates to be managed by cert-manager on configured/newly added routes.

Acceptance Criteria

All PRs are reviewed and merged.

Dependencies (internal and external)

CFE team dependency for addressing review suggestions.

Done Checklist

DEV - All related PRs are merged.

Epic OAPE-26: [GA] Support router to load secrets

View the Description

Placeholder to track GA work for ~~CFE-811~~

Story OAPE-94: As a developer, I want to update the API godoc to document that manual intervention is required for using externalCertificate field

View the Description View the linked PRs

Update API godoc to document that manual intervention is required for using .spec.tls.externalCertificate. Something simple like: "The Router service account needs to be granted with read-only access to this secret, please refer to openshift docs for additional details."

https://github.com/openshift/api/pull/2159

Story OAPE-96: As a developer, I want to bump o/api into o/kubernetes repo to update API godoc

View the Description View the linked PRs

As a developer, I want to bump o/api into o/kubernetes repo to update API godoc added in ~~OAPE-94~~

https://github.com/openshift/kubernetes/pull/2217

Feature OCPSTRAT-1787: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-5575: Rebase Kube version to 1.32 in repos maintained by the SDN team

View the Description

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
1. Please use the "Discussion Needed: Service Delivery Architecture Overview" checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the "Discussion Needed: Service Delivery Architecture Overview" checkbox and record the outcome of the discussion in the epic description here.
2. The guidance here is that unless it is very clear that your epic doesn't have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement
details and documents.

...

Dependencies (internal and external)

...

Previous Work (Optional):

1. ...

Open questions::

1. ...

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SDN-5690: CNCC 1.32 Kube Rebase

View the linked PRs

https://github.com/openshift/cloud-network-config-controller/pull/163

Epic CNTRLPLANE-1: Upgrade to Kubernetes 1.32

View the Description View the linked PRs

Epic Goal*

Drive the technical part of the Kubernetes 1.32 upgrade, including rebasing openshift/kubernetes repository and coordination across OpenShift organization to get e2e tests green for the OCP release.

Why is this important? (mandatory)

OpenShift 4.19 cannot be released without Kubernetes 1.32

Scenarios (mandatory)

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Slack Discussion Channel - https://redhat.enterprise.slack.com/archives/C07V32J0YKF

Epic WRKLDS-1653: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OCPCLOUD-2848: Rebase Cluster API Components onto 1.31

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.30
target is 4.18 since CAPI is always a release behind upstream

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2852: Rebase/update to K8s 1.31 for Cluster API Provider IBM

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/107

Task OCPCLOUD-2858: Rebase/update to K8s 1.31 for Cluster CAPI Operator

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-capi-operator/pull/270

Task OCPCLOUD-2854: Rebase/update to K8s 1.31 for Cluster API Provider Azure

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-azure/pull/332

Task OCPCLOUD-2857: Rebase/update to K8s 1.31 for Core Cluster API

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

Task OCPCLOUD-2853: Rebase/update to K8s 1.31 for Cluster API Provider vSphere

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-vsphere/pull/53

Epic CCO-627: Upgrade to Kubernetes 1.32

View the Description

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that CCO uses to v1.32 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

To make sure that Hive imports of other OpenShift components do not break when those rebase
To avoid breaking other OpenShift components importing from CCO.
To pick up upstream improvements

Acceptance Criteria

CI - MUST be running successfully with tests automated

Dependencies (internal and external)

Kubernetes 1.32 is released

Previous Work (Optional):

Similar previous epic ~~CCO-595~~

Done Checklist

CI - CI is running, tests are automated and merged.

Story CCO-631: Upgrade to Kubernetes 1.32

View the Description View the linked PRs

As a developer, I want to upgrade the Kubernetes dependencies to 1.32

to ensure compatibility with the OpenShift cluster

https://github.com/openshift/cloud-credential-operator/pull/814

Epic OCPCLOUD-2825: Rebase Cluster Infrastructure Components onto 1.32

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Cluster Infrastructure owned components should be running on Kubernetes 1.32
This includes
- The cluster autoscaler (+operator)
- Machine API operator
  - Machine API controllers for:
    - AWS
    - Azure
    - GCP
    - vSphere
    - OpenStack
    - IBM
    - Nutanix
- Cloud Controller Manager Operator
  - Cloud controller managers for:
    - AWS
    - Azure
    - GCP
    - vSphere
    - OpenStack
    - IBM
    - Nutanix
- Cluster Machine Approver
- Cluster API Actuator Package
- Control Plane Machine Set Operator

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2838: Rebase/update to K8s 1.32 for Cloud Provider vSphere

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-vsphere/pull/75

Task OCPCLOUD-2841: Rebase/update to K8s 1.32 for Machine API Provider IBM

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.32. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-ibmcloud/pull/60

Task OCPCLOUD-2843: Rebase/update to K8s 1.32 for Cloud Provider AWS

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

Task OCPCLOUD-2839: Rebase/update to K8s 1.32 for Machine API Provider Azure

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-azure/pull/128

Task OCPCLOUD-2832: Rebase/update to K8s 1.32 for Cluster Machine Approver

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-machine-approver/pull/268

Task OCPCLOUD-2845: Rebase/update to K8s 1.32 for Cloud Provider GCP

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.32. This should be done by rebasing/updating as appropriate for the repository

Task OCPCLOUD-2846: Rebase/update to K8s 1.32 for Cloud Provider IBM

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-ibm/pull/76

Task OCPCLOUD-2844: Rebase/update to K8s 1.32 for Cloud Provider Azure

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

Task OCPCLOUD-2831: Rebase/update to K8s 1.32 for Cloud Controller Manager Operator

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/382

Task OCPCLOUD-2830: Rebase/update to K8s 1.32 for Machine API Operator

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-operator/pull/1325

Task OCPCLOUD-2829: Rebase/update to K8s 1.32 for Cluster Autoscaler Operator

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-autoscaler-operator/pull/339

Task OCPCLOUD-2834: Rebase/update to K8s 1.32 for Control Plane Machine Set Operator

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/345

Task OCPCLOUD-2836: Rebase/update to K8s 1.32 for Machine API Provider AWS

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-aws/pull/123

Task OCPCLOUD-2840: Rebase/update to K8s 1.32 for Machine API Provider GCP

View the Description View the linked PRs

To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-gcp/pull/109

Epic MCO-1488: Update MCO dependencies to Kubernetes 1.32

View the Description

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.29 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

Uncover any possible issues with the openshift/kubernetes rebase before it merges.
MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

All stories in this epic must be completed.
Go version is upgraded for MCO components.
CI is running successfully with the upgraded components against the 4.18/master branch.

Dependencies (internal and external)

ART team creating the go 1.31 image for upgrade to go 1.31.
OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
Step 3 - Upgrade OpenShift client and API dependencies
Step 4 - Update kubelet and kube-proxy submodules in MCO repository
Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

Story MCO-1515: Pick up openshift/kubernetes 1.32 rebase updates

View the Description View the linked PRs

User or Developer story

As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.32 k8s rebase to track the k8s version as rest of the OpenShift 1.32 cluster.

Engineering Details

Update the go.mod, go.sum and vendor dependencies pointing to the kube 1.32 libraries. This includes all direct kubernetes related libraries as well as openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils

Acceptance Criteria:

All k8s.io related dependencies should be upgraded to 1.32.
openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils should be upgraded to latest commit from master branch
All ci tests must be passing

https://github.com/openshift/machine-config-operator/pull/4797

Feature OCPSTRAT-1791: Support AWS Capacity Blocks for ML in MAPI/CAPI

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Following up this issue, it's added the support for CapacityReservations but missing the support for Capacity Blocks for ML which is essential to launch the capacity-blocks.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic OCPCLOUD-2780: Support AWS Capacity Blocks for ML in MAPI

View the Description View the linked PRs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

MAPI supports CapacityReservations, but it has the missing functionality that supports AWS capacity bloks.

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OCPSTRAT-1805: Azure - Add support for Dxv6 machine series

View the Description

Feature Overview (aka. Goal Summary)

Add support to Dxv6 machine series to be used on OpenShift deployment in Azure

Goals (aka. expected user outcomes)

As a user, I can deploy OpenShift in Azure using Dxv6 machine series so both the Control Plane and Compute Nodes can run on these machine series

Requirements (aka. Acceptance Criteria):

The new machines series that will be available in Azure soon, Dxv6 can be selected at install time to be used to deploy OpenShift on Azure

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Out of Scope

These machines series will be available for OpenShift to be used once they are declared GA by Microsoft

Background

ARO will need to support Dxv6 instance types supported. These are currently in preview. The specific instance types are:

Documentation Considerations

Usual documentation to list these machine series as tested

Interoperability Considerations

This feature will be consumed by ARO later

Epic CORS-3771: Azure - Add support for Dxv6 machine series

View the Description

Epic Goal

Test, validate and list Dxv6 machine series to be used on OpenShift deployment in Azure as supported machines

- Dlsv6
- Dldsv6
- Dsv6
- Ddsv6

Why is this important?

ARO will need to support Dxv6 instance types supported. These are currently in preview

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3825: Add azure disk nvme controller support

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9232

Feature OCPSTRAT-1823: [TP] 'oc adm upgrade status' command and status API

View the Description

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)

Here are common update improvements from customer interactions on Update experience

Show nodes where pod draining is taking more time.
Customers have to dig deeper often to find the nodes for further debugging.
The ask has been to bubble up this on the update progress window.
oc update status ?
From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"
But the ask is to show more details in a human-readable format.
Know where the update has stopped. Consider adding at what run level it has stopped.
```
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS

version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
```

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic OTA-1260: Status API for oc adm upgrade status command

View the Description

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process.
Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Bug OCPBUGS-23514: Upgrade from 4.14.1 to 4.15.0-ec.2 is stuck but not reported as such by CVO

View the Description View the linked PRs

Description of problem:

Upgrade of the ota-stage cluster from 4.14.1 to 4.15.0-ec.2 got stuck because of the operator-lifecycle-manager-packageserver ClusterOperator which never reaches the desired version (likely because its Pods are CrashLooping, which is a separate issue discussed now on Slack and ~~OCPBUGS-23538~~ was filed for it)

However, I would expect CVO to enter its waiting for operator-lifecycle-manager-packageserver up to 40 minutes state, eventually hit that deadline and signal the upgrade as stuck via a Failing=True condition, but that did not happen and CVO does not signal anything problematic in this stuck state.

Version-Release number of selected component (if applicable):

upgrade from 4.14.1 to 4.15.0-ec.2

How reproducible:

Unsure

Steps to Reproduce:

1. upgrade from 4.14.1 to 4.15.0-ec.2 and hope you get stuck the way ota-stage did

Actual results:

$ OC_ENABLE_CMD_UPGRADE_STATUS=true ./oc adm upgrade status
An update is in progress for 2h8m20s: Working towards 4.15.0-ec.2: 695 of 863 done (80% complete), waiting on operator-lifecycle-manager-packageserver

Expected results:

$ oc adm upgrade status
Failing=True
  Reason: operator-lifecycle-manager-packageserver is stuck (or whatever is the message)

An update is in progress for 2h8m20s: Working towards 4.15.0-ec.2: 695 of 863 done (80% complete), waiting on operator-lifecycle-manager-packageserver

Additional info

Attached CVO log and the waited-on CO yaml dump

https://github.com/openshift/oc/pull/1989

Story OTA-1418: USC: Implement proper lifecycle for health insights

View the Description View the linked PRs

Health insights have a lifecycle that is not suitable for the async producer/consumer architecture USC has right now, where update informers send individual insights to the controller that maintains the API instance. Health insights are expected to disappear and appear following the trigger condition, and this needs to be respected through controller restart, API unavailability etc. Basically this means that the informer should ideally work as a standard, full-reconciliation controller over a set of its insights.

We should also have an easy method to test health insight lifecycle: easy on/off switch for an artificial health insight to be reported or not, to avoid relying on true problematic conditions for testing the controller operation. Something like an existence of a label or an annotation on a resource to trigger a health insight directly.

Definition of Done

When a resource watched by USC (so ClusterVersion ATM) has a certain annotation, USC should produce an artificial health insight
The properties of the health insight do not matter but must clearly indicate the health insight is artificial and intended for testing
All these scenarios must work:
- When insight is not in API, USC is running and the annotation is added to CV -> insight is added to API
- When insight is not in API, USC is running, annotation is not on CV, insight is manually added to API -> insight is removed from API
- When insight is in the API, USC is running, annotation is removed from CV -> insight is removed from API
- When insight is in the API, USC is running, insight is manually removed from API -> insight is added from API
- When insight is not in API, USC is stopped, annotation is added to CV, USC is started -> insight is added to API
- When insight is in the API, USC is stopped, annotation is removed from CV, USC is started -> insight is removed from API
In all cases where there is an existing insight in the API and the annotation was never observed removed from the CV, any "refresh" by USC must respect the original properties of the insight (start time, uid, etc)
Identical sync mechanism should be used for Status Insights

Story OTA-1393: status: recognize the process of migration to multi-arch

View the Description View the linked PRs

After ~~OTA-960~~ is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.

$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Completed
Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
Completion:      100% (33 operators updated, 0 updating, 0 waiting)
Duration:        15m
Operator Status: 33 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-95-224.us-east-2.compute.internal   Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-33-81.us-east-2.compute.internal    Completed     Updated   4.18.0-ec.3   -
ip-10-0-45-170.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100%         3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Nodes: worker
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-72-40.us-east-2.compute.internal    Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-17-117.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
ip-10-0-22-179.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Update Health =
SINCE   LEVEL     IMPACT         MESSAGE
-       Warning   Update Speed   Node ip-10-0-95-224.us-east-2.compute.internal is unavailable
-       Warning   Update Speed   Node ip-10-0-72-40.us-east-2.compute.internal is unavailable

Run with --details=health for additional description and links to related online documentation

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-ec.3   True        True          14m     Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

$ oc get co machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.18.0-ec.3   True        True          False      63m     Working towards 4.18.0-ec.3

The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.

For grooming:

It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.

One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:

oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5

Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.

We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.

manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.

oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):

"Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)": confusing. We should tell "multi-arch" migration somehow. Or even better: from the current arch to multi-arch, for example "Target Version: 4.18.0-ec.3 multi (from x86_64)" if we could get the origin arch from CV or somewhere else.
- We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint. ~~MULTIARCH-4559~~ is adding tech-preview status properties around architecture in 4.18, but tech-preview, so may not be worth bothering with in oc code. Two history entries with the same version string but different digests is probably a reliable-enough heuristic, coupled with the spec-side hint.

"Duration: 6m55s (Est. Time Remaining: 1h4m)": We will see if we could find a simple way to hand this special case. I do not understand "the 97% completion will be reached so fast." as I am not familiar with the algorithm. But it seems acceptable to Petr that we show N/A for the migration.
- I think I get "the 97% completion will be reached so fast." now as only MCO has the operator-image pull spec. Other COs claim the completeness immaturely. With that said, "N/A" sounds like the most possible way for now.

Node status like "All control plane nodes successfully updated to 4.18.0-ec.3" for control planes and "ip-10-0-17-117.us-east-2.compute.internal Completed". It is technically hard to detect the transaction during migration as MCO annotates only the version. This may become a separate card if it is too big to finish with the current one.

"targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)" should be computed just once. Now it is in the each iteration of the for loop. We should also comment about why we do it with this hacky way.

Story OTA-1427: USC: Maintain status insights for Nodes

View the Description View the linked PRs

Implement a new Informer controller in the Update Status Controller to watch Node resources in the cluster and maintain an update status insight for each. The informer will need to interact with additional resources such as MachineConfigPools and MachineConfigs, e.g. to discover the OCP version tied to config that is being reconciled on the Node, but should not attempt to maintain the MachineConfigPool status insights. Generally the node status insight should carry enough data for any client to be able to render a line that the oc adm upgrade status currently shows:

NAME                                      ASSESSMENT    PHASE      VERSION       EST   MESSAGE
build0-gstfj-ci-prowjobs-worker-b-9lztv   Degraded      Draining   4.16.0-ec.2   ?     failed to drain node: <node> after 1 hour. Please see machine-config-controller logs for more informatio
build0-gstfj-ci-prowjobs-worker-d-ddnxd   Unavailable   Pending    ?             ?     Machine Config Daemon is processing the node
build0-gstfj-ci-tests-worker-b-d9vz2      Unavailable   Pending    ?             ?     Not ready
build0-gstfj-ci-tests-worker-c-jq5rk      Unavailable   Updated    4.16.0-ec.3   -     Node is marked unschedulable

The basic expectations for Node status insights are described in the design docs but the current source of truth for the data structure is the NodeStatusInsight structure from https://github.com/openshift/api/pull/2012 .

Definition of Done

During the upgrade, the status api contains a Node status insight for each Node in the cluster
Do not bother with the status insight lifecycle (when a Node is removed from the cluster, the status insight should technically disappear, but do not address that in this card, suitable lifecycle mechanism for this does not exist yet and OTA-1418 will address it)
Overall the functionality should match what oc adm upgrade status client-based checks
The NodeStatusInsight should have correctly populated: name, resource, poolResource, scopeType, version, estToComplete and message fields, following the existing logic from oc adm upgrade status
Health insights are out of scope
Status insights for MCPs are out of scope

The Updating condition has a similar meaning and interpretation like in the other insights.
- When its status is False, it will contain a reason which needs to be interpreted. Three known reasons are Pending, Updated and Paused:
  - Pending: Node will eventually be updated but has not started yet
  - Updated: Node already underwent the update.
  - Paused: Node is running an outdated version but something is pausing the process (like parent MCP .spec.paused field)
- When Updating=True, there are also three known reasons: Draining, Updating and Rebooting.
  - Draining: MCO drains the node so it can be updated and rebooted
  - Updating: MCO applies the new config and prepares the node to be rebooted into the new OS version
  - Rebooting: MCO is rebooting the node, after which it (hopefully) becomes ready again

The Degraded and Unavailable condition logic should match the existing assessment logic from oc adm upgrade status

Story OTA-1411: USC: Maintain status insights for ClusterOperator resources

View the Description View the linked PRs

Extend the Control Plane Informer in the Update Status Controller so it watches ClusterOperator resources in the cluster and maintains an update status insight for each.

The actual API structure for an update status insights needs to be taken from whatever state https://github.com/openshift/api/pull/2012 is at the moment. The story does not include the actual API form nor how it is exposed by the cluster (depending on the state of API approval, the USC may still expose the API as a ConfigMap or an actual custom resource), it includes just the logic to watch ClusterOperator resources and producing a matching set of cluster operator status insights.

The basic expectations for cluster operator status insights are described in the design docs

Definition of Done

During the control plane upgrade, the status api contains a ClusterOperator status insight for each platform (hopefully we can use the same selection logic like we have in the prototype) ClusterOperator in the cluster
Outside of control plane update, the cluster operator status insights are not present
Updating condition:
- If a CO is updating right now (Progressing=True && does not have a target operator version), then Updating=True and a suitable reason
- Otherwise, if it has target version, Updating=False and Reason=Updated
- Otherwise, if it does not have target version, Updating=False and Reason=Pending

Healthy condition
- Corresponds to the existing checks in the client prototype, taking into account the thresholds (Healthy=True if the "bad" condition is not holding long enough yet, but we may invent a special Reason for this case)
This card does *not* involve creating Health Insights when there are unhealthy operators
This card does *not* involve updating a ClusterVersion status insight from the CO-related data (such as completeness percentage or duration estimate)

https://github.com/openshift/cluster-version-operator/pull/1135

Task OTA-1339: Enhancement & API proposals

View the Description View the linked PRs

~~OTA-1266~~ and ~~OTA-1268~~ created sufficient content for us to put together an OpenShift enhancement document together with an API proposal and get the necessary reviews started.

Definition of Done

Enhancement PR opened
API PR opened

https://github.com/openshift/cluster-version-operator/pull/1138

Story OTA-1269: Scaffold the status API controller

View the Description View the linked PRs

On the call to discuss oc adm upgrade status roadmap to server side-implementation (notes) we agreed on basic architectural direction and we can starting moving in that direction:

status API will be backed by a new controller
new controller will be a separate binary but delivered in the CVO image (=release payload) to avoid needing new ClusterOperator
new operator will maintain a singleton resource of a new UpgradeStatus CRD - this is the interface to the consumers

Let's start building this controller; we can implement the controller perform the functionality currently present in the client, and just expose it through an API. I am not sure how to deal with the fact that we won't have the API merged until it merges into o/api, which is not soon. Maybe we can implement the controller over a temporary fork of o/api and rely on manually inserting the CRD into the cluster when we test the functionality? Not sure.

We need to avoid committing to implementation details and investing effort into things that may change though.

Definition of Done

CVO repository has a new controller (a new cluster-version-operator cobra subcommand sounds like a good option; an alternative would a completely new binary included in CVO image)
The payload contains manifests (SA, RBAC, Deployment) to deploy the new controller when DevPreviewNoUpgrade feature set is enabled (but not TechPreview)
The controller uses properly scoped minimal necessary RBAC through a dedicated SA
The controller will react on ClusterVersion changes in the cluster through an informer
The controller will maintain a single ClusterVersion status insight as specified by the Update Health API Draft
The controller does not need to maintain all fields precisely: it can use placeholders or even ignore fields that need more complicated logic over more resources (estimated finish, completion, assessment)
The controller will publish the serialized CV status insight (in yaml or json) through a ConfigMap (this is a provisionary measure until we can get the necessary API and client-go changes merged) under a key that identifies the kube resource ("cv-version")
The controller only includes the necessary types code from o/api PR together with the necessary generated code (like deepcopy). These local types will need to be replaced with the types eventually merged into o/api and vendored to o/cluster-version-operator

Testing notes

This card only brings a skeleton of the desired functionality to the DevPreviewNoUpgrade feature set. Its purpose is mainly to enable further development by putting the necessary bits in place so that we can start developing more functionality. There's not much point in automating testing of any of the functionality in this card, but it should be useful to start getting familiar with how the new controller is deployed and what are its concepts.

For seeing the new controller in action:

1. Launch a cluster that includes both the code and manifests. As of Nov 11, #1107 is not yet merged so you need to use launch 4.18,openshift/cluster-version-operator#1107 aws,no-spot
2. Enable the DevPreviewNoUpgrade feature set. CVO will restart and will deploy all functionality gated by this feature set, including the USC. It can take a bit of time, ~10-15m should be enough though.
3. Eventually, you should be able to see the new openshift-update-status-controller Namespace created in the cluster
4. You should be able to see a update-status-controller Deployment in that namespace
5. That Deployment should have one replica running and being ready. It should not crashloop or anything like that. You can inspect its logs for obvious failures and such. At this point, its log should, near its end, say something like "the ConfigMap does not exist so doing nothing"
6. Create the ConfigMap that mimics the future API (make sure to create it in the openshift-update-status-controller namespace): oc create configmap -n openshift-update-status-controller status-api-cm-prototype
7. The controller should immediately-ish insert a usc-cv-version key into the ConfigMap. Its content is a YAML-serialized ClusterVersion status insight (see design doc). As of ~~OTA-1269~~ the content is not that important, but the (1) reference to the CV (2) versions field should be correct.
8. The status insight should have a condition of Updating type. It should be False at this time (the cluster is not updating).
9. Start upgrading the cluster (it's a cluster bot cluster with ephemeral 4.18 version so you'll need to use --to-image=pullspec and probably force it
10. While updating, you should be able to observe the controller activity in the log (it logs some diffs), but also the content of the status insight in the ConfigMap changing. The versions field should change appropriately (and startedAt too), and the Updating condition should become True.
11. Eventually the update should finish and the Updating condition should flip to False again.

Some of these will turn into automated testcases, but it does not make sense to implement that automation while we're using the ConfigMap instead of the API.

https://github.com/openshift/cluster-version-operator/pull/1107

Feature OCPSTRAT-1825: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic MCO-1208: Manage the MCS ignition-ca cert

View the Description

Spun out of https://issues.redhat.com/browse/MCO-668

This aims to capture the work required to rotate the MCS-ignition CA + cert.

Original description copied from ~~MCO-668~~:

Today in OCP there is a TLS certificate generated by the installer , which is called "root-ca" but is really "the MCS CA".

A key derived from this is injected into the pointer Ignition configuration under the "security.tls.certificateAuthorities" section, and this is how the client verifies it's talking to the expected server.

If this key expires (and by default the CA has a 10 year lifetime), newly scaled up nodes will fail in Ignition (and fail to join the cluster).

The MCO should take over management of this cert, and the corresponding user-data secret field, to implement rotation.

Reading:

- There is a section in the customer facing documentation that touches on this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html

- There's a section in the customer facing documentation for this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html that needs updating for clarification.

- There's a pending PR to openshift/api: https://github.com/openshift/api/pull/1484/files

- Also see old (related) bug: https://issues.redhat.com/browse/OCPBUGS-9890

- This is also separate to https://issues.redhat.com/browse/MCO-499 which describes the management of kubelet certs

Story MCO-1457: Clean up bootstrap MCS CA & TLS cert objects for management

View the Description View the linked PRs

The CA/cert generated by the installer is not currently managed and also does not preserve the signing key; so the cert controller we are adding in the MCO(leveraged from library-go), throws away everything and starts fresh. Normally this happens fairly quickly so both the MCS and the -user-data secrets are updated together. However, in certain cases(such as agent based installations) where a bootstrap node joins the cluster late, it will have the old CA from installer, and unfortunately the MCS will have a TLS cert signed by the new CA - resulting in invalid TLS cert errors.

To account for such cases, we have to ensure the first CA embedded in any machine is matching the format expected by the cert controller. To do this, we'll have to do the following in the installer:

~~Have the bootstrap MCC generate the CA/TLS cert using the cert controller, and populate them into the right places(this card)~~
~~Make changes in the installer to remove the creation of the CA/cert, since the bootstrap MCC will now handle this(https://issues.redhat.com/browse/MCO-1458)~~
Template out all root-ca artifacts in the format expected by the library-go cert controller. This would involve adding certain annotations on the artifacts(with respect to validity of the cert and some other ownership metadata)
The root CA signing key is currently discarded by the installer, so this will have to be a new template in the installer.

https://github.com/openshift/installer/pull/9309

Story MCO-643: Implement a path in the controller to manage user-data secrets

View the Description View the linked PRs

The machinesets in the machine-api namespace reference a user-data secret (per pool and can be customized) which stores the initial ignition stub configuration pointing to the MCS, and the TLS cert. This today doesn't get updated after creation.

The MCO now has the ability to manage some fields of the machineset object as part of the managed bootimage work. We should extend that to also sync in the updated user-data secrets for the ignition tls cert.

The MCC should be able to parse both install-time-generated machinesets as well as user-created ones, so as to not break compatibility. One way users are using this today is to use a custom secret + machineset to do non-MCO compatible ignition fields, for example, to partition disks for different device types for nodes in the same pool. Extra care should be taken not to break this use case

https://github.com/openshift/machine-config-operator/pull/4735

Feature OCPSTRAT-1834: OCP Update Precheck command to improve update experience

View the Description

Feature Overview (aka. Goal Summary)

As a cluster-admin I can use a single command to see all upgrade checklist before I trigger an update.
Create a Update precheck command that helps customers identify potential issues before triggering an OpenShift cluster upgrade, without blocking the upgrade process. This tool aims to reduce upgrade failures and support tickets by surfacing common issues beforehand.

Goals (aka. expected user outcomes)

Enable users (especially those with limited OpenShift expertise) to identify potential upgrade issues before starting the upgrade
Reduce the number of failed upgrades and support tickets
Provide clear, actionable information about cluster state relevant to upgrades
Help customers make informed decisions about when to initiate upgrades

Requirements (aka. Acceptance Criteria):

Check Pod Disruption Budgets (PDBs):
- Identify existing PDBs that might impact the upgrade
- Display information about PDBs in a way that's understandable to users with limited Kubernetes experience
- workaround - Check DVO PDB checks
Image Registry Access Verification:
- Validate access to required image repositories
- Pre-check ability to pull images needed for the upgrade
- Verify connectivity to public registries or repository of choice
- workaround : Image pinning GA
Node Health Verification:
- Check for unavailable nodes
- Identify nodes in maintenance mode
- Detect unscheduled nodes
- Verify overall node health status
Core Platform Component Health:
- Verify health of control plane workloads
- Check core platform operators' health
Alert Analysis:
- List any active critical alerts
- Display relevant warning alerts
- Focus on alerts that could impact upgrade success
Version-Specific Checks:
- Include checks specific to the target upgrade version
- Verify requirements for new features or changes between versions
- Check networking-related requirements (e.g., SDN to OVN migrations)
Output Requirements:
- Provide clear, understandable output for users without deep OpenShift knowledge
- Don't block upgrades even if issues are found
- Present information in an easily digestible format

==============

{}New additions in 2025{}

MCP status
- Check the maxUnavailable
- Compare maxUnavailable to the request level or current load level (if above request level) and determine if this is the correct setting
- Check to see if the MCPs are paused
Make a note if etcd is backed up
Other operators
- Note which operators are set to manual vs automatic update
- Check to determine the next update of all OLM based operators

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Self-managed
Classic (standalone cluster)	standalone
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all	All
Connected / Restricted Network	All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

Blocking upgrade execution
Checking entire cluster state
Verifying non-platform workloads
Automated issue resolution
Comprehensive cluster health checking
Extensive operator compatibility verification beyond core platform
ACM integration
- Although Operations will use ACM for day 2 operations. Customer Engineering will use cli for patching, updating, precheck etc.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Target users may have limited Kubernetes/OpenShift expertise
Many users coming from VMware background
Customers often don't have TAM or premium support
Users may not be familiar with platform-specific concepts
Need to accommodate users who prefer not to read extensive documentation

Documentation Considerations

Add docs for precheck command

Interoperability Considerations

Epic OTA-1422: Extend tech-preview 'oc adm upgrade recommend' to render relevant alerts

View the Description

Epic Goal

OCPSTRAT-1834 is requesting an oc precheck command that helps customers identify potential issues before triggering an OpenShift cluster upgrade. For 4.18, we'd built a tech-preview oc adm upgrade recommend (OTA-1270, product docs) that is in this "Anything I should think about before updating my cluster [to 4.y.z]?" space, and this Epic is about extending that subcommand with alerts to deliver the coverage requested by OCPSTRAT-1834.

Why is this important?

We currently document some manual checks for customer admins to run before launching an update. For example, RFE-5104 is up asking to automate whatever we're hoping customer are supposed to look for vs. critical alerts. But updating the production OpenShift Update Service is complicated, and it's easier to play around in a tech-preview oc subcommand, while we get a better idea of what information is helpful, and which presentation approaches are most accessible. 4.18's ~~OTA-902~~ / cvo#1907 folded the Upgradeable condition in as a client-side Conditional Update risk, and this Epic proposes to continue in that direction by retrieving update-relevant alerts and folding those in as additional client-side Conditional Update risks.

Scenarios

As a cluster administrator interested in launching an OCP update, I want to run an oc command that talks to me about my next-hop options, including any information related to known regressions with those target releases, and also including any information about things I should consider addressing in my current cluster state.

Dependencies

The initial implementation can be delivered unilaterally by the OTA updates team. The implementation may surface ambiguous or hard-to-actuate alert messages, and those messages will need to be improved by the component team responsible for maintaining that alert.

Contributing Teams (and contacts)

Development - OTA
Documentation - no docs required
QE - OTA
PX - OTA
Others -

Acceptance Criteria

OCPSTRAT-1834 customer is happy

Drawbacks or Risk

Client-side Conditional Update risks are helpful for cluster administrators who use that particular client. But admins who use older oc or who are using the in-cluster web-console and similar will not see risks known only to newer oc. If we can clearly tie a particular cluster state to update risk, declaring that risk via the OpenShift Update Service would put the information in front of all cluster administrators, regardless of which update interface they use.

However, trialing update risks client-side in tech-preview oc and then possibly promoting them to risks served by the OpenShift Update Service in the future might help us identify cluster state that's only weakly coupled to update success but still interesting enough to display. Or help us find more accessible ways of displaying that context before putting the message in front of large chunks of the fleet.

Done - Checklist

CI Testing - Tests are merged and completing successfully
Documentation - No docs.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Story OTA-1426: Extend tech-preview 'oc adm upgrade recommend' to render relevant alerts

View the Description View the linked PRs

oc adm upgrade recommend should retrieve alerts from the cluster (similar to how oc adm upgrade status already does), and inject them as conditional update risks (similar to how ~~OTA-902~~ / cvo#1907 injected Upgradeable issues). The set of alerts to include is:

All critical alerts, because that's explicitly selected in OCPSTRAT-1834. This includes ClusterOperatorDown, which overlaps with the existing ClusterVersion Failing condition, but 🤷. It also includes the PodDisruptionBudgetLimit alert which is part of PDB coverage.
PDB coverage:
- warning PodDisruptionBudgetAtLimit.
Image registry access:
- warning KubeContainerWaiting.
Nodes:
- warning KubeNodeNotReady, KubeNodeReadinessFlapping, and KubeNodeUnreachable.

Definition of done / test-plan:

Find a cluster with both update recommendations and some of the mentioned alerts firing.
Run OC_ENABLE_CMD_UPGRADE_RECOMMEND=true OC_ENABLE_CMD_UPGRADE_RECOMMEND_PRECHECK=true oc adm upgrade recommend.
Confirm that the command calls out the expected alerts for all update targets.

https://github.com/openshift/oc/pull/1970

Bug OCPBUGS-52504: 'oc adm upgrade recommend' should not panic on precheck Route lookup

View the Description View the linked PRs

Description of problem

The precheck functionality introduced by ~~OTA-1426~~ flubbed the Route lookup implementation, and crashes with a seg-fault. Thanks to Billy Pearson for reporting.

Version-Release number of selected component

4.19 code behind both the OC_ENABLE_CMD_UPGRADE_RECOMMEND=true and OC_ENABLE_CMD_UPGRADE_RECOMMEND_PRECHECK=true feature gates.

How reproducible

Every time.

Steps to Reproduce

$ export OC_ENABLE_CMD_UPGRADE_RECOMMEND=true  # overall 'recommend' command still tech-preview
$ export OC_ENABLE_CMD_UPGRADE_RECOMMEND_PRECHECK=true  # pre-check functionality even more tech-preview
$ ./oc adm upgrade recommend

Actual results

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3bd3cdd]

goroutine 1 [running]:
github.com/openshift/client-go/route/clientset/versioned/typed/route/v1.NewForConfig(0x24ec35c?)
	/go/src/github.com/openshift/oc/vendor/github.com/openshift/client-go/route/clientset/versioned/typed/route/v1/route_client.go:31 +0x1d
github.com/openshift/oc/pkg/cli/admin/upgrade/recommend.(*options).alerts(0xc000663130, {0x5ed4828, 0x8350a80})
	/go/src/github.com/openshift/oc/pkg/cli/admin/upgrade/recommend/alerts.go:34 +0x170
github.com/openshift/oc/pkg/cli/admin/upgrade/recommend.(*options).precheck(0x0?, {0x5ed4828, 0x8350a80})
...

Expected results

Successful precheck execution and reporting.

https://github.com/openshift/oc/pull/1987

Feature OCPSTRAT-1843: Console: Customer Happiness (RFEs) for 4.19

View the Description

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

Goals

This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?

CI - MUST be running successfully with test automation

This is a requirement for ALL features.

YES

Release Technical Enablement

Provide necessary release enablement details and documents.

YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories

Alternate flow/scenarios - high-level user stories

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CONSOLE-4334: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-4399: Add "Created Time" Column to Job Listing in OCP Console

View the Description View the linked PRs

Introduce a "Created Time" column to the Job listing in the OpenShift Container Platform (OCP) console to enhance the ability to sort jobs by their creation date. This feature will help users efficiently manage and navigate through numerous jobs, particularly in environments with frequent CronJob executions and a high volume of job runs.

Acceptance Criteria:

Add a "Created Time" column to the Job listing in the OCP console.

Display the creation timestamp in a format consistent with the console's date and time standards.
Enable sorting of jobs by the "Created Time" column.

https://github.com/openshift/console/pull/14786

Bug OCPBUGS-45514: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature OCPSTRAT-1844: TechDebt - OCP Console - Dependency Cleanup

View the Description

Feature Overview (aka. Goal Summary)

We need to maintain our dependencies across all the libraries we use in order to stay in compliance.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CONSOLE-4350: OCP 4.19 - Console Dependencies & Tech Debt

View the Description

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Story CONSOLE-3414: Inconsistency in the loader/spinner/dots component used throughout the unified console

View the Description View the linked PRs

Inconsistency in the loader/spinner/dots component used throughout the unified console. The dots animation is used widely through the spoke clusters, but it is not a Patternfly component(This component was originally inhereted from CoreOS console). Spoke clusters also uses skeleton states on certain pages, which is a Patternfly component. Hub uses a mix of dots animation first for half a second, and then spinners and skeletons.

Currently there is a discussion with PF to update with clearer guidelines. According to the current PF guidelines, we should be using the large spinner if we cannot anticipate the data being loaded, and the skeleton state if we do know. Link to doc

Story CONSOLE-4407: Update YAML language server and monaco

View the linked PRs

https://github.com/openshift/console/pull/14663

Story CONSOLE-4400: Update to TypeScript 5

View the Description View the linked PRs

Currently console is using TypeScript 4, which is preventing us from upgrading to NodeJS-22. Due to that we need to update TypeScript 5 (not necessarily latest version).

AC:

Update TypeScript to version 5
Update ES build target to ES-2021

Note: In case of higher complexity we should be splitting the story into multiple stories, per console package.

https://github.com/openshift/console/pull/14620

Story CONSOLE-4508: Enable CSP tests for console-operator

View the Description View the linked PRs

Part of lifting feature gate for the CSP API we need to enable e2e CSP tests for console-operator.

https://github.com/openshift/console-operator/pull/969

Bug OCPBUGS-45848: Some references in static plugins are missing file extensions

View the linked PRs

https://github.com/openshift/console/pull/14599

Story CONSOLE-4430: Automated Content Security Policy testing of Console pages

View the Description View the linked PRs

In Console 4.18 we introduced an initial Content Security Policy (CSP) implementation (~~CONSOLE-4263~~).

This affects both Console web application as well as any dynamic plugins loaded by Console. In production, CSP violations are sent to telemetry service for analysis (~~CONSOLE-4272~~).

We need a reliable way to detect new CSP violations as part of our automated CI checks. We can start with testing the main dashboard page of Console and expand to more pages as necessary.

Acceptance criteria:

Console project provides a script to test for CSP violations.
CSP violation test script does not report any errors for Console.

https://github.com/openshift/console/pull/14675

Story CONSOLE-3905: Update Webpack package to version 5

View the Description View the linked PRs

As a developer I want to make sure we are running the latest version of webpack in order to take advantage of the latest benefits and also keep current so that future updating is a painless as possible.

We are currently on v4.47.0.

Changelog: https://webpack.js.org/blog/2020-10-10-webpack-5-release/

By updating to version 5 we will need to update following pkgs as well:

html-webpack-plugin
webpack-bundle-analyzer
copy-webpack-plugin
fork-ts-checker-webpack-plugin

AC: Update webpack to version 5 and determine what should be the ideal minor version.

https://github.com/openshift/console/pull/14378

Epic CONSOLE-4352: OCP 4.19 - Address tech debt in frontend/public/components/secrets/create-secret.tsx

View the Description

Epic Goal

Migrate all components to functional components
Remove all HOC patterns
Break the file down into smaller files
Improve type definitions
Improve naming for better self-documentation
Address any React anti-patterns like nested components, or mirroring props in state.
Address issues with handling binary data
Add unit tests to these components

Acceptance Criteria

Refactor secret forms
Adding unit tests to these components.
Fix https://issues.redhat.com/browse/OCPBUGS-32401

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-4079: Address tech debt in GenericSecretForm component

View the Description View the linked PRs

The GenericSecretForm component needs to be refactored to address several tech debt issues: * Rename to OpaqueSecretForm

Refactor into a function component
Remove i18n withTranslation HOC pattern
Improve type definitions

https://github.com/openshift/console/pull/14520

Story CONSOLE-4080: Address tech debt in KeyValueEntryForm component

View the Description View the linked PRs

The KeyValueEntryForm component needs to be refactored to address several tech debt issues: * Rename to OpaqueSecretFormEntry

Refactor into a function component
Remove i18n withTranslation HOC pattern
Improve type definitions

https://github.com/openshift/console/pull/14520

Story CONSOLE-4077: Address tech debt in BasicAuthSubform component

View the Description View the linked PRs

The BasicAuthSubform component needs to be refactored to address several tech debt issues: * Rename to BasicAuthSecretForm

Refactor into a function component
Remove i18n withTranslation HOC pattern
Improve type definitions

https://github.com/openshift/console/pull/14631

Story CONSOLE-4076: Address tech debt in SourceSecretForm component

View the Description View the linked PRs

The SourceSecretForm component needs to be refactored to address several tech debt issues: * Rename to AuthSecretForm

Refactor into a function component
Remove i18n withTranslation HOC pattern
Improve type definitions

https://github.com/openshift/console/pull/14633

Feature OCPSTRAT-1847: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-5519: Universal connectivity: Localnet [4.19]

View the Description

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Provide quality user experience for customers connecting their Pods and VMs to the underlying physical network through OVN Kubernetes localnet.

Why is this important?

This is a continuation to https://issues.redhat.com/browse/SDN-5313.

It covers the UDN API for localnet and other improvements

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

CI - MUST be running successfully with tests automated
- This must be done downstream too
Release Technical Enablement - Provide necessary release enablement
details and documents.
OVN Kubernetes secondary networks with the localnet topology can be created through ClusterUserDefinedNetworks
When possible, user input is validated and any configuration issue is shown on the UDN. Alternatively some issues can be shown on CNI ADD events on Pod
Definition of these networks can be changed even if there are Pods connected to them. When that happens, the UDN is marked as degraded until all the "old" pods are gone. The mutable fields should be: MTU, VLAN, physnet name
A single "bridge-mappings" "localnet" can be referenced from multiple different UDNs
The default MTU set for localnet is 1500
Pod requesting UDN without a VLAN is able to connect to services running on the host's network
(~~stretch) The "physnet" mapping is a "supported API" and available to users - so they can connect to the machine network without a need to configure a custom bridge-mapping~~ we should just always request user to configure the mapping themselves, until we understand all the implications of non-NORMAL mode on br-ex and how it works with local access / bondings / ...
(stretch) Scheduling is managed by the platform - if a UDN requests a localnet (as in bridge-mappins.localnet), the Pod requesting this UDN will be only scheduled on a node with this resource available. This can use the same mechanism as the SR-IOV operator - combination of device plugins and "k8s.v1.cni.cncf.io/resourceName" annotation

...

IPAM is not in the scope of this epic. See RFE-6947.

Dependencies (internal and external)

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-45469: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2441

Feature OCPSTRAT-1856: ARO/HCP Managed/Workload Identities for Control Plane and Data plane - v3

View the Description

Feature Overview (aka. Goal Summary)

In order for Managed OpenShift Hosted Control Planes to run as part of the Azure Redhat OpenShift, it is necessary to support the new AKS design for secrets/identities.

Goals (aka. expected user outcomes)

Hosted Cluster components use the secrets/identities provided/referenced in the Hosted Cluster resources creation.

Requirements (aka. Acceptance Criteria):

All OpenShift Hosted Cluster components running with the appropriate managed or workload identity.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Managed
Classic (standalone cluster)	No
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All supported ARO/HCP topologies
Connected / Restricted Network	All supported ARO/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All supported ARO/HCP topologies
Operator compatibility	All core operators
Backport needed (list applicable versions)	OCP 4.18.z
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	No
Other (please specify)

Background

This is a follow-up to ~~OCPSTRAT-979~~ required by an AKS sweeping change to how identities need to be handled.

Documentation Considerations

Should only affect ARO/HCP documentation rather than Hosted Control Planes documentation.

Interoperability Considerations

Does not affect ROSA or any of the supported self-managed Hosted Control Planes platforms

Epic CNTRLPLANE-102: Add Support for NewUserAssignedIdentityCredential Authentication Upstream

View the Description

Goal

The goal for this epic is for every upstream project ARO HCP uses, that needs to authenticate with Azure Cloud, supports adding authenticating with NewUserAssignedIdentityCredential.
- i.e., https://github.com/Azure/azure-sdk-for-go/blob/b1480a29448972f6cc81221af685ff9011e9e0b9/sdk/azcore/cloud/cloud.go#L41
- // opts.Cloud.ActiveDirectoryAuthorityHost
This includes the following upstream projects:
- CAPZ
- ASO
- Cloud Provider
- Azure Disk CSI Driver
- Azure File CSI Driver

Scenarios

CAPZ supports authenticating with NewUserAssignedIdentityCredential.
ASO supports authenticating with NewUserAssignedIdentityCredential.
Cloud Provider supports authenticating with NewUserAssignedIdentityCredential.
Azure Disk CSI Driver supports authenticating with NewUserAssignedIdentityCredential.
Azure File CSI Driver supports authenticating with NewUserAssignedIdentityCredential.

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

External - dependent on upstream communities accepting the changes needed to support authenticating with NewUserAssignedIdentityCredential.
External - dependent on Microsoft having the SDK ready prior to HyperShift's work on this epic.

Previous Work (Optional):

Open questions:

This information is retrieved from a 1P Microsoft application; to my knowledge, there is no way for HyperShift to test this in our current environments?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNTRLPLANE-106: Add Support for NewUserAssignedIdentityCredential Upstream for Cloud Provider

View the Description View the linked PRs

User Story:

As an ARO HCP user, I want to be able to:

authenticate with NewUserAssignedIdentityCredential for Cloud Provider

so that I can

be compliant with Microsoft security standards when using managed identities to authenticate with.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Updated upstream code in Cloud Provider to support authenticating with NewUserAssignedIdentityCredential
Pull the upstream Cloud Provider PR into OpenShift's Cloud Provider once the Cloud Provider PR is merged

Out of Scope:

N/A

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/cloud-provider-azure/pull/136

Story CNTRLPLANE-105: Add Support for Authentication Endpoint Upstream for CAPZ

View the Description View the linked PRs

User Story:

As an ARO HCP user, I want to be able to:

authenticate with NestedCredentialsObject for CAPZ

so that I can

be compliant with Microsoft security standards when using managed identities to authenticate with.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Updated upstream code in CAPZ to support authenticating with NestedCredentialsObject
Pull the upstream CAPZ PR into OpenShift's CAPZ once the CAPZ PR is merged

Out of Scope:

N/A

Engineering Details:

Previous client certificate work on CAPZ from the HyperShift team - https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/5200

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/cluster-api-provider-azure/pull/333

Epic CNTRLPLANE-103: Support UserAssignedIdentityCredentials for Control Plane Managed Identities

View the Description

Goal

Today the current, the current HyperShift Azure API for Control Plane Managed Identities (MI) stores the client ID and its certificate name for each MI. The goal for this epic is to modify this API to instead allow a NestedCredentialsObject to be stored for each Control Plane MI.
- https://github.com/Azure/msi-dataplane/blob/b493e302f06b1fb364ba09e374ea61ee176e3dc3/pkg/dataplane/swagger/zz_generated_models.go#L66-L80
In ARO HCP, CS will store the NestedCredentialsObject for each Control Plane MI, in its JSON format, in Azure Key Vault under a secret name for each MI. The secret name for a Control Plane MI will be provided to the HyperShift Azure API (i.e. HostedCluster). The control plane operator will read and parse the ClientID ClientSecret, AuthenticationEndpoint, and TenantID for each Control Plane MI and either pass or use this data to use ClientCertificate authentication for each Control Plane component that needs to authenticate with Azure Cloud.

Why is this important?

As part of the msi-dataplane repository walk-through, a gap in the way ARO HCP is approaching authentication as managed identities for control plane components was found. The gap was that we're not overriding the ActiveDirectoryAuthorityHost as requested by the MSI team when authenticating as a managed identity. This prompted a wider discussion with HyperShift which led to the proposal here and allowing HyperShift to use the full nested credentials objects and leverage the fields they need within the struct.

Scenarios

The HyperShift Azure API supports only a secret name for each Control Plane MI (instead of a client ID and certificate name today).
The Control Plane Operator, using the SecretsStore CSI Driver, will retrieve the NestedCredentialsObject from Azure Key Vault and mount it to a volume in any pod needing to authenticate with Azure Cloud.
The Control Plane Operator, through possibly a parsing function from the library-go or msi-dataplane repo, will parse the ClientID ClientSecret, AuthenticationEndpoint, and TenantID from the NestedCredentialsObject and either use or pass this data along to authenticate with ClientCertificate. This will be done for each control plane component needing to authenticate to Azure Cloud.
Remove the filewatcher functionality from HyperShift and in OpenShift repos (CIO, CIRO, CNO/CNCC)

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

External - dependent on upstream communities accepting the changes needed to support ActiveDirectoryAuthorityHost in ClientCertificateCredentialOptions.
External - dependent on Microsoft having the SDK ready prior to HyperShift's work on this epic.

Previous Work (Optional):

Previous Microsoft work:
1. https://github.com/Azure/msi-dataplane/pull/29
2. https://github.com/Azure/msi-dataplane/pull/30

Open questions:

This information is retrieved from a 1P Microsoft application; to my knowledge, there is no way for HyperShift to test this in our current environments?
Can HyperShift get a mock/real example of the JSON structure that would be stored in the Key Vault? (to be used in development, unit testing since we cannot retrieve a real version of this in our current test environments).

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNTRLPLANE-111: Update HyperShift Control Plane Components to use UserAssignedIdentityCredentials

View the Description View the linked PRs

As an ARO HCP user, I want to be able to:

have the Secret Store CSI driver retrieve the NestedCredentialsObject from an Azure Key Vault based on the control plane component's secret name in the HyperShift Azure API for these control plane components: CAPZ, Cloud Provider, KMS, and CPO.
the corresponding HCP pod needing to authenticate with Azure cloud reads the ClientID ,ClientSecret, AuthenticationEndpoint, and TenantID from the NestedCredentialsObject and uses the data to authenticate with Azure Cloud

so that

the NestedCredentialsObject is mounted in a volume in the pod needing to authenticate with Azure Cloud
the ClientCertificate authentication is using the right fields needed for managed identities in ARO HCP.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Update all the SecretProviderClasses to pull from the new HyperShift Azure API field holding the secret name
Update each HyperShift HCP component to use UserAssignedIdentityCredentials
Remove the filewatcher functionality from HyperShift

Out of Scope:

Updating any external OpenShift components that run in the HCP

Engineering Details:

This should be CAPZ, CPO, and Cloud Provider.
Additional details on why we are doing this are located here - https://docs.google.com/document/d/1tX2avo8ORyjqL4GJNwxq-wQTM3-LC7zBTP5V9EgwU74/edit?tab=t.0.

This does not require a design proposal.
This does not require a feature gate.

Story CNTRLPLANE-112: Update OpenShift Control Plane Components to Use UserAssignedIdentityCredentials

View the Description View the linked PRs

As an ARO HCP user, I want to be able to:

have the Secret Store CSI driver retrieve the UserAssignedIdentityCredentials from an Azure Key Vault based on the control plane component's secret name in the HyperShift Azure API for these control plane components: CNO, CIRO, CSO, and CIO.
the corresponding HCP pod needing to authenticate with Azure cloud can read the file path to the UserAssignedIdentityCredentials object and uses the data to authenticate with Azure Cloud

so that

the UserAssignedIdentityCredentials is mounted in a volume in the pod needing to authenticate with Azure Cloud

Acceptance Criteria:

Description of criteria:

Upstream documentation
Update all the SecretProviderClasses to pull from the new HyperShift Azure API field holding the secret name
Update each HyperShift HCP component to use UserAssignedIdentityCredentials
Remove the filewatcher functionality from OpenShift repos (CIO, CIRO, CNO/CNCC)

Out of Scope:

Updating any HyperShift-only components that run in the HCP

Engineering Details:

This should be CNO, CIRO, CSO, and CIO.
Additional details on why we are doing this are located here - https://docs.google.com/document/d/1tX2avo8ORyjqL4GJNwxq-wQTM3-LC7zBTP5V9EgwU74/edit?tab=t.0.

This does not require a design proposal.
This does not require a feature gate.

Story CNTRLPLANE-109: Update HyperShift API to Support UserAssignedIdentityCredential for Control Plane Managed Identities

View the Description View the linked PRs

User Story:

As an ARO HCP user, I want to be able to:

add the secret name for each control plane managed identity in the HyperShift Azure API

so that I can

provide the secret name to the Secrets CSI driver.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Deprecate storing client ID and certificate name for each control plane managed identity in the HyperShift Azure API
Make the client ID and certificate name for each control plane managed identity in the HyperShift Azure API optionally instead of required
Add the capability to store the secret name for each control plane managed identity in the HyperShift Azure API; this functionality should be behind the tech preview flag

Out of Scope:

Removing the deprecated client ID and certificate name fields. This should be done at a later date when CS only supports the new API

Engineering Details:

Additional details on why we are doing this are located here - https://docs.google.com/document/d/1tX2avo8ORyjqL4GJNwxq-wQTM3-LC7zBTP5V9EgwU74/edit?tab=t.0.

This does not require a design proposal.
This requires a feature gate.

https://github.com/openshift/hypershift/pull/5556

Feature OCPSTRAT-1874: [Tech Preview] Agent-Installer Installation UI for OpenShift Virtualization

View the Description

Summary

The installation process for the OpenShift Virtualization Engine (OVE) has been identified as a critical area for improvement to address customer concerns regarding its complexity compared to competitors like VMware, Nutanix, and Proxmox. Customers often struggle with disconnected environments, operator configuration, and managing external dependencies, making the initial deployment challenging and time-consuming.

To resolve these issues, the goal is to deliver a streamlined, opinionated installation workflow that leverages existing tools like the Agent-Based Installer, the Assisted Installer, and the OpenShift Appliance (all sharing the same underlying technology) while pre-configuring essential operators and minimizing dependencies, especially the need for an image registry before installation.

By focusing on enterprise customers, particularly VMware administrators working in isolated networks, this effort aims to provide a user-friendly, UI-based installation experience that simplifies cluster setup and ensures quick time-to-value.

Objectives and Goals

Primary Objectives

Simplify the OpenShift Virtualization installation process to reduce complexity for enterprise customers coming from VMware vSphere.
Enable installation in disconnected environments with minimal prerequisites.
Eliminate the dependency on a pre-existing image registry in disconnected installations.
Provide a user-friendly, UI-driven installation experience for users used to VMware vSphere.

Goals

Deliver an installation experience leveraging existing tools like the Agent-Based Installer, Assisted Installer, and OpenShift Appliance, i.e. the Assisted Service.
Pre-configure essential operators for OVE and minimize external day 1 dependencies (see OCPSTRAT-1811 "Agent Installer interface to install Operators")
Ensure successful installation in disconnected environments with standalone OpenShift, with minimal requirements and no pre-existing registry

Personas

Primary Audience

VMware administrators transitioning to OpenShift Virtualization in isolated/disconnected environments.

Pain Points

Lack of UI-driven workflows; writing YAML files is a barrier for the target user (virtualization platforms admins)
Complex setup requirements (e.g., image registries in disconnected environments).
Difficulty in configuring network settings interactively.
Lack of understanding when to use a specific installation method
Hard time finding the relevant installation method (docs or at console.redhat.com)

Technical Requirements

Image Registry Simplification

Eliminate the dependency on an existing external image registry for disconnected environments.
Support a workflow similar to the OpenShift Appliance model, where users can deploy a cluster without external dependencies.

Agent-Based Installer Enhancements

Extend the existing UI to capture all essential data points (e.g., cluster details, network settings, storage configuration) without requiring YAML files.
Install without a pre-existing registry in disconnected environment
Install required operators for virtualization
OpenShift Virtualization Reference Implementation Guide v1.0.2
List of Operators:
- OpenShift Virtualization Operator
- Machine and Node Configuration
- Machine Config Operator

- Node Health Check Operator
- Fence Agents Remediation Operator
- Additional Operators
- Node Maintenance Operator
- OpenShift Logging
- MetalLB
- Migration Toolkit for Virtualization
- Migration Toolkit for Containers
- Compliance Operator
- Kube Descheduler Operator
- NUMA Resources Operator
- Ansible Automation Platform Operator
- Network
- NMState Operator
- Node Failure
- Self Node Remediation Operator
- Disaster Discovery
- OADP
- ODF

Note: we need each operator owner to enable the operator to allow its installation via the installer. We won't block the release due to not having the full list of operators included and they'll be added as required and prioritized with each team.

User experience requirements

The first area of focus is a disconnected environment. We target these environments with the Agent-Based Installer.

The current docs for installing on disconnected environment are very long and hard to follow.

Installation without pre-existing Image Registry

The image registry is required in disconnected installations before the installation process can start. We must simplify this point so that users can start the installation with one image, without having to explicitly install one.

This isn't a new requirement and in the past we've analyzed options for this and even did a POC, we could revisit this point, see Deploy OpenShift without external registry in disconnected environments.

The OpenShift Appliance can in fact be installed without a registry.

Additionally, we started work in this direction AGENT-262 (Strategy to complete installations where there isn't a pre-existing registry).

We also had the field (Brandon Jozsa) doing a POC which was promising:

https://gist.github.com/v1k0d3n/cbadfb78d45498b79428f5632853112a

User Interface (no configuration files)

The type of users coming from VMware vSphere expect a UI. They aren't used to writing YAML files and this has been identified as a blocker for some of them. We must provide a simple UI to stand up a cluster.

Proposed Workflow

Simplified Disconnected Installation:

This document written by Zane Bitter as the Staff Engineer of this area contains an interactive workflow proposal.

This is the workflow proposed in the above document.

PRD and notes from regular meetings

Epic AGENT-1086: OVE release image generation

View the Description

Epic Goal

Setup a workflow to generate an ISO that will contain all the relevant pieces to install an OVE cluster

Why is this important?

As per OCPSTRAT-1874, the user must be able to install into a disconnected environment an OVE cluster, with the help of a UI, and without requiring explicitly to setup an external registry

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Previous work:

https://github.com/openshift/appliance

Dependencies (internal and external)

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AGENT-1113: Add create interactive-disconnected-ignition sub command to openshift-installer

View the Description View the linked PRs

This sub-command will be used to generate the ignition file based on the interactive disconnected workflow.

This command will be invoked by the builder script (currently within the appliance tool) for supporting generating the ISO.

It will also consume, in future, the eventual (portion of) install configuration that the user will provide via the connected UI (above the sea level)

https://github.com/openshift/installer/pull/9529

Spike AGENT-1087: Script for generating the base OVE image

View the Description View the linked PRs

Description (2/20/25):
Create a script in agent-installer-utils/tools/ove-builder to build a live ISO using the appliance. The script will:

1. Checkout the appliance code.
2. Generate appliance-config.yaml.
3. Build the live ISO with the command:

sudo podman run --rm -it --pull newer --privileged --net=host -v $APPLIANCE_ASSETS:/assets:Z $APPLIANCE_IMAGE build live-iso

4. Take the pull secret as input
~~5. Unpack the ISO, embed TUI and Web UI components~~
~~6. Repack the ISO.~~

Refer to Appliance PR #335 for how release images are added in

Acceptance Criteria:

The script takes version, arch, pull secret as mandatory inputs, generates appliance-config.yaml
The script also successfully generates appliance ISO.
~~Agent TUI is a part of this ISO~~
~~Assisted web UI is a part of this ISO. Stretch goal~~
----------------------------------------

Original description:

Deploy a script (in dev-scripts) that will take as input a release image version.

The script should perform the following tasks (possibly within a container):

Extract the release installer (via oc extract)
Generate the unconfigured ignition (via the openshift-install agent create unconfigured-ignition)
Download the release ISO
Embed the generated ignition in the ISO

This will be used as a starting point for generating the image that the user will download to install the OVE cluster, and subsequently could be expanded with the additional images required (for example, the UI)

Notes:

we should also find a way to embed the agent TUI
not required to address disconnected registry right now

Story AGENT-1118: Add agent TUI to ove builder script

View the Description View the linked PRs

The ove builder script should include all the necessary agent TUI artifacts within the generated ISO

https://github.com/openshift/agent-installer-utils/pull/39

Epic AGENT-387: Interactively configure the rendezvous address

View the Description

Epic Goal

Allow the user to select a host to be Node 0 interactively after the booting the ISO. On each host the user would be presented with a choice between two options:

Select this host as the rendezvous host (it will become part of the control plane)
The IP address of the rendezvous host is: [Enter IP]

(If the former option is selected, the IP address should be displayed so that it can be entered in the other hosts.)

Why is this important?

Currently, when using DHCP the user must determine which IP address is assigned to at least one of the hosts prior to generating the ISO. (OpenShift requires infinite DHCP leases anyway, so no extra configuration is required but it does mean trying to manually match data with an external system.) ~~AGENT-385~~ would extend a similar problem to static IPs that the user is planning to configure interactively, since in that case we won't have the network config to infer them from. We should permit the user to delay collecting this information until after the hosts are booted and we can discover it for them.

Scenarios

In a DHCP network, the user creates the agent ISO without knowing which IP addresses are assigned to the hosts, then selects one to act as the rendezvous host after booting.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

~~AGENT-7~~

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AGENT-464: Expand TUI with a form that asks whether the node should be node0 or not

View the Description View the linked PRs

User Story:

As an admin, I want to be able to:

Have an interactive generic installation image that I can use for all nodes. Since it is a single image for all the nodes, I need to be able to select on boot whether the node is node0 (and future master) or a regular node.
Have the TUI checks take into account whether the node is node0 or not to perform additional checks (like connectivity check to the rendenzvous IP)

so that I can achieve

Interactive installation with a single image

Acceptance Criteria:

Description of criteria:

A dialog is presented on boot asking whether this node should be the one that controls the installation (node0)
On regular nodes additional connectivity checks are performed towards rendezvous IP
TUI writes Node0 configuration so the blocked node0 services can proceed (after network configuration and registry checks)

Engineering Details:

There is a PoC of this dialog in https://github.com/openshift-agent-team/tui
Final dialog in use with ABI: https://github.com/openshift/agent-installer-utils/

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/agent-installer-utils/pull/38

Feature OCPSTRAT-1892: vSphere - Delete PV and PVCs when destroying a cluster

View the Description

Goal

Remove all persistent volumes and claims. Also check if there are any CNS volumes that could be removed but the pv/pvc deletion should check for that.

Why is this important?

When an OpenShift cluster on vSphere with CSI volumes is destroyed the volumes are not deleted, leaving behind multiple objects within vSphere. This leads to storage usage by orphan volumes that must be manually deleted.

Multiple customers have requested this feature and we need this feature for CI. PV(s) are not cleaned up and leave behind CNS orphaned volumes that cannot be removed.

Epic SPLAT-1993: vSphere - Delete PV and PVCs on installer destroy

View the Description

Epic Goal

The goal of this epic is upon destroy to remove all persistent volumes and claims. Also check if there are any CNS volumes that could be removed but the pv/pvc deletion should check for that.

Why is this important?

Multiple customers have requested this feature and we need this feature for CI. PV(s) are not cleaned up and leave behind CNS orphaned volumes that cannot be removed.

Story SPLAT-1995: Destroy and confirm there are no lingering CNS volumes

View the Description View the linked PRs

As a openshift engineer I want the installer to make sure there are no CNS volumes so we are not leaking volumes that could be taking disk space or alerting in vCenter.

acceptance criteria

check if there are cns volumes, check if there are still pv from previous delete. If there are still CNS volumes and no PV(s) try to delete, if unsuccessful just return list
are you sure warning

https://github.com/openshift/installer/pull/9425

Feature OCPSTRAT-1895: OLM v1: OpenShift Cluster Proxy Support

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Some Kubernetes clusters do not have direct Internet access and rely solely on proxies for communication, so OLM v1 needs to support proxies to enable this communication.

Goals (aka. expected user outcomes)

Some Kubernetes clusters do not have direct Internet access and rely solely on proxies for communication. This may be done for isolation, testing or to enhance security and minimise vulnerabilities. This is a fully supported configuration in OpenShift, with origin tests designed to validate functionality in proxy-based environments. Supporting proxies is essential to ensure your solution operates reliably within these secure and compliant setups.
To address this need, we have two key challenges to solve:

Enable catalogd and the operator-controller to work with OpenShift proxy configurations.
Implement a solution to pass the proxy configuration for the solutions managed by OLM so that they work well with the proxy.

Requirements (aka. Acceptance Criteria):

Enable catalogd and the operator-controller to work with OpenShift proxy configurations.
Implement a solution to pass the proxy configuration for the solutions managed by OLM so that they work well with the proxy.
Trusted CA support

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network	Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)	4.18
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Proxy support for operator-controller when communicating with catalogd.
Adding code to support proxies upstream (the environment variable mechanism can be used).
Support for CSV-defined proxies.
oc-mirror support (which should already be there)

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

OpenShift’s centralized proxy control via the proxies.config.openshift.io (a.k.a. proxy.config.openshift.io) resource makes managing proxies across a cluster easier. At the same time, vanilla Kubernetes requires a manual and decentralized proxy configuration, making it more complex to manage, especially in large clusters. There is no native Kubernetes solution that can adequately address the need for centralized proxy management.

Kubernetes lacks a built-in unified API like OpenShift’s proxies.config.openshift.io, which can streamline proxy configuration and management across any Kubernetes vendor. Consequently, Kubernetes requires more manual work to ensure the proxy configuration is consistent across the cluster, and this complexity increases with the scale of the environment. As such, vanilla Kubernetes does not provide a solution that can natively address proxy configuration across all clusters and vendors without relying on external tools or complex manual processes (such as that devised by OpenShift).

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic OPRUN-3679: OLMv1 Proxy Support

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Support OpenShift Cluster Proxy

See RFC for more details

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-45098: OLMv1 doesn't work in proxied environment

View the Description View the linked PRs

It looks like OLMv1 doesn't handle proxies correctly, aws-ovn-proxy job is permafailing https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy/1861444783696777216

I suspect it's on the OLM operator side, are you looking at the cluster-wide proxy object and wiring it into your containers if set?

https://github.com/openshift/cluster-olm-operator/pull/93

Feature OCPSTRAT-1916: Azure - Remove not required permissions from the Nodes

View the Description

Feature Overview (aka. Goal Summary)

Once CCM was moved out-of-tree for Azure the 'azurerm_user_assigned_identity' resource the Installer creates is not required anymore. To make sure the Installer only creates the minimum permissions required to deploy OpenShift on Azure this resource created at install time needs to be removed

Goals (aka. expected user outcomes)

The installer doesn't create the 'azurerm_user_assigned_identity' resource anymore that is no longer required for the Nodes
**

Requirements (aka. Acceptance Criteria)

The Installer only creates the minimum permissions required to deploy OpenShift on Azure

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Background

Once CCM was moved out-of-tree this permission is not required anymore. We are implementing this change into 4.19 and backported to 4.18.z

At the same time, for customers running previous OpenShift releases, we will test upgrades between EUS releases (4.14.z - 4.16.z - 4.18.z) when `azurerm_user_assigned_identity` resource is removed previously to ensure the upgrade process is working with no issues and OpenShift is not reporting any issues because of this change

Customer Considerations

A KCS will be created for customers running previous OpenShift releases who want to remove this resource

Documentation Considerations

The new permissions requirements will be documented

Epic CORS-3883: Azure: Remove Automatic Identity Creation

View the Description View the linked PRs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Remove automatic (opinionated) creation (and attachment) of identities to Azure nodes
Allow API to configure identities for nodes

Why is this important?

Creating and attaching identities to nodes requires elevated permissions
The identities are no longer required (or used) so we can reduce the required permissions

Scenarios

Users want to do a default ipi install that just works without the User Access Admin role
Users want to BYO user-assigned identity (requires some permissions)
Users want to use a system assigned identity

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/machine-api-provider-azure/pull/133

Feature OCPSTRAT-1921: Support for VolumeGroup Snapshots (GA)

View the Description

Feature Overview (aka. Goal Summary)

Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

This is also a key requirement for backup and DR solutions.

https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot

Goals (aka. expected user outcomes)

Productise the volume group snapshots feature as GA, have docs updated, testing as well as removing feature gate to enable it by default.

Requirements (aka. Acceptance Criteria):

Tests and CI must pass. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.

Use Cases (Optional):

As a storage vendor I want my customers to benefit from the VolumeGroupSnapshot feature included in my CSI driver.
As a backup/DR software vendor I want to use the VolumeGroupSnapshot feature.
As a customer I want access to use VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs or use a backup/DR solution that leverages VolumeGroupSnapshot

Out of Scope

CSI drivers development/support for this feature.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

This allows backup vendors to implemented advanced feature by taking snapshots of multiple volumes at the same time a common use case in virtualisation.

Customer Considerations

Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.

Documentation Considerations

We already have TP doc content. Update the OCP driver's table to include this capability. Check if any new driver supports it beyond ODF

Interoperability Considerations

Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.

Epic STOR-2265: Upstream Beta Tracking: VolumeGroupSnapshot

View the Description

Epic Goal

Support upstream feature "VolumeGroupSnapshot"" in OCP as ~~Beta~~ GA, i.e. test it and have docs for it.

Why is this important?

We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Enhancement issue: https://github.com/kubernetes/enhancements/issues/3476
KEP: https://github.com/kubernetes/enhancements/pull/1551

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

External: the feature is currently scheduled for GA in Kubernetes 1.32, i.e. OCP 4.19.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story STOR-2285: Add e2e for running volume group snapshot tests

View the Description View the linked PRs

We need to make sure that we have e2e tests in Openshift that exercise this feature.

https://github.com/openshift/kubernetes/pull/2232

Feature OCPSTRAT-1929: GCP - Add support to AMD SEV-SNP confidential VMs

View the Description

Feature Overview (aka. Goal Summary)

Enable OpenShift to be deployed on Confidential VMs on GCP using AMD SEV-SNP technology

Goals (aka. expected user outcomes)

Users deploying OpenShift on GCP can choose to deploy Confidential VMs using AMD SEV-SNP technology to rely on confidential computing to secure the data in use

Requirements (aka. Acceptance Criteria):

As a user, I can choose OpenShift Nodes to be deployed with the Confidential VM capability on GCP using AMD SEV-SNP technology at install time

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Background

This is a piece of a higher-level effort to secure data in use with OpenShift on every platform

Documentation Considerations

Documentation on how to use this new option must be added as usual

Epic OCPCLOUD-2882: GCP - Add support to deploy Confidential VMs using AMD SEV-SNP

View the Description View the linked PRs

Epic Goal

Add support to deploy Confidential VMs on GCP using AMD SEV-SNP technology

Why is this important?

As part of the Zero Trust initiative we want to enable OpenShift to support data in use protection using confidential computing technologies

Scenarios

As a user I want all my OpenShift Nodes to be deployed as Confidential VMs on Google Cloud using SEV-SNP technology

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Previous Work (Optional):

We enabled Confidential VMs for GCP using SEV technology already - ~~OCPSTRAT-690~~

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OCPSTRAT-1945: Updated boot images: Phase 4 (vSphere TP and GCP, AWS to opt-out)

View the Description

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

In phase 1 provided tech preview for GCP.

In phase 2, GCP support goes to GA and AWS goes to TP.

In phase 3, AWS support goes to GA .

In phase 4, vsphere opt-in goes to TP and GCP goes to opt-out.

Requirements

Epic MCO-1361: Opt-out updated bootimage for GCP and AWS

View the Description

This epic will encompass work required to switch boot image updates on GCP to be opt-out.

Story MCO-1594: Update tests in origin to use explicit opt-out option

View the Description View the linked PRs

The origin tests should be:

updated to use the new API from https://github.com/openshift/api/pull/2223
disable the ownerref test as we no longer plan to degrade in that manner in the default-on behavior

This should land before MCO-1584 lands.

https://github.com/openshift/origin/pull/29598

Story MCO-1590: Add explicit opt-out and Status field for ManagedBootImages API

View the Description View the linked PRs

Based on discussion on the enhancement, we have decided that we'd like to add an explicit opt-out option and a status field for the ManagedBootImages knob in the MachineConfiguration object.

More context here:

https://github.com/openshift/enhancements/pull/1761#discussion_r1987873170

https://github.com/openshift/api/pull/2223

Story MCO-1485: Boot Image Controller should attempt to upgrade stub ignition to spec 3

View the Description View the linked PRs

The boot image controller should should ensure `-user-data` stub secrets are at least spec 3. This requires the cert management work to land first.

To ensure maximum coverage and minimum risk, we will only attempt to upgrade stub secrets that are currently spec 2. While we could potentially upgrade all stubs to the newest(which at the moment is 3.4.0) supported by the MCO, this may cause issues like https://issues.redhat.com/browse/MCO-1589 for some boot images that only support early spec 3 ignition(certain older boot images can't do can only do 3.0.0 and 3.1.0 ignition). Newer boot images can support all spec 3 stubs, so to preserve scaling ability as much as we can, we'll leave spec 3 stubs as is for the moment.

https://github.com/openshift/machine-config-operator/pull/4885

Feature OCPSTRAT-1946: [Tech Preview] AutoNode (Native Karpenter) with ROSA-HCP

View the Description

Feature Overview (aka. Goal Summary)

As a cluster administrator, I want to use Karpenter on an OpenShift cluster running in AWS to scale nodes instead of Cluster Autoscalar(CAS). I want to automatically manage heterogeneous compute resources in my OpenShift cluster without the additional manual task of managing node pools. Additional features I want are:

Reducing cloud costs through instance selection and scaling/descaling
Support GPUs, spot instances, mixed compute types and other compute types.
Automatic node lifecycle management and upgrades

This feature covers the work done to integrate upstream Karpenter 1.x with ROSA HCP. This eliminates the need for manual node pool management while ensuring cost-effective compute selection for workloads. Red Hat manages the node lifecycle and upgrades.

The goal is roll this out with ROSA-HCP (AWS) since it has more mature Karpenter ecosystem, followed by ARO-HCP (Azure) implementation (refer to OCPSTRAT-1498).

This feature will be delivered in 3 Phases:

Dev Preview: Autonode with HCP (OCPSTRAT-943) – targeting OCP 4.19
Preview (Tech Preview): Autonode for ROSA-HCP (OCPSTRAT-1946) - TBD (2025)
GA: Autonode for ROSA-HCP – TBD (2025)

The Dev Preview release will expose AutoNode capabilities on Hosted Control Planes for AWS (note this is not meant to be productized on self-managed OpenShift). It includes the following capabilities:

Service Consumer opts-in to AutoNode on Day 1 and Day 2
Service Provider lifecycles Karpenter management side
Cluster Admin gains access to Karpenter CRDs and default nodeClass
Cluster Admin creates a NodePool and scale out workloads
Service Consumer signals cluster control plane upgrade
Expose Karpenter metrics to Cluster Admin

Goals (aka. expected user outcomes)

Run Karpenter in management cluster and disable CAS
Automate node provisioning in workload cluster
automate lifecycle management in workload cluster
Reduce cost in heterogenous compute workloads
Additional features karpenter

OpenShift AutoNode (a.k.a. Karpenter) Proposal

Requirements (aka. Acceptance Criteria):

As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.

Run Karpenter in management cluster and disable CAS
OCM API
- Enable/Disable Cluster autoscaler
- Enable/disable AutoNode feature
- New ARN role configuration for Karpenter
- Optional: New managed policy or integration with existing nodepool role permissions
Expose NodeClass/Nodepool resources to users.
secure node provisioning and management, machine approval system for Karpenter instances
HCP Karpenter cleanup/deletion support
ROSA CAPI fields to enable/disable/configure Karpenter
Write end-to-end tests for karpenter running on ROSA HCP

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Managed, i.e. ROSA-HCP
Classic (standalone cluster)	N/A
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	MNO
Connected / Restricted Network	Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_x64, ARM (aarch64)
Operator compatibility
Backport needed (list applicable versions)	No
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	yes - console
Other (please specify)	OCM, rosa-cli, ACM, cost management

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Self-managed standalone OCP, self-hosted HCP, ROSA classic are out-of-scope.
Creating a multi-provider cost/pricing operator compatible with CAPI is beyond the scope of this Feature. That may take more time.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Karpenter.sh is an open-source node provisioning project built for Kubernetes. It is designed to simplify Kubernetes infrastructure by automatically launching and terminating nodes based on the needs of your workloads. Karpenter can help you to reduce costs, improve performance, and simplify operations.
Karpenter works by observing the unscheduled pods in your cluster and launching new nodes to accommodate them. Karpenter can also terminate nodes that are no longer needed, which can help you save money on infrastructure costs.
Karpenter architecture has a Karpenter-core and Karpenter-provider as components.
The core has AWS code which does the resource calculation to reduce the cost by re-provisioning new nodes.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Ability to enable AutoNode/Karpenter during installation and post-cluster installation
Ability to run AutoNode/Karpenter and Cluster Autoscaler at the same time
Use with an OpenShift cluster with Hosted Control Planes
CAPI to enable/disable/configure AutoNode/Karpenter
Have AutoNode/Karpenter perform data plane upgrades
Designed for FIPS / FIPS compatible
Enable cost effective mixed compute with auto-provisioning from/to zero
Provide Karpenter metrics for monitoring and reporting purposes

Documentation Considerations

Migration guides from using CAS to Karpenter
Performance testing to compare CAS vs Karpenter on ROSA HCP
API documentation for NodePool and EC2NodeClass configuration

Interoperability Considerations

Epic HOSTEDCP-2227: prod ready controllers implementation

View the Description

https://docs.google.com/document/d/1ID_IhXPpYY4K3G_wa1MYJxOb3yz5FYoOj3ONSkEDsZs/edit?tab=t.0#heading=h.yvv1wy2g0utk

Goal

Validate the implementation details are prod ready before removing the feature gate

Why is this important?

Scenarios

Revisit Default ec2Class
Revisit karpenter IAM policy
Consider vendoring karpenter instead of using unstructured types
Consider manage the userdata programatically instead of using NodePool API
Revisit rbac and watchers
Last time I checked, karpenter was failing to allocate OCP workloads because of a conflicting label.
https://github.com/kubernetes-sigs/karpenter/issues/1046#issuecomment-2578291462
https://github.com/kubernetes-sigs/karpenter/pull/1908

Remove feature gate

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2249: Move from karpenter nodepool to programatic generate userdata

View the Description View the linked PRs

Move from karpenter nodepool to programatic generate userdata

https://github.com/openshift/hypershift/pull/5439

Story HOSTEDCP-2257: Vendor karpenter CRDs

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

create/update Karpenter resources directly without dealing with unstructured types

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Karpenter CRDs are vendored in Hypershift code
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/5522

Bug PODAUTO-302: karpenter instances role is currently harcoded to KarpenterNodeRole-agl

View the Description View the linked PRs

This is currently hardcoded to this value and clobbered by the controller in the default ec2nodeclass. https://github.com/openshift/hypershift/blob/f21f250592ea74b38e8d79555ab720982869ef5e/karpenter-operator/controllers/karpenter/karpenter_controller.go#L362

We need to default to a more neutral name and possibly stop clobbering it so it can be changed.

I see creating the role as an admin task. For hypershift "the project" this can be automated by the cli, so it will be created when cli create cluster with known name e.g karpenter-role-infra-id

https://github.com/openshift/hypershift/pull/5581

Story HOSTEDCP-2262: Prototype having our own class instead of VAP for shared ownership

View the Description View the linked PRs

While VAP is ok for implementing shared ownership it has some drawbacks. e.g. it forces us to change the builtin CEL rules of the API for required fields which is a maintenance burden and error prone. Besides it doesn't gives control to future api pivots we might need to execute to satisfy business needs. E.g. expose a field for dual stream support which requires picking a different ami, e.g expose a field for kubeletconfig that let us include that in the payload generation...
we should be ready to pivot to have our own class which exposes only a subset of the upstream and have a controller which just renders the upstream one

https://github.com/openshift/hypershift/pull/5600

Feature OCPSTRAT-306: [TP] Support for bring your own external OIDC based Auth provider for direct API Server access [Standalone OCP]

View the Description

Feature Overview (aka. Goal Summary)

The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.

BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.

Goals (aka. expected user outcomes)

Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.

OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.

Requirements (aka. Acceptance Criteria):

The customer should be able to tie into RBAC functionality, similar to how it is closely aligned with OpenShift OAuth

Use Cases (Optional):

As a customer, I would like to integrate my OIDC Identity Provider directly with the OpenShift API server.
As a customer in multi-cluster cloud environment, I have both K8s and non-K8s clusters using my IDP and hence I need seamless authentication directly to the OpenShift/K8sAPI using my Identity Provider

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

_Items listed for GA https://issues.redhat.com/browse/OCPSTRAT-1804_

Multiple IDPs
Removing resources related to Oauth server when it is disabled.
Metrics equivalence. We provided access attempts for OAuth. With KAS we cannot provide exactly same functionality but we are looking to provide ways to review audit log and get this information.
Exec plugins other than oc
OIDC workflows other than Auth Code flow (Such as device code grant workflow, implicit flow etc). __

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic CNTRLPLANE-80: Direct External OIDC Provider for Standalone OCP

View the Description

Epic Goal

The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.

Why is this important? (mandatory)

OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.

Scenarios (mandatory)

As a customer, I want to integrate my OIDC Identity Provider directly with OpenShift so that I can fully use its capabilities in machine-to-machine workflows.
*As a customer in a hybrid cloud environment, I want to seamlessly use my OIDC Identity Provider across all of my fleet.

Dependencies (internal and external) (mandatory)

Support in the console/console-operator (already completed)
Support in the OpenShift CLI `oc` (already completed)

Contributing Teams(and contacts) (mandatory)

Development - OCP Auth
Documentation - OCP Auth
QE - OCP Auth
PX -
Others -

Acceptance Criteria (optional)

external OIDC provider can be configured to be used directly via the kube-apiserver to issue tokens
built-in oauth stack no longer operational in the cluster; respective APIs, resources and components deactivated
changing back to the built-in oauth stack possible

Drawbacks or Risk (optional)

Enabling an external OIDC provider to an OCP cluster will result in the oauth-apiserver being removed from the system; this inherently means that the two API Services it is serving (v1.oauth.openshift.io, v1.user.openshift.io) will be gone from the cluster, and therefore any related data will be lost. It is the user's responsibility to create backups of any required data.
Configuring an external OIDC identity provider for authentication by definition means that any security updates or patches must be managed independently from the cluster itself, i.e. cluster updates will not resolve security issues relevant to the provider itself; the provider will have to be updated separately. Additionally, new functionality or features on the provider's side might need integration work in OpenShift (depending on their nature).

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story CNTRLPLANE-71: Update cluster-authentication-operator to manage the RoleBindingRestriction CRD

View the Description View the linked PRs

In order to remove the RoleBindingRestriction CRD from the cluster, as outlined in the updates to the original OIDC enhancement proposal in ~~CNTRLPLANE-69~~, we will have to make it such that CVO no longer manages it. This means updating the cluster-authentication-operator such that it is responsible for ensuring the CRD is present on the cluster.

https://github.com/openshift/cluster-authentication-operator/pull/748

Story AUTH-541: Structured authentication configuration for the KAS pods

View the Description View the linked PRs

The CAO and KAS-o both need to work and enable structured authentication configuration for the KAS static pods.

CAO:

a controller tracks the auth CR for auth type OIDC
generates structured auth config object and serializes it into a configmap
syncs the configmap into openshift-config

KAS-o:

a config observer tracks the auth CR for type OIDC
syncs the auth configmap from openshift-config into openshift-kube-apiserver and enables the `--authentication-config` CLI arg for the KAS pods
the auth-metadata and webhook-authenticator config observers remove their resources and CLI args accordingly
a revision controller syncs that configmap into a static file

Bug OCPBUGS-44953: [Premerge test] Removing OCP BYO external oidc to revert to OAuth IDP caused co/console degraded with AuthStatusHandlerDegraded

View the Description View the linked PRs

Description of problem:
This is a bug found during pre-merge test of 4.18 epic AUTH-528 PRs and filed for better tracking per existing "OpenShift - Testing Before PR Merges - Left-Shift Testing" google doc workflow.

co/console degraded with AuthStatusHandlerDegraded after OCP BYO external oidc is configured and then removed (i.e. reverted back to OAuth IDP).

Version-Release number of selected component (if applicable):

Cluster-bot build which is built at 2024-11-25 09:39 CST (UTC+800)
build 4.18,openshift/cluster-authentication-operator#713,openshift/cluster-authentication-operator#740,openshift/cluster-kube-apiserver-operator#1760,openshift/console-operator#940

How reproducible:

Always (tried twice, both hit it)

Steps to Reproduce:

1. Launch a TechPreviewNoUpgrade standalone OCP cluster with above build. Configure htpasswd IDP. Test users can login successfully.

2. Configure BYO external OIDC in this OCP cluster using Microsoft Entra ID. KAS and console pods can roll out successfully. oc login and console login to Microsoft Entra ID can succeed.

3. Remove BYO external OIDC configuration, i.e. go back to original htpasswd OAuth IDP:
[xxia@2024-11-25 21:10:17 CST my]$ oc patch authentication.config/cluster --type=merge -p='
spec: 
  type: ""
  oidcProviders: null
'
authentication.config.openshift.io/cluster patched

[xxia@2024-11-25 21:15:24 CST my]$ oc get authentication.config  cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Authentication
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2024-11-25T04:11:59Z"
  generation: 5
  name: cluster
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: e814f1dc-0b51-4b87-8f04-6bd99594bf47
  resourceVersion: "284724"
  uid: 2de77b67-7de4-4883-8ceb-f1020b277210
spec:
  oauthMetadata:
    name: ""
  serviceAccountIssuer: ""
  type: ""
  webhookTokenAuthenticator:
    kubeConfig:
      name: webhook-authentication-integrated-oauth
status:
  integratedOAuthMetadata:
    name: oauth-openshift
  oidcClients:
  - componentName: cli
    componentNamespace: openshift-console
  - componentName: console
    componentNamespace: openshift-console
    conditions:
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Degraded
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Progressing
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "True"
      type: Available
    currentOIDCClients:
    - clientID: 95fbae1d-69a7-4206-86bd-00ea9e0bb778
      issuerURL: https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/v2.0
      oidcProviderName: microsoft-entra-id


KAS and console pods indeed can roll out successfully; and now oc login and console login indeed can succeed using the htpasswd user and password:
[xxia@2024-11-25 21:49:32 CST my]$ oc login -u testuser-1 -p xxxxxx
Login successful.
...

But co/console degraded, which is weird:
[xxia@2024-11-25 21:56:07 CST my]$ oc get co | grep -v 'True *False *False'
NAME                                       VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.18.0-0.test-2024-11-25-020414-ci-ln-71cvsj2-latest   True        False         True       9h      AuthStatusHandlerDegraded: Authentication.config.openshift.io "cluster" is invalid: [status.oidcClients[1].currentOIDCClients[0].issuerURL: Invalid value: "": oidcClients[1].currentOIDCClients[0].issuerURL in body should match '^https:\/\/[^\s]', status.oidcClients[1].currentOIDCClients[0].oidcProviderName: Invalid value: "": oidcClients[1].currentOIDCClients[0].oidcProviderName in body should be at least 1 chars long]

Actual results:

co/console degraded, as above.

Expected results:

co/console is normal.

Additional info:

https://github.com/openshift/console-operator/pull/945

Bug OCPBUGS-44556: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console-operator/pull/940

Story CNTRLPLANE-194: Update openshift/installer to use cluster-authentication-operator manifest renderer during bootstrapping

View the Description View the linked PRs

In order to fully transition the management of the RoleBindingRestriction CRD to the cluster-authentication-operator, we also need to update the openshift/installer to use the cluster-authentication-operator render subcommand to add the RoleBindingRestriction CRD to the set of manifests applied during cluster bootstrapping.

Without the RoleBindingRestriction CRD included in the set of bootstrap manifests, the authorization.openshift.io/RestrictSubjectBindings admission plugin will prevent the creation of system:* RoleBindings during the installation process.

https://github.com/openshift/installer/pull/9424

Feature OCPSTRAT-330: [Upstream] OpenShift AutoScaler TechDebt (Phase 3)

View the Description

Feature Overview
This is a TechDebt and doesn't impact OpenShift Users.
As the autoscaler has become a key feature of OpenShift, there is the requirement to continue to expand it's use bringing all the features to all the cloud platforms and contributing to the community upstream. This feature is to track the initiatives associated with the Autoscaler in OpenShift.

Goals

Scale from zero available on all cloud providers (where available)
Required upstream work
Work needed as a result of rebase to new kubernetes version

Requirements

Requirement	Notes	isMvp?
vSphere autoscaling from zero		No
Upstream E2E testing		No
Upstream adapt scale from zero replicas		No

Out of Scope

n/a

Background, and strategic fit
Autoscaling is a key benefit of the Machine API and should be made available on all providers

Assumptions

Customer Considerations

Documentation Considerations

Target audience: cluster admins
Updated content: update docs to mention any change to where the features are available.

Epic OCPCLOUD-2136: Update autoscaling annotations to accommodate upstream keys

View the Description

Epic Goal

Update the scale from zero autoscaling annotations on MachineSets to conform with the upstream keys, while also continuing to accept the openshift specific keys that we have been using.

Why is this important?

This change makes our implementation of the cluster autoscaler conform to the API that is described in the upstream community. This reduces the mental overhead for someone that knows kubernetes but is new to openshift.
This change also reduces the maintenance burden that we carry in the form of addition patches to the cluster autoscaler. By changing our controllers to understand the upstream annotations we are able to remove extra patches on our fork of the cluster autoscaler, making future maintenance easier and closer to the upstream source.

Scenarios

A user is debugging a cluster autoscaler issue by examining the related MachineSet objects, they see the scale from zero annotations and recognize them from the project documentation and from upstream discussions. The result is that the user is more easily able to find common issues and advice from the upstream community.
An openshift maintainer is updating the cluster autoscaler for a new version of kubernetes, because the openshift controllers understand the upstream annotations, the maintainer does not need to carry or modify a patch to support multiple varieties of annotation. This in turn makes the task of updating the autoscaler simpler and reduces burden on the maintainer.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Scale from zero autoscaling must continue to work with both the old openshift annotations and the newer upstream annotations.

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - OpenShift code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - OpenShift documentation merged: <link to meaningful PR or GitHub Issue>
DEV - OpenShift build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - OpenShift documentation merged: <link to meaningful PR>

please note, the changes described by this epic will happen in OpenShift controllers and as such there is no "upstream" relationship in the same sense as the Kubernetes-based controllers.

Story OCPCLOUD-2500: Update CAS to recognize both upstream and openshift scale from zero annotations

View the Description View the linked PRs

User Story

As a developer, in order to deprecate the old annotations, we will need to carry both for at least one release cycle. Updating the CAO to apply the upstream annotations, and the CAS to accept both (preferring upstream), will allow me to properly deprecate the old annotations.

Background

to help the transition to the upstream scale from zero annotations, we need to have the CAS recognize both sets of annotations, preferring the upstream, for at least one release cycle. this will allow us to have a window of deprecation on the old annotations.

Steps

update CAS to recognize both annotations
add a unit test to ensure prioritization works properly

Stakeholders

openshift eng

Definition of Done

CAS can recognize both sets of annotations

Docs

Testing

unit testing for priority behavior

https://github.com/openshift/kubernetes-autoscaler/pull/335

Feature OCPSTRAT-487: Pod Security Admission Integration - Restricted Enforcement

View the Description

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.19, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".

Epic CNTRLPLANE-73: Pod Security Admission Integration - Restricted Enforcement

View the Description

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

Story AUTH-482: SCC pinning for all workloads in platform namespaces

View the Description View the linked PRs

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces	4.19	4.18	4.17	4.16	4.15	4.14
monitored	82	82	82	82	82	82
fix needed	68	68	68	68	68	68
fixed	39	39	35	32	39	1
remaining	29	29	33	36	29	67
~ remaining non-runlevel	8	8	12	15	8	46
~ remaining runlevel (low-prio)	21	21	21	21	21	21
~ untested	3	2	2	2	82	82

Progress breakdown

#	namespace	4.19	4.18	4.17	4.16	4.15	4.14
1	oc debug node pods			#1763	#1816	#1818
2	openshift-apiserver-operator				#573	#581
3	openshift-authentication				#656	#675
4	openshift-authentication-operator				#656	#675
5	openshift-catalogd				#50	#58
6	openshift-cloud-credential-operator				#681	#736
7	openshift-cloud-network-config-controller		#2282	#2490	#2496
8	openshift-cluster-csi-drivers	#118 #5310 #135	#524 #131 #306 #265 #75		#170 #459	#484
9	openshift-cluster-node-tuning-operator				#968	#1117
10	openshift-cluster-olm-operator				#54	n/a	n/a
11	openshift-cluster-samples-operator				#535	#548
12	openshift-cluster-storage-operator		#516		#459 #196	#484 #211
13	openshift-cluster-version				#1038	#1068
14	openshift-config-operator				#410	#420
15	openshift-console			#871	#908	#924
16	openshift-console-operator			#871	#908	#924
17	openshift-controller-manager				#336	#361
18	openshift-controller-manager-operator				#336	#361
19	openshift-e2e-loki		#56579	#56579	#56579	#56579
20	openshift-image-registry				#1008	#1067
21	openshift-ingress		#1032
22	openshift-ingress-canary		#1031
23	openshift-ingress-operator		#1031
24	openshift-insights	#1033	#1041	#1049	#915	#967
25	openshift-kni-infra		#4504	#4542	#4539	#4540
26	openshift-kube-storage-version-migrator				#107	#112
27	openshift-kube-storage-version-migrator-operator				#107	#112
28	openshift-machine-api	#1308 #1317	#1311	#407	#315 #282 #1220 #73 #50 #433	#332 #326 #1288 #81 #57 #443
29	openshift-machine-config-operator		#4636	#4219	#4384	#4393
30	openshift-manila-csi-driver			#234	#235	#236
31	openshift-marketplace		#578		#561	#570
32	openshift-metallb-system		#238	#240	#241
33	openshift-monitoring	#2298 #366	#2498		#2335	#2420
34	openshift-network-console		#2545
35	openshift-network-diagnostics		#2282	#2490	#2496
36	openshift-network-node-identity		#2282	#2490	#2496
37	openshift-nutanix-infra		#4504		#4539	#4540
38	openshift-oauth-apiserver				#656	#675
39	openshift-openstack-infra		#4504		#4539	#4540
40	openshift-operator-controller				#100	#120
41	openshift-operator-lifecycle-manager				#703	#828
42	openshift-route-controller-manager				#336	#361
43	openshift-service-ca				#235	#243
44	openshift-service-ca-operator				#235	#243
45	openshift-sriov-network-operator			#995	#999	#1003
46	openshift-user-workload-monitoring				#2335	#2420
47	openshift-vsphere-infra		#4504	#4542	#4539	#4540
48	(runlevel) kube-system
49	(runlevel) openshift-cloud-controller-manager
50	(runlevel) openshift-cloud-controller-manager-operator
51	(runlevel) openshift-cluster-api
52	(runlevel) openshift-cluster-machine-approver
53	(runlevel) openshift-dns
54	(runlevel) openshift-dns-operator
55	(runlevel) openshift-etcd
56	(runlevel) openshift-etcd-operator
57	(runlevel) openshift-kube-apiserver
58	(runlevel) openshift-kube-apiserver-operator
59	(runlevel) openshift-kube-controller-manager
60	(runlevel) openshift-kube-controller-manager-operator
61	(runlevel) openshift-kube-proxy
62	(runlevel) openshift-kube-scheduler
63	(runlevel) openshift-kube-scheduler-operator
64	(runlevel) openshift-multus
65	(runlevel) openshift-network-operator
66	(runlevel) openshift-ovn-kubernetes
67	(runlevel) openshift-sdn
68	(runlevel) openshift-storage

Story CNTRLPLANE-55: PodSecurityAdmissionLabelSynchronizationController should annotate namespace with decision

View the Description View the linked PRs

What

The PodSecurityAdmissionLabelSynchronizationController in cluster-policy-controller should set an annotation, that tells us what it decided wrt given namespace.

Why

Once a customer changes the labels and the label syncer doesn't set them anymore, we don't know what it would have picked.
Knowing that is important, in case that the label syncer would set the enforce label.

Note

Annotation is already part of the API: https://github.com/openshift/api/pull/1980.
Once this is done, the kube-apiserver-operator can be updated to use that.
We would like to backpor this to 4.18

https://github.com/openshift/cluster-policy-controller/pull/161

Feature OCPSTRAT-561: Support Private Google Access to GCP endpoints

View the Description

Feature Overview

Add support to custom GCP API endpoints (private and restricted) while deploying OpenShift on GCP

Goals

Enable OpenShift to support private and restricted GCP API endpoints while deploying the platform on GCP as we do for AWS already

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Use Cases

This Section:

As a user I want to be able to use GCP Private API endpoints while deploying OpenShift so I can be complaint with my company security policies
As a user I want to be able to use GCP Restricted API endpoints while deploying OpenShift so I can be complaint with my company security policies

Background, and strategic fit

For users with strict regulatory policies, Private Service Connect allows private consumption of services across VPC networks that belong to different groups, teams, projects, or organizations. Supporting OpenShift to consume these private endpoints is key for these customers to be able to deploy the platform on GCP and be complaint with their regulatory policies.

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CORS-2389: OpenShift Installer to support Private Google Access to GCP endpoints

View the Description

Feature Overview

Add support to custom GCP API endpoints (private and restricted) while deploying OpenShift on GCP

Goals

Enable OpenShift to support private and restricted GCP API endpoints while deploying the platform on GCP as we do for AWS already

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Use Cases

This Section:

As a user I want to be able to use GCP Private API endpoints while deploying OpenShift so I can be complaint with my company security policies
As a user I want to be able to use GCP Restricted API endpoints while deploying OpenShift so I can be complaint with my company security policies

Background, and strategic fit

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Story CORS-3843: Add Tech Preview Feature gate to Installer

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Add the Tech Preview Feature Gate to the installer for custom endpoints
Validate the custom endpoints feature gate in the installer
Capability 3

so that I can achieve

An installer feature gate to ensure users know that this feature is not yet slated for release
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9501

Story CORS-3917: Validate custom endpoints

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Ensure that the custom endpoints are valid before use.
Reach the endpoints provided.
Capability 3

so that I can achieve

Ensuring that the custom endpoint connectivity are not the reason for any installation issues.
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9517

Story CORS-3842: Add GCP Endpoint Tech Preview Feature to API

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Add the feature to API
Add tech preview tags for the feature
Capability 3

so that I can achieve

Protect installs using this feature. The feature will touch many aspects of openshift
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/api/pull/2150

Story CORS-3835: Add Endpoints to GCP Platform in Install Config

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Enter the custom endpoints via the install config
Capability 2
Capability 3

so that I can achieve

Initiate an install where the custom endpoints for GCP APIs can be used.
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

The user enters the data into the install-config. The data is validated.
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9363

Story CORS-3906: Use custom endpoints in MAPI for GCP

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Use the custom endpoints in MAPI that were set in the installer.
Override the endpoints for:
- compute
- tagging

so that I can achieve

Using the same custom endpoints for all of the services in the cluster.
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

The services with endpoints to be overwritten can be found:
- pkg/cloud/gcp/actuators/services
Each time the service is created New, the the option for `withEndpoint` should be issued when applicable.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/machine-api-provider-gcp/pull/111

Story CORS-3919: Update Cluster Infra Manifest

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Pass the custom endpoints to all cluster components
Capability 2
Capability 3

so that I can achieve

All cluster components should use the same api endpoints.
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

Fill out the API infra config with the custom endpoints when the user has supplied them via the install-config
Bring in the API changes as a vendor update.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9518

Feature OCPSTRAT-569: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic CORS-3440: Add ability to choose ingress controller subnets at installation

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Add the ability to choose subnets for IngressControllers with LoadBalancer-type Services for AWS in the Installer. This install config should be applies to the default IngressController and all future IngressControllers (the design is similar to installconfig.platform.aws.lbtype).

Why is this important?

Cluster Admins may have dedicated subnets for their load balancers due to security reasons or infrastructure constraints. With the implementation in ~~NE-705~~, Cluster Admins will be able to specify subnets for IngressControllers for Day 2 operations. Service Delivery needs a way to configure IngressController subnets for the default IngressController for ROSA.

Scenarios

If the cluster is spread across multiple subnets then we can have a way to select the subnet while creating ingresscontroller of type LoadBalancerService.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Network Edge Epic (~~NE-705~~)

Previous Work (Optional):

Slack Thread discussion

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3868: Deprecate installconfig.platform.aws.subnets

View the Description View the linked PRs

User Story:

As an openshift-install user I want to be able to continue to use aws.subnets during deprecation (a warning will show)
As an openshift developer, I only want a single code path for subnets (via upconversion)

Acceptance Criteria:

Description of criteria:

Validation that both fields are not simultaneously specified
When aws.subnets is specified, it's upconverted into aws.vpc.subnets
Existing pkg/types/aws/Subnets type is renamed to DeprecatedSubnets
Remove all (or as many possible) usages of DeprecatedSubnets, replaced with the new vpc.Subnets field
- We may need to keep usage of DeprecatedSubnets for certain validations
Warning when using deprecated field

(optional) Out of Scope:

Engineering Details:

Conversion package: https://github.com/openshift/installer/tree/main/pkg/types/conversion
Review how subnets are used in https://github.com/openshift/installer/blob/main/pkg/asset/installconfig/aws/metadata.go and whether any changes/refactoring is needed

https://github.com/openshift/installer/pull/9443

Story CORS-3869: Static validations (pkg/types)

View the Description View the linked PRs

User Story:

Static validations (no API connection required)

Acceptance Criteria:

Description of criteria:

~~The subnet IDs must be valid:~~
- ~~Start with subnet~~-
- ~~Length is exactly 24 characters~~
The subnet IDs must not include duplicates
Maximum of 10 IngressController Subnets Validation
All or Nothing Subnet Roles Selection
For a subnet that has defined roles
- Roles must be of supported types (i.e. from a set of defined roles)
- Roles must not be duplicate. This and * check naturally validates that a subnet can only have max 5 roles
- EdgeNode cannot be combined with any other roles
- ClusterNode, IngressControllerLB, ControlPlaneExternalLB (if cluster is external), and ControlPlaneInternalLB must be assigned to at least 1 subnet
- A subnet cannot have both role ControlPlaneExternalLB and ControlPlaneInternalLB *
- If the cluster is internal, ControlPlaneExternalLB must not be assigned to any subnets.

Some validations are extracted from API validation (i.e. the intstaller does not handle CEL at this time) and Xvalidation markers (i.e. defined in the enhancement proposal).

(optional) Out of Scope:

Validations that require access to the AWS API will go in pkg/asset (different card)

Engineering Details:

https://github.com/openshift/installer/tree/main/pkg/types/aws/validation

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9505

Story CORS-3867: Allow users to specify subnets with roles in install config

View the Description View the linked PRs

User Story:

As an openshift-instlal user I want to be able to specify AWS subnets with roles.

Acceptance Criteria:

Description of criteria:

Installconfig has installconfig.platform.aws.vpc.subnets field
vpc.subnets conforms to API defined in the enhancement
godoc/oc explain text is written and generated (see explain docs)

(optional) Out of Scope:

Validations will be handled in a different card.

Engineering Details:

API for installconfig field is defined in https://github.com/openshift/enhancements/blob/30f44ee0cd57dc4ba3b72e10c0b8f1614970d0e0/enhancements/installer/aws-lb-subnet-selection.md

- the type for subnets is:
  - id - string
  - roles - slice of SubnetRoles
the list of subnet roles is defined in the enhancement. they include ClusterNode, EdgeNode, ControlPlaneExternalLBSubnetRole, etc...

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9443

Feature OCPSTRAT-683: Migrate MAPI to Cluster API for AWS -Phase 1

View the Description

Feature Overview (aka. Goal Summary)

Implement Migration core for MAPI to CAPI for AWS

This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
This Design investigates possible solutions for AWS
Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

Epic OCPCLOUD-2809: MAPI/CAPI Feature Parity (AWS) (Tech Preview)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

To bring MAPI and CAPI to feature parity and unblock conversions between MAPI and CAPI resources

Why is this important?

Blocks migration to Cluster API

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2718: [AWS] Handle no MAPI ebs volumesize

View the Description View the linked PRs

Background

VolumeSize on the block device mapping spec in MAPA is currently optional (and if is not set we send an empty value to AWS and let it choose for us), where it is required and a minimum of 8gb in CAPA.

We need to determine an appropriate behaviour for when the value is unset.

Steps

Check historically on the installer to see what value it typically has set (in the AMI, and if that changed overtime)
Determine an appropriate minimum size for the root volume in OpenShift
When not set, default the CAPA volume size to an appropriate value based on the above
Adjust conversion logic based on the above

Stakeholders

Cluster Infra

Definition of Done

Machines with no volume size in MAPI can be converted to CAPI

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/260

Epic OCPCLOUD-2120: Implement Migration core for MAPI to CAPI (Tech Preview)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Create the core/common tooling needed to enable the migration designed in ~~OCPCLOUD-1578~~
To allow providers to individually migrate from MAPI to CAPI
Implementation plan in https://docs.google.com/document/d/1IZPmcJujKPdoBZKt66i3eGcJWb1RDIgk1ywt13h6T-w/edit

Why is this important?

We need to build out the core so that development of the migration for individual providers can then happen in parallel

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2647: Implement CAPI to MAPI Machine conversion

View the Description View the linked PRs

Background

To enable CAPI MachineSets to still mirror MAPI MachineSets accurately, and to enable MAPI MachineSets to be implemented by CAPI MachineSets in the future, we need to implement a way to convert CAPI Machines back into MAPI Machines.

These steps assume that the CAPI Machine is authoritative, or, that there is no MAPI Machines.

Behaviours

If no Machine exists in MAPI
- But the CAPI Machine is owned, and that owner exists in MAPI
  - Create a MAPI Machine to mirror the CAPI Machine
  - MAPI Machines should set authority to CAPI on create
If a MAPI Machine exists
- Convert infrastructure template from InfraMachine to providerSpec
- Update spec and status fields of MAPI Machine to reflect CAPI Machine
On failures
- Set Synchronized condition to False and report error on MAPI resource
On success
- Set Synchronized condition to True on MAPI resource
- Set status.synchronizedGeneration to match the auth resource generation

Steps

Implement conversion based on the behaviours outlined above using the CAPI to MAPI conversion library

Stakeholders

Cluster Infra

Definition of Done

When a CAPI MachineSet scales up and is mirrored in MAPI, the CAPI Machine gets mirrored into MAPI

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/239

Story OCPCLOUD-2644: Implement MAPI to CAPI MachineSet conversion

View the Description View the linked PRs

Background

For the MachineSet controller, we need to implement a forward conversion, converting the MachineAPI MachineSet to ClusterAPI.

This will involve creating the CAPI MachineSet if it does not exist, and managing the Infrastructure templates.

This card covers the case where MAPI is currently authoritative.

Behaviours

Create Cluster API mirror if not present
- CAPI mirror should be paused on create
- Names of mirror should be 1:1 with original
Manage InfraTemplate creation by converting MAPI providerSpec
- InfraTemplate naming should be based on hash to be able to deduplicate
- InfraTemplate naming should be based on parent resources
- InfraTemplate should have ownerReference to CAPI MachineSet
- If template has changed, remove ownerReference from old template. If no other ownerReferences, remove template.
- Should be identifiable as created by the sync controller (annotated?)
Ensure CAPI MachineSet spec and status overwritten with conversion from MAPI
Ensure Labels/Annotations copied from MAPI to CAPI
On failures
- Set Synchronized condition to False and report error on MAPI resource
On success
- Set Synchronized condition to True on MAPI resource
- Set status.synchronizedGeneration to match the auth resource generation

Steps

Implement MAPI to CAPI conversion by leveraging library for conversion and applying above MachineSet level rules

Stakeholders

Cluster Infra

Definition of Done

When a MAPI MachineSet exists, a CAPI MachineSet is created and kept up to date if there are changes

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/237

Task OCPCLOUD-2787: Get Migration controller working on an actual cluster (not tests)

View the Description View the linked PRs

As QE have tried to test upstream CAPI pausing, we've hit a few issues with running the migration controller. & cluster capi operator on a real cluster vs envtest.

This card captures the work required to iron out these kinks, and get things running (i.e not crashing).

I also think we want an e2e or some sort of automated testing to ensure we don't break things again.

Goal: Stop the CAPI operator crashing on startup in a real cluster.

Non goals: get the entire conversion flow running from CAPI -> MAPI and MAPI -> CAPI. We still need significant feature work before we're here.

Story OCPCLOUD-2645: Implement MAPI to CAPI Machine conversion

View the Description View the linked PRs

Background

For the Machine controller, we need to implement a forward conversion, converting the MachineAPI Machine to ClusterAPI.

This will involve creating the CAPI Machine if it does not exist, and managing the Infrastructure Machine.

This card covers the case where MAPI is currently authoritative.

Behaviours

Create Cluster API mirror if not present
- CAPI mirror should be paused on create
- Names of mirror should be 1:1 with original
Manage InfraMachine creation by converting MAPI providerSpec
- InfraMachine should be named based on the name of the Machine (to mirror CAPI behaviour)
- InfraMachine should have appropriate owner references and finalizers created
Ensure CAPI Machine spec and status overwritten with conversion from MAPI
Ensure Labels/Annotations copied from MAPI to CAPI
On failures
- Set Synchronized condition to False and report error on MAPI resource
On success
- Set Synchronized condition to True on MAPI resource
- Set status.synchronizedGeneration to match the auth resource generation

Steps

Implement behaviours described above to convert MAPI Machines to CAPI Machines using conversion library

Stakeholders

Cluster Infra

Definition of Done

MAPI Machines create paused CAPI mirrors

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/258

Epic OCPCLOUD-2706: MAPI/CAPI Feature Parity (Core) (Tech Preview)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

To bring MAPI and CAPI to feature parity and unblock conversions between MAPI and CAPI resources

Why is this important?

Blocks migration to Cluster API

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2680: [Core] CAPI should support machine to node label propagation

View the Description View the linked PRs

User Story

As a user I want to be able to add labels/taints/annotations to my machines, and have them propagate to the nodes. This will allow me to use the labels for other tasks e.g selectors.

Background

Currently, MAPI supports propagating labels from machines to nodes, but CAPI does not. When we move to CAPI we will lose this feature.

_See https://issues.redhat.com/browse/OCPBUGS-37236_

Relevant upstream issues:

https://github.com/kubernetes-sigs/cluster-api/issues/11657

Steps

Understand why the discrepancy exists
Determine how much work it would be for the NodeLink controller to copy the labels
Chat with upstream to see if the idea of unrestricted label propagation through some mechansim is palletable.
Come back to the group and decide a course of action.

Stakeholders

Our users, who currently have this feature.

Definition of Done

Code is implemented upstream to sync labels from a Machine to a Node
Our manifests include the "--additional-sync-machine-labels=.*" argument. The generated manifests are at https://github.com/openshift/cluster-api/blob/master/openshift/manifests/0000_30_cluster-api_04_cm.core-cluster-api.yaml#L481

Feature OCPSTRAT-709: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic API-1789: CI implementation: Create TLS artifacts registry

View the Description

Epic Goal

This is the epic tracking the work to collect a list of TLS artifacts (certificates, keys and CA bundles).

This list will contain a set of required an optional metadata. Required metadata examples are ownership (name of Jira component) and ability to auto-regenerate certificate after it has expired while offline. In most cases metadata can be set via annotations to secret/configmap containing the TLS artifact.

Components not meeting the required metadata will fail CI - i.e. when a pull request makes a component create a new secret, the secret is expected to have all necessary metadata present to pass CI.

This PR will enforce it WIP API-1789: make TLS registry tests required

Bug OCPBUGS-48506: Update metadata for certs, common with self-hosted version

View the Description View the linked PRs

Description of problem:

    In order to make TLS registry tests required we need to make sure all OpenShift variants are using the same metadata for kube-apiserver certs. Hypershift uses several certs stored in the secret without accompanying metadata (namely component ownership).

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5355

Epic API-1689: Create TLS artifacts registry

View the Description View the linked PRs

In order to keep track of existing certs/CA bundles and ensure that they adhere to requirements we need to have a TLS artifact registry setup.

The registry would:

have a test which automatically collects existing certs/CA bundles from secrets/configmaps/files on disk
have a test which collects necessary metedata from them (from cert contents or annotations)
ensure that new certs match expected metadata and have necessary annotations on when a new cert is added

Ref: ~~API-1622~~

Feature OCPSTRAT-787: Configure AWS User Tags on Day 2 (Hosted Control Planes only)

View the Description

Feature Overview (aka. Goal Summary)

To improve automation, governance and security, AWS customers extensively use AWS Tags to track resources. Customers wants the ability to change user tags on day 2 without having to recreate a new cluster to have 1 or more tags added/modified.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

Cluster administrator can add one or more tags to an existing cluster.
Cluster administrator can remove one or more tags from an existing cluster.
Cluster administrator can add one or more tags just to machine-pool / node-pool in the ROSA with HCP cluster.
All ROSA client interfaces (ROSA CLI, API, UI) can utilise the day2 tagging feature on ROSA with HCP clusters
All OSD client interfaces (API, UI, CLI) can utilize the day2 tagging feature on ROSA with HCP clusters
This feature does not affect the Red Hat owned day1 tags built into OCP/ROSA (there are 10 reserved spaces for tags, of the 50 available, leaving 40 spaces for customer provided tags)

Requirements (aka. Acceptance Criteria):

Following capabilities are available for AWS on standalone and HCP clusters.
OCP automatically tags the cloud resources with the Cluster's External ID.
Tags added by default on Day 1 are not affected.
All existing active AWS resources in the OCP clusters have the tagging changes propagated.
All new AWS resources created by OCP reflect the changes to tagging.
Hive to support additional list of key=value strings on MachinePools
- These are AWS user-defined / custom tags, not to be confused with node labels
- ROSA CLI can accept a list of key=value strings with additional tag values
  - it currently can do this during cluster-install
- The default tag(s) is/are still applied
- NOTE: AWS limit of 50 tags per object (2 used automatically by OCP; with a third to be added soon; 10 reserved for Red Hat overall, as at least 2-3 are used by Managed Services) - customer's can only specify 40 tags max!
  - Source: https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html#tag-conventions
- Must be able to modify tags after creation
Support for OpenShift 4.15 onwards.

Out-of-scope

This feature will only apply to ROSA with Hosted Control Planes, and ROSA Classic / standalone is excluded.

Why is this important?

Customers want to use custom tagging for
- access controls
- chargeback/showback
- cloud IAM conditional permissions

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic CFE-1122: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CFE-1132: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/313

Story CFE-1131: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature OCPSTRAT-943: [Dev Preview] AutoNode (Native Karpenter) with HCP

View the Description

Feature Overview (aka. Goal Summary)

Reducing cloud costs through instance selection and scaling/descaling
Support GPUs, spot instances, mixed compute types and other compute types.
Automatic node lifecycle management and upgrades

The goal is roll this out with ROSA-HCP (AWS) since it has more mature Karpenter ecosystem, followed by ARO-HCP (Azure) implementation (refer to OCPSTRAT-1498).

This feature will be delivered in 3 Phases:

Dev Preview: Autonode with HCP (OCPSTRAT-943) – targeting OCP 4.19
Preview (Tech Preview): Autonode for ROSA-HCP (OCPSTRAT-1946) - TBD (2025)
GA: Autonode for ROSA-HCP – TBD (2025)

The Dev Preview release will expose AutoNode capabilities on Hosted Control Planes for AWS (note this is not meant to be productized on self-managed OpenShift) as APIs for Managed Services (ROSA) to consume. It includes the following capabilities:

Service Consumer opts-in to AutoNode on Day 1 and Day 2 (out of scope for Dev Preview)
Service Provider lifecycles Karpenter management side
Cluster Admin gains access to Karpenter CRDs and default nodeClass
Cluster Admin creates a NodePool and scale out workloads
Service Consumer signals cluster control plane upgrade (TBD for Dev Preview but potentially out of scope for Dev Preview, i.e. may slip to Tech Preview)
Expose Karpenter metrics to Cluster Admin (out of scope for Dev Preview, Targeting Tech Preview)

Goals (aka. expected user outcomes)

Run Karpenter in management cluster and disable CAS
Automate node provisioning in workload cluster
automate lifecycle management in workload cluster
Reduce cost in heterogenous compute workloads
Additional features karpenter

OpenShift AutoNode (a.k.a. Karpenter) Proposal

Requirements (aka. Acceptance Criteria):

As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.

Run Karpenter in management cluster and disable CAS
OCM API
- Enable/Disable Cluster autoscaler
- Enable/disable AutoNode feature
- New ARN role configuration for Karpenter
- Optional: New managed policy or integration with existing nodepool role permissions
Expose NodeClass/Nodepool resources to users.
secure node provisioning and management, machine approval system for Karpenter instances
HCP Karpenter cleanup/deletion support
ROSA CAPI fields to enable/disable/configure Karpenter
Write end-to-end tests for karpenter running on ROSA HCP

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Managed, i.e. ROSA-HCP
Classic (standalone cluster)	N/A
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	MNO
Connected / Restricted Network	Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_x64, ARM (aarch64)
Operator compatibility
Backport needed (list applicable versions)	No
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	yes - console
Other (please specify)	OCM, rosa-cli, ACM, cost management

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Self-managed standalone OCP, self-hosted HCP, ROSA classic are out-of-scope.
Creating a multi-provider cost/pricing operator compatible with CAPI is beyond the scope of this Feature. That may take more time.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Karpenter.sh is an open-source node provisioning project built for Kubernetes. It is designed to simplify Kubernetes infrastructure by automatically launching and terminating nodes based on the needs of your workloads. Karpenter can help you to reduce costs, improve performance, and simplify operations.
Karpenter works by observing the unscheduled pods in your cluster and launching new nodes to accommodate them. Karpenter can also terminate nodes that are no longer needed, which can help you save money on infrastructure costs.
Karpenter architecture has a Karpenter-core and Karpenter-provider as components.
The core has AWS code which does the resource calculation to reduce the cost by re-provisioning new nodes.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Ability to enable AutoNode/Karpenter during installation and post-cluster installation
Ability to run AutoNode/Karpenter and Cluster Autoscaler at the same time
Use with an OpenShift cluster with Hosted Control Planes
CAPI to enable/disable/configure AutoNode/Karpenter
Have AutoNode/Karpenter perform data plane upgrades
Designed for FIPS / FIPS compatible
Enable cost effective mixed compute with auto-provisioning from/to zero
Provide Karpenter metrics for monitoring and reporting purposes

Documentation Considerations

Migration guides from using CAS to Karpenter
Performance testing to compare CAS vs Karpenter on ROSA HCP
API documentation for NodePool and EC2NodeClass configuration

Interoperability Considerations

Epic HOSTEDCP-2222: Implement automated machine approval for karpenter instances

View the Description

Goal

Instances created by karpenter can automatically become Nodes

Why is this important?

Reduce operational burden.

Scenarios

For CAPI/MAPI driven machine management the cluster-machine-approver uses the machine.status.ips to match the CSRs. In karpenter there's no Machine resources

We'll need to implement something similar. Some ideas:

Explore using the nodeClaim resource info like status.providerID to match the CSRs
Store the requesting IP when the ec2 instances query ignition and follow similar comparison criteria than machine approver to match CSRs
Query AWS to get info and compare info to match CSRs

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2234: Implement automated machine approval for karpenter instances

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Instances created by karpenter can automatically become Nodes

so that I can achieve

Reduce operational burden.

Acceptance Criteria:

Description of criteria:

For CAPI/MAPI driven machine management the cluster-machine-approver uses the machine.status.ips to match the CSRs. In karpenter there's no Machine resources

We'll need to implement something similar. Some ideas:

– Explore using the nodeClaim resource info like status.providerID to match the CSRs
– Store the requesting IP when the ec2 instances query ignition and follow similar comparison criteria than machine approver to match CSRs
– Query AWS to get info and compare info to match CSRs

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/5349

Story HOSTEDCP-2237: Implement auto approval for serving CSRs for Karpenter

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Instances created by karpenter can automatically become Nodes

so that I can achieve

Reduce operational burden.

Acceptance Criteria:

Description of criteria:

https://github.com/openshift/hypershift/pull/5349 introduced a new controller to implement auto approval for kubelet client CSRs. We need to extend to also approve serving CSRs since they are not auto approved by the cluster-machine-approver

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/5708

Epic HOSTEDCP-2220: Build and merge a HCP + Karpenter feature gated prototype

View the Description View the linked PRs

Goal

Codify and enable usage of a prototype for HCP working with karpetner management side.

Why is this important?

A first usable version is critical to democratize knowledge and develop internal feedback.

Acceptance Criteria

Deploying a cluster with --auto-node results in karpenter running management side, the CRDs and a default ec2NodeClass installed within the guest cluster
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/hypershift/pull/5279

Epic HOSTEDCP-2226: Implement shared ownership of karpenter CRs in guest cluster

View the Description View the linked PRs

https://docs.google.com/document/d/1ID_IhXPpYY4K3G_wa1MYJxOb3yz5FYoOj3ONSkEDsZs/edit?tab=t.0#heading=h.yvv1wy2g0utk

Goal

As a service provider I the cluster admin to only manipulate fields of the nodeclass API that won't impact the service ability to operate, e.g. userdata and ami can't be changed.
As a service provider I want to be the solely authoritative source of truth to set input that impacts the ability to operate AutoNode.

Why is this important?

The way we implement this will have UX implications for cluster admin which has direct impact on customer satisfaction.

Scenarios

We decided to start by using validating admission policies to implement ownership of ec2NodeClass fields. So we can restrict fields crud to a particular service account.
This has some caveats:

If a field that the service own is required in the API, we need to let the cluster admin to set it on creation even though we'll clobber it by reconciling a controller. To mitigate this we might want to change the upstream CEL validations of the ec2NodeClass API
The raw userdata is exposed to the cluster admin via ec2NodeClass.spec.userdata
Since we enforce the values for userdata and ami via controller reconciliation there's potential room for race conditions

If using validating policies for this proves to be satisfactory we'll need to consider alternatives, e.g:

Having an additional dedicated CRD for openshiftnodeclass that translates into the ec2NodeClass and completely prevent the cluster admin from interacting with the latter via vap.
having our own class similar to eks so we can fully manage the operational input in the backend.

# ...

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/hypershift/pull/5395

Feature RHIN-1262: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic RHINENG-10537: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug RHINENG-15096: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/309

Story RHINENG-15190: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/316

Bug RHINENG-15114: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/310

Bug RHINENG-15284: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug RHINENG-14555: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/285

Bug RHINENG-15362: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/322

Bug RHINENG-14585: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/280

Bug RHINENG-15115: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/311

Bug RHINENG-14524: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/270

Bug RHINENG-15501: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug RHINENG-14677: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/279

Bug RHINENG-15594: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/334

Task RHINENG-12634: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/359

Task RHINENG-15582: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/331

Bug RHINENG-14523: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature TELCOSTRAT-282: P13 [GA? - TBC!] Support for limited - or not limited - user workloads on control plane on 5G Core clusters

View the Description

Bring https://issues.redhat.com/browse/CNF-11805 to a GA solution:

- end to end CI with sunny and rainy days scenarios: we want to make sure that the control plane cannot be affected by any workload, ideally, and worse case, clearly document workloads profiles that will put the control plane at risk.

Business justification: request from NEPs and end customers (not listed in the title, please check then tracker links) who do want to have one single type of servers in their clusters, those servers hosting more than 200 CPUs (512 is now common for AMD based servers, and e can expect more in the future). Having 3 servers dedicated just to run control plane is not OPEX neither CAPEX efficient, as 90%+ of the server capacity will never be used. This is even worse for 30 nodes clusters (very common size).

Finally, our HUB clusters (hosting ACM), infrastructure clusters (hosting HCP for instance) are tiny clusters, and the above applies: we cannot require customers to use servers with less than 32 CPUs (for instance).

Phase 1 goals:

Come up with a proposal of what level of isolation and workload limitations (only easy workload? DPDK?) are acceptable on the control plane nodes. In other words, lets take all the easy hanging fruits and see what the platform can do while maintaining control plane stability.

Epic CNF-16638: Disable RPS - pods should "pay" for networking

View the Description View the linked PRs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Currently we protect containers from their own mistakes wrt networking by redirecting all network sends to separate CPUs via RPS.
This is causing load on the reserved / control plane level
We should stop configuring this protection mechanism by default and only leave it opt-in for emergency purposes.

Why is this important?

Containers that use kernel networking heavily are causing high cpu load of the reserved (system and control plane) cpus. This affects the stability of the whole node. Our opinion is that containers should pay for the resources they use by their own cpu time.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/cluster-node-tuning-operator/pull/1301

Feature TELCOSTRAT-99: P2 ARM-based RAN solution (RAN DU) based on SuperMicro HW (NVidia GraceHopper CPU, BF-3)

View the Description

BU Priority Overview

To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.

Goals

Introduce ARM CPUs for RAN DU scenario (SNO deployment) with a feature parity to Intel Ice Lake/SPR-EE/SPR-SP w/o QAT for DU with:
- STD-kernel (RT-Kernel is not supported by RHEL)
- SR-IOV and DPDK over SR-IOV
- PTP (OC, BC). Partner asked for LLS-C3, according to Fujitsu - ptp4l and phy2sys to work with NVIDIA Aerial SDK

Characterize ARM-based RAN DU solution performance and power metrics (unless performance parameters are specified by partners, we should propose them, see Open questions)
Productize ARM-based RAN DU solution by 2024 (partner’s expectation).

State of the Business

Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.

There are currently three main obvious thrusts to how to go about this:

Introducing tools that improve overall observability w.r.t. power utilization of the network.
Improvement of existing RAN architectures via smarter orchestration of workloads, fine tuning hardware utilization on a per site basis in response to network usage, etc.
Introducing alternative architectures which have been designed from the ground up with lower power utilization as a goal.

This BU priority focuses on the third of these approaches.

BoM

Details per partner in TELCOSTRAT-99 and TELCOSTRAT-210 ARM requirements for DU
SuperMicro ARS-111GL-NHR
NVidia GraceHopper CPU
2 units of BlueField-3 DPU (in NIC mode)
Single NUMA
GPU - 1 NVidia GraceHopper GPU (GH200)

Out of scope

Open questions:

What are the latency KPIs? Do we need a RT-kernel to meet them?
What page size is expected?
What are the performance/throughout requirements?

Reference Documents:

Softbank AI-RAN PoC Architecture
Fujitsu vRAN RHOCP Integration
Red Hat Ecosystem Catalog
- NVIDIA Grace Hopper - Red Hat Ecosystem Catalog (RHEL 9.2 +)
- NVIDIA BlueField-3 200GbE (2x200) - Red Hat Ecosystem Catalog (RHEL 9.4 +)

Planning call notes from Apr 15

Epic CNF-15903: Allow selection of kernels with different page sizes in PerformanceProfile

View the Description

Epic Goal

The PerformanceProfile currently allows the user to select either the standard kernel (by default) or the realtime kernel, using the realTimeKernel field. However, for some use cases (e.g. Nvidia based ARM server) a kernel with 64k page size is required. This is supported through the MachineConfig kernelType, which currently supports the following options:

default (standard kernel with 4k pages)
realtime (realtime kernel with 4k pages)
64k-pages (standard kernel with 64k pages)

At some point it is likely that 64k page support will be added to the realtime kernel, which would likely mean another "realtime-64k-pages" option (or similar) would be added.

The purpose of this epic is to allow the 64k-pages (standard kernel with 64k pages) option to be selected in the PerformanceProfile and make it easy to support new kernelTypes added to the MachineConfig. There is a workaround for this today, by applying an additional MachineConfig CR, which overrides the kernelType, but this is awkward for the user.

One option to support this in the PerformanceProfile would be to deprecate the existing realTimeKernel option and replace it with a new kernelType option. The kernelType option would support the same values as the MachineConfig kernelType (i.e. default, realtime, 64k-pages). The old option could be supported for backwards compatibility - attempting to use both options at the same time would be treated as an error. Another option would be to add a new kernelPageSize option (with values like default or 64k) and then internally map that to the MachineConfig kernelType (after validation that the combination of kernel type and page size was allowed).

This will require updates to the customer documentation and to the performance-profile-creator to support the new option.

This will also might require updates to the workloadhints kernel related sections.

Why is this important?

This makes it easier for the user to select the kernel page size.

Scenarios

User wants to use standard kernel with default page size (4k).
User wants to use standard kernel with 64k page size.
User wants to use realtime kernel with default page size (4k).
Future: user wants to use realtime kernel with 64k page size.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Need to decide the best option for specifying the kernel type and kernel page size in the PerformanceProfile.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNF-16742: Add a validation to kernelPageSize + hugepages combinations

View the Description View the linked PRs

https://docs.kernel.org/mm/vmemmap_dedup.html
Reading this doc, we can see a mapping of kernel page size and hugepages support.
We want to enforce that the combination of hugepages and kernelPageSize entered by the users in the performance profile is valid (done by validation webhook).

https://github.com/openshift/cluster-node-tuning-operator/pull/1296

Story CNF-16267: Add kernel page size field to performance profile API

View the Description View the linked PRs

Acceptance criteria:

Introduce a new kernelPageSize field to the performance profile API:
for x86/amd64, the only valid value is 4K, while for aarch64, the valid values are 4K, 64K.
Adding this to performance-profile.md doc.
A smart validation should be integrated into the validation webhook:
64k should be used only on aarch64 nodes and is currently supported with a no real-time kernel. Moreover, we should check for invalid inputs on both arch systems.
The default value is 4K if none is specified.
Modifying the kernel Type selection process, adding the new kernel type 64k-pages alongside default and realtime.
Verification on an ARM cluster.

https://github.com/openshift/cluster-node-tuning-operator/pull/1262

Feature XCMSTRAT-771: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic ARO-14182: Hypershift support for ImageRegistry capability

View the Description

PERSONAS:

The following personas are borrowed from Hypershift docs used in the user stories below.

Cluster service consumer: The user empowered to request control planes, request workers, and drive upgrades or modify externalized configuration. Likely not empowered to manage or access cloud credentials or infrastructure encryption keys. In the case of managed services, this is someone employed by the customer.
Cluster service provider: The user hosting cluster control planes, responsible for up-time. UI for fleet wide alerts, configuring AWS account to host control planes in, user provisioned infra (host awareness of available compute), where to pull VMs from. Has cluster admin management. In the case of managed services, this persona represents Red Hat SRE.

USER STORY:

As a cluster service consumer, I want to provision hosted control planes and clusters without the Image Registry, so that my hosted clusters do not contain resources from a component I do not use, such as workloads, storage accounts, pull-secrets, etc, which allows me to save on computing resources
As a cluster service provider, I want users to be able to disable the Image Registry so that I don't need to maintain hosted control plane components that users don't care about.

ACCEPTANCE CRITERIA:

What is "done", and how do we measure it? You might need to duplicate this a few times.

Given a
When b
Then c

CUSTOMER EXPERIENCE:

Only fill this out for Product Management / customer-driven work. Otherwise, delete it.

Does this feature require customer facing documentation? YES/NO
- If yes, provide the link once available
Does this feature need to be communicated with the customer? YES/NO
- - How far in advance does the customer need to be notified?
  - Ensure PM signoff that communications for enabling this feature are complete
Does this feature require a feature enablement run (i.e. feature flags update) YES/NO
- If YES, what feature flags need to change?
  - FLAG1=valueA
- If YES, is it safe to bundle this feature enablement with other feature enablement tasks? YES/NO

BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

ADR: a
Design Doc: b
Wiki: c
Similar Work PRs: d
Subject Matter Experts: e
PRD: f

NOTES:

If there's anything else to add.

Story ARO-14354: HostedClusterConfigOperator needs to reconcile CVO Capabilities

View the Description View the linked PRs

User Story

To reconcile capabilities at runtime correctly, we need to constantly reconcile the CVO capabilities in the HostedClusterConfigOperator.

This was introduced in https://github.com/openshift/hypershift/pull/1687 and causes the capabilities to reset after being set initially after the installation.

This logic needs to take into account the enabled/disabled state of the image registry and the desired state of the hypershift control plane config CRD to render the capabilities correctly.

Ref from ~~ARO-13685~~ where this was simply removed: https://github.com/openshift/hypershift/pull/5315/files#diff-c888020c5dc46c458d818b931db97131d6e35b90661fe1030a39ebeac8859b19L1224

Definition of Done

Mark with an X when done; strikethrough for non-applicable items. All items
must be considered before closing this issue.

[ ] Ensure all pull request (PR) checks, including ci & e2e, are passing
[ ] Document manual test steps and results
[ ] Manual test steps executed by someone other than the primary implementer or a test artifact such as a recording are attached
[ ] All PRs are merged
[ ] Ensure necessary actions to take during this change's release are communicated and documented
[ ] Troubleshooting Guides (TSGs), ADRs, or other documents are updated as necessary

https://github.com/openshift/hypershift/pull/5456

Story ARO-14356: Configure bootstrap CVO manifest

View the Description View the linked PRs

User Story

When the cluster is installed and configured, the CVO container has init containers that apply the initial CRDs and remove some manifests from CVO.

Depending on the configuration state of the image registry, it needs to remove or keep the capability as enabled. The bootstrap init container needs to set the initial capabilities as AdditionalEnabledCapabilities and the BaselineCapabilities as None.

Note there is a gotcha on the ordering of the capabilities, they always need to be sorted ascending for the API to accept them - otherwise they are silently ignored.

Note, we need to implement the same change in V1 and V2 of the operator:

https://github.com/openshift/hypershift/blob/99c34c1b6904448fb065cd65c7c12545f04fb7c9/control-plane-operator/controllers/hostedcontrolplane/cvo/reconcile.go#L341-L366

https://github.com/openshift/hypershift/blob/99c34c1b6904448fb065cd65c7c12545f04fb7c9/control-plane-operator/controllers/hostedcontrolplane/v2/cvo/deployment.go#L55-L60

XREF implementation from ~~ARO-13685~~: https://github.com/openshift/hypershift/pull/5315/files#diff-520b6ecfad21e6c9bc0cbf244ff694cf5296ffc8c0318cb2248eb7185a36cd8aR363-R366

Definition of Done

Mark with an X when done; strikethrough for non-applicable items. All items
must be considered before closing this issue.

https://github.com/openshift/hypershift/pull/5456

Story ARO-14355: OAPI Controller needs to update ImagePolicyConfig

View the Description View the linked PRs

User Story

Currently the ImagePolicyConfig has the internal image registry url hardcoded

When the image registry is disabled, we need to ensure that this is properly accounted for, as the registry won't be reachable under this address anymore.

Definition of Done

Mark with an X when done; strikethrough for non-applicable items. All items
must be considered before closing this issue.

https://github.com/openshift/hypershift/pull/5456

Story ARO-14442: Update Hypershift HostedCluster API with DisabledCapabilities

View the Description View the linked PRs

Enhancement https://github.com/openshift/enhancements/pull/1729

We want to add the disabled capabilities to the hosted cluster CRD as described in the above enhancement.

AC:

Updated API is merged in the Hypershift repo
Updated API can be consumed from cluster service and follow-up tasks in this epic

https://github.com/openshift/hypershift/pull/5554

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Vulnerability OCPBUGS-51254: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1983

Bug OCPBUGS-45050: UI crash accessing a Service in pending state

View the Description View the linked PRs

Description of problem:

Created a service for DNS server for secondary networks in Openshift-Virtualizaion, using MetalLB, but the IP is still pending, when accessing the service from the UI, it crash.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Steps to Reproduce:

    1. Create an IP pool (for example 1 IP) for Metal LB and fully utilize the IP range (which other service)
    2. Allocate a new IP using the oc expose command like below
    3. Check the service status on the UI

Actual results:

UI crash

Expected results:

Should show the service status

Additional info:

oc expose -n openshift-cnv deployment/secondary-dns --name=dns-lb --type=LoadBalancer --port=53 --target-port=5353 --protocol='UDP'

https://github.com/openshift/networking-console-plugin/pull/151

Bug OCPBUGS-45739: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/318

Bug OCPBUGS-44920: Period placed incorrectly in i18n error message

View the Description View the linked PRs

Description of problem:

    The period is placed inside the quotes of the missingKeyHandler i18n error

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always when there is a missingKeyHandler error

Steps to Reproduce:

    1. Check browser console
    2. Observe period is placed inside the quites
    3.

Actual results:

    It is placed inside the quotes

Expected results:

    It should be placed outside the quotes

Additional info:

https://github.com/openshift/console/pull/14547

Bug OCPBUGS-45756: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29349

Bug OCPBUGS-46035: Fix skew version support for oc adm node-image commands (4.17 -> 4.18)

View the Description View the linked PRs

Description of problem:

    Due the recent changes, using oc 4.17 adm node-image commands on a 4.18 ocp cluster doesn't work

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. oc adm node-image create / monitor
    2.
    3.

Actual results:

    The commands fail

Expected results:

    The commands should work as expected

Additional info:

https://github.com/openshift/installer/pull/9307

Bug OCPBUGS-48142: dev console, "Silence details" page, click Alert link under "Firing alerts" section, "No Alert found" shows

View the Description View the linked PRs

Description of problem:

dev console, select one project that has alerts, example: openshift-monitoring, silence one Alert, example Watchdog, go to the "Silence details" page, click the Watchdog link under "Firing alerts" section, "No Alert found" shows, which should go to the alert details page, see screen recording: https://drive.google.com/file/d/1lUKLoHpmBKuzd8MmEUaUJRPgIkI1LjCj/view?usp=drive_link

the issue is happen with 4.18.0-0.nightly-2025-01-04-101226/4.19.0-0.nightly-2025-01-07-234605, no issue withe 4.17

checked the links for Watchdog link under "Firing alerts" section, there is undefined in the link, which should be namespace(openshift-monitoring) like 4.17

4.19
https://${console_url}/dev-monitoring/ns/undefined/alerts/1067612101?alertname=Watchdog&namespace=openshift-monitoring&severity=none

4.18
https://${console_url}/dev-monitoring/ns/undefined/alerts/1086044860?alertname=Watchdog&namespace=openshift-monitoring&severity=none

4.17

https://${console_url}/dev-monitoring/ns/openshift-monitoring/alerts/3861382580?namespace=openshift-monitoring&prometheus=openshift-monitoring%2Fk8s&severity=none&alertname=Watchdog

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always for 4.18+

Steps to Reproduce:

1. see the description

Actual results:

"No Alert found" shows

Expected results:

no error

https://github.com/openshift/monitoring-plugin/pull/314

Bug OCPBUGS-49621: [UDN pre-merge testing] not able to create layer3 UDN from CRD on dualstack cluster

View the Description View the linked PRs

Description of problem: [UDN pre-merge testing] not able to create layer3 UDN from CRD on dualstack cluster

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. On a dualstack cluster, created a UDN namespace with label

2. Attempted to create a layer3 UDN from CRD

$ cat /tmp/e2e-test-udn-networking-udn-bhmwv-n3wn8mngresource.json
{
"kind": "List",
"apiVersion": "v1",
"metadata": {},
"items": [
{
"apiVersion": "k8s.ovn.org/v1",
"kind": "UserDefinedNetwork",
"metadata":

{ "name": "udn-network-77827-ns1", "namespace": "e2e-test-udn-networking-udn-b6ldh" }

,
"spec": {
"layer3": {
"mtu": 1400,
"role": "Primary",
"subnets": [

{ "cidr": "10.150.0.0/16", "hostSubnet": 24 }

{ "cidr": "2010:100:200::0/48", "hostSubnet": 64 }

]
},
"topology": "Layer3"
}
}
]
}

3. got the following error message:

The UserDefinedNetwork "udn-network-77827-ns1" is invalid: spec.layer3.subnets[1]: Invalid value: "object": HostSubnet must < 32 for ipv4 CIDR
The UserDefinedNetwork "udn-network-77827-ns1" is invalid: spec.layer3.subnets[1]: Invalid value: "object": HostSubnet must < 32 for ipv4 CIDR

subnets[1] is ipv6 hostsubnet, but it was compared with IPv4 CIDR

Actual results: Not able to create UDN in UDN namespace on dualstack cluster

Expected results: should be able to create UDN in UDN namespace on dualstack cluster

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

Bug OCPBUGS-50503: oc-mirror fails with error `unknown image: reference name is empty` when mirroring release payloads with digest and graph=true

View the Description View the linked PRs

Description of problem:

oc-mirror throws error when trying to mirror release payloads using digest with graph=true for nightly, rc and ec builds and does not generate signatures for ec and rc builds.

Version-Release number of selected component (if applicable):

    [fedora@knarra-fedora knarra]$ ./oc-mirror version
W0210 13:19:40.701416  143622 mirror.go:102] 

⚠️  oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2

WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202502041931.p0.gc7144d5.assembly.stream.el9-c7144d5", GitCommit:"c7144d5d2c2b0345f163299ed04a400f2f93d340", GitTreeState:"clean", BuildDate:"2025-02-04T20:04:49Z", GoVersion:"go1.22.9 (Red Hat 1.22.9-1.module+el8.10.0+22500+aee717ef) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

    Always

Steps to Reproduce:

    1. use the nightly image set config as shown below and see that it throws an error
    2. use the rc candidate as shown below and see that it throws an error and does not generate signature relatted files
    3. use the ec candidate as shown below and see that it throws an error and does not generate signature related files.

Actual results:

    2025/02/10 12:15:49  [INFO]   : === Results ===
2025/02/10 12:15:49  [INFO]   :  ✓  191 / 191 release images mirrored successfully
2025/02/10 12:15:49  [INFO]   : 📄 Generating IDMS file...
2025/02/10 12:15:49  [INFO]   : internlrelease/working-dir/cluster-resources/idms-oc-mirror.yaml file created
2025/02/10 12:15:49  [INFO]   : 📄 No images by tag were mirrored. Skipping ITMS generation.
2025/02/10 12:15:49  [INFO]   : 📄 No catalogs mirrored. Skipping CatalogSource file generation.
2025/02/10 12:15:49  [INFO]   : 📄 No catalogs mirrored. Skipping ClusterCatalog file generation.
2025/02/10 12:15:49  [INFO]   : 📄 Generating UpdateService file...
2025/02/10 12:15:49  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2025/02/10 12:15:49  [ERROR]  : unknown image : reference name is empty

Expected results:

     No errors should be seen and for ec and rc candidate signature should be generated as well.

Additional info:

with nightly payload and graph==true:
===============================
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    graph: true
    release: quay.io/openshift-release-dev/ocp-release@sha256:e0907823bc8989b02bb1bd55d5f08262dd0e4846173e792c14e7684fbd476c0d

with rc payload and graph==true:
===========================
[fedora@knarra-fedora knarra]$ cat /tmp/internal.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    release: quay.io/openshift-release-dev/ocp-release@sha256:f0de3be10be2f5fc1a5b1c208bcfe5d3a71a70989cacbca57ebf7c5fe6e14b09
    graph: true

with ec payload and graph==true:
============================
[fedora@knarra-fedora knarra]$ cat /tmp/internal.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    release: quay.io/openshift-release-dev/ocp-release@sha256:aa3e0a3a94babd90535f8298ab274b51a9bce6045dda8c3c8cd742bc59f0e2d9
    graph: true

https://github.com/openshift/oc-mirror/pull/1072

Bug OCPBUGS-46557: [TP] Display name and Description are not taking effect when creating project

View the Description View the linked PRs

Description of problem:

when TechPreviewNoUpgrade feature gate is enabled, console will show a customized 'Create Project' modal to all users.
In the customized modal, 'Display name' and 'Description' values user typed are not taking effect

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-16-065305

How reproducible:

Always when TechPreviewNoUpgrade feature gate is enabled

Steps to Reproduce:

1. Enable TechPreviewNoUpgrade feature gate
$ oc patch  featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge
2. any user login to console and create a project from web, set 'Display name' and 'Description' then click on 'Create' 
3. Check created project YAML
$ oc get project ku-5 -o json | jq .metadata.annotations
{
  "openshift.io/description": "",
  "openshift.io/display-name": "",
  "openshift.io/requester": "kube:admin",
  "openshift.io/sa.scc.mcs": "s0:c28,c17",
  "openshift.io/sa.scc.supplemental-groups": "1000790000/10000",
  "openshift.io/sa.scc.uid-range": "1000790000/10000"
}

Actual results:

display-name and description are all empty

Expected results:

display-name and description should be set to the values user had configured

Additional info:

once TP is enabled, customized create project modal is looking like https://drive.google.com/file/d/1HmIlm0u_Ia_TPsa0ZAGyTloRmpfD0WYk/view?usp=drive_link

https://github.com/openshift/networking-console-plugin/pull/154

Bug OCPBUGS-46577: Power VS: Block CSI driver does not honor endpoint overrides.

View the Description View the linked PRs

Description of problem:

    When deploying with endpoint overrides, the block CSI driver will try to use the default endpoints rather than the ones specified.

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/79

Bug OCPBUGS-48542: [IPSEC] 'encapsulation=no' is not supported in OVN, 'Never' option in API might need to be removed

View the Description View the linked PRs

Description of problem:
[IPSEC] 'encapsulation=no' is not supported in OVN, 'Never' option in API might need to be removed

Version-Release number of selected component (if applicable):
Pre-merge testing build openshift/cluster-network-operator#2573
How reproducible:
Always

Steps to Reproduce:

1. % oc patch networks.operator.openshift.io cluster --type=merge -p '{ "spec":{ "defaultNetwork":{ "ovnKubernetesConfig":{ "ipsecConfig":{ "mode":"Full", "full":{"encapsulation": "Never"}}}}}}'

From https://redhat-internal.slack.com/archives/C01G7T6SYSD/p1732621519461779,
encapsulation=no is not supported currently.
2.

Actual results:
No 'encapsulation=no' as parameter added in /etc/ipsec.d/openshift.conf, still use default setting which is auto.

Expected results:
Suggest remove 'Never' from API.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/api/pull/2199

Bug OCPBUGS-45392: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-capi-operator/pull/241

Bug OCPBUGS-13685: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14130

Bug OCPBUGS-47688: Installation of OCP fails on disconnected cluster with 4.19/4.18 oc-mirror

View the Description View the linked PRs

Description of problem:

Users could not install 4.19 ocp clusters with 4.19 oc-mirror it fails with error below

[core@ci-op-r0wcschh-0f79b-mxw75-bootstrap ~]$ sudo crictl ps
FATA[0000] validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
[core@ci-op-r0wcschh-0f79b-mxw75-bootstrap ~]$ sudo crictl img
FATA[0000] validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
[core@ci-op-r0wcschh-0f79b-mxw75-bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service
Dec 27 04:04:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2568]: Error: initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: pinging container registry registry.build02.ci.openshift.org: Get "https://registry.build02.ci.openshift.org/v2/": dial tcp 34.74.144.21:443: i/o timeout
Dec 27 04:04:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap podman[2568]: 2024-12-27 04:04:04.637824679 +0000 UTC m=+243.178748520 image pull-error  registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: pinging container registry registry.build02.ci.openshift.org: Get "https://registry.build02.ci.openshift.org/v2/": dial tcp 34.74.144.21:443: i/o timeout
Dec 27 04:04:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2107]: Pull failed. Retrying registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1...
Dec 27 04:05:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2656]: time="2024-12-27T04:05:04Z" level=warning msg="Failed, retrying in 1s ... (1/3). Error: initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: pinging container registry registry.build02.ci.openshift.org: Get \"https://registry.build02.ci.openshift.org/v2/\": dial tcp 34.74.144.21:443: i/o timeout"
Dec 27 04:06:05 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2656]: time="2024-12-27T04:06:05Z" level=warning msg="Failed, retrying in 1s ... (2/3). Error: initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e

Version-Release number of selected component (if applicable):

     Running command: 'oc-mirror' version --output=yaml
W1227 14:24:55.919668     102 mirror.go:102] 

⚠️  oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2

clientVersion:
  buildDate: "2024-12-17T11:21:05Z"
  compiler: gc
  gitCommit: 27a04ae182eda7a668d0ad99c66a5f1e0435010b
  gitTreeState: clean
  gitVersion: 4.19.0-202412170739.p0.g27a04ae.assembly.stream.el9-27a04ae
  goVersion: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime
  major: ""
  minor: ""
  platform: linux/amd64

How reproducible:

     Always

Steps to Reproduce:

    1.  Install OCP4.19 cluster via oc-mirror 4.19
    2.
    3.

Actual results:

     Users see the error as described in the Description

Expected results:

    Installation should be successful.

Additional info:

     More details in jira https://issues.redhat.com/browse/OCPQE-27853
      More details in thread https://redhat-internal.slack.com/archives/C050P27C71S/p1735550241970219

https://github.com/openshift/oc-mirror/pull/1002

Story CLID-301: cli: add option to change location of the cache

View the Description View the linked PRs

Currently the location of the cache directory can be set via the environment variable `OC_MIRROR_CACHE`. The only problem is that the env var is not easily discoverable by users. It would be better if we had a command line option (e.g `-~~cache-dir <dir>`) which is discoverable via `~~-help`.

Bug OCPBUGS-43731: v2 delete is not aware of all the tags created by v1

View the Description View the linked PRs

Description of problem:

For same image , v1 has more than 1 tags , but for v2 only has 1 tag, when delete executed, may remain some tags created by v1.

Version-Release number of selected component (if applicable):

./oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-309-g63a5556", GitCommit:"63a5556a", GitTreeState:"clean", BuildDate:"2024-10-23T02:42:55Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1. mirror catalog package for v1: kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
      packages:
        - name: sandboxed-containers-operator

 `oc-mirror -c config-catalog-v1.yaml docker://localhost:5000/catalog --dest-use-http`

2.   mirror same package for v2 : kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
      packages:
        - name: sandboxed-containers-operator

 `oc-mirror -c config-catalog-v2.yaml --workspace file://ws docker://localhost:5000/catalog  --v2 --dest-tls-verify=false`

3. generate the delete image list with config :

kind: DeleteImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
delete:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
      packages:
        - name: sandboxed-containers-operator  


  
`oc-mirror delete -c config-d-catalog-v2.yaml --workspace file://ws docker://localhost:5000/catalog  --v2 --dest-tls-verify=false --generate`

 4. Execute the delete action: 

`oc-mirror delete  --delete-yaml-file  ws/working-dir/delete/delete-images.yaml docker://localhost:5000/catalog  --v2 --dest-tls-verify=false `

Actual results:

1. v1 has more than 1 tags:
[fedora@preserve-fedora-yinzhou ~]$ curl localhost:5000/v2/catalog/openshift4/ose-kube-rbac-proxy/tags/list |json_reformat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    81  100    81    0     0  24382      0 --:--:-- --:--:-- --:--:-- 27000
{
    "name": "catalog/openshift4/ose-kube-rbac-proxy",
    "tags": [
        "cb9a8d8a",
        "d07492b2"
    ]
}

2. v2 only has 1 tag:
[fedora@preserve-fedora-yinzhou ~]$ curl localhost:5000/v2/catalog/openshift4/ose-kube-rbac-proxy/tags/list |json_reformat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   148  100   148    0     0  60408      0 --:--:-- --:--:-- --:--:-- 74000
{
    "name": "catalog/openshift4/ose-kube-rbac-proxy",
    "tags": [
        "cb9a8d8a",
        "f6c37678f1eb3279e603f63d2a821b72394c52d25c2ed5058dc29d4caa15d504",
        "d07492b2"
    ]
}

4.  after delete , we could see still has tags remaining: 
[fedora@preserve-fedora-yinzhou ~]$ curl localhost:5000/v2/catalog/openshift4/ose-kube-rbac-proxy/tags/list |json_reformat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    70  100    70    0     0  28056      0 --:--:-- --:--:-- --:--:-- 35000
{
    "name": "catalog/openshift4/ose-kube-rbac-proxy",
    "tags": [
        "d07492b2"
    ]
}

Expected results:

 4. delete all the tags for the same image.

Additional info:

https://github.com/openshift/oc-mirror/pull/959

Bug OCPBUGS-44238: [CI] Investigate: The HAProxy router converges when multiple routers are writing conflicting status

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router converges when multiple routers are writing conflicting status [Suite:openshift/conformance/parallel]

Significant regression detected.
Fishers Exact probability of a regression: 99.98%.
Test pass rate dropped from 99.51% to 93.75%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-29T00:00:00Z
End Time: 2024-11-05T23:59:59Z
Success Rate: 93.75%
Successes: 40
Failures: 3
Flakes: 5

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.51%
Successes: 197
Failures: 1
Flakes: 6

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&LayeredProduct=none&Network=ovn&NetworkAccess=default&Platform=vsphere&Procedure=none&Scheduler=default&SecurityMode=default&Suite=unknown&Topology=ha&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Router&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Networking%20%2F%20router&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20vsphere%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-11-05%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-29%2000%3A00%3A00&testId=openshift-tests%3Aa6ab2b14071c8929c9c8bd3205a62482&testName=%5Bsig-network%5D%5BFeature%3ARouter%5D%5Bapigroup%3Aroute.openshift.io%5D%20The%20HAProxy%20router%20converges%20when%20multiple%20routers%20are%20writing%20conflicting%20status%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

https://github.com/openshift/origin/pull/29395

Bug OCPBUGS-44723: aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics target down with OpenShift Container Platform

View the Description View the linked PRs

Description of problem:

After the upgrade to OpenShift Container Platform 4.17, it's being observed that aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics is reporting target down state. When checking the newly created Container one can find the below logs, that may explain the effect seen/reported.

$ oc logs aws-efs-csi-driver-controller-5b8d5cfdf4-zwh67 -c kube-rbac-proxy-8211
W1119 07:53:10.249934       1 deprecated.go:66] 
==== Removed Flag Warning ======================

logtostderr is removed in the k8s upstream and has no effect any more.

===============================================
		
I1119 07:53:10.250382       1 kube-rbac-proxy.go:233] Valid token audiences: 
I1119 07:53:10.250431       1 kube-rbac-proxy.go:347] Reading certificate files
I1119 07:53:10.250645       1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9211
I1119 07:53:10.250944       1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9211
I1119 07:54:01.440714       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:19.860038       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:31.432943       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:49.852801       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:01.433635       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:19.853259       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:31.432722       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:49.852606       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:01.432707       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:19.853137       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:31.440223       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:49.856349       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:01.432528       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:19.853132       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:31.433104       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:49.852859       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:58:01.433321       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:58:19.853612       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.17

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.17
2. Install aws-efs-csi-driver-operator
3. Create efs.csi.aws.com CSIDriver object and wait for the aws-efs-csi-driver-controller to roll out.

Actual results:

The below Target Down Alert is being raised

50% of the aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics targets in Namespace openshift-cluster-csi-drivers namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

Expected results:

The ServiceMonitor endpoint should be reachable and properly responding with the desired information to monitor the health of the component.

Additional info:

https://github.com/openshift/csi-operator/pull/330

Bug OCPBUGS-46564: Error while creating the egressfirewall dnsName with uppercase on OCP 4.16

View the Description View the linked PRs

Description of problem:
We are getting the below error on OCP 4.16 while creating the egressfirewall with uppercase:
~~~

* spec.egress[4].to.dnsName: Invalid value: "TESTURL.infra.example.com": spec.egress[4].to.dnsName in body should match '^(*\.)?([a-z0-9]([-a-z0-9] {0,61}[a-z0-9])?\.)+[a-z0-9]([-a-z0-9]{0,61}
[a-z0-9])?\.?$'
#
~~~
When I check the code
https://github.com/openshift/ovn-kubernetes/blob/release-4.15/go-controller/pkg/crd/egressfirewall/v1/types.go#L80-L82
types.go
~~~
// dnsName is the domain name to allow/deny traffic to. If this is set, cidrSelector and nodeSelector must be unset.
// kubebuilder:validation:Pattern=^([A-Za-z0-9-]\.)*[A-Za-z0-9-]+\.?$
DNSName string `json:"dnsName,omitempty"`
~~~
https://github.com/openshift/ovn-kubernetes/blob/release-4.16/go-controller/pkg/crd/egressfirewall/v1/types.go#L80-L85
types.go
~~~
// dnsName is the domain name to allow/deny traffic to. If this is set, cidrSelector and nodeSelector must be unset.
// For a wildcard DNS name, the '' will match only one label. Additionally, only a single '' can be
// used at the beginning of the wildcard DNS name. For example, '*.example.com' will match 'sub1.example.com' // but won't match 'sub2.sub1.example.com'.
// kubebuilder:validation:Pattern=`^(*\.)?([A-Za-z0-9-]\.)*[A-Za-z0-9-]+\.?$`
DNSName string `json:"dnsName,omitempty"`
~~~

Code looks its supported for the upper case.
Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Deploy the cluster with OCP 4.16.x

2. Create EgressFirewall with the upper case for dnsName
~~~
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
name: default
spec:
egress:

type: Allow
to:
dnsName: TEST.redhat.com
ports:
port: 80
protocol: TCP
port: 443
protocol: TCP
~~~

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/cluster-network-operator/pull/2610

Bug OCPBUGS-48619: Cluster fails to complete provisioning when using proxy with custom trust bundle

View the Description View the linked PRs

Description of problem:

When the cluster is created with a secure proxy enabled, and certificate is set in configuration.proxy.trustCA, the cluster fails to complete provisioning.
4.19.0-0.nightly-2024-12-10-040415

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Create a cluster with a secure proxy, certificate is set in .spec.configuration.proxy.trustCA.
    3.

Actual results:

    cluster does not complete provisioning

Expected results:

    cluster completes

Additional info:

    root cause is that certificate in additionalTrustBunlde isn't propagated into ingress proxy. 
slack:
https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1734047816669079?thread_ts=1734023627.636019&cid=G01QS0P2F6W

https://github.com/openshift/hypershift/pull/5525

Bug OCPBUGS-49686: SecretProviderClass Doesn't Allow Object Encoding

View the Description View the linked PRs

Description of problem:

    SecretProviderClass Doesn't Allow Object Encoding

Version-Release number of selected component (if applicable):

How reproducible:

    Every time

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    SecretProviderClass Doesn't Allow Object Encoding Per CP MI

Expected results:

    SecretProviderClass Allows Object Encoding Per CP MI

Additional info:

https://github.com/openshift/hypershift/pull/5505

Vulnerability OCPBUGS-52216: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-baremetal-operator/pull/463

Bug OCPBUGS-44254: asn1js build warning when building Console

View the Description View the linked PRs

Description of problem:

    I get this build warning when building:

warning Pattern ["asn1js@latest"] is trying to unpack in the same destination "~/.cache/yarn/v6/npm-asn1js-2.0.26-0a6d435000f556a96c6012969d9704d981b71251-integrity/node_modules/asn1js" as pattern ["asn1js@^2.0.26"]. This could result in non-deterministic behavior, skipping.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Run ./clean-frontend && ./build-frontend.sh
    2. Observe build output
    3.

Actual results:

    I get the warning

Expected results:

    No warning

Additional info:

https://github.com/openshift/console/pull/14470

Bug OCPBUGS-49351: vsphere - The multi-nic count not validating as expected.

View the Description View the linked PRs

Description of problem:

  for vSphere multi-nic setting, it should have validation that allow vSphere network count from 1 to 10. but installer didn't print any error info when I set 11 nics in install-config.

Version-Release number of selected component (if applicable):

   4.18.0-0.nightly-2025-01-25-163410

How reproducible:

    setting 11 nics under install-config.platform.vsphere.failureDomains.topology.networks and continue installation

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    installation continue and failed

Expected results:

    should print validation info.

Additional info:

Story OTA-861: CVO allows new unforced updates even when it is currently midway through a partial update. It should require a force to retarget mid-update

View the Description View the linked PRs

Moving the bug https://bugzilla.redhat.com/show_bug.cgi?id=1802553 to this Jira card

Currently we had a customer that triggered the upgrade from 4.1.27 to 4.3, having intermediate versions on 4.2 in partial state. We have asked for details of the CVO from the customer to understand better the procedure taken, but we might need to implement a way to either stop the upgrade in case customer makes a mistake or block the upgrade if the customer changes the channel on the console to a version which the upgrade does not support, like in this case.

As per https://github.com/openshift/enhancements/blob/master/enhancements/update/eus-upgrades-mvp.md#ota---inhibit-minor-version-upgrades-when-an-upgrade-is-in-progress

OTA - Inhibit minor version upgrades when an upgrade is in progress

We should inhibit minor version upgrades via Upgradeable=False whenever an existing upgrade is in progress. This prevents retargetting of upgrades before we've reached a safe point.

Imagine:

Be running 4.6.z.
Request an update to 4.7.z'.
CVO begins updating to 4.7.z'.
CVO requests recommended updates from 4.7.z', and hears about 4.8.z".
User accepts recommended update to 4.8.z" before the 4.7.z' OLM operator had come out to check its children's max versions against 4.8 and set Upgradeable=False.
Cluster core hits 4.8.z" and some OLM operators fail on compat violations.

This should not inhibit further z-stream upgrades, but we should be sure that we catch the case of 4.6.z to 4.7.z to 4.7.z+n to 4.8.z whenever 4.7.z was not marked as Complete.

Update:

Eventually, the output of this card:

y-then-y: blocked (originally required in this card description).
z-then-y: blocked (not originally required in this card description but during the implementation we think it is good to have).
y-then-z or z-then-z: accepted because we need to always allow z-upgrade to include fixes.

https://github.com/openshift/cluster-version-operator/pull/1093

Task MULTIARCH-5186: Bump k8s and openshift dependencies.

View the Description View the linked PRs

This issue tracks the updation of k8s and related openshift APIs to a recent version, to keep in-line with other MAPI providers.

https://github.com/openshift/machine-api-provider-powervs/pull/94

Bug OCPBUGS-50613: [ironic-image] Install python3-inotify pkg explicitly

View the Description View the linked PRs

The python3-inotify package is installed as dependency but it should be explicitly mentioned in the packages list to avoid issues like okd-project/okd#2113

https://github.com/openshift/ironic-image/pull/638

Bug OCPBUGS-51372: Filter 'Name' on resource list page doesn't align well when language is set to Chinese/Japanese/Korean

View the Description View the linked PRs

Description of problem:

Filter 'Name' on resource list page doesn't align well when language is set to Chinese/Japanese/Korean

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-02-26-172353

How reproducible:

Always

Steps to Reproduce:

    1.Set language as Chinese, Japanese or Korean. Check column 'Name' on resource list page, eg, on pods list page.
    2.
    3.

Actual results:

1. Filter 'Name' on resource list page doesn't align well.
screenshots: https://drive.google.com/drive/folders/1w1-sXSj3RdIWIVbAGl8q0hZaHuFHHCJM?usp=drive_link

Expected results:

1. Should align well and keep consistent with other filter input.

Additional info:

https://github.com/openshift/console/pull/14822

Bug OCPBUGS-48686: [cluster-kube-apiserver-operator] Inconsistent static pod operator statuses after apply migration

View the Description View the linked PRs

Description of problem:

Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers:

https://issues.redhat.com/browse/OCPBUGS-45924
https://issues.redhat.com/browse/OCPBUGS-46372
https://issues.redhat.com/browse/OCPBUGS-48276

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-39199: clusteroperator/machine-config blips Degraded=True during upgrade test

View the Description View the linked PRs

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures machine-config operator that blips Degraded=True during upgrade runs.


Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade/1843023092004163584   

Reasons associated with the blip: RenderConfigFailed   

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exceptions are defined here: 


See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4706

Bug OCPBUGS-47489: GCP Destroy: leaks global backend services

View the Description View the linked PRs

https://redhat-internal.slack.com/archives/C01CQA76KMX/p1734698666158289?thread_ts=1734688838.123979&cid=C01CQA76KMX

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following tests:

install should succeed: infrastructure

install should succeed: overall

Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 99.24% to 89.63%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-12-13T00:00:00Z
End Time: 2024-12-20T12:00:00Z
Success Rate: 89.63%
Successes: 120
Failures: 17
Flakes: 27

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.24%
Successes: 939
Failures: 8
Flakes: 99

View the test details report for additional context.

https://github.com/openshift/installer/pull/9384

Bug OCPBUGS-48700: [cluster-kube-scheduler-operator] Inconsistent static pod operator statuses after apply migration

View the Description View the linked PRs

Description of problem:

Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers:

https://issues.redhat.com/browse/OCPBUGS-45924
https://issues.redhat.com/browse/OCPBUGS-46372
https://issues.redhat.com/browse/OCPBUGS-48276

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-52298: Breadcrumbs are missing on Developer perspective Project details page

View the Description View the linked PRs

Visit `/project-details/ns/default/` and note the breadcrumbs are not present. There is container for them, but no actual content.

https://drive.google.com/file/d/1FE0GHmWDKNqV8EXbTXx-lBSjQyK6--g_/view?usp=sharing

https://github.com/openshift/console/pull/14824

Story CNTRLPLANE-2: openshift/api - Upgrade to Kubernetes 1.32

View the Description View the linked PRs

Update openshift/api to k8s 1.32

https://github.com/openshift/api/pull/2096

Bug OCPBUGS-23435: authn op. reports Progressing=false when old-gen pods are still running

View the Description View the linked PRs

Description of problem:

When a new configuration is picked up by the authentication operator, it rolls out a new revision of oauth-server pods. However, since the pods define `terminationGracePeriodSeconds`, the old-revision pods would still be running even after the oauth-server deployment reports that all pods have been updated to the newest revision, which triggers the authentication operator to report Progressing=false.

The above behavior might cause that tests (and possible any observers) that expect the newer revision would be still routed to the old pods, causing confusion.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Trigger new oauth-server rollout
2. Observe the authentication operator reporting Progressing while also watching the number of pods in the openshift-authentication namespace

Actual results:

CAO reports Progressing=false even though there are more than the expected number of pods running.

Expected results:

CAO waits to report Progressing=false when only the new revision of pods is running in the openshift-authentication NS.

Additional info:

Bug OCPBUGS-45161: As an oc-mirror user, I'd like to have ClusterCatalog resource generated for operator catalogs

View the Description View the linked PRs

Description of problem:

OLMv1 is being GA'd with OCP 4.18, together with OLMv0. The long(ish)-term plan right now is for OLM v0 and v1 to be able to coexist on a cluster. However, their access to installable extensions is through different resources. v0 uses CatalogSource and v1 uses ClusterCatalog. We expect to see catalog content begin to diverge at some point, but don't have a specific timeline for it yet.

oc-mirror v2 should generate ClusterCatalog YAML along with CatalogSource YAML.

We also work with docs team to document ClusterCatalog YAML is only needed to be applied when managing Operator catalogs with OLM v1.

Version-Release number of selected component (if applicable):

    4.18+

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

    ClusterCatalog resource is generated for operators.

Additional info:

    OLMv1 currently only works for a small subset of operators in the catalog.

https://github.com/openshift/oc-mirror/pull/1034

Bug OCPBUGS-46038: Persistent Volumes are not creating while using LUN IDs

View the Description View the linked PRs

Description of problem:

Below error is occuring while creating Persistent Volume using LUN IDs,

Error "Invalid value: 300: must be between 0 and 255, inclusive" for field "spec.fc.lun".

Actual results:

Persistent volume is not creating.

Expected results:

Persistent Volume should create while using fiber channel

https://github.com/openshift/kubernetes/pull/2171

Bug OCPBUGS-46361: deprecate "oc adm pod-network"

View the Description View the linked PRs

The "oc adm pod-network" command for working with openshift-sdn multitenant mode is now totally useless in OCP 4.17 and newer clusters (since it's only useful with openshift-sdn, and openshift-sdn no longer exists as of OCP 4.17). Of course, people might use a new oc binary to talk to an older cluster, but probably the built-in documentation should make it clearer that this is not a command that will be useful to 99% of users.

If it's possible to make "pod-network" not show up as a subcommand in "oc adm -h" that would probably be good. If not, it should probably have a description like "Manage OpenShift-SDN Multitenant mode networking [DEPRECATED]", and likewise, the longer descriptions of the pod-network subcommands should talk about "OpenShift-SDN Multitenant mode" rather than "the redhat/openshift-ovs-multitenant network plugin" (which is OCP 3 terminology), and maybe should explicitly say something like "this has no effect when using the default OpenShift Networking plugin (OVN-Kubernetes)".

https://github.com/openshift/oc/pull/1955

Bug OCPBUGS-48780: IBM Cloud E2E DNS Failures

View the Description View the linked PRs

Description of problem:

The e2e-ibmcloud-operator presubmit job for the cluster-ingress-operator repo introduced in https://github.com/openshift/release/pull/56785 always fails due to DNS. Note that this job has `always_run: false` and `optional: true` so it requires calling /test e2e-ibmcloud-operator on a PR to make it appear. These failures are not blocking any PRs from merging. Example failure.

The issue is that IBM Cloud has DNS propagation issues, similar to the AWS DNS issues (~~OCPBUGS-14966~~), except:

There isn't a way to adjust the IBMCloud DNS SOA TTL because IBMCloud DNS is managed by a 3rd party (cloudflare I think, slack ref).
Our AWS E2E tests run on AWS test runner clusters; whereas our IBMCloud E2E test run on the same AWS test runner clusters (DNS resolution isn't as reliable in AWS test runner cluster for IBM Cloud DNS names)

The PR https://github.com/openshift/cluster-ingress-operator/pull/1164 was an attempt at fixing the issue by both resolving the DNS name inside of the cluster and allowing for a couple minute "warmup" interval to avoid negative caching. I found (via https://github.com/openshift/cluster-ingress-operator/pull/1132) that the SOA TTL is ~30 minutes, which if you trigger negative caching, you will have to wait 30 minutes for the IBM DNS Resolver to refresh the DNS name.

However, I found that if you wait ~7 minutes for the DNS record to propagate and don't query the DNS name, it will work after that 7 minute wait (I call it the "warmup" period).

The tests affected are any tests that use a DNS name (wildcard or load balancer record):

TestManagedDNSToUnmanagedDNSIngressController
TestUnmanagedDNSToManagedDNSIngressController
TestUnmanagedDNSToManagedDNSInternalIngressController
TestConnectTimeout

The two paths I can think of are:

Continue https://github.com/openshift/cluster-ingress-operator/pull/1164 and adjust the warm up time to 7+ minutes
Or just skip these tests for IBM Cloud (admit we can't use IBMCloud DNS records in testing)

Version-Release number of selected component (if applicable):

4.19

How reproducible:

90-100%

Steps to Reproduce:

    1. Run /test e2e-ibmcloud-operator

Actual results:

    Tests are flakey

Expected results:

    Tests should work reliably

Additional info:

https://github.com/openshift/cluster-ingress-operator/pull/1164

Bug OCPBUGS-45175: Baremetal IPI install fails to retrieve boot iso with SSLError - ssl service is not running on the 6180 port used for IPv6

View the Description View the linked PRs

Description of problem:

Following error returns in IPI Baremetal install with recent 4.18 builds. In bootstrap vm, https is not configured on 6180 port used in boot iso url. 

openshift-master-1: inspection error: Failed to inspect hardware. Reason: unable to start inspection: HTTP POST https://[2620:52:0:834::f1]:8000/redfish/v1/Managers/7fffdce9-ff4a-4e6a-b598-381c58564ca5/VirtualMedia/Cd/Actions/VirtualMedia.InsertMedia returned code 500. Base.1.0.GeneralError: Failed fetching image from URL https://[2620:52:0:834:f112:3cff:fe47:3a0a]:6180/redfish/boot-93d79ad0-0d56-4c8f-a299-6dc1b3f40e74.iso: HTTPSConnectionPool(host='2620:52:0:834:f112:3cff:fe47:3a0a', port=6180): Max retries exceeded with url: /redfish/boot-93d79ad0-0d56-4c8f-a299-6dc1b3f40e74.iso (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)'))) Extended information: [{'@odata.type': '/redfish/v1/$metadata#Message.1.0.0.Message', 'MessageId': 'Base.1.0.GeneralError'}]"

Version-Release number of selected component (if applicable):

 4.18 ec.4, 4.18.0-0.nightly-2024-11-27-162407

How reproducible:

    100%

Steps to Reproduce:

    1. trigger ipi baremetal install with dual stack config using virtual media
    2. 
    3.

Actual results:

    inspection fails at fetching boot iso

Expected results:

Additional info:

# port 6180 used in ironic ipv6 url is not configured for https. Instead, ssl service is running 
# at https://[2620:52:0:834:f112:3cff:fe47:3a0a]:6183. 
# May be introduced by OCPBUGS-39404.

[root@api core]# cat /etc/metal3.env 
AUTH_DIR=/opt/metal3/auth
IRONIC_ENDPOINT="http://bootstrap-user:pJ0R9XXsxUfoYVK2@localhost:6385/v1"
IRONIC_EXTERNAL_URL_V6="https://[2620:52:0:834:f112:3cff:fe47:3a0a]:6180/"
METAL3_BAREMETAL_OPERATOR_IMAGE="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e142d5989415da3c1035d04f84fa765c127bf2cf3406c4612e36607bb03384d9"  


[root@api core]# echo "" | openssl s_client -connect localhost:6180
CONNECTED(00000003)
405CE187787F0000:error:0A00010B:SSL routines:ssl3_get_record:wrong version number:ssl/record/ssl3_record.c:354:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 5 bytes and written 295 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

https://github.com/openshift/installer/pull/9249

Bug OCPBUGS-32776: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/1133

Bug OCPBUGS-39359: Shipwright build strategy params should not be displayed

View the Description View the linked PRs

Description of problem:

When creating a new Shipwright Build via the form (~~ODC-7595~~), the form shows all the params available on the build strategy.

Expected results:

The parameters should be hidden behind an "Advanced" botton.

Additional info:

Except the following parameters, the rest for each build strategy should be hidden and moved to the advanced section.

source-to-image: builder-image
buildah: no-params

https://github.com/openshift/console/pull/14540

Bug OCPBUGS-44961: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29323

Bug OCPBUGS-45896: "No datapoints found." on alert details graph

View the Description View the linked PRs

Description of problem:

checked on 4.18.0-0.nightly-2024-12-07-130635/4.19.0-0.nightly-2024-12-07-115816, admin console, go to alert details page, "No datapoints found." on alert details graph. see picture for CannotRetrieveUpdates alert: https://drive.google.com/file/d/1RJCxUZg7Z8uQaekt39ux1jQH_kW9KYXd/view?usp=drive_link

issue exists in 4.18+, no such issue with 4.17

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always on 4.18+

Steps to Reproduce:

1. see the description

Actual results:

"No datapoints found." on alert details graph

Expected results:

show correct graph

https://github.com/openshift/monitoring-plugin/pull/288

Bug OCPBUGS-51364: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-44265: ReRun of Resolver based PipelineRuns fails from UI

View the Description View the linked PRs

Description of problem:

As a User when we attempt to "ReRun" a resolver based pipelinerun from OpenshiftConsole, UI errors with message "Invalid PipelineRun configuration, unable to start Pipeline." Slack thread: https://redhat-internal.slack.com/archives/CG5GV6CJD/p1730876734675309

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Create a resolver based pipelinerun
2. Attempt to "ReRun" the same from Console

Actual results:

    ReRun Errors

Expected results:

    ReRun should be triggered successfully

Additional info:

https://github.com/openshift/console/pull/14471

Bug OCPBUGS-44831: alertmanager-user-workload Service Account shouldn't be configured with automount token.

View the Description View the linked PRs

Description of problem:

The alertmanager-user-workload Service Account has "automountServiceAccountToken: true"

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Enable Alertmanager for user-defined monitoring.
    2. oc get sa -n openshift-user-workload-monitoring alertmanager-user-workload -o yaml
    3.

Actual results:

    Has "automountServiceAccountToken: true"

Expected results:

    Has "automountServiceAccountToken: false" or no mention of automountServiceAccountToken.

Additional info:

    It is recommended to not enable token automount for service accounts in general.

https://github.com/openshift/cluster-monitoring-operator/pull/2522

Bug OCPBUGS-49413: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1330

Bug OCPBUGS-52180: Notification drawer close button throws error

View the Description View the linked PRs

Description of problem:

 The notification drawer close button throws a runtime error when clicked.

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    Always

Steps to Reproduce:

    1. Click the notification drawer icon in the masthead
    2. Click the close button in the notifcation drawer
    3.

Actual results:

A runtime error is thrown

Expected results:

The notification drawer closes

Additional info:

This was a regression introduced by https://github.com/openshift/console/pull/14680

https://github.com/openshift/console/pull/14809

Bug OCPBUGS-44933: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/781

Vulnerability OCPBUGS-47121: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/136

Bug OCPBUGS-47476: Power VS: MAPI ignores RC endpoint override

View the Description View the linked PRs

When deploying with an endpoint override for the resourceController, the Power VS machine API provider will ignore the override.

https://github.com/openshift/machine-api-provider-powervs/pull/97

Bug OCPBUGS-41964: Should not set "/" in Path field by default on route creation form/yaml

View the Description View the linked PRs

Description of problem:

Check on route create form/yaml, the Path field is set "/" by default, when create passthrough route, it prompts error about not supporting "spec.path", user needs to clear this field.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-13-193731

How reproducible:

Steps to Reproduce:

    1.Go to Networking->Routes page, click "Create Route", check the path field.
    2.Choose Passthrough type for insecure set and click "Create" button.
    3.

Actual results:

1. It has "/" set by default.
2. Error info shows up: 
Error "Invalid value: "/": passthrough termination does not support paths" for field "spec.path".

Expected results:

1. The field should be null by default.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/216

Bug OCPBUGS-45341: iptables-alerter daemonset should run everywhere

View the Description View the linked PRs

Managed services marks a couple of nodes as "infra" so user workloads don't get scheduled on them. However, platform daemonsets like iptables-alerter should run there – and the typical toleration for that purpose should be:

 tolerations:
- operator: Exists

instead the toleration is

tolerations:
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"

Examples from other platform DS:

$ for ns in openshift-cluster-csi-drivers openshift-cluster-node-tuning-operator openshift-dns openshift-image-registry openshift-machine-config-operator openshift-monitoring openshift-multus openshift-multus openshift-multus openshift-network-diagnostics openshift-network-operator openshift-ovn-kubernetes openshift-security; do echo "NS: $ns"; oc get ds -o json -n $ns|jq '.items.[0].spec.template.spec.tolerations'; done
NS: openshift-cluster-csi-drivers
[
  {
    "operator": "Exists"
  }
]
NS: openshift-cluster-node-tuning-operator
[
  {
    "operator": "Exists"
  }
]
NS: openshift-dns
[
  {
    "key": "node-role.kubernetes.io/master",
    "operator": "Exists"
  }
]
NS: openshift-image-registry
[
  {
    "operator": "Exists"
  }
]
NS: openshift-machine-config-operator
[
  {
    "operator": "Exists"
  }
]
NS: openshift-monitoring
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-network-diagnostics
[
  {
    "operator": "Exists"
  }
]
NS: openshift-network-operator
[
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/master",
    "operator": "Exists"
  }
]
NS: openshift-ovn-kubernetes
[
  {
    "operator": "Exists"
  }
]
NS: openshift-security
[
  {
    "operator": "Exists"
  }
]

https://github.com/openshift/cluster-network-operator/pull/2581

Bug OCPBUGS-45433: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/317

Bug OCPBUGS-46498: updating the list of the monitored accelerators

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-48177: Readiness probes must not rely on etcd

View the Description View the linked PRs

Description of problem:

Requests allow up to 30s for etcd to respond. Readiness probes only allow 9s for etcd to respond. When etcd latency is between 10-30s, standard requests will succeed, but due to the readiness probe configuration we lose every apiserver endpoint at the same time. This requires correction in the pod definitions and the load balancers. Making the ongoing readiness check `readyz?exclude=etcd` should correct the issue.

Off the top of my head this will include

kube-apiserver operator
authentication operator
openshift-apiserver operator
MCO apiserver-watch
metal LB
https://github.com/multi-arch/ocp-remote-ci/pull/39
where LBs are defined for aws, azure, and gcp

This is a low cost, low risk, high benefit change.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Vulnerability OCPBUGS-48073: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/networking-console-plugin/pull/161

Bug CNV-55085: UDN Details "Layer configuration" should be flat

View the Description View the linked PRs

Description of problem:

The UDN Details page shows a "group" of attributes called "Layer configuration". That does not really add any benefit and the name is just confusing. Let's just remove the grouping and keep the attributes flat.

Version-Release number of selected component (if applicable):

rc.4

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/180

Bug OCPBUGS-26603: conformance HAProxy tests failing in ipv6primary dualstack cluster

View the Description View the linked PRs

Description of problem:

Below tests fail on ipv6primary dualstack cluster because the router deployed is not prepared for dualstack:

[sig-network][Feature:Router][apigroup:image.openshift.io] The HAProxy router should serve a route that points to two services and respect weights [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should respond with 503 to unrecognized hosts [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should serve routes that were created from an ingress [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io][apigroup:operator.openshift.io] The HAProxy router should support reencrypt to services backed by a serving certificate automatically [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host for overridden domains with a custom value [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host with a custom value [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should run even if it has no access to update status [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should serve the correct routes when scoped to a single namespace and label set [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

That is confirmed by accessing to the router pod and checking the connectivity locally:

sh-4.4$  curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://127.0.0.1/Letter"                      
200
sh-4.4$ echo $?
0

   
sh-4.4$  curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://fd01:0:0:5::551/Letter" 
000
sh-4.4$ echo $?
3

sh-4.4$  curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://[fd01:0:0:5::551]/Letter"
000
sh-4.4$ echo $?
7

The default router deployed in the cluster supports dualstack. Hence it's possible and required to update the router image configuration usedin the tests to be able to answer both ipv4 and ipv6.

Version-Release number of selected component (if applicable): https://github.com/openshift/origin/tree/release-4.15/test/extended/router/
How reproducible: Always.
Steps to Reproduce: Run the tests in ipv6primary dualstack cluster.
Actual results: Tests failing as below:

    <*errors.errorString | 0xc001eec080>:
    last response from server was not 200:
{
    s: "last response from server was not 200:\n",
}
occurred
Ginkgo exit error 1: exit with code 1

Expected results: Test passing

https://github.com/openshift/origin/pull/29224

Bug OCPBUGS-32158: RegistrySources configuration errors should be bubbled up to HostedCluster

View the Description View the linked PRs

Description of problem:

    After OCPBUGS-13726, Hypershift honors ImageConfig provided by the user in the HostedCluster.
Providing both allowedRegistries and blockedRegistries is forbidden https://github.com/openshift/api/blob/1e963d8dc4663f4f004f44fd58459381a771bdb5/config/v1/types_image.go#L126
If we do that in HyperShift, it will block any NodePool creation but no error is visible in the HostedCluster, so it is not easy to identify the error.
The error should instead be visible in the existing HC condition ValidHostedControlPlaneConfiguration

Version-Release number of selected component (if applicable):

    4.15.6

How reproducible:

    Always

Steps to Reproduce:

    1. Create HC
    2. oc patch hc -n $HC_NS $HC_NAME -p '{"spec":{"configuration":{"image":{"registrySources":{"allowedRegistries":["docker.io"], "blockedRegistries":["test.io"]}}}}}' --type=merge
     3. New node pools not coming up and condition visible in the NP

  - lastTransitionTime: "2024-04-11T12:49:32Z"
    message: 'Failed to generate payload: error getting ignition payload: failed to
      execute machine-config-controller: machine-config-controller process failed:
      exit status 255'
    observedGeneration: 1
    reason: InvalidConfig
    status: ""
    type: ValidGeneratedPayload

Actual results:

    HC condition successful

  - lastTransitionTime: "2024-04-11T08:59:01Z"
    message: Configuration passes validation
    observedGeneration: 4
    reason: AsExpected
    status: "True"
    type: ValidHostedControlPlaneConfiguration

Expected results:

    Above HC condition to be failed

Additional info:

https://github.com/openshift/api/pull/1859

Bug OCPBUGS-45765: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kube-rbac-proxy/pull/115

Bug OCPBUGS-49319: [IBMCloud] Drop CAPI metrics-bind-addr argument

View the Description View the linked PRs

Description of problem:

IBM Cloud has a IBM Cloud CAPI argument for metrics-bind-addr, which it appears is being dropped in a future cluster-api release.

Drop the argument now to account for when this argument is no longer available.

Version-Release number of selected component (if applicable):

4.19

How reproducible:

100%

Steps to Reproduce:

    1. Update IBM Cloud CAPI and Cluster-API components (Cluster-API to 1.9.3)
    2. Attempt to create cluster for IBM Cloud, using CAPI support

Actual results:

time="2025-01-24T20:50:00Z" level=info msg="Running process: ibmcloud infrastructure provider with args [--provider-id-fmt=v2 -v=2 --metrics-bind-addr=0 --health-addr=127.0.0.1:35849 --leader-elect=false --webhook-port=44443 --webhook-cert-dir=/tmp/envtest-serving-certs-3963032230 --namespace=openshift-cluster-api-guests --kubeconfig=/root/cluster-deploys/capi-drop-v1-1/.clusterapi_output/envtest.kubeconfig]"
time="2025-01-24T20:50:00Z" level=debug msg="unknown flag: --metrics-bind-addr"
time="2025-01-24T20:50:00Z" level=debug msg="Usage of cluster-deploys/capi-drop-v1-1/cluster-api/cluster-api-provider-ibmcloud:"
time="2025-01-24T20:50:00Z" level=debug msg="      --add_dir_header                       If true, adds the file directory to the header of the log messages"
time="2025-01-24T20:50:00Z" level=debug msg="      --alsologtostderr                      log to standard error as well as files (no effect when -logtostderr=true)"
time="2025-01-24T20:50:00Z" level=debug msg="      --diagnostics-address string           The address the diagnostics endpoint binds to. Per default metrics are served via https and withauthentication/authorization. To serve via http and without authentication/authorization set --insecure-diagnostics. If --insecure-diagnostics is not set the diagnostics endpoint also serves pprof endpoints and an endpoint to change the log level. (default \":8443\")"
time="2025-01-24T20:50:00Z" level=debug msg="      --health-addr string                   The address the health endpoint binds to. (default \":9440\")"
time="2025-01-24T20:50:00Z" level=debug msg="      --insecure-diagnostics                 Enable insecure diagnostics serving. For more details see the description of --diagnostics-address."
time="2025-01-24T20:50:00Z" level=debug msg="      --kubeconfig string                    Paths to a kubeconfig. Only required if out-of-cluster."
time="2025-01-24T20:50:00Z" level=debug msg="      --leader-elect                         Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager."
time="2025-01-24T20:50:00Z" level=debug msg="      --log-flush-frequency duration         Maximum number of seconds between log flushes (default 5s)"
time="2025-01-24T20:50:00Z" level=debug msg="      --log-json-info-buffer-size quantity   [Alpha] In JSON format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this."
time="2025-01-24T20:50:00Z" level=debug msg="      --log-json-split-stream                [Alpha] In JSON format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this."
time="2025-01-24T20:50:00Z" level=debug msg="      --log-text-info-buffer-size quantity   [Alpha] In text format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this."
time="2025-01-24T20:50:00Z" level=debug msg="      --log-text-split-stream                [Alpha] In text format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this."
time="2025-01-24T20:50:00Z" level=debug msg="      --log_backtrace_at traceLocation       when logging hits line file:N, emit a stack trace (default :0)"
time="2025-01-24T20:50:00Z" level=debug msg="      --log_dir string                       If non-empty, write log files in this directory (no effect when -logtostderr=true)"
time="2025-01-24T20:50:00Z" level=debug msg="      --log_file string                      If non-empty, use this log file (no effect when -logtostderr=true)"
time="2025-01-24T20:50:00Z" level=debug msg="      --log_file_max_size uint               Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)"
time="2025-01-24T20:50:00Z" level=debug msg="      --logging-format string                Sets the log format. Permitted formats: \"json\" (gated by LoggingBetaOptions), \"text\". (default \"text\")"
time="2025-01-24T20:50:00Z" level=debug msg="      --logtostderr                          log to standard error instead of files (default true)"
time="2025-01-24T20:50:00Z" level=debug msg="      --namespace string                     Namespace that the controller watches to reconcile cluster-api objects. If unspecified, the controller watches for cluster-api objects across all namespaces."
time="2025-01-24T20:50:00Z" level=debug msg="      --one_output                           If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true)"
time="2025-01-24T20:50:00Z" level=debug msg="      --provider-id-fmt string               ProviderID format is used set the Provider ID format for Machine (default \"v2\")"
time="2025-01-24T20:50:00Z" level=debug msg="      --service-endpoint string              Set custom service endpoint in semi-colon separated format: ${ServiceRegion1}:${ServiceID1}=${URL1},${ServiceID2}=${URL2};${ServiceRegion2}:${ServiceID1}=${URL1}"
time="2025-01-24T20:50:00Z" level=debug msg="      --skip_headers                         If true, avoid header prefixes in the log messages"
time="2025-01-24T20:50:00Z" level=debug msg="      --skip_log_headers                     If true, avoid headers when opening log files (no effect when -logtostderr=true)"
time="2025-01-24T20:50:00Z" level=debug msg="      --stderrthreshold severity             logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=true) (default 2)"
time="2025-01-24T20:50:00Z" level=debug msg="      --sync-period duration                 The minimum interval at which watched resources are reconciled. (default 10m0s)"
time="2025-01-24T20:50:00Z" level=debug msg="      --tls-cipher-suites strings            Comma-separated list of cipher suites for the webhook server and metrics server (the latter only if --insecure-diagnostics is not set to true). If omitted, the default Go cipher suites will be used. "
time="2025-01-24T20:50:00Z" level=debug msg="                                             Preferred values: TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256. "
time="2025-01-24T20:50:00Z" level=debug msg="                                             Insecure values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA, TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_3DES_EDE_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_RC4_128_SHA."
time="2025-01-24T20:50:00Z" level=debug msg="      --tls-min-version string               The minimum TLS version in use by the webhook server and metrics server (the latter only if --insecure-diagnostics is not set to true)."
time="2025-01-24T20:50:00Z" level=debug msg="                                             Possible values are VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13. (default \"VersionTLS12\")"
time="2025-01-24T20:50:00Z" level=debug msg="  -v, --v Level                              number for the log level verbosity (default 0)"
time="2025-01-24T20:50:00Z" level=debug msg="      --vmodule pattern=N,...                comma-separated list of pattern=N settings for file-filtered logging (only works for text log format)"
time="2025-01-24T20:50:00Z" level=debug msg="      --webhook-cert-dir string              The webhook certificate directory, where the server should find the TLS certificate and key. (default \"/tmp/k8s-webhook-server/serving-certs/\")"
time="2025-01-24T20:50:00Z" level=debug msg="      --webhook-port int                     The webhook server port the manager will listen on. (default 9443)"
time="2025-01-24T20:50:00Z" level=debug msg="unknown flag: --metrics-bind-addr"

Expected results:

Successful cluster deployment using updated IBM Cloud CAPI

Additional info:

IBM Cloud CAPI needs to be updated to resolve known bugs.
in the process of doing this, this requires updating Cluster-API to 1.9.3
https://github.com/kubernetes-sigs/cluster-api-provider-ibmcloud/commit/ffa141c7edd53a58396403b7c0e3995c91580161

However, this appears to cause issues with the current IBM Cloud CAPI arguments within IPI, metrics-bind-addr.

IBM Cloud is working on a patch to drop that argument, to allow for updating IBM Cloud CAPI to a newer level for bug fixes.

https://github.com/openshift/installer/pull/9401

Bug OCPBUGS-49733: Power VS: VPC endpoint is not honored by MAPI

View the Description View the linked PRs

Description of problem:

    When overriding the VPC endpoint in a PowerVS deployment, the VPC endpoint override is ignored by MAPI

Version-Release number of selected component (if applicable):

How reproducible:

    easily

Steps to Reproduce:

    1. Deploy a disconnected cluster
    2. network operator will fail to come up
    3.

Actual results:

    Deploy fails and endpoint is ignored

Expected results:

    Deploy should succeed with endpoint honored

Additional info:

https://github.com/openshift/machine-api-provider-powervs/pull/99

Bug OCPBUGS-50546: CSR permission should not be a hard requirement for viewing Node tab

View the Description View the linked PRs

Description of problem:

A user without CSR read permission can not view the Node page in Openshift console.

Version-Release number of selected component (if applicable):

    4.16.30

How reproducible:

    Always

Steps to Reproduce:

    1. Create a 4.16.30 ROSA/OSD cluster
    2. Assign a user with dedicated-admins group. dedicated-admins has get node permission but without CSR related permissions.
    3. Open console access Compute - Node page

Actual results:

The page only shows an error, without other content:

certificatesigningrequests.certificates.k8s.io is forbidden: User "xxxx" cannot list resource "certificatesigningrequests" in API group "certificates.k8s.io" at the cluster scope

Expected results:

The user should still be able to view nodes.

Additional info:

OSD-28173 - card for SRE tracking.

https://github.com/openshift/console/pull/14771

Bug CNV-55239: CUDN dialog requires selection of existing projects

View the Description View the linked PRs

it is meant to select projects that don't even exist yet - it should work with label selectors,

https://github.com/openshift/networking-console-plugin/pull/179

Bug OCPBUGS-44924: [aws] missing ec2:GetConsoleOutput permission

View the Description View the linked PRs

Description of problem:

    If the bootstrap fails, the installer will try to get the VM console logs via the AWS SDK which requires the ec2:GetConsoleOutput permission.

Version-Release number of selected component (if applicable):

    all versions where we enabled VM console log gathering

How reproducible:

    always

Steps to Reproduce:

    1. Use minimal permissions and force a bootstrap failure
    2.
    3.

Actual results:

                level=info msg=Pulling VM console logs
                level=error msg=UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-xgq2j8ch-f93c7-minimal-perm is not authorized to perform: ec2:GetConsoleOutput on resource: arn:aws:ec2:us-west-1:460538899914:instance/i-0fa40c9966e9f1ab9 because no identity-based policy allows the ec2:GetConsoleOutput action. Encoded authorization failure message: XYfLhyZ0pKnDzJrs9ZbOH8z8YkG03aPhT6U57EoqiLH8iS5PZvFgbgONlBuZfDswNpaNBVOfZcdPc1dWYoIsoPTXtQ_n32tzrdxloK7qpVbvkuesHtb8ytV8iLkpmOGyArMqp7Muphn2yXG9DQ5aijx-zQh_ShwirruMTZWhkZdx7_f1WtfjnCBVJGRwAc-rMZ_Xh82-jjxQlQbtBfgJ8COc3kQm7E_iJ1Ngyrcmu6bmVKCS6cEcGIVwRi03PZRtiemfejZfUT7yhKppB-zeeRm5bBWWVmRiuJhswquIW4dH0E9obNvq76-C0b2PR_V9ep-t0udUcypKGilqzqT1DY51gaP66GlSEfN5b4CTLTQxEhE73feZn4xEK0Qq4MkatPFJeGsUcxY5TXEBsGMooj4_D7wPFwkY46QEle41oqs-KNCWEifZSlV5f4IUyiSear85LlUIxBS9-_jfitV90Qw7MZM4z8ggIinQ_htfvRKgnW9tjREDj6hzpydQbViaeAyBod3Q-qi2vgeK6uh7Q6kqK3f8upu1hS8I7XD_TH-oP-npbVfkiPMIQGfy3vE3J5g1AyhQ24LUjR15y-jXuBOYvGIir21zo9oGKc0GEWRPdZr4suSbbx68rZ9TnTHXfwa0jrhIns24uwnANdR9U2NStE6XPJk9KWhbbz6VD6gRU72qbr2V7QKPiguNpeO_P5uksRDwEBWxDfQzMyDWx1zOhhPPAjOQRup1-vsPpJhkgkrsdhPebN0duz6Hd4yqy0RiEyb1sSMaQn_8ac_2vW9CLuWWbbt5qo2WlRllo3U7-FpvlP6BRGTPjv5z3O4ejrGsnfDxm7KF0ANvLU0KT2dZvKugB6j-Kkz56HXHebIzpzFPRpmo0B6H3FzpQ5IpzmYiWaQ6sNMoaatmoE2z420AJAOjSRBodqhgi2cVxyHDqHt0E0PQKM-Yt4exBGm1ZddC5TUPnCrDnZpdu2WLRNHMxEBgKyOzEON_POuDaOP0paEXFCflt7kNSlBRMRqAbOpGI_F96wlNmDO58KZDbPKgdOfomwkaR5icdeS-tQyQk2PnhieOTNL1M5hQZpLrzWVeJzZEtmZ_0vsePUdvXYusvL828ldyg8VCwq-B2oGD_ym_iPCINBC7sIy8Q0HVb5v5dzbs4l2UKcC7OzTG-TMlxphV20DqNmC5yCnHEdmnleNA48J69HdTMw_G7N9mo5IrXw049MjvYnia4NwarMGUvoBYnxROfQ2jprN7_BW-Cdyp2Ca2P9uU9AeSubeeQdzieazkXNeR9_4Su_EGsbQm Instance=i-0fa40c9966e9f1ab9

Expected results:

    No failures.

Additional info:

See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/57437/rehearse-57437-pull-ci-openshift-installer-master-e2e-aws-ovn-user-provisioned-dns/1860020284598259712 for an example of a failed job

https://github.com/openshift/installer/pull/9233

Bug OCPBUGS-48228: Single-stack IPv6 installation fails to run cluster api system

View the Description View the linked PRs

Description of problem:

   The OpenShift installer fails to create a cluster on an OpenStack Single-stack IPv6 environment - failed to run cluster api system

Version-Release number of selected component (if applicable):

Installer version:
 openshift-install version
openshift-install 4.18.0-rc.3
built from commit 0f87b38910a84cfe3243fb878436bc052afc3187
release image registry.ci.openshift.org/ocp/release@sha256:668c92b06279cb5c7a2a692860b297eeb9013af10d49d2095f2c3fe9ad02baaa
WARNING Release Image Architecture not detected. Release Image Architecture is unknown
release architecture unknown
default architecture amd64

RHOSO version:

[zuul@controller-0 ~]$ oc get openstackversions.core.openstack.org
NAME           TARGET VERSION            AVAILABLE VERSION         DEPLOYED VERSION
controlplane   18.0.4-trunk-20241112.1   18.0.4-trunk-20241112.1   18.0.4-trunk-20241112.1

How reproducible:

    Always

Steps to Reproduce:

    1. Prepare openstack infra for openshift installation with Single-stack IPv6 (see the install-config.yaml below)
    2. openshift-install create cluster

install-config.yaml:

apiVersion: v1
baseDomain: "shiftstack.local"
controlPlane:
  name: master
  platform:
    openstack:
      type: "master"
  replicas: 3
compute:
- name: worker
  platform:
    openstack:
      type: "worker"
  replicas: 2
metadata:
  name: "ostest"
networking:
  clusterNetworks:
  - cidr: fd01::/48
    hostPrefix: 64
  machineNetwork:
    - cidr: "fd2e:6f44:5dd8:c956::/64"
  serviceNetwork:
    - fd02::/112
  networkType: "OVNKubernetes"
platform:
  openstack:
    cloud:            "shiftstack"
    region:           "regionOne"
    apiVIPs: ["fd2e:6f44:5dd8:c956::5"]
    ingressVIPs: ["fd2e:6f44:5dd8:c956::7"]
    controlPlanePort:
      fixedIPs:
        - subnet:
            name: "subnet-ssipv6"
pullSecret: <omitted> 
sshKey:     <omitted>

Actual results:

The openshift-install fails to start the controlplane - kube-apiserver:

INFO Started local control plane with envtest
E0109 13:17:36.425059   30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=0
E0109 13:17:38.365005   30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=1
E0109 13:17:40.142385   30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=2
E0109 13:17:41.947245   30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=3
E0109 13:17:43.761197   30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=4
DEBUG Collecting applied cluster api manifests...
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run local control plane: unable to start control plane itself: failed to start the controlplane. retried 5 times: timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)

Additional info:

After the openshift-install failure, we observe that the kube-apiserver attempts to find service IPv4, even though our environment exclusively supports IPv6:

    $ cat ostest/.clusterapi_output/kube-apiserver.log
I0109 13:17:36.402549   31041 options.go:228] external host was not specified, using fd01:0:0:3::97
E0109 13:17:36.403397   31041 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\""
I0109 13:17:38.351573   31096 options.go:228] external host was not specified, using fd01:0:0:3::97
E0109 13:17:38.352116   31096 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\""
I0109 13:17:40.129451   31147 options.go:228] external host was not specified, using fd01:0:0:3::97
E0109 13:17:40.130026   31147 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\""
I0109 13:17:41.517490   31203 options.go:228] external host was not specified, using fd01:0:0:3::97
E0109 13:17:41.518118   31203 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\""
I0109 13:17:43.750048   31258 options.go:228] external host was not specified, using fd01:0:0:3::97
E0109 13:17:43.750649   31258 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\""

$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0@if174: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
    link/ether 0a:58:19:b4:10:b3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fd01:0:0:3::97/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::858:19ff:feb4:10b3/64 scope link
       valid_lft forever preferred_lft forever

https://github.com/openshift/installer/pull/9361

Bug OCPBUGS-48340: Component Readiness: OperatorHubSourceError when disableAllDefaultSources is true

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Significant regression detected.
Fishers Exact probability of a regression: 99.95%.
Test pass rate dropped from 99.06% to 93.75%.

Sample (being evaluated) Release: 4.18
Start Time: 2025-01-06T00:00:00Z
End Time: 2025-01-13T16:00:00Z
Success Rate: 93.75%
Successes: 45
Failures: 3
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.06%
Successes: 210
Failures: 2
Flakes: 0

View the test details report for additional context.

From the test details link, two of the three referenced failures are as follows:

    [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "OperatorHubSourceError",
          "alertstate": "firing",
          "container": "catalog-operator",
          "endpoint": "https-metrics",
          "exported_namespace": "openshift-marketplace",
          "instance": "[fd01:0:0:1::1a]:8443",
          "job": "catalog-operator-metrics",
          "name": "community-operators",
          "namespace": "openshift-operator-lifecycle-manager",
          "pod": "catalog-operator-6c446dcbbb-sxvjz",
          "prometheus": "openshift-monitoring/k8s",
          "service": "catalog-operator-metrics",
          "severity": "warning"
        },
        "value": [
          1736751753.045,
          "1"
        ]
      }
    ]

This looks to always happen sparodically in CI lately: https://search.dptools.openshift.org/?search=OperatorHubSourceError&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Though overall it looks quite rare.

What is happening to cause these alerts to fire?

At this moment, it's a regression for 4.18 and thus a release blocker. I suspect it will clear naturally, but it might be a good opportunity to look for a reason why. Could use some input from OLM on what exactly is happening in the runs such as these two:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6/1878675368131432448

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6/1877545344619778048

https://github.com/openshift/origin/pull/29435

Bug OCPBUGS-42862: oc-mirror fails to find the image-references while performing a disk to mirror

View the Description View the linked PRs

Description of problem:

2024/10/08 12:50:50  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/10/08 12:50:50  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/10/08 12:50:50  [INFO]   : ⚙️  setting up the environment for you...
2024/10/08 12:50:50  [INFO]   : 🔀 workflow mode: diskToMirror 
2024/10/08 12:52:19  [INFO]   : 🕵️  going to discover the necessary images...
2024/10/08 12:52:19  [INFO]   : 🔍 collecting release images...
2024/10/08 12:52:19  [ERROR]  : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory
2024/10/08 12:52:19  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/10/08 12:52:19  [ERROR]  : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory

Version-Release number of selected component (if applicable):

    [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202409120935.p0.gc912303.assembly.stream.el9-c912303", GitCommit:"c9123030d5df99847cf3779856d90ff83cf64dcb", GitTreeState:"clean", BuildDate:"2024-09-12T09:57:57Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

    Always

Steps to Reproduce:

    1. Install 4.17 cluster and 4.17 oc-mirror
    2. Now use the ImageSetConfig.yaml below and perform mirror2disk using the command below
[root@bastion-dsal oc-mirror]# cat imageset.yaml
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.14
      minVersion: 4.14.20
      maxVersion: 4.14.20
      shortestPath: true
    graph: true
oc-mirror -c /tmp/imagesetconfig.yaml file:///home/fedora/test-oc-mirror/release-images --v2
    3. Now perform disk2mirror using the command below
    oc-mirror -c /tm/imagesetconfig.yaml --from file:///home/fedora/test-oc-mirror/release-images --v2 --dry-run

Actual results:

    When performing disk2mirror errors are seen as below
[fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/imageset.yaml --from file:///home/fedora/test-oc-mirror/release-images docker://localhost:5000 --v2 --dest-tls-verify=false --dry-run

2024/10/08 12:50:50  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/10/08 12:50:50  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/10/08 12:50:50  [INFO]   : ⚙️  setting up the environment for you...
2024/10/08 12:50:50  [INFO]   : 🔀 workflow mode: diskToMirror 
2024/10/08 12:52:19  [INFO]   : 🕵️  going to discover the necessary images...
2024/10/08 12:52:19  [INFO]   : 🔍 collecting release images...
2024/10/08 12:52:19  [ERROR]  : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory
2024/10/08 12:52:19  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/10/08 12:52:19  [ERROR]  : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory

Expected results:

    No errors should be seen when performing disk2mirror

Additional info:

    If not using nestedpaths for file i.e like file:///home/fedora/test-oc-mirror/release-images and just using file://test-oc-mirror things work fine and no error as above is seen.

https://github.com/openshift/oc-mirror/pull/966

Bug OCPBUGS-43830: Table in rollback helm release page is not responsive to various screen widths

View the Description View the linked PRs

Description of problem:

The data in the table column overlaps on the helm rollback page when the screen width is shrunk

Version-Release number of selected component (if applicable):

How reproducible:

Every time

Steps to Reproduce:

    1. Open a quick start window
    2. Create a helm chart and upgrade it
    3. Now select the rollback option from the action menu or kebab menu of the helm chart

Actual results:

The Rollback page has a messed-up UI

https://drive.google.com/file/d/1YXz80YsR5pkRG4dQmqFxpTkzgWQnQWLe/view?usp=sharing

Expected results:

The UI should be similar to the build config page with quick start open

https://drive.google.com/file/d/1UYxdRdV2kGC1m-MjBifTNdsh8gtpYnaU/view?usp=sharing

Additional info:

https://github.com/openshift/console/pull/14448

Bug OCPBUGS-44823: “404: Page Not Found” is shown on Networking->UserDefinedNetworks

View the Description View the linked PRs

Description of problem:

There is a new menu "UserDefinedNetworks" under "Networking", it shows 404 error on the page after nav to this menu.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-20-085127

How reproducible:

Always

Steps to Reproduce:

    1.Go to Networking->UserDefinedNetworks page.
    2.
    3.

Actual results:

1. 404 error is shown on the page :
404: Page Not Found
The server doesn't have a resource type "UserDefinedNetwork" in "k8s.ovn.org/v1". Try refreshing the page if it was recently added.

Expected results:

1. Should not show 404 error.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/163

Bug OCPBUGS-45701: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-apiserver-operator/pull/607

Bug OCPBUGS-45710: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/image-customization-controller/pull/133

Bug OCPBUGS-30313: openshift-samples CO goes to degraded state when registrySources.allowedRegistries is enforced

View the Description View the linked PRs

Description of problem:

A cluster with a default (empty) `configs.spec.samplesRegistry` field but with whitelist entries in `image.spec.registrySources.allowedRegistries` causes openshift-samples CO in degraded state.

Version-Release number of selected component (if applicable):

4.13.30, 4.13.32

How reproducible:

 100%

Steps to Reproduce:

1. Add the whitelist entries in image.spec.registrySources.allowedRegistries:
~~~
oc get image.config/cluster -o yaml

spec:
  registrySources:
    allowedRegistries:
    - registry.apps.example.com
    - quay.io
    - registry.redhat.io
    - image-registry.openshift-image-registry.svc:5000
    - ghcr.io
    - quay.apps.example.com
~~~

2. Delete the pod, so it recreates:
~~~
oc delete pod -l name=cluster-samples-operator -n openshift-cluster-samples-operator
~~~

3. The openshift-samples go to degraded state:
~~~
# oc get co openshift-samples
NAME                VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
openshift-samples   4.13.30   True        True          True       79m     Samples installation in error at 4.13.30: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}
~~~

4. The configs.samples spec is empty:
~~~
# oc get configs.samples.operator.openshift.io  cluster -o jsonpath='{.spec}{"\n"}'
{"architectures":["x86_64"],"managementState":"Managed"}
~~~

Actual results:

The openshift-sample go to degraded state.

Expected results:

The openshift-sample should remain in healthy state.

Additional info:

We had a Bug (https://bugzilla.redhat.com/show_bug.cgi?id=2027745) earlier which was fixed in OCP 4.10.3 as per erratta (https://access.redhat.com/errata/RHSA-2022:0056).

One of my customer faced this issue when they upgraded the cluster from 4.12 to 4.13.32.

As a workaround updating the below lines under `image.config.spec` helped.
~~~
 allowedRegistriesForImport        
    - domainName: registry.redhat.io  
      insecure: false  
~~~~

https://github.com/openshift/cluster-samples-operator/pull/592

Bug OCPBUGS-43273: vmware-vsphere-csi-driver-operator goes in panic mode

View the Description View the linked PRs

Description of problem:

    vmware-vsphere-csi-driver-operator goes in panic mode when the vcenter address is incorrect

Version-Release number of selected component (if applicable):

  Vsphere IPI 4.17

How reproducible:

    100%

$ oc logs vmware-vsphere-csi-driver-operator-74f65d4444-ljt78 -n openshift-cluster-csi-drivers

E1007 09:22:39.958324       1 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
  line 1: cannot unmarshal !!seq into config.CommonConfigYAML
I1007 09:22:39.958515       1 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.
I1007 09:22:39.959600       1 config.go:283] Config initialized
W1007 09:22:39.959738       1 vspherecontroller.go:910] vCenter vcenter1.vmware.gsslab.pnq2.redhat.com is missing from vCenter map
E1007 09:22:39.959815       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 731 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2cf0f80, 0x54f9210})
        k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc000b0f4e8, 0x1, 0xc001adc000?})
        k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x2cf0f80?, 0x54f9210?})
        runtime/panic.go:770 +0x132

Actual results:

The storage CO goes in panic:

$ oc get co storage
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.17.0    True        False         True       6d21h   VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: panic caught:...

Expected results:

    The vmware-vsphere-csi-driver-operator should not go into panic even if the configuration is missing or incorrect.

Additional info:

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/283

Bug OCPBUGS-45551: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9275

Bug OCPBUGS-45566: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/operator-framework/operator-marketplace/pull/580

Bug OCPBUGS-49667: Missing RBAC for OVNK to annotate network IDs

View the Description View the linked PRs

Missing RBAC causes an error when OVNK tries to annotate the network ID on the NADs. Regression noticed when testing coverage for secondary networks was added.

https://github.com/openshift/cluster-network-operator/pull/2634

Vulnerability OCPBUGS-48072: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/networking-console-plugin/pull/160

Bug OCPBUGS-44362: Remove the v1alpha1 schema for ConsolePlugin CRD

View the Description View the linked PRs

Description of problem:

v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console-operator/pull/942

Bug OCPBUGS-46051: [IBMCloud] [CAPI] create manifests panic with the default install-config.yaml (no zones field)

View the Description View the linked PRs

Description of problem:

with the default install-config (not zones field),create manifests panic

Version-Release number of selected component (if applicable):

 4.19

How reproducible:

Always

Steps to Reproduce:

1.openshift-install create install-config
2.enable CAPI in install-config.yaml
featureSet: CustomNoUpgrade
featureGates: ["ClusterAPIInstall=true"]     
3. openshift-install   create manifests

Actual results:

INFO Consuming Install Config from target directory
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x3b93434]goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/manifests/ibmcloud.GenerateClusterAssets(0xc0017a3080, 0xc001999700, {0xc0006847e9, 0x31})
        /go/src/github.com/openshift/installer/pkg/asset/manifests/ibmcloud/cluster.go:73 +0x754
github.com/openshift/installer/pkg/asset/manifests/clusterapi.(*Cluster).Generate(0x2718f500, {0x5?, 0x89d1e67?}, 0xc00135bd10)
        /go/src/github.com/openshift/installer/pkg/asset/manifests/clusterapi/cluster.go:142 +0x7ef
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c12180, {0x224f8690, 0xc000ffbe00}, {0x224d4a90, 0x2718f500}, {0x0, 0x0})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x6e2
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0xc000c12180, {0x224f8690?, 0xc000ffbe00?}, {0x224d4a90, 0x2718f500}, {0x27151c80, 0x6, 0x6})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x4e
github.com/openshift/installer/pkg/asset/store.(*fetcher).FetchAndPersist(0xc0006c6be0, {0x224f8690, 0xc000ffbe00}, {0x27151c80, 0x6, 0x6})
        /go/src/github.com/openshift/installer/pkg/asset/store/assetsfetcher.go:47 +0x16b
main.newCreateCmd.runTargetCmd.func3({0x7ffde172857f?, 0x1?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:307 +0x6a
main.newCreateCmd.runTargetCmd.func4(0x2715d2c0, {0xc000a32a60?, 0x4?, 0x8986d1f?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:321 +0x102
github.com/spf13/cobra.(*Command).execute(0x2715d2c0, {0xc000a32a20, 0x2, 0x2})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:989 +0xa91
github.com/spf13/cobra.(*Command).ExecuteC(0xc0009cef08)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1041
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:67 +0x390
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:39 +0x168

Expected results:

create manifest file succeed.

Additional info:

the default install-config.yaml without zones field:
  controlPlane:
    architecture: amd64
    hyperthreading: Enabled
    name: master
    platform: {}
    replicas: 3
  compute:
  - architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform: {}
    replicas: 3
  metadata:
    name: maxu-capi1
  platform:
    ibmcloud:
      region: eu-gb

After adding the zones field, create manifests succeed.

 controlPlane:
   architecture: amd64
   hyperthreading: Enabled
   name: master
   platform:
     ibmcloud:
       zones:
       - eu-gb-1
       - eu-gb-2
       - eu-gb-3
   replicas: 3
 compute:
 - architecture: amd64
   hyperthreading: Enabled
   name: worker
   platform:
     ibmcloud:
       zones:
       - eu-gb-1
       - eu-gb-2
       - eu-gb-3
   replicas: 3
 metadata:
   name: maxu-capi2
 platform:
   ibmcloud:
     region: eu-gb

https://github.com/openshift/installer/pull/9415

Vulnerability OCPBUGS-47538: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-metal3/pull/30

Vulnerability OCPBUGS-52508: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-workload-identity/pull/26

Bug OCPBUGS-36357: CBO Watches don't watch what we expect

View the Description View the linked PRs

The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.

The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.

The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.

See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.

https://github.com/openshift/cluster-baremetal-operator/pull/452

Bug OCPBUGS-44314: Cannot access external network via https from the HCP openshift-apiserver component

View the Description View the linked PRs

Description of problem:

Trying to setup a disconnected HCP cluster with self-managed image registry.

After the cluster installed, all the imagestream failed to import images.
With error:
```
Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client
```

The imagestream will talk to openshift-apiserver and get the image target there.

After login to the hcp namespace, figured out that I cannot access any external network with https protocol.

Version-Release number of selected component (if applicable):

4.14.35

How reproducible:

    always

Steps to Reproduce:

    1. Install the hypershift hosted cluster with above setup
    2. The cluster can be created successfully and all the pods on the cluster can be running with the expected images pulled
    3. Check the internal image-registry
    4. Check the openshift-apiserver pod from management cluster

Actual results:

All the imagestreams failed to sync from the remote registry.
$ oc describe is cli -n openshift
Name:            cli
Namespace:        openshift
Created:        6 days ago
Labels:            <none>
Annotations:        include.release.openshift.io/ibm-cloud-managed=true
            include.release.openshift.io/self-managed-high-availability=true
            openshift.io/image.dockerRepositoryCheck=2024-11-06T22:12:32Z
Image Repository:    image-registry.openshift-image-registry.svc:5000/openshift/cli
Image Lookup:        local=false
Unique Images:        0
Tags:            1latest
  updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d  ! error: Import failed (InternalError): Internal error occurred: [122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-1@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-2@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-3@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-4@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-5@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://quay.io/v2/": http: server gave HTTP response to HTTPS client]


Access the external network from the openshift-apiserver pod:
sh-5.1$ curl --connect-timeout 5 https://quay.io/v2
curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received
sh-5.1$ curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/
curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received

sh-5.1$ env | grep -i http.*proxy
HTTPS_PROXY=http://127.0.0.1:8090
HTTP_PROXY=http://127.0.0.1:8090

Expected results:

The openshift-apiserver should be able to talk to the remote https services.

Additional info:

It is working after set the registry to no_proxy

sh-5.1$ NO_PROXY=122610517469.dkr.ecr.us-west-2.amazonaws.com curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/
Not Authorized

https://github.com/openshift/hypershift/pull/5281

Bug OCPBUGS-47526: OWNERS update

View the Description View the linked PRs

Description of problem:

  OWNERS file updated to include prabhakar and Moe as owners and reviewers

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    This is to fecilitate easy backport via automation

https://github.com/openshift/builder/pull/419

Bug OCPBUGS-48548: The secret created with Basic authentication has an incorrect type

View the Description View the linked PRs

Description of problem:

    The secret created with Basic authentication has an incorrect type, the type will show as Opaque, compare with the same behavior with OCP 4.18, the type should show as kubernetes.io/basic-auth

Version-Release number of selected component (if applicable):

    4.19.0-0.nightly-2025-01-16-064700

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Secrect page, and Create a Source secrect
       eg: /k8s/ns/default/secrets/~new/source     
    2. Make sure the Authentication type is selected as 'Basic authentication'
    3. Create the secret, and check the secret type on Secret list and Secret details page

Actual results:

    The secret created with Basic authentication, its type shown as Opaque which is incorrect

Expected results:

    Compare with the same behavior on OCP 4.18, it should shown as kubernetes.io/basic-auth

Additional info:

https://github.com/openshift/console/pull/14691

Bug OCPBUGS-48694: Etcd client can unsafely retry timeouts on mutating requests

View the Description View the linked PRs

Description of problem:

Our carry patch intended to retry retriable requests that fail due to leader change will retry any etcd error with code "Unavailable": https://github.com/openshift/kubernetes/blob/4b2db1ec33faa3ffc305e5ffa7376908cc955370/staging/src/k8s.io/apiserver/pkg/storage/etcd3/etcd3retry/retry_etcdclient.go#L135-L145, but this includes reasons like "timeout" and does not distinguish between writes and reads. So a "timeout" error on a writing request might be retried even though a "timeout" observed by a client does not indicate that the effect of the write has not been persisted.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/2191

Bug OCPBUGS-50682: [4.19] oc adm top pvc improve mislead message "error: no persistentvolumeclaims found in xxxx namespace"

View the Description View the linked PRs

Description of problem:

Current output message for oc adm top pvc -n xxxx 
"error: no persistentvolumeclaims found in xxxx namespace" even though there exists persistent volumes which could mislead the end user.

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-02-11-161912

How reproducible:

Always

Steps to Reproduce:

1. Create sc with Immediate VOLUMEBINDINGMODE, pvc which consume sc
2. Check the pvc. oc get pvc -n testropatil there exists persistent volume claims/volume 
3. Check the output message oc adm top pvc     
oc adm top pvc --insecure-skip-tls-verify=true -n testropatil
error: no persistentvolumeclaims found in testropatil namespace.

sc_pvc.yaml 
 allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mysc
parameters:
  encrypted: "true"
  type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mypvc-fs
  namespace: testropatil
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  storageClassName: mysc
  resources:
    requests:
      storage: 1Gi

oc get pvc
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
mypvc-fs       Bound    pvc-8257a331-c808-47c0-b127-53c817f090a7   1Gi        RWO            mysc           <unset>                 4s

oc get pv | grep "testropatil"
pvc-8257a331-c808-47c0-b127-53c817f090a7   1Gi        RWO            Delete           Bound    testropatil/mypvc-fs                        mysc           <unset>                          82s

Actual results:

error: no persistentvolumeclaims found in testropatil namespace

Expected results:

error: no persistentvolumeclaims found or mounted in xxxx namespace

Additional info:

https://github.com/openshift/oc/pull/1974

Bug OCPBUGS-44836: oc-mirror can't support mirror image with bundle

View the Description View the linked PRs

Description of problem:

oc-mirror can't support mirror image with bundle

Version-Release number of selected component (if applicable):

oc-mirror version 
W1121 06:10:37.581138  159435 mirror.go:102] ⚠️  oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202411191706.p0.g956fc31.assembly.stream.el9-956fc31", GitCommit:"956fc318cc67769aedb2db8c0c4672bf7ed9f909", GitTreeState:"clean", BuildDate:"2024-11-19T18:08:35Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.el9_5) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1.  mirror the image with bundles :
 kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.16
    packages:
    - name: cluster-kube-descheduler-operator
      bundles:
      - name: clusterkubedescheduleroperator.v5.0.1
      - name: clusterkubedescheduleroperator.4.13.0-202309181427
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.14
    packages:
    - name: 3scale-community-operator
      bundles:
      - name: 3scale-community-operator.v0.11.0


oc-mirror -c /tmp/config-73420.yaml file://out73420 --v2

Actual results:

1. hit error :      
oc-mirror  -c /tmp/ssss.yaml file:///home/cloud-user/outss   --v22024/11/21 05:57:40  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/11/21 05:57:40  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/11/21 05:57:40  [INFO]   : ⚙️  setting up the environment for you...
2024/11/21 05:57:40  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/11/21 05:57:40  [INFO]   : 🕵️  going to discover the necessary images...
2024/11/21 05:57:40  [INFO]   : 🔍 collecting release images...
2024/11/21 05:57:40  [INFO]   : 🔍 collecting operator images...
 ✓   (43s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.16 
 ⠼   (20s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.14 
 ✗   (20s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.14 
2024/11/21 05:58:44  [ERROR]  : filtering on the selected bundles leads to invalidating channel "threescale-2.11" for package "3scale-community-operator": cha ✗   (20s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.14

Expected results:

1. no error

Additional info:

no such issue with older version : 4.18.0-ec3

https://github.com/openshift/oc-mirror/pull/959

Bug OCPBUGS-45174: Bar Chart: wrong bar size if the first record is not the largest one

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

    every time

Steps to Reproduce:

    1. Create the dashboard with a bar chart and sort query result asc.
    2. 
    3.

Actual results:

 bar goes outside of the border

Expected results:

bar should not goes outside of the border

Additional info:

screenshot: https://drive.google.com/file/d/1xPRgenpyCxvUuWcGiWzmw5kz51qKLHyI/view?usp=drive_link

https://github.com/openshift/monitoring-plugin/pull/295

Bug OCPBUGS-45467: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/coredns/pull/131

Bug OCPBUGS-45705: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/configmap-reload/pull/66

Bug OCPBUGS-45807: [AWS mini-perm] sts:AssumeRole permission is missing from installer generated policy

View the Description View the linked PRs

Description of problem:

`sts:AssumeRole` is required by creating Shared-VPC [1], otherwise which will cause the error:

 level=fatal msg=failed to fetch Cluster Infrastructure Variables: failed to fetch dependency of "Cluster Infrastructure Variables": failed to generate asset "Platform Provisioning Check": aws.hostedZone: Invalid value: "Z01991651G3UXC4ZFDNDU": unable to retrieve hosted zone: could not get hosted zone: Z01991651G3UXC4ZFDNDU: AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-1c2w7jv2-ef4fe-minimal-perm-installer is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::641733028092:role/ci-op-1c2w7jv2-ef4fe-shared-role
level=fatal msg=	status code: 403, request id: ab7160fa-ade9-4afe-aacd-782495dc9978
Installer exit with code 1

[1]https://docs.openshift.com/container-platform/4.17/installing/installing_aws/installing-aws-account.html

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-03-174639

How reproducible:

Always

Steps to Reproduce:

1. Create install-config for Shared-VPC cluster
2. Run openshift-install create permissions-policy
3. Create cluster by using the above installer-required policy.

Actual results:

See description

Expected results:

sts:AssumeRole is included in the policy file when Shared VPC is configured.

Additional info:

The configuration of Shared-VPC is like:
platform:
  aws:
	hostedZone:
	hostedZoneRole:

https://github.com/openshift/installer/pull/9287

Bug OCPBUGS-51009: OSUpdateStarted event should only be emitted on actual OS updates

View the Description View the linked PRs

Slack context: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1739880614072559?thread_ts=1739804170.104739&cid=C02CZNQHGN8

The MCD appears to be inconsistently emitting OSUpdateStarted events when the NodeDisruptionPolicy functionality is in-use, causing failures in the openshift origin test suite.

https://github.com/openshift/machine-config-operator/pull/4864

Bug OCPBUGS-51150: Race condition in rpm-ostree update logic

View the Description View the linked PRs

TL;DR: I suppose that a combination of resource constraints together with thousands of systemd services could trigger an underlying racy behavior in the MCO.
In the combination of rpm-ostree, the machineconfig operator and when we actually reboot nodes.

Here's the graph of how an rpm-ostree update works according to my understanding (read bottom to top):

Note: Red Hat viewers, please see https://drive.google.com/file/d/1zYirNlsbFsBRNRZ8MVLugdsTsMV6yCQR/view?usp=sharing for an ASCII version that wasn't defigured by Jira's horrible UI ...

                             ostree-finalize-staged-hold.service stopped
                                          ^
                                          |
                                  ostree-finalize-staged-hold.service stopping
                                          ^
                                          |
                                Stopped ostree-finalize-staged.service <------------------------
                                                                                               |
                                                                                               | 
                                Stopping ostree-finalize-staged.service  --(runs)--> /usr/bin/ostree admin finalize-staged -----------------------------(reads)----------------------------|
                                          ^                                                                                                                                                |
                                          |                                                                                                                                                |
                                (on service stopping, e.g. reboot)                                                                                                                         |
                                          |                                                                                                                                                |
                             Finished ostree-finalize-staged.service(is a noop)[active (exited)]   <-----------------(dependency met)-------------------                                   |
                                                                                                                                                       |                                   |
                                                                                                                                                       |                                   |
                          ostree-finalize-staged.service --(depends on, starts and runs)--> ostree-finalize-staged-hold.service starting --> ostree-finalize-staged-hold.service started   |
                                          ^                                                                                                                                                |
                                          |                                                                                                                                                |
                                       (triggers)                                                                                                                                          |                                                                                                                        |
                                          |                                                                                                                                                |
                               ostree-finalize-staged.path                                                                                                                                 |
                                          |                                                                                                                                                |
                                      (monitors)                                                                                                                                           |
                                          |                                                                                                                                                |
                                          v                                                                                                                                                |
rpm-ostree start-daemon --(writes)--> /run/ostree/staged-deployment <----------------------------------------------------------------------------------------------------------------------|
     ^
     |
 (starts and instructs via dbus)
     |
rpm-ostree kargs

In the journal, on a healthy run from a lab of mine, this plays out as:

Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment... # <--- comes from ostree-finalize-staged-hold.service, pulled in via dep relationship with ostree-finalize-staged.service
Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- remains running now
Feb 19 15:04:28 ocp-on-osp.workload.bos2.lab systemd[1]: Finished OSTree Finalize Staged Deployment.                       # <--- is a noop, remains active (exited) - comes from ostree-finalize-staged.service
(...)
Feb 19 15:06:50 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping OSTree Finalize Staged Deployment...                     # <--- stop logic starting now meaining ostree admin finalize-staged
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: ostree-finalize-staged.service: Deactivated successfully.
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped OSTree Finalize Staged Deployment.
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment... # <--- pulled in via dep, starts before, stops after
Feb 19 15:06:54 ocp-on-osp.workload.bos2.lab systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.    # <--- same

In the attached support case, this logic fails with the following log messages:

Jan 08 00:28:44 worker0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-39e251cb69d9f430a38dd14b0d3bae3c.
Jan 08 00:28:44 worker0 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
Jan 08 00:28:44 worker0 root[15374]: machine-config-daemon[10214]: reboot successful
Jan 08 00:28:44 worker0 systemd-logind[1796]: The system will reboot now!
Jan 08 00:28:44 worker0 ovs-vswitchd[1947]: ovs|00075|connmgr|INFO|br-int<->unix#2: 2317 flow_mods in the 7 s starting 10 s ago (2111 adds, 206 deletes)
Jan 08 00:28:44 worker0 kernel: ice 0000:6c:00.0 ens14f0: Setting MAC 36:4a:79:c2:61:e7 on VF 34. VF driver will be reinitialized
Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Reset indication received from the PF
Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Scheduling reset task
Jan 08 00:28:44 worker0 kernel: iavf 0000:6c:05.2: Removing device
Jan 08 00:28:44 worker0 systemd[1]: Started Hold /boot Open for OSTree Finalize Staged Deployment.
Jan 08 00:28:44 worker0 systemd-logind[1796]: System is rebooting.
Jan 08 00:28:44 worker0 systemd[1]: Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed to queue unit startup job: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start' is included in transaction).
Jan 08 00:28:44 worker0 systemd[1]: ostree-finalize-staged.path: Failed with result 'resources'.   # <------------- here
(...)
Jan 08 00:29:01 worker0 systemd[1]: Stopping Hold /boot Open for OSTree Finalize Staged Deployment...
(...)
Jan 08 00:29:01 worker0 systemd[1]: Stopped Hold /boot Open for OSTree Finalize Staged Deployment.
(...)

Message ` Requested transaction contradicts existing jobs: Transaction for ostree-finalize-staged.service/start is destructive (ostree-finalize-staged-hold.service has 'stop' job queued, but 'start'
is included in transaction).` means that ostree-finalize-staged requires ostree-finalize-staged-hold to be running, but a reboot was triggered right after ostree-finalize-staged-hold completed, and before ostree-finalize-staged resumed its own start sequence again. At that point, the dependency relationship is never fulfilled, and ostree-finalize-staged can never start. However, rpm-ostree applies all changes in the ExecStop of ostree-finalize-staged, and that's why the changes are never applied.

Here's the sequence of what's happening when it's going wrong:
0) ostree-finalize-staged.path sees that /run/ostree/staged-deployment is created and it wants to start ostree-finalize-staged.service
1) ostree-finalize-staged.service wants to run, but requires to start ostree-finalize-staged-hold.service
2) ostree-finalize-staged-hold.service starting as requested by ostree-finalize-staged.service
3) reboot triggered
4) ostree-finalize-staged-hold.service started
5) ----> ASSUMPTION: reboot queues stop job for ostree-finalize-staged-hold.service
6) ostree-finalize-staged.service wants to run, but requires ostree-finalize-staged-hold.service to be started, but it is currently stopping
ostree-finalize-staged.service can never start
7) ostree-finalize-staged.path: Failed with result 'resources' (because what it wanted to start, ostree-finalize-staged.service, can't start)

Regardless of why the systemd hold service is delayed so much that the reboot affects it ...

A race can occur here because rpm-ostree kargs (or any rpm-ostree update operation) blocks only until the rpm-ostree daemon writes /run/ostree/staged-deployment (the changes are staged in this file).
rpm-ostree then returns.
Everything else happens asynchronously: ostree-finalize-staged.path detects that the file was created, it starts ostree-finalize-staged.service which in ExecStop runs /usr/bin/ostree admin finalize-staged.
This last command actually applies the staged changes.
The MCO daemon is actually blocking only on rpm-ostree kargs. It's not waiting for ostree-finalize-staged.service to be active.
Meaning that the MCO will reboot the node immediately after the rpm-ostree kargs changes were staged, but it is not waiting for the actual daemon that applies the changes to be active.
Here lies the potential for a race condition which - I believe - the customer is hitting due to a very specific node configuration.

Whereas I cannot reproduce the exact error message in the customer's journal, it's super easy to reproduce a failure by exploiting this async mechanism.

Spawn a cluster with cluster bot, e.g. `launch 4.16.30 aws,no-spot`, then run the following test:

#!/bin/bash

TIMEOUT=900

echo "#####################################"
echo "Creating MCP for a single worker node"
echo "#####################################"

first_node=$(oc get nodes -l node-role.kubernetes.io/worker= -o name | head -1)
oc label "${first_node}" node-role.kubernetes.io/worker-test=

cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-test
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-test]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-test: ""
EOF

echo "#####################################"
echo "   Creating reproducer conditions    "
echo "#####################################"

f=$(mktemp)
cat <<EOF>$f
[Service]
ExecStartPre=/bin/bash -c "echo 'exec start pre'; /bin/sleep 15; echo 'exec start pre end'"
EOF

cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker-test
  name: 99-worker-test-smoketest-prep
spec:
  baseOSExtensionsContainerImage: ""
  config:
    ignition:
      config:
        replace:
          verification: {}
      proxy: {}
      security:
        tls: {}
      timeouts: {}
      version: 3.2.0
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,$(cat $f | base64 -w0)
          verification: {}
        group: {}
        mode: 600
        path: /etc/systemd/system/ostree-finalize-staged-hold.service.d/override.conf
EOF

echo "Sleeping for a bit ..."
sleep 60
echo "Waiting for MCP to be updated"
oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s

echo "#####################################"
echo "      Updating kernel arguments      "
echo "#####################################"
cat <<EOF | oc apply -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker-test
  name: 99-worker-test-kernelarg-smoketest
spec:
  kernelArguments:
    - abc=def
EOF

echo "Sleeping for a bit ..."
sleep 60
echo "Waiting for MCP to be updated"
oc wait --for=condition=Updated=true mcp/worker-test --timeout=${TIMEOUT}s &
oc wait --for=condition=Degraded=true mcp/worker-test --timeout=${TIMEOUT}s &
wait -n

echo "#####################################"
echo "             End of test             "
echo "#####################################"
oc get mcp

Wnereas the above test delays the ostree-finalize-staged-hold.service by 15 seconds, and it does not reproduce the exact customer error message ...
It still manages to reproduce an issue where the rpm-ostree staged changes are not applied. The 15 seconds sleep is of course extreme.
But it demonstrates that any delay in starting the -hold service, for example, can lead to a race condition which makes the process fail.

With the above test, the MCP will go into degraded state with:

[akaris@workstation 04029338 (master)]$ oc get mcp worker-test -o yaml | grep abc -C4
    type: Updating
  - lastTransitionTime: "2025-02-21T12:13:03Z"
    message: 'Node ip-10-0-107-203.us-west-2.compute.internal is reporting: "unexpected
      on-disk state validating against rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72:
      missing expected kernel arguments: [abc=def]"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2025-02-21T12:13:03Z"

In the journal:

Feb 21 12:09:11 ip-10-0-107-203 root[8530]: machine-config-daemon[2561]: Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) added; new total=1
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Loaded sysroot
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Locked sysroot
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Initiated txn KernelArgs for client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Feb 21 12:09:11 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] connected to transaction progress
Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564171    2259 patch_prober.go:28] interesting pod/loki-promtail-z7ll9 container/promtail namespace/openshift-e2e-loki: Readiness probe status=failure output="Get \"http://10.129.2.3:3101/ready\": context deadline exceeded (Client.Timeout exceeded wh
ile awaiting headers)" start-of-body=
Feb 21 12:09:11 ip-10-0-107-203 kubenswrapper[2259]: I0221 12:09:11.564237    2259 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-e2e-loki/loki-promtail-z7ll9" podUID="c31172d3-6a1e-457a-8e66-c49db072c977" containerName="promtail" probeResult="failure" output="Get \"http://10.129.2.3:3101/ready\":
 context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Feb 21 12:09:16 ip-10-0-107-203 systemd[1]: Starting Hold /boot Open for OSTree Finalize Staged Deployment...
Feb 21 12:09:16 ip-10-0-107-203 bash[8598]: exec start pre
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Created new deployment /ostree/deploy/rhcos/deploy/7c3910540178cc5e35b9c819a09cb58831b264384e9ff7a071a73a379dc32718.4
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: sanitycheck(/usr/bin/true) successful
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Unlocked sysroot
Feb 21 12:09:16 ip-10-0-107-203 rpm-ostree[6705]: Process [pid: 8531 uid: 0 unit: crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope] disconnected from transaction progress
Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: client(id:machine-config-operator dbus:1.254 unit:crio-958faf4c29a580eace124aeb7cf8ecffe66b326216226daee367e207e3cb67d3.scope uid:0) vanished; remaining=0
Feb 21 12:09:17 ip-10-0-107-203 rpm-ostree[6705]: In idle state; will auto-exit in 63 seconds
Feb 21 12:09:17 ip-10-0-107-203 root[8608]: machine-config-daemon[2561]: Rebooting node
Feb 21 12:09:17 ip-10-0-107-203 root[8609]: machine-config-daemon[2561]: initiating reboot: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72
Feb 21 12:09:17 ip-10-0-107-203 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-worker-test-bf73a6e5a7892ac4151522bbf11a8f72.
Feb 21 12:09:17 ip-10-0-107-203 root[8612]: machine-config-daemon[2561]: reboot successful
Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: The system will reboot now!
Feb 21 12:09:17 ip-10-0-107-203 systemd-logind[913]: System is rebooting.

And here are the MCO daemon logs:

2025-02-21T12:09:11.464509464+00:00 stderr F I0221 12:09:11.464485    2561 update.go:2641] Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def]
2025-02-21T12:09:11.466273927+00:00 stderr F I0221 12:09:11.466231    2561 update.go:2626] Running: rpm-ostree kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1="all" --delete=psi=0 --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=0 --append=abc=def
2025-02-21T12:09:16.367223780+00:00 stdout F Staging deployment...done
2025-02-21T12:09:17.184102954+00:00 stdout F Changes queued for next boot. Run "systemctl reboot" to start a reboot

So we can see here that the MCO daemon reboots the node even though rpm-ostree isn't ready, yet, because the corresponding daemon is not yet active.

https://github.com/openshift/machine-config-operator/pull/4907

Bug OCPBUGS-43745: Route update does not work correctly in a multiple EAP clusters environment

View the Description View the linked PRs

Description of problem:

We have two EAP application server clusters and for each of them there is a service created. We have a route configured to the one of the services. When we update the route programmatically to lead to the second service/cluster the response shows it is still being attached to the same service.

Steps to Reproduce:
1. Create two separate clusters of the EAP servers
2. Create one service for the first cluster (hsc1) and one for the second one (hsc2)
3. Create a route for the first service (hsc1)
4. Start both of the clusters and assure the replication works
5. Send a request to the first cluster using the route URL - response should contain identification of the first cluster (hsc-1-xxx)

[2024-08-29 11:30:44,544] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 11:30:44,654] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

6. update the route programatically to redirect to the second service (hsc2)

...
builder.editSpec().editTo().withName("hsc2").endTo().endSpec();
...

7. Send the request again using the same route - in the response there is the same identification of the first cluster

[2024-08-29 11:31:45,098] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

although the service was updated in the route yaml:

...
kind: Service
    name: hsc2

When creating a new route hsc2 for a service hsc2 and using it for the third request we can see the second cluster was targetted correctly with his own separate replication working

[2024-08-29 13:43:13,679] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:43:13,790] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:44:14,056] INFO - [ForkJoinPool-1-worker-1] responseString after second route for service hsc2 was used hsc-2-2-614582a9-3c71-4690-81d3-32a616ed8e54 1 with route hsc2-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

I also did a different attempt.
I Stopped the test in debug mode after the two requests were executed

[2024-08-30 14:23:43,101] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-30 14:23:43,210] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

Then manually changed the route yaml to use the hsc2 service and send the request manually:

curl http://hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com/Counter
hsc-2-2-84fa1d7e-4045-4708-b89e-7d7f3cd48541 1

responded correctly with the second service/cluster.

Then resumed the test run in debug mode and sent the request programmatically

[2024-08-30 14:24:59,509] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

responded with the wrong first service/cluster.

Actual results: Route directs to the same service and EAP cluster

Expected results: After the update the route should direct to the second service and EAP cluster

Additional info:
This issue started to occur from OCP 4.16. Going through the 4.16 release notes and suggested route configuration didn't lead to any possible configuration chnages which should have been applied.

The code of the MultipleClustersTest.twoClustersTest where was this issue discovered is available here.

All the logs as well as services and route yamls are attached to the EAPQE jira.

Bug OCPBUGS-52223: DescribeVpcEndpoints calls are 25% of total AWS API call volume for CI

View the Description View the linked PRs

The cloud provider health check runs every time through the HCP reconcile loop and result in ~800k calls to DescribeVpcEndpoints per day. This is 25% of our total AWS API call volume in our CI account and is contibuting to API throttling.

https://github.com/openshift/hypershift/pull/5751

Bug CNV-55234: UDN page list view "namespace" column says "All Namespace"

View the Description View the linked PRs

Description of problem:

The UserDefinedNetwork page lists UDN and CUDN objects.
UDN is namespaces scope, CUDN is clsuter scope.

The list view "namespace" column for CUDN objects presents "All Namespaces" which is confusing, making me think the CUDN selects all namespaces in the clsuter.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

100%

Steps to Reproduce:

1. Create CUDN, check "Namespace" column in the UDN page list view
2.
3.

Actual results:

UDN page list view "Namespace" column present "All Namespaces" for CUDN objects

Expected results:

I expect "Namespace" column to not present "All Namespaces" for CUDN objects because its confusing.
I think its better for "Namespace" to remain empty for CUDNs objects.

Additional info:

The CUDN spec has namespace selector, controlling which namespaces the CUDN is affecting, I think this is the source for confusion.
Maybe having the "Namespace" column preset "All Namespaces" for cluster-scope objects make sense, but in this particular case I find it confusing.

https://github.com/openshift/networking-console-plugin/pull/197

Bug OCPBUGS-45570: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-azure/pull/328

Bug OCPBUGS-45696: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-ibmcloud/pull/59

Story CNTRLPLANE-22: openshift/cluster-kube-scheduler-operator - Upgrade to Kubernetes 1.32

View the Description View the linked PRs

Update openshift/api to k8s 1.32

https://github.com/openshift/cluster-kube-scheduler-operator/pull/557

Bug OCPBUGS-20505: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-operator-controller/pull/28

Bug OCPBUGS-44523: Unused variable "identityName" defined in ASH arm template

View the Description View the linked PRs

Description of problem:

In ASH arm template 06_workers.json[1], there is an unused variable "identityName" defined, this is harmless, but little weird to be present in official upi installation doc[2], which might confuse user when installing UPI cluster on ASH.

[1] https://github.com/openshift/installer/blob/master/upi/azurestack/06_workers.json#L52
[2]  https://docs.openshift.com/container-platform/4.17/installing/installing_azure_stack_hub/upi/installing-azure-stack-hub-user-infra.html#installation-arm-worker_installing-azure-stack-hub-user-infra

suggest to remove it from arm template.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9204

Bug OCPBUGS-44901: oc-mirror should have the same format for tags of helm, operator and additional images

View the Description View the linked PRs

Today, when source images are by digest only, oc-mirror applies a default tag:

for operators and additional images it is the digest
for helm images it is digestAlgorithm+"-"+digest

This should be unified.

https://github.com/openshift/oc-mirror/pull/981

Bug OCPBUGS-49844: [openshift-apiserver] Etcd client can unsafely retry timeouts on mutating requests

View the Description View the linked PRs

Description of problem:

Our carry patch intended to retry retriable requests that fail due to leader change will retry any etcd error with code "Unavailable": https://github.com/openshift/kubernetes/blob/4b2db1ec33faa3ffc305e5ffa7376908cc955370/staging/src/k8s.io/apiserver/pkg/storage/etcd3/etcd3retry/retry_etcdclient.go#L135-L145, but this includes reasons like "timeout" and does not distinguish between writes and reads. So a "timeout" error on a writing request might be retried even though a "timeout" observed by a client does not indicate that the effect of the write has not been persisted.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/openshift-apiserver/pull/495

Bug OCPBUGS-51272: [kube-apiserver] [operator-conditions] test regressed due to control plane machine set operator jobs

View the Description View the linked PRs

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

operator conditions kube-apiserver

Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 97.09% to 91.78%.

Sample (being evaluated) Release: 4.19
Start Time: 2025-02-18T00:00:00Z
End Time: 2025-02-25T12:00:00Z
Success Rate: 91.78%
Successes: 67
Failures: 6
Flakes: 0

Base (historical) Release: 4.18
Start Time: 2025-01-26T00:00:00Z
End Time: 2025-02-25T12:00:00Z
Success Rate: 97.09%
Successes: 334
Failures: 10
Flakes: 0

View the test details report for additional context.

The problem involved may exist in 4.18 and only be appearing in 4.19 because machine set operator jobs are lumped into a larger set, which has shrunk in 4.19. However, there appears to be a common test failure causing this which looks important to the functionality of the job and should be fixed, in addition to the need to get this cell red.

The test this always seems to fail on is:

E2E Suite: [It] ControlPlaneMachineSet Operator With an active ControlPlaneMachineSet and the instance type is changed should perform a rolling update [Periodic]

https://github.com/openshift/api/pull/2208

Task HOSTEDCP-2206: Improve Unit Test Runtime

View the Description View the linked PRs

Our unit test runtime is slow. It seems to run anywhere from ~16-20 minutes locally. On CI it can take at least 30 minutes to run. Investigate whether or not any changes can be made to improve the unit test runtime.

https://github.com/openshift/hypershift/pull/5257

Task CNTRLPLANE-238: Update HyperShift with k8s 1.32

View the Description View the linked PRs

Update HyperShift to use k8s 1.32

https://github.com/openshift/hypershift/pull/5623

Bug OCPBUGS-44126: Azure CredentialsRequest for Cloud Controller Manager may be missing some permissions

View the Description View the linked PRs

During review of ARO MiWi permissions, some permissions in the CCM CredentialsRequest for Azure having other permissions identified through a linked action that are missing.

A linked access check is an action performed by Azure Resource Manager during a incoming request. For example, when you issue a create operation to a network interface ( Microsoft.Network/networkInterfaces/write ) you specify a subnet in the payload. ARM parses the payload, sees you're setting a subnet property, and as a result requires the linked access check Microsoft.Network/virtualNetworks/subnets/join/action to the subnet resource specified in the network interface. If you update a resource but don't include the property in the payload, it will not perform the permission check.

The following permissions were identified as possibly needed in CCM CredsRequest as they are specified as linked action of one of CCM's existing permissions

Microsoft.Network/applicationGateways/backendAddressPools/join/action
Microsoft.Network/applicationSecurityGroups/joinIpConfiguration/action
Microsoft.Network/applicationSecurityGroups/joinNetworkSecurityRule/action
Microsoft.Network/ddosProtectionPlans/join/action
Microsoft.Network/gatewayLoadBalancerAliases/join/action
Microsoft.Network/loadBalancers/backendAddressPools/join/action
Microsoft.Network/loadBalancers/frontendIPConfigurations/join/action
Microsoft.Network/loadBalancers/inboundNatRules/join/action
Microsoft.Network/networkInterfaces/join/action
Microsoft.Network/networkSecurityGroups/join/action
Microsoft.Network/publicIPAddresses/join/action
Microsoft.Network/publicIPPrefixes/join/action
Microsoft.Network/virtualNetworks/subnets/join/action

Each permission needs to be validated as to whether it is needed by CCM through any of its code paths.

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/376

Bug OCPBUGS-45286: PowerVS: Listen to machineNetwork

View the Description View the linked PRs

Description of problem:

In CAPI, we use a random machineNetwork instead of using the one passed in by the user.

https://github.com/openshift/installer/pull/9254

Story CNTRLPLANE-21: openshift/cluster-kube-controller-manager-operator - Upgrade to Kubernetes 1.32

View the Description View the linked PRs

Update openshift/api to k8s 1.32

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/830

Bug OCPBUGS-45757: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/network-tools/pull/135

Bug OCPBUGS-46065: [4.19 IPSEC] pod to pod communication is degraded

View the Description View the linked PRs

Description of problem:

Bare Metal UPI cluster

Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. 

**update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.

Version-Release number of selected component (if applicable):

 4.14.7, 4.14.30

How reproducible:

Can't reproduce locally but reproducible and repeatedly occurring in customer environment

Steps to Reproduce:

identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).

Actual results:

Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).

Expected results:

Nodes should not be losing communication and even if they did it should not happen repeatedly

Additional info:

What's been tried so far
========================

- Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again)

- Flushing the conntrack (Doesn't work)

- Restarting nodes (doesn't work)

Data gathered
=============

- Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic)

- ovnkube-trace

- SOSreports of two nodes having communication issues before an OVN rebuild

- SOSreports of two nodes having communication issues after an OVN rebuild 

- OVS trace dumps of br-int and br-ex 


====

More data in nested comments below.

linking KCS: https://access.redhat.com/solutions/7091399

https://github.com/openshift/cluster-network-operator/pull/2590

Bug OCPBUGS-44432: ImageStream ignores ITMS NeverContactSource policy

View the Description View the linked PRs

Description of problem:

ImageStream cannot import image tags when ImageTagMirrorSet is set to NeverContactSource. The same issue does not apply for pods

Version-Release number of selected component (if applicable):

4.15.35

Steps to Reproduce:

    1. Create a disconnected cluster with no internet access
    2. Create a "pull-through" image registry  [1]   
    3. Create the following ImageTagMirrorSet and ImageDigestMirrorSet

~~~
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  name: image-mirrors
spec:
  imageDigestMirrors:
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/docker-remote
      source: docker.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/registry.access.redhat.com
      source: registry.access.redhat.com
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/quay.io
      source: quay.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/registry.redhat.io
      source: registry.redhat.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/gcr.io
      source: gcr.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/ghcr.io
      source: ghcr.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/com.redhat.connect.registry
      source: registry.connect.redhat.com
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/nvcr.io
      source: nvcr.io
---
apiVersion: config.openshift.io/v1
kind: ImageTagMirrorSet
metadata:
  name: image-mirrors
spec:
  imageTagMirrors:
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/docker-remote
      source: docker.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/registry.access.redhat.com
      source: registry.access.redhat.com
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/quay.io
      source: quay.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/registry.redhat.io
      source: registry.redhat.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/gcr.io
      source: gcr.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/ghcr.io
      source: ghcr.io
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/com.redhat.connect.registry
      source: registry.connect.redhat.com
    - mirrorSourcePolicy: NeverContactSource 
      mirrors:
        - <local-registry-url>/nvcr.io
      source: nvcr.io
~~~

    4. Import an image [2]

[1] https://docs.redhat.com/en/documentation/red_hat_quay/3.13/html/use_red_hat_quay/quay-as-cache-proxy
[2] https://docs.openshift.com/container-platform/4.15/openshift_images/image-streams-manage.html#images-imagestream-import-images-image-streams

Actual results:

Unable to import images

Expected results:

Being able to import images

A similar issue is reported in OCPBUGS-17975

https://github.com/openshift/openshift-apiserver/pull/475

Bug OCPBUGS-49791: Use /livez for Kubernetes scheduler liveness probe

View the Description View the linked PRs

Description of problem:

    The kube scheduler pod should use the /livez endpoint rather than /healthz for its liveness probe.

Version-Release number of selected component (if applicable):

How reproducible:

N/A

Steps to Reproduce:

N/A

Actual results:

        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthz

Expected results:

        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /livez

Additional info:

https://github.com/openshift/hypershift/pull/5546

Bug OCPBUGS-44264: [must-gather] should collect the 3rd driver operator clustercsidriver resources

View the Description View the linked PRs

Description of problem:

 [must-gather] should collect the 3rd driver operator clustercsidriver resources

Version-Release number of selected component (if applicable):

Client Version: 4.18.0-ec.3
Kustomize Version: v5.4.2
Server Version: 4.18.0-0.nightly-2024-11-05-163516

How reproducible:

Always

Steps to Reproduce:

    1. Install an Openshift cluster on azure.
    2. Deploy the smb csi driver operator and create the cluster csidriver.
    3. Use oc adm must-gather  --dest-dir=./gather-test command gather the cluster info.

Actual results:

  In step 3 the gathered data does not contain the clustercsidriver smb.csi.k8s.io object

$ wangpenghao@pewang-mac  ~  omc get clustercsidriver
NAME                 AGE
disk.csi.azure.com   3h
file.csi.azure.com   3h
 wangpenghao@pewang-mac  ~  oc get clustercsidriver
NAME                 AGE
disk.csi.azure.com   4h45m
efs.csi.aws.com      40m
file.csi.azure.com   4h45m
smb.csi.k8s.io       4h13m

 wangpenghao@pewang-mac  ~  ls -l ~/gather-test/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2a3aa11d261a312215bcba80827ab6c75527f44d1ebde54958e7b7798673787c/cluster-scoped-resources/operator.openshift.io/clustercsidrivers
total 32
-rwxr-xr-x@ 1 wangpenghao  staff  7191 Nov  6 13:55 disk.csi.azure.com.yaml
-rwxr-xr-x@ 1 wangpenghao  staff  7191 Nov  6 13:55 file.csi.azure.com.yaml

Expected results:

 In step 3 the gathered data should contain the clustercsidriver smb.csi.k8s.io object

Additional info:

aws efs, gcp filestore also have the same issue

https://github.com/openshift/must-gather/pull/464

Bug OCPBUGS-47761: Router default of the default cert (default_pub_keys.pem) uses SHA1

View the Description View the linked PRs

Description of problem:

The default of the default cert in the router, default_pub_keys.pem, uses SHA1 and fails to load if any of the DEFAULT_CERTIFICATE, DEFAULT_CERTIFICATE_PATH, or DEFAULT_CERTIFICATE_DIR are NOT specified on the router deployment.

This isn't an active problem for our supported router scenarios because default_pub_keys.pem is never used since DEFAULT_CERTIFICATE_DIR is always specified. But it does impact E2E testing such as when we create router deployments with no default cert, which attempts to load default_pub_keys.pem, which HAProxy fails on now because it's SHA1.

So, both a completeness fix, and a fix to help make E2E tests simpler in origin.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    100%

Steps to Reproduce:

    1. openssl x509 -in ./images/router/haproxy/conf/default_pub_keys.pem  -noout -text

Actual results:

...
    Signature Algorithm: sha1WithRSAEncryption
...

Expected results:

...
    Signature Algorithm: sha256WithRSAEncryption
...

Additional info:

https://github.com/openshift/router/pull/646

Bug OCPBUGS-48635: Broken codeRefs in console static plugins

View the Description View the linked PRs

Description of problem:

    Some coderefs in console are broken and runtime errors will occur if these coderefs are loaded

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

    Always

Steps to Reproduce:

    1. Import every codeRef used by console
    2. Observe runtime errors if any
    3.

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

https://github.com/openshift/console/pull/14694

Bug OCPBUGS-48641: ClusterVersion spec.desiredUpdate docs should explain requested-version vs. requested-image validation

View the Description View the linked PRs

Description of problem

Since 4.8, cvo#431 has the CVO checking to see whether the requested image's version matches the requested version, and erroring out if they don't match. For example, asking a 4.17.12 cluster to move to a 4.17.13 pullspec but claiming it will have a 4.17.99 version:

oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/desiredUpdate", "value": {"image": "quay.io/openshift-release-dev/ocp-release@sha256:82aa2a914d4cd964deda28b99049abbd1415f96c0929667b0499dd968864a8dd", "version": "4.17.99"}}]'

fails with ReleaseAccepted=False:

$ oc adm upgrade
Cluster version is 4.17.12

ReleaseAccepted=False

  Reason: VerifyPayloadVersion
  Message: Verifying payload failed version="4.17.99" image="quay.io/openshift-release-dev/ocp-release@sha256:82aa2a914d4cd964deda28b99049abbd1415f96c0929667b0499dd968864a8dd" failure=release image version 4.17.13 does not match the expected upstream version 4.17.99

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.17 (available channels: candidate-4.17, candidate-4.18, fast-4.17)

Recommended updates:

  VERSION     IMAGE
  4.17.13     quay.io/openshift-release-dev/ocp-release@sha256:82aa2a914d4cd964deda28b99049abbd1415f96c0929667b0499dd968864a8dd

API godocs should be updated to accurately explain this behavior.

Version-Release number of selected component

The current CVO behavior dates back to 4.8, cvo#431. The current (and incorrect) API godocs date back to 4.13, api#1339.

How reproducible

Every time.

Steps to Reproduce

1. View ClusterVersion API docs, e.g. 4.17's.

Actual results

See strings like "version is ignored if image is specified".

Expected results

Have the actual cluster-version operator behavior accurately described.

Additional info

cvo#431 landing in 4.8:

cvo$ git diff origin/release-4.7..origin/release-4.8 -- pkg/cvo/sync_worker.go | grep -2 FIXME                         err = fmt.Errorf("release image version %s does not match the expected upstream version %s", payloadUpdate.Release.Version, work.Desired.Version)
                        w.eventRecorder.Eventf(cvoObjectRef, corev1.EventTypeWarning, "VerifyPayloadVersionFailed", "verifying payload failed version=%q image=%q failure=%v", work.Desired.Version, work.Desired.Image, err)
-                       /* FIXME: Ignore for now.  I will make this fatal in a follow-up pivot
                        reporter.Report(SyncWorkerStatus{
                                Generation:  work.Generation,

api#1339 landing in 4.13:

api$ git diff origin/release-4.12..origin/release-4.13 -- config/v1/types_cluster_version.go | grep 'version is ignored if image is specified'
+       // version is ignored if image is specified and required if

https://github.com/openshift/api/pull/2158

Bug CNV-56076: CUDN cannot be created thorugh the UI if there is no existing UDN

View the Description View the linked PRs

Description of problem:

CUDN cannot be created thorugh the UI if there is no existing UDN

Version-Release number of selected component (if applicable):

4.18 rc.6

How reproducible:

Always

Steps to Reproduce:

1. Go to a fresh cluster without any existing UDNs
2. Go to UserDefinedNetworks menu

Actual results:

It only allows the user to create UDN

Expected results:

It should allow both UDN and CUDN

Additional info:

https://github.com/openshift/networking-console-plugin/pull/210

Bug OCPBUGS-45733: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-gcp/pull/96

Bug OCPBUGS-45825: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/826

Bug OCPBUGS-50665: enable project extension tab will block other project tabs

View the Description View the linked PRs

Description of problem:

console-demo-plugin adds a `Demo Plugin` tab for project resource, every time we click on `Demo Plugin` tab, it will append /demo-plugin suffix as part of URL path, in this situation visiting any tab will just return empty page since that's invalid URL

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-02-11-161912

How reproducible:

Always

Steps to Reproduce:

1. deploy console-demo-plugin manifests and enable it
$ oc apply -f https://raw.githubusercontent.com/openshift/console/refs/heads/master/dynamic-demo-plugin/oc-manifest.yaml
2. Navigate to any project details page, such as openshift-console
3. switch between tabs including 'Demo Plugin' tab

Actual results:

3. After we click on 'Demo Plugin', nav through any project tab will just return empty page

Expected results:

3. return correct page

Additional info:

https://github.com/openshift/console/pull/14764

Bug OCPBUGS-29815: TP cluster micro-upgrade fails, waiting on cluster-api

View the Description View the linked PRs

Description of problem:

On azure(or vsphere) TP cluster upgrade failed from 4.15.0-rc.5-> 4.15.0-rc.7 or 4.15.0-rc.4-> 4.15.0-rc.5, stuck in cluster-api.
Seems this only happened on platforms don't support capi, this couldn't be reproduced on aws and gcp, .

Version-Release number of selected component (if applicable):

    4.15.0-rc.5-> 4.15.0-rc.7 or 4.15.0-rc.4-> 4.15.0-rc.5

How reproducible:

    always

Steps to Reproduce:

    1.Build a TP cluster 4.15.0-rc.5 on azure(or vsphere)
    2.Upgrade to 4.15.0-rc.7     
    3.

Actual results:

Upgrade stuck in cluster-api. 
must-gather: https://drive.google.com/file/d/12ykhEVZvqY_0eNdLwJOWFSxTSdQQrm_y/view?usp=sharing

$ oc get clusterversion       NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS version   4.15.0-rc.5   True        True          82m     Working towards 4.15.0-rc.7: 257 of 929 done (27% complete), waiting on cluster-api

I0222 04:53:18.733907       1 sync_worker.go:1134] Update error 198 of 929: ClusterOperatorUpdating Cluster operator cluster-api is updating versions (*errors.errorString: cluster operator cluster-api is available and not degraded but has not finished updating to target version) E0222 04:53:18.733944       1 sync_worker.go:638] unable to synchronize image (waiting 2m44.892272217s): Cluster operator cluster-api is updating versions

$ oc get co     
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.0-rc.5   True        False         False      99m
baremetal                                  4.15.0-rc.5   True        False         False      123m
cloud-controller-manager                   4.15.0-rc.7   True        False         False      128m
cloud-credential                           4.15.0-rc.5   True        False         False      135m
cluster-api                                4.15.0-rc.5   True        False         False      124m
cluster-autoscaler                         4.15.0-rc.5   True        False         False      123m
config-operator                            4.15.0-rc.7   True        False         False      124m
console                                    4.15.0-rc.5   True        False         False      101m
control-plane-machine-set                  4.15.0-rc.7   True        False         False      113m
csi-snapshot-controller                    4.15.0-rc.5   True        False         False      112m
dns                                        4.15.0-rc.5   True        False         False      115m
etcd                                       4.15.0-rc.7   True        False         False      122m
image-registry                             4.15.0-rc.5   True        False         False      107m
ingress                                    4.15.0-rc.5   True        False         False      106m
insights                                   4.15.0-rc.5   True        False         False      118m
kube-apiserver                             4.15.0-rc.7   True        False         False      108m
kube-controller-manager                    4.15.0-rc.7   True        False         False      121m
kube-scheduler                             4.15.0-rc.7   True        False         False      120m
kube-storage-version-migrator              4.15.0-rc.5   True        False         False      115m
machine-api                                4.15.0-rc.7   True        False         False      111m
machine-approver                           4.15.0-rc.5   True        False         False      124m
machine-config                             4.15.0-rc.5   True        False         False      121m
marketplace                                4.15.0-rc.5   True        False         False      123m
monitoring                                 4.15.0-rc.5   True        False         False      106m
network                                    4.15.0-rc.5   True        False         False      126m
node-tuning                                4.15.0-rc.5   True        False         False      112m
olm                                        4.15.0-rc.5   True        False         False      106m
openshift-apiserver                        4.15.0-rc.5   True        False         False      115m
openshift-controller-manager               4.15.0-rc.5   True        False         False      115m
openshift-samples                          4.15.0-rc.5   True        False         False      111m
operator-lifecycle-manager                 4.15.0-rc.5   True        False         False      123m
operator-lifecycle-manager-catalog         4.15.0-rc.5   True        False         False      123m
operator-lifecycle-manager-packageserver   4.15.0-rc.5   True        False         False      112m
platform-operators-aggregated              4.15.0-rc.5   True        False         False      73m
service-ca                                 4.15.0-rc.5   True        False         False      124m
storage                                    4.15.0-rc.5   True        False         False      107m

Expected results:

Upgrade is successful

Additional info:

 upgrade succeed from 4.15.0-rc.3-> 4.15.0-rc.4

https://github.com/openshift/cluster-capi-operator/pull/248

Bug OCPBUGS-44671: GCP MAPI seemingly reconciles MachineSet with incompatible shieldedInstanceConfig

View the Description View the linked PRs

Description of problem:

Following a 4.12.53 > 4.13.48 > 4.14.35 cluster upgrade path, a customer scaled up one of their cluster's MachineSets, at while points their spotted the following error:

"ocp-lmwfc-infra-westeurope3b-8c48v: reconciler failed to Create machine: error launching instance: googleapi: Error 400: Invalid value for field 'resource.shieldedInstanceConfig': '{ "enableVtpm": true, "enableIntegrityMonitoring": true}'. Shielded VM Config can only be set when using a UEFI-compatible disk., invalid"

At that point their noticed the following new parameters in their machineSet: `.spec.template.spec.providerSpec.value.shieldedInstanceConfig`

The above seems to be related to commit 8bc61bd, introduced in RHOCP 4.13:
- https://github.com/openshift/machine-api-provider-gcp/commit/8bc61bddf5cf01fce2462808afad3ab4e773c13e
- https://issues.redhat.com/browse/OCPSTRAT-632

Version-Release number of selected component (if applicable):

4.14.35

Actual results:

As of now, shieldedInstanceConfig seems to be reconciled automátically into the MachineSet, even when the cluster may be using non UEFI-compatible disks

Expected results:

shieldedInstanceConfig to only be enabled when the cluster is using UEFI-compatible disks

Additional info:

- The customer workaround this, by disabling VTPM & IntegrityMonitoring in their MachineSet ShieldedInstanceConfig
- The `compute-api.json` seems to suggest shieldedInstanceConfig is enabled by default (which breaks compatibility with non UEFI-compatible disks:
$ curl -s https://raw.githubusercontent.com/openshift/machine-api-provider-gcp/refs/heads/release-4.13/vendor/google.golang.org/api/compute/v1/compute-api.json | sed -n '61048,61066p'
    "ShieldedInstanceConfig": {
      "description": "A set of Shielded Instance options.",
      "id": "ShieldedInstanceConfig",
      "properties": {
        "enableIntegrityMonitoring": {
          "description": "Defines whether the instance has integrity monitoring enabled. Enabled by default.",   <<<----------
          "type": "boolean"
        },
        "enableSecureBoot": {
          "description": "Defines whether the instance has Secure Boot enabled. Disabled by default.",
          "type": "boolean"
        },
        "enableVtpm": {
          "description": "Defines whether the instance has the vTPM enabled. Enabled by default.",   <<<----------
          "type": "boolean"
        }
      },
      "type": "object"
    },

Bug OCPBUGS-45049: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-samples-operator/pull/601

Bug OCPBUGS-45395: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/339

Bug OCPBUGS-50910: metal3 pods fail on creating provisioning resource on UPI arm64 cluster

View the Description View the linked PRs

Description of problem:

While creating the provisioning resource on UPI cluster of 4.19, the metal3 pods fail to come up, because the init-container machine-os-images fails.

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-arm64-2025-02-14-044543

How reproducible:

always in arm64 cluster

Steps to Reproduce:

    1. On a UPI cluster create provisioning resource
    2. Check the pods being created on openshift-machine-api namespace

Actual results:

$ cat provisioning-upi.yaml 
apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningNetwork: "Disabled"
  watchAllNamespaces: false

$ oc get pods -n openshift-machine-api
NAME                                                 READY   STATUS                  RESTARTS      AGE
cluster-autoscaler-operator-fd96c97bf-psbt7          2/2     Running                 1 (82m ago)   91m
cluster-baremetal-operator-fb7cbbb65-dp7zb           2/2     Running                 1 (82m ago)   92m
control-plane-machine-set-operator-68596d754-ckfwq   1/1     Running                 1 (82m ago)   91m
ironic-proxy-j7whx                                   1/1     Running                 0             28s
ironic-proxy-nchz6                                   1/1     Running                 0             28s
ironic-proxy-wjgj6                                   1/1     Running                 0             28s
machine-api-operator-7dcd67759b-nqnv8                2/2     Running                 0             92m
metal3-5d77cd875-xpcsf                               0/3     Init:CrashLoopBackOff   1 (6s ago)    34s
metal3-baremetal-operator-9b465b848-j2ztt            1/1     Running                 0             33s
metal3-image-customization-767bd55db5-49pnn          0/1     Init:CrashLoopBackOff   1 (5s ago)    29s

$ oc logs -f metal3-image-customization-767bd55db5-49pnn machine-os-images -n openshift-machine-api
extracting PXE files...
/shared/html/images/coreos-aarch64-initrd.img
/shared/html/images/coreos-aarch64-rootfs.img
/shared/html/images/coreos-aarch64-vmlinuz
 
gzip: /shared/html/images/coreos-aarch64-vmlinuz.gz: not in gzip format

Expected results:

All 3 metal pods should be in Running state

https://github.com/openshift/machine-os-images/pull/51

Story HOSTEDCP-2255: E2E coverage for custom tolerations

View the Description View the linked PRs

QE has testing for this which detected ~~OCPBUGS-43357~~ but we should make our own test and verify this in our e2e as well

https://github.com/openshift/hypershift/pull/5543

Bug OCPBUGS-45414: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/image-registry/pull/417

Bug OCPBUGS-45861: telco openshift-apiserver panic observed

View the Description View the linked PRs

The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-telco5g-cnftests/1863644602574049280 and 4.18-e2e-telco5g/1863677472923455488

Undiagnosed panic detected in pod
{  pods/openshift-kube-apiserver_kube-apiserver-cnfdu3-master-1_kube-apiserver.log.gz:E1202 22:12:02.806740      12 audit.go:84] "Observed a panic" panic="context deadline exceeded" panicGoValue="context.deadlineExceededError{}" stacktrace=<}

Undiagnosed panic detected in pod
{  pods/openshift-kube-apiserver_kube-apiserver-cnfdu11-master-2_kube-apiserver.log.gz:E1202 22:11:42.359004      14 timeout.go:121] "Observed a panic" panic=<}

Bug OCPBUGS-45943: Empty status.ServiceNetwork field causes x509: cannot validate certificate for xxx.xxx.xx.xx which doesn't contain any IP SANs

View the Description View the linked PRs

Description of problem:

    1 Client can not connect to the kube-apiserver via kubernetes svc, as the kubernetes svc is not in the cert SANs
    2 The kube-apiserver-operator generate apiserver certs, and insert the kubernetes svc ip from the network cr status.ServiceNetwork
    3 When the temporary control plane is down, and the network cr is not ready yet, Client will not connect to apiserver again

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. I have just met this for very rare conditions, especially when the machine performance is poor     
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1776

Bug OCPBUGS-48608: No refresh popover after adding CSP directive to the ConsolePlugin CR

View the Description View the linked PRs

Description of problem:

After adding any directive to the ConsolePlugin CR a hard refresh is required for the changes to actually reflect, but we are not getting a refresh popover for this.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Enable feature gate (CSP feature is behind the FG in 4.18).
    2. Add "DefaultSrc" directive to any ConsolePlugin CR.
    3.

Actual results:

    No refresh popover getting displayed, we need to manually refresh for the changes to get reflected.

Expected results:

    No manual refresh. An automatic popover should be rendered.

Additional info:

    ex: https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md#example

https://github.com/openshift/console/pull/14692

Bug OCPBUGS-45855: If there are no esxi host in vcenter cluster fail with sane error

View the Description View the linked PRs

If the vCenter cluster has no esxi hosts importing the ova fails. Add a more sane error message

https://github.com/openshift/installer/pull/9291

Bug OCPBUGS-47722: console plugin name should always be present on CSV details page regradless of plugin count

View the Description View the linked PRs

Description of problem:

when there is only one console plugin within operator, we don't display the console plugin name on CSV details page, only show a `Enabled` or `Disabled` button

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-29-211757

How reproducible:

Always

Steps to Reproduce:

1. goes to Operators -> OperatorHub page, find 'Node Health Check Operator' and install it, Enable or Disable the associated console plugin during operator installation process, either option is OK
2. Check CSV details page

Actual results:

2. We only show a 'Enabled' or 'Disabled' button in Console plugin section

Expected results:

2. We should also display plugin name no matter the plugin count, otherwise user has no idea which plugin we are enabling or disabling

Additional info:

https://github.com/openshift/console/pull/14656

Bug OCPBUGS-48819: Portworx CSI migration broken without upstream patch

View the Description View the linked PRs

The kind folks at Pure Storage tell us that if customers upgrade to 4.18 without the following patch issues will occur in CSI migration.

Kube 1.31 backport https://github.com/kubernetes/kubernetes/pull/129675

Master branch PR will full issue description and testing procedure
https://github.com/kubernetes/kubernetes/pull/129630

https://github.com/openshift/kubernetes/pull/2185

Bug CNV-54827: Incorrect title on "Create project" -> Network

View the Description View the linked PRs

Description of problem:

While creating a project and move to network, select "Refer an existing ClusterUserDefinedNetwork", the title "Project name" is not correct, which should be "UserDefinedNetwork name".

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/173

Task OU-588: Adjust the monitoring plugin to include a HorizontalNav extension in the admin alert menu

View the Description View the linked PRs

Open Questions

How far back we need to backport this?

https://github.com/openshift/monitoring-plugin/pull/303

Vulnerability OCPBUGS-47466: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/803

Bug OCPBUGS-52316: NodeLogs Selects don't close when clicking outside the menu

View the Description View the linked PRs

Go to Nodes > Node details > Logs and click one of the Selects above the log and then click outside the menu. Note the menu does not close but should.

https://github.com/openshift/console/pull/14826

Vulnerability OCPBUGS-52346: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-samples-operator/pull/602

Bug OCPBUGS-45689: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/api/pull/2118

Bug OCPBUGS-50523: Fix audit-logs container to properly handle SIGTERM

View the Description View the linked PRs

Description of problem:

    Audit-logs container was adjusted previously to handle SIGTERM using a trap and cleanup function. However due to behavior described here https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/#container-v1-core the `$$` used to grab PID is not properly interpreted. This issue is to fix the script to handle this behavior correctly and make the cleanup be consistent with the changes to apply-bootstrap container for a similar ignored SIGTERM issue.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Create a hypershift cluster with auditing enabled
    2. Delete apiserver pods and observe the script does not correctly handle sigterm.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5491

Bug OCPBUGS-45162: [GCP] "destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-: owned"

View the Description View the linked PRs

Description of problem:

    "destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-<infra-id>: owned"

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-11-27-162629

How reproducible:

    Always

Steps to Reproduce:

1. include the step which sets the cluster default storageclass to the hyperdisk one before ipi-install (see my debug PR https://github.com/openshift/release/pull/59306)
2. "create cluster", and make sure it succeeds
3. "destroy cluster"

Note: although we confirmed with issue with disk type "hyperdisk-balanced", we believe other disk types have the same issue.

Actual results:

    The 2 PVC disks of hyperdisk-balanced type are not deleted during "destroy cluster", although the disks have the label "kubernetes-io-cluster-<infra-id>: owned".

Expected results:

    The 2 PVC disks should be deleted during "destroy cluster", because they have the correct/expected labels according to which the uninstaller should be able to detect them.

Additional info:

    FYI the PROW CI debug job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59306/rehearse-59306-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.18-installer-rehearse-debug/1861958752689721344

https://github.com/openshift/installer/pull/9274

Bug OCPBUGS-45222: When the webhook token authenticator is enabled, the console is in crashloopback

View the Description View the linked PRs

Description of problem:


When setting up the "webhookTokenAuthenticator" the oauth configure "type" is set to "None". 
Then controller sets the console configmap with "authType=disabled". Which will cause that the console pod goes in the crash loop back due to the not allowed type:

Error:
validate.go:76] invalid flag: user-auth, error: value must be one of [oidc openshift], not disabled.

This worked before on 4.14, stopped working on 4.15.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.15

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

The console can't start, seems like it is not allowed to change the console.

Expected results:

Additional info:

https://github.com/openshift/console-operator/pull/944

Bug OCPBUGS-45390: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-operator/pull/56

Bug OCPBUGS-48273: Prometheus: segfault at exit

View the Description View the linked PRs

Description of problem:

Backport https://github.com/prometheus/prometheus/pull/15723

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/prometheus/pull/241

Bug OCPBUGS-31459: OLMv1 bumper overrides changes downstream

View the Description View the linked PRs

We merged this ART PR which bumps base images. And then bumper [reverted the changes here|https://github.com/openshift/operator-framework-operator-controller/pull/88/files].

I still see the ART bump commit in main, but there is "Add OpenShift specific files" commit on top of it with older images. Actually now we have two "Add OpenShift specific files" commits in main:

And every UPSTREAM: <carry>-prefixed commit seems to be duplicated on top of synced changes.

Expected result:

Bumper doesn't override/revert UPSTREAM: <carry>-prefixed commit contributed directly into the downstream repos. Order of UPSTREAM: <carry>-prefixed commits should be respected.

https://github.com/openshift/operator-framework-operator-controller/pull/97

Bug OCPBUGS-44059: Parallelize ./check-patternfly-modules.sh

View the Description View the linked PRs

Description of problem:

    Currently check-patternfly-modules.sh checks them serially, which could be improved by checking them in parallel. 

Since yarn why does not write to anything, this should be easily parallelizable as there is no race condition with writing back to the yarn.lock

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14453

Bug OCPBUGS-44968: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4904

Bug OCPBUGS-52213: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4900

Bug OCPBUGS-46380: Static pod operator API accepts invalid node statuses and node status transitions

View the Description View the linked PRs

Description of problem:

The StaticPodOperatorStatus API validations permit:
- nodeStatuses[].currentRevision can be cleared and can decrease
- more than one entry in nodeStatuses can have a targetRevision > 0
But both of these signal a bug in one or more of the static pod controllers that write to them.

Version-Release number of selected component (if applicable):

This has been the case ~forever but we are aware of bugs in 4.18+ that are resulting in controllers trying to make these invalid writes. We also have more expressive validation mechanisms today that make it possible to plug the holes.

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

 The bug in 4.18+ that are resulting in some static pod node/installer controllers trying to make these invalid write requests.

Expected results:

To add some validation rules to help to see them

Additional info:

https://github.com/openshift/api/pull/2123

Bug OCPBUGS-48194: Runtimecfg security check failing

View the Description View the linked PRs

Description of problem:

The security job is failing on a new test added in October. I'm not sure we actually need to worry about it since we don't deal with user input so it may not be exploitable, but I think just bumping our logrus module would fix it so we should probably just do that.

Version-Release number of selected component (if applicable):

4.19, but possibly other branches were the security job is enabled too.

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/baremetal-runtimecfg/pull/338

Bug OCPBUGS-48513: e2e installs wrong lib versions

View the Description View the linked PRs

Description of problem:

    Our e2e setup `go install` a few packages with the `@latest` tag. `go install` does not take `go.mod` into consideration, so in older branches we can pull package versions not compatible with the system Go version.

Version-Release number of selected component (if applicable):

    All branches using Go < 1.23

How reproducible:

    always on branch <= 4.18

Steps to Reproduce:

    1. 
    2.
    3.

Actual results:

                ./test/e2e/e2e-simple.sh ././bin/oc-mirror
                /go/src/github.com/openshift/oc-mirror/test/e2e/operator-test.17343 /go/src/github.com/openshift/oc-mirror
                go: downloading github.com/google/go-containerregistry v0.20.3
                go: github.com/google/go-containerregistry/cmd/crane@latest: github.com/google/go-containerregistry@v0.20.3 requires go >= 1.23.0 (running go 1.22.9; GOTOOLCHAIN=local)             
                /go/src/github.com/openshift/oc-mirror/test/e2e/lib/util.sh: line 17: PID_DISCONN: unbound variable
              
    
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc-mirror/1006/pull-ci-openshift-oc-mirror-release-4.18-e2e/1879913390239911936

Expected results:

    The package version selected is compatible with the system Go version.

Additional info:

https://github.com/openshift/oc-mirror/pull/1016

Bug OCPBUGS-42815: IBM Fusion operator upgrade is blocked with the error: "error validating existing CRs against new CRD's schema"

View the Description View the linked PRs

Description of problem:

    While upgrading the Fusion operator,  IBM team is facing the following error in the operator's subscription:
error validating existing CRs against new CRD's schema for "fusionserviceinstances.service.isf.ibm.com": error validating service.isf.ibm.com/v1, Kind=FusionServiceInstance "ibm-spectrum-fusion-ns/odfmanager": updated validation is too restrictive: [].status.triggerCatSrcCreateStartTime: Invalid value: "number": status.triggerCatSrcCreateStartTime in body must be of type integer: "number"


question here, "triggerCatSrcCreateStartTime" has been present in the operator for the past few releases and it's datatype (integer) hasn't changed in the latest release as well. There was  one "FusionServiceInstance" CR present in the cluster when this issue was hit and the value of "triggerCatSrcCreateStartTime" field being "1726856593000774400".

Version-Release number of selected component (if applicable):

    Its impacting between OCP 4.16.7 and OCP 4.16.14 versions

How reproducible:

    Always

Steps to Reproduce:

    1.Upgrade the fusion operator ocp version 4.16.7 to ocp 4.16.14
    2.
    3.

Actual results:

    Upgrade fails with error in description

Expected results:

    Upgrade should not be failed

Additional info:

https://github.com/openshift/operator-framework-olm/pull/910

Bug OCPBUGS-45306: Due to trailing dot(.) in domain name openshift installation getting failed.

View the Description View the linked PRs

Description of problem:

Customer is trying to install Self managed OCP cluster in aws. This customer use AWS VPC DHCPOptionSet. where it has a trailing dot (.) at the end of domain name in dhcpoptionset. due to this setting Master nodes hostname also has trailing dot & this cause failure in OpenShift installation.

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

1.Please create a aws vpc with DHCPOptionSet, where DHCPoptionSet has trailing dot at the domain name.
2.Try installation of cluster with IPI.

Actual results:

    Openshift Installer should allowed to create AWS Master nodes, where domain has trailing dot(.).

Expected results:

Additional info:

Bug OCPBUGS-45712: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-node-driver-registrar/pull/77

Bug OCPBUGS-44808: Inconsistent properties location of maxOpenShiftVersion

View the Description View the linked PRs

Description of problem:

    Some bundles in the Catalog have been given the property in the FBC (and not in the bundle's CSV) which does not get propagated through to the helm chart annotations.

Version-Release number of selected component (if applicable):

How reproducible:

    Install elasticsearch 5.8.13

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    cluster is upgradeable

Expected results:

    cluster is not upgradeable

Additional info:

https://github.com/openshift/origin/pull/29328

Bug OCPBUGS-46072: nmstate: after reboot wait-for-br-ex-up.service stuck

View the Description View the linked PRs

Description of problem:

With balance-slb and nmstate a node got stuck on reboot.

[root@master-1 core]# systemctl list-jobs
JOB UNIT                                 TYPE  STATE
307 wait-for-br-ex-up.service            start running
341 afterburn-checkin.service            start waiting
187 multi-user.target                    start waiting
186 graphical.target                     start waiting
319 crio.service                         start waiting
292 kubelet.service                      start waiting
332 afterburn-firstboot-checkin.service  start waiting
306 node-valid-hostname.service          start waiting
293 kubelet-dependencies.target          start waiting
321 systemd-update-utmp-runlevel.service start waiting


systemctl status wait-for-br-ex-up.service
Dec 10 20:11:39 master-1.ostest.test.metalkube.org systemd[1]: Starting Wait for br-ex up event from NetworkManager...

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-04-113014

How reproducible:

Sometimes

Steps to Reproduce:

1. create nmstate config

interfaces:
 - name: bond0
type: bond
state: up
copy-mac-from: eno2
ipv4:
enabled: false
link-aggregation:
mode: balance-xor
options:
xmit_hash_policy: vlan+srcmac
balance-slb: 1
port:
 - eno2
 - eno3
 - name: br-ex
type: ovs-bridge
state: up
ipv4:
enabled: false
dhcp: false
ipv6:
enabled: false
dhcp: false
bridge:
port:
 - name: bond0
 - name: br-ex
 - name: br-ex
type: ovs-interface
state: up
copy-mac-from: eno2
ipv4:
enabled: true
address:
 - ip: "192.168.111.111"
prefix-length: 24
ipv6:
enabled: false
dhcp: false
 - name: eno1
type: interface
state: up
ipv4:
enabled: false
ipv6:
enabled: false
dns-resolver:
config:
server:
 - 192.168.111.1
routes:
config:
 - destination: 0.0.0.0/0
next-hop-address: 192.168.111.1
next-hop-interface: br-ex

2. reboot
3.

Actual results:

systemctl status wait-for-br-ex-up.service
Dec 10 20:11:39 master-1.ostest.test.metalkube.org systemd[1]: Starting Wait for br-ex up event from NetworkManager...

bond0 fails, network is in odd state

[root@master-1 core]# ip -c a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 90:e2:ba:ca:9f:28 brd ff:ff:ff:ff:ff:ff
    altname enp181s0f0
    inet6 fe80::92e2:baff:feca:9f28/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 30:d0:42:56:66:bb brd ff:ff:ff:ff:ff:ff
    altname enp23s0f0
4: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 90:e2:ba:ca:9f:29 brd ff:ff:ff:ff:ff:ff
    altname enp181s0f1
    inet6 fe80::92e2:baff:feca:9f29/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
5: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 30:d0:42:56:66:bc brd ff:ff:ff:ff:ff:ff
    altname enp23s0f1
6: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 30:d0:42:56:66:bd brd ff:ff:ff:ff:ff:ff
    altname enp23s0f2
    inet 192.168.111.34/24 brd 192.168.111.255 scope global dynamic noprefixroute eno3
       valid_lft 3576sec preferred_lft 3576sec
    inet6 fe80::32d0:42ff:fe56:66bd/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
7: eno4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 30:d0:42:56:66:be brd ff:ff:ff:ff:ff:ff
    altname enp23s0f3
8: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 56:92:14:97:ed:10 brd ff:ff:ff:ff:ff:ff
9: ovn-k8s-mp0: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
    link/ether ae:b9:9e:dc:17:d1 brd ff:ff:ff:ff:ff:ff
10: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether e6:68:4d:df:e0:bd brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e468:4dff:fedf:e0bd/64 scope link
       valid_lft forever preferred_lft forever
11: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
    link/ether 32:5b:1f:35:ce:f5 brd ff:ff:ff:ff:ff:ff
12: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue master ovs-system state DOWN group default qlen 1000
    link/ether aa:c8:8c:e3:71:aa brd ff:ff:ff:ff:ff:ff
13: bond0.104@bond0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue master ovs-system state LOWERLAYERDOWN group default qlen 1000
    link/ether aa:c8:8c:e3:71:aa brd ff:ff:ff:ff:ff:ff
14: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 30:d0:42:56:66:bd brd ff:ff:ff:ff:ff:ff
    inet 192.168.111.111/24 brd 192.168.111.255 scope global noprefixroute br-ex
       valid_lft forever preferred_lft forever

Expected results:

System reboots correctly.

Additional info:

br-ex up/down re-generates the event

[root@master-1 core]# nmcli device down br-ex ; nmcli device up br-ex

https://github.com/openshift/machine-config-operator/pull/4798

Bug OCPBUGS-48626: Missing DNS cli option for OpenStack

View the Description View the linked PRs

Description of problem:

Customers need to be able to configure the DNS nameservers for the OpenStack subnet created by Hypershift (through Cluster API Provider for OpenStack). Without that, the default subnet wouldn't have DNS nameservers and resolution can fail in some environments.

Version-Release number of selected component (if applicable):

4.19, 4.18

How reproducible:

In default RHOSO 18 we don't have DNS forwarded to the DHCP agent so we need to set the DNS nameservers in every subnet that is created.

https://github.com/openshift/hypershift/pull/5446

Bug OCPBUGS-49778: OLM UI doesn't parse links in operator.openshift.io/uninstall-message

View the Description View the linked PRs

If you use a hyperlink in the value of operator.openshift.io/uninstall-message it is not parsed by the UI.

https://github.com/openshift/console/pull/14713

Bug OCPBUGS-45480: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1310

Bug OCPBUGS-50899: PowerVS: COS eu-es hack

View the Description View the linked PRs

Description of problem:

For the PowerVS IPI CAPI installer, creating an OpenShift cluster in the Madrid zone will fail because it cannot import the RHCOS image.  This is due to not using the correct bucket name.

https://github.com/openshift/installer/pull/9490

Bug OCPBUGS-50637: OCP 4.16 "openshift-install agent create image" returns "error: unable to read image quay.io" in disconnected env

View the Description View the linked PRs

Description of problem:

Using the same registry, oc and oc-mirror 4.16 versions but different installer versions in different directory structures, 4.16 returns the following error while generating the ISO file while 4.14 does not:

DEBUG Using internal constant for release image quay.io/openshift-release-dev/ocp-release@sha256:0e71cb61694473b40e8d95f530eaf250a62616debb98199f31b4034808687dae 
ERROR Release Image arch could not be found: command '[oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:0e71cb61694473b40e8d95f530eaf250a62616debb98199f31b4034808687dae -o=go-template={{if and .metadata.metadata (index . "metadata" "metadata" "release.openshift.io/architecture")}}{{index . "metadata" "metadata" "release.openshift.io/architecture"}}{{else}}{{.config.architecture}}{{end}} --insecure=true --registry-config=/tmp/registry-config2248417039]' exited with non-zero exit code 1:  
ERROR error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:0e71cb61694473b40e8d95f530eaf250a62616debb98199f31b4034808687dae: Get "http://quay.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 
ERROR                                              
WARNING Unable to validate the release image architecture, skipping validation

This is valid for disconnected environments. This doesn't affect creating the image, it's more a cosmetic error. The ISO file is still being created and works as expected.

Version-Release number of selected component (if applicable):

openshift-install-linux-4.16

How reproducible:

./openshift-install agent create image --dir=./abi/ --log-level=debug

Steps to Reproduce:

1. Create 2 directory structures ocp4.14 and ocp4.16
2. Get from mirror.openshift.com oc client, oc-mirror for OCP 4.16 and openshift-install for both 4.14 and 4.16
3. Clone catalogs for both versions 4.14 and 4.16 to private registry
4. Use same install-config.yaml and agent-config.yaml for both "openshift-install agent create image" command with different versions

Actual results:

Error shows up for version 4.16

Expected results:

No error should show as in version 4.14

Additional info:

There are logs and config files in linked case.
I'm attaching here the files and logs from my own reproduction.

https://github.com/openshift/installer/pull/9531

Bug OCPBUGS-44618: In OCL, MCPs are reporting Updating=false while the image is being built

View the Description View the linked PRs

Description of problem:

    In OCL, while a new image is being built the MCP is reporting Updating=false.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-11-14-090045

How reproducible:

    Always

Steps to Reproduce:

    1. Enable techpreview
    2. Create a MOSC

Actual results:

    A builder pod is created, a machineosbuild resource is reporting "building" status but the MCP is reporting Updating=false


    Once the MCO starts applying the image to the nodes, then the MCP starts reporting Updating=true.



    For example, using an infra custom pool we get this scenario


# The MOSB reports a building status
$ oc get machineosbuild
NAME                                          PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
mosc-infra-1adf9d0871a38cfcb8a0a3242fb78269   False      True       False       False         False

# The builder pod is running
$ oc get pods
NAME                                                             READY   STATUS    RESTARTS   AGE
build-mosc-infra-1adf9d0871a38cfcb8a0a3242fb78269                2/2     Running   0          53s
kube-rbac-proxy-crio-ip-10-0-17-74.us-east-2.compute.internal    1/1     Running   8          10h


# But the infra pool is reporting Updating=false

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-8336591613a94060cb7d8a1a8319dc8e    True      False      False      0              0                   0                     0                      62m
master   rendered-master-ae24125af2c010fe672af84ce06153d9   True      False      False      3              3                   3                     0                      10h
worker   rendered-worker-1c0c28ca4046a899927f4417754955c6   True      False      False      2              2                   2                     0                      10h

Expected results:

    The pool should report Updating=true when the a new MC is rendered and MCO starts building a new image.

Additional info:

    Currently our automation for OCL relies on the MCP reporting Updating=true while the new image is being built so automation for OCL cannot be used until this issue is fixed.

https://github.com/openshift/machine-config-operator/pull/4710

Story METAL-1266: support change master branch to main

View the Description View the linked PRs

the master branch is going to be renamed to 'main'
we need to update the automation for the sync to reduce breakage

Bug OCPBUGS-48312: [4.19] frr-k8s listening on host's 8081

View the Description View the linked PRs

Description of problem:

    see https://issues.redhat.com/browse/OCPBUGS-44111

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2615

Bug OCPBUGS-48570: [Nutanix] Installation failed with timeout when uploading images to PC

View the Description View the linked PRs

Description of problem:

   Reported by customer IHAC, see https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1736514939074049

"The timeout needs to be increased for Nutanix IPI installations using OpenShift versions >= 4.16, as image creation that takes more than 5 minutes will fail. OpenShift versions 4.11-4.15 work as expected with the OpenShift installer because the image creation timeout is set to a value greater than 5 minutes."

Version-Release number of selected component (if applicable):

How reproducible:

    In some slow Prism-Central env. (slow network, etc.)

Steps to Reproduce:

    In some slow Prism-Central env. (such as slow network), run the installer (4.16 and later) to create a Nutanix OCP cluster. The installation will fail with timeout when trying to upload the RHCOS image to PC.

Actual results:

The installation failed with timeout when uploading the RHCOS image to PC.

Expected results:

The installation successfully create the OCP cluster.

Additional info:

  In some slow Prism-Central env. (such as slow network), run the installer (4.16 and later) to create a Nutanix OCP cluster. The installation will failed with timeout when trying to upload the RHCOS image to PC.

https://github.com/openshift/installer/pull/9377

Bug OCPBUGS-36786: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14775

Bug OCPBUGS-37101: unhandled error: TypeError: L.b.logoutOpenShift is not a function

View the Description View the linked PRs

Description of problem:

https://github.com/openshift/console/commit/4b29cd8d77d4dcbf5cec1cd947f4877bd95bb684#diff-b7cc128ed1e2d7ad6eba033d9e76a4d8794bd1820ec5d132dd05c12bc993fa73L95-L109 removed `logoutOpenShift`, but it is still in use (see https://github.com/openshift/console/blob/7ba2bdcadf64b9e51157cb77b4b284cd6654504d/frontend/public/components/masthead-toolbar.jsx#L559).  Cursory investigation shows `logout` alone will not successfully log out kubeadmin in OpenShift.

https://github.com/openshift/console/pull/14658

Bug OCPBUGS-45215: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/995

Bug OCPBUGS-45727: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9284

Bug OCPBUGS-45827: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/bond-cni/pull/66

Bug OCPBUGS-48320: [sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork is isolated from the default network with L2 primary UDN

View the Description View the linked PRs

Description of problem:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.19/analysis?test=%5Bsig-network%5D%5BOCPFeatureGate%3ANetworkSegmentation%5D%5BFeature%3AUserDefinedPrimaryNetworks%5D%20when%20using%20openshift%20ovn-kubernetes%20created%20using%20UserDefinedNetwork%20is%20isolated%20from%20the%20default%20network%20with%20L2%20primary%20UDN%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-network%5D%5BOCPFeatureGate%3ANetworkSegmentation%5D%5BFeature%3AUserDefinedPrimaryNetworks%5D%20when%20using%20openshift%20ovn-kubernetes%20created%20using%20UserDefinedNetwork%20is%20isolated%20from%20the%20default%20network%20with%20L2%20primary%20UDN%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

Task MGMT-19506: Bump golang.org/x/crypto from 0.25.0 to 0.31.0

View the Description View the linked PRs

In order to fix security issue https://github.com/openshift/assisted-service/security/dependabot/94

Bug OCPBUGS-45617: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/546

Bug OCPBUGS-51378: vSphere installer conversion govmomi false positive error is confusing

View the Description View the linked PRs

time="2025-02-21T22:51:33Z" level=warning msg="unable to log into vCenter vcenter-1.ci.ibmc.devcluster.openshift.com, Post \"https://vcenter-1.ci.ibmc.devcluster.openshift.com/sdk\": tls: failed to verify certificate: x509: certificate signed by unknown authority"

https://github.com/openshift/installer/pull/9524

Bug OCPBUGS-45406: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-vsphere/pull/81

Bug OCPBUGS-45736: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/320

Bug OCPBUGS-52655: Hypershift CLI to support new arg to pass etcd-storage-size

View the Description View the linked PRs

Description of problem:

Request to add a CLI option in hypershift create cluster to pass user configurable ETCD disk size. 
It allows user to pass ETCD Storage Class name but disk size, this request to accept storage size to override current default size of 8Gi. This disk size won't be sufficient for larger clusters, ROSA-HCP default is already set to 32Gi. It would be a good addition to make it configurable in CLI for self-hosted installs.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

hypershift create cluster azure --name hosted-cp --etcd-storage-class premium --etcd-storage-size 32Gi

Additional info:

https://github.com/openshift/hypershift/pull/5795

Bug OCPBUGS-45465: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-vpc-block-csi-driver/pull/78

Story TRT-1803: Investigate flake rates of new disruption near start of e2e testing tests

View the Description View the linked PRs

Analyze the data from the new tests and determine what, if anything, we should do.

https://github.com/openshift/origin/pull/29367

Bug OCPBUGS-45734: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/534

Bug OCPBUGS-45923: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-gcp/pull/98

Bug OCPBUGS-46471: Power VS: MAPI ignores endpoint override

View the Description View the linked PRs

Description of problem:

    The Power VS Machine API provider ignores the authentication endpoint override.

https://github.com/openshift/machine-api-provider-powervs/pull/95

Bug OCPBUGS-48677: Spelling of NetworkAttachment Definition is wrong

View the Description View the linked PRs

Testing OCP Console 4.17.9, the NetworkAttachmentDefinition Creation Button is misspelled as NetworkArrachmentDefinition.

I have attached a picture.
https://issues.redhat.com/secure/attachment/13328391/console_bug.png

https://github.com/openshift/networking-console-plugin/pull/189

Story CCO-626: Log diff on CredentialsRequest status change

View the Description View the linked PRs

When the CCO updates a CredentialsRequest's status, the current logs are not clear on what's changing:

time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials

We should make it possible to get the CCO to log the diff it's trying to push, even if that requires bumping the operator's log level to debug. That would make it easier to understand hotloops like OCPBUGS-47505.

https://github.com/openshift/cloud-credential-operator/pull/811

Vulnerability OCPBUGS-47437: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/138

Bug OCPBUGS-45413: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/136

Bug OCPBUGS-45737: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/544

Bug OCPBUGS-51042: oVirt support should be removed from Machine API operator

View the Description View the linked PRs

Description of problem:

    oVirt support was dropped in 4.13, so the Machine API operator no longer needs to reference the oVirt image, nor does it need to know how to ship it

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-api-operator/pull/1331

Bug OCPBUGS-18656: AlertmanagerConfig with missing options causes Alertmanager to crash

View the Description View the linked PRs

Description of problem:

AlertmanagerConfig with missing options causes Alertmanager to crash

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

A cluster administrator has enabled monitoring for user-defined projects.
CMO 

~~~
 config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      retention: 7d
~~~

A cluster administrator has enabled alert routing for user-defined projects. 

UWM cm / CMO cm 

~~~
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true 
      enableAlertmanagerConfig: true
~~~

verify existing config: 

~~~
$ oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093  
global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
route:
  receiver: Default
  group_by:
  - namespace
  continue: false
receivers:
- name: Default
templates: []
~~~

create alertmanager config without options "smtp_from:" and "smtp_smarthost"

~~~
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: example
  namespace: example-namespace
spec:
  receivers:
    - emailConfigs:
        - to: some.username@example.com
      name: custom-rules1
  route:
    matchers:
      - name: alertname
    receiver: custom-rules1
    repeatInterval: 1m
~~~

check logs for alertmanager: the following error is seen. 

~~~
ts=2023-09-05T12:07:33.449Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="no global SMTP smarthost set"
~~~

Actual results:

Alertmamnager fails to restart.

Expected results:

CRD should be pre validated.

Additional info:

Reproducible with and without user workload Alertmanager.

https://github.com/openshift/prometheus-operator/pull/321

Bug OCPBUGS-50580: CCPMSO on OpenStack: custom name breaks on OpenStack

View the Description View the linked PRs

Description of problem:

Custom CPMS name fails on OpenStack. This was caught by the CI jobs:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-openstack-techpreview/1877591758888505344

Notes:

The job that exercices that test is the TechPreview one
We need to confirm that the test runs fine on other platforms
The feature was introduced here: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/332 and the periodic tests: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/338

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/349

Bug OCPBUGS-49764: KubeAPIErrorBudgetBurn calculation is erroneous

View the Description View the linked PRs

The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.

For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.

apiserver_request:burnrate1h

{verb="write"}

would be equal to 2/4 and apiserver_request:burnrate1h

{verb="read"}

would be 1/6.
The sum of these as these by the alert today would be equal to 2/4+1/6=2/3 when in reality, the ratio of successful requests should be 2/10*1/10=3/10. So there is quite a huge difference today when we don't account for the total number of requests.

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1744

Bug OCPBUGS-43489: oc-mirror only delete the manifests on local cache when with `--force-cache-delete=true` for all images

View the Description View the linked PRs

Description of problem:

When user tries to run oc-mirror delete command with `--force-cache-delete=true` after a (M2D + D2M) for catalog operators, it only delete the manifests on local cache, don't delete the blobs,which is not expected , from the help information , we should also delete the blobs for catalog operators :
--force-cache-delete        Used to force delete  the local cache manifests and blobs

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

    1. Using follow imagesetconfig to do mirror2disk+disk2mirror:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest                        
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  operators:
  - catalog: oci:///test/redhat-operator-index
    packages:
    - name: aws-load-balancer-operator
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator

2. Generate delete file :
cat delete.yaml 
kind: DeleteImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
delete:
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest                        
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  operators:
  - catalog: oci:///test/redhat-operator-index
    packages:
    - name: aws-load-balancer-operator
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator  

3. execute the delete with --force-cache-delete=true
`oc-mirror delete --v2 --delete-yaml-file  out/working-dir/delete/delete-images.yaml --force-cache-delete=true docker://localhost:5000 --dest-tls-verify=false`

Actual results:

3. Check the local cache, didn't see any blobs deleted.

Expected results:

 3. Not only delete the manifest for catalog operator , should also delete the blobs.

Additional info:

    This error is resolved upon using  --src-tls-verify=false with the oc-mirror delete --generate command
   More details in the slack thread here https://redhat-internal.slack.com/archives/C050P27C71S/p1722601331671649?thread_ts=1722597021.825099&cid=C050P27C71S

Also the logs show some logs from the registry, when --force-cache-delete is true

https://github.com/openshift/oc-mirror/pull/988

Bug OCPBUGS-45715: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-static-ip-manager/pull/46

Bug OCPBUGS-45722: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console-operator/pull/946

Bug OCPBUGS-48190: DHCP failures may be mitigated by upstream library refactor

View the Description View the linked PRs

Description of problem:

    Root issue reported in: https://issues.redhat.com/browse/RHEL-70334

Version-Release number of selected component (if applicable):

    (4.18, 4.17, 4.16 as reported)

How reproducible:

Difficult. Relies on testing as mentioned in RHEL-70344

Additional info:

    We're pending a downstream sync of the upstream repositories, and we'll want to account for this refactor eventually as it stands, so, 4.19 is likely a good place to bring it in, and see if it improves testing results in RHEL-70344

https://github.com/openshift/containernetworking-plugins/pull/173

Bug OCPBUGS-45103: Console plugins name is null on Operator details page

View the Description View the linked PRs

Description of problem:

The plugins name is shown as {{plugin}} on Operator details page

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-26-075648

How reproducible:

Always

Steps to Reproduce:

1. Prepare an operator has `console.openshift.io/plugins` annotation, or create a catalogsource with image quay.io/openshifttest/dynamic-plugin-oprs:latest
        annotations:
          alm-examples: 'xxx'
          console.openshift.io/plugins: '["prometheus-plugin1", "prometheus-plugin2"]' 
2. Install operator, on operator installation page, choose Enable or Disable associated plugins
3. check Operator details page

Actual results:

2. on Operator installation page, associated plugin names are correctly shown
3. There is Console plugins section on Operator details page, in this section all plugins name is shown as {{plugin}}

Expected results:

3. plugin name associated with operator should be correctly displayed

Additional info:

https://github.com/openshift/console/pull/14650

Bug OCPBUGS-45428: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api/pull/228

Bug OCPBUGS-48195: error decoding policy json with ImagePolicy in different namespaces

View the Description View the linked PRs

Description of problem:

    MCO failed to roll out imagepolicy configuration with imagepoliy objects for different namespaces

Version-Release number of selected component (if applicable):

How reproducible:

Create ImagePolicy for testnamespace and mynamespace

apiVersion: config.openshift.io/v1alpha1
kind: ImagePolicy
metadata:
  name: p1
  namespace: testnamespace
spec:
  scopes:
  - example.com/global/image
  - example.com
  policy:
    rootOfTrust:
      policyType: PublicKey
      publicKey:
        keyData: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFVW9GVW9ZQVJlS1hHeTU5eGU1U1FPazJhSjhvKwoyL1l6NVk4R2NOM3pGRTZWaUl2a0duSGhNbEFoWGFYL2JvME05UjYyczAvNnErK1Q3dXdORnVPZzhBPT0KLS0tLS1FTkQgUFVCTElDIEtFWS0tLS0t
    signedIdentity:
      matchPolicy: ExactRepository
      exactRepository:
        repository: example.com/foo/bar

apiVersion: config.openshift.io/v1alpha1
kind: ImagePolicy
metadata:
  name: p2
  namespace: mynamespace
spec:
  scopes:
  - registry.namespacepolicy.com
  policy:
    rootOfTrust:
      policyType: PublicKey
      publicKey:
        keyData: Zm9vIGJhcg==
    signedIdentity:
      matchPolicy: ExactRepository
      exactRepository:
        repository: example.com/foo/bar

Steps to Reproduce:

    1.create namespace test-namespace, the first imagepolicy
    2.create the second namespace and imagepolicy

Actual results:

    only the first imagepolicy got rolled out
machineconfig controller log error:  
$ oc logs -f machine-config-controller-c997df58b-9dk8t  
I0108 23:05:09.141699       1 container_runtime_config_controller.go:499] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update namespace policy JSON from imagepolicy: error decoding policy json for namespaced policies: EOF

Expected results:

    both /etc/crio/policies/mynamespace.json and /etc/crio/policies/testnamespace.json created

Additional info:

https://github.com/openshift/machine-config-operator/pull/4780

Bug OCPBUGS-49798: The flag parallel-layers is not showing in the --help

View the Description View the linked PRs

Currently the parallel-layers flag is not being shown in the --help of the mirror and delete command.

https://github.com/openshift/oc-mirror/pull/1063

Bug OCPBUGS-45352: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/336

Vulnerability OCPBUGS-46994: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/282

Bug OCPBUGS-47528: OWNERS update

View the Description View the linked PRs

Description of problem:

  OWNERS file updated to include prabhakar and Moe as owners and reviewers

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    This is to fecilitate easy backport via automation

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/373

Bug OCPBUGS-49845: [oauth-apiserver] Etcd client can unsafely retry timeouts on mutating requests

View the Description View the linked PRs

Description of problem:

Our carry patch intended to retry retriable requests that fail due to leader change will retry any etcd error with code "Unavailable": https://github.com/openshift/kubernetes/blob/4b2db1ec33faa3ffc305e5ffa7376908cc955370/staging/src/k8s.io/apiserver/pkg/storage/etcd3/etcd3retry/retry_etcdclient.go#L135-L145, but this includes reasons like "timeout" and does not distinguish between writes and reads. So a "timeout" error on a writing request might be retried even though a "timeout" observed by a client does not indicate that the effect of the write has not been persisted.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oauth-apiserver/pull/128

Bug OCPBUGS-44938: agent create image failing with secured local registry

View the Description View the linked PRs

Description of problem:

When deploying a disconnected cluster, creating the iso by "openshifit-install agent create image" is failing (authentication required), when the release image resides in a secured local-registry.
Actually the issue is this:
openshift-install generates registry-config out of the install-config.yaml, and it's only the local regustry credentials (disconnected deploy), but it's not creating an icsp-file to get the image from local registry.

Version-Release number of selected component (if applicable):

How reproducible:

    Run an agent-based iso image creation of a disconnected clutser. choose a version (nightly), where the image is in secured registry (such as registry.ci).  it will fail on authentication required.

Steps to Reproduce:

    1.openshift-install agant create image
    2.
    3.

Actual results:

failing on authentication required

Expected results:

    iso to be created

Additional info:

https://github.com/openshift/installer/pull/9266

Bug OCPBUGS-36471: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-operator-controller/pull/114

Bug OCPBUGS-45766: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovirt-csi-driver-operator/pull/138

Bug OCPBUGS-50668: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/556

Bug TRT-2005: AWS Disruption Payload Failures

View the Description View the linked PRs

Payload 4.19.0-0.nightly-2025-02-13-083804 showed aws regression failures likely from cloud-provider-aws#102 which came in but needs cluster-cloud-controller-manager-operator in an accepted payload to upgrade out of.

https://github.com/openshift/cloud-provider-aws/pull/103

Story METAL-1137: Enable TLS on Ironic API in the installer bootstrap VM

View the Description View the linked PRs

CBO-installed Ironic unconditionally has TLS, even though we don't do proper host validation just yet (see bug OCPBUGS-20412). Ironic in the installer does not use TLS (mostly for historical reasons). Now that OCPBUGS-36283 added a TLS certificate for virtual media, we can use the same for Ironic API. At least initially, it will involve disabling host validation for IPA.

https://github.com/openshift/installer/pull/9189

Bug OCPBUGS-46089: The cluster storage operator is in a degraded state because it is unable to find the UUID for the Windows node.

View the Description View the linked PRs

ISSUE:
The cluster storage operator is in a degraded state because it is unable to find the UUID for the Windows node.

DESCRIPTIONS:
The customer has one Windows node in the OCP environment, the OCP environment is installed on vSphere. Storage CO is in a degraded state with the following error:
~~~
'VSphereCSIDriverOperatorCRDegraded: VMwareVSphereOperatorCheckDegraded:
unable to find VM win-xx-xx by UUID
vSphere CSI driver operator is trying to search UUID of that windows machine which should not be intended.
~~~
2024-09-27T15:44:27.836266729Z E0927 15:44:27.836234 1 check_error.go:147] vsphere driver install failed with unable to find VM win-ooiv8vljg7 by UUID , found existing driver
2024-09-27T15:44:27.860300261Z W0927 15:44:27.836249 1 vspherecontroller.go:499] Marking cluster as degraded: vcenter_api_error unable to find VM win--xx-xx by UUID
~~~
So, the operator pod should exclude the Windows node and should not go in a 'Degraded' state.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/278

Story CNTRLPLANE-187: Remove MachineIdentityID from Azure HyperShift API

View the Description View the linked PRs

As a ARO HCP user, I would like MachineIdentityID to be removed from the Azure HyperShift API since this field is not needed for ARO HCP.

https://github.com/openshift/hypershift/pull/5488

Task MULTIARCH-5191: Rebase ibm-powervs-block-csi-driver with upstream

View the linked PRs

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/95

Bug OCPBUGS-2956: 'create a Project' button on Getting started page doesn't work

View the Description View the linked PRs

Description of problem:

clicking on 'create a Project' button on Getting Started page doesn't work

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Always

Steps to Reproduce:

1. A new normal user login to OCP web console, user will be redirected to Getting Started page
2. try to create a project via 'create a Project' button in message "Select a Project to start adding to it or create a Project."
3.

Actual results:

click on 'create a Project' button doesn't open project creation modal

Expected results:

as it indicated 'create a Project' should open project creation modal

Additional info:

https://github.com/openshift/console/pull/14635

Bug OCPBUGS-41489: ovs-configuration failed to start after reboot

View the Description View the linked PRs

Description of problem:

 Sometimes the ovs-configuration cannot be started with errors as below: 

Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + add_nm_conn br-ex type ovs-bridge conn.interface br-ex 802-3-ethernet.mtu 1500 connection.autoconnect-slaves 1
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + nmcli c add save no con-name br-ex type ovs-bridge conn.interface br-ex 802-3-ethernet.mtu 1500 connection.autoconnect-slaves 1 connection.autoconnect no
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9781]: Connection 'br-ex' (eb9fdfa0-912f-4ee2-b6ac-a5040b290183) successfully added.
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + nmcli connection show ovs-port-phys0
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + ovs-vsctl --timeout=30 --if-exists del-port br-ex ens1f0np0
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + add_nm_conn ovs-port-phys0 type ovs-port conn.interface ens1f0np0 master br-ex connection.autoconnect-slaves 1
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + nmcli c add save no con-name ovs-port-phys0 type ovs-port conn.interface ens1f0np0 master br-ex connection.autoconnect-slaves 1 connection.autoconnect no
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9790]: Error: failed to modify connection.port-type: 'ovs-interface' not among [bond, bridge, ovs-bridge, ovs-port, team, vrf].
Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: ++ handle_exit


 However there is workaround is remove the existing `ovs-if-br-ex` by  `nmcli connection delete ovs-if-br-ex` can fix this issue.

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-07-151850

How reproducible:

    not always

Steps to Reproduce:

    1.  Create many bond interface by nmstate nncp
    2. reboot worker
    3.

Actual results:

    ovs-configuration service cannot be started up

Expected results:

     ovs-configuration service should be started without any issue

Additional info:

  Not sure if these bond interface affected this issue. 
  However there is workaround is remove the existing `ovs-if-br-ex` by
  `nmcli connection delete ovs-if-br-ex` can fix this issue. 

  [root@openshift-qe-024 ~]# nmcli c 
NAME              UUID                                  TYPE           DEVICE      
ens1f0np0         701f8b4e-819d-56aa-9dfb-16c00ea947a8  ethernet       ens1f0np0   
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       eno1        
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens1f1np1   
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens2f2      
bond12            ba986131-f4d2-460c-b883-a1d6a9ddfdcb  bond           bond12      
bond12.101        46bc3df0-e093-4096-a747-0e6717573f82  vlan           bond12.101  
bond12.102        93d68598-7453-4666-aff6-87edfcf2f372  vlan           bond12.102  
bond12.103        be6013e1-6b85-436f-8ce8-24655db0be17  vlan           bond12.103  
bond12.104        fabf9a76-3635-48d9-aace-db14ae2fd9c3  vlan           bond12.104  
bond12.105        0fab3700-ce50-4815-b329-35af8f830cb1  vlan           bond12.105  
bond12.106        68c20304-f3e9-4238-96d7-5bcce05b3827  vlan           bond12.106  
bond12.107        f1029614-2e6e-4e20-b9b6-79902dd12ac9  vlan           bond12.107  
bond12.108        27669b6f-e24d-4ac2-a8ba-35ca0b6c5b05  vlan           bond12.108  
bond12.109        d421e0bb-a441-4305-be23-d1964cb2bb46  vlan           bond12.109  
bond12.110        c453e70c-e460-4e80-971c-88fac4bd1d9e  vlan           bond12.110  
bond12.111        2952a2c6-deb4-4982-8a4b-2a962c3dda96  vlan           bond12.111  
bond12.112        5efe4b2d-2834-4b0b-adb2-8caf153cef2d  vlan           bond12.112  
bond12.113        2ec39bea-2704-4b8a-83fa-d48e1ef1c472  vlan           bond12.113  
bond12.114        8fc8ae53-cc8f-4412-be7d-a05fc3abdffe  vlan           bond12.114  
bond12.115        58f9e047-fe4f-475d-928f-7dec74cf379f  vlan           bond12.115  
bond12.116        d4d133cb-13cc-43f3-a636-0fbcb1d2b65d  vlan           bond12.116  
bond12.117        3a2d10a1-3fd8-4839-9836-56eb6cab76a7  vlan           bond12.117  
bond12.118        8d1a22da-efa0-4a06-ab6d-6840aa5617ea  vlan           bond12.118  
bond12.119        b8556371-eba8-43ba-9660-e181ec16f4d2  vlan           bond12.119  
bond12.120        989f770f-1528-438b-b696-eabcb5500826  vlan           bond12.120  
bond12.121        b4c651f6-18d7-47ce-b800-b8bbeb28ed60  vlan           bond12.121  
bond12.122        9a4c9ec2-e5e4-451a-908c-12d5031363c6  vlan           bond12.122  
bond12.123        aa346590-521a-40c0-8132-a4ef833de60c  vlan           bond12.123  
bond12.124        c26297d6-d965-40e1-8133-a0d284240e46  vlan           bond12.124  
bond12.125        24040762-b6a0-46f7-a802-a86b74c25a1d  vlan           bond12.125  
bond12.126        24df2984-9835-47c2-b971-b80d911ede8d  vlan           bond12.126  
bond12.127        0cc62ca7-b79d-4d09-8ec3-b48501053e41  vlan           bond12.127  
bond12.128        bcf53331-84bd-400c-a95c-e7f1b846e689  vlan           bond12.128  
bond12.129        88631a53-452c-4dfe-bebe-0b736633d15a  vlan           bond12.129  
bond12.130        d157ffb0-2f63-4844-9a16-66a035315a77  vlan           bond12.130  
bond12.131        a36f8fb2-97d6-4059-8802-ce60faffb04a  vlan           bond12.131  
bond12.132        94aa7a8e-b483-430f-8cd1-a92561719954  vlan           bond12.132  
bond12.133        7b3a2b6e-72ad-4e0a-8f37-6ecb64d1488c  vlan           bond12.133  
bond12.134        68b80892-414f-4372-8247-9276cea57e88  vlan           bond12.134  
bond12.135        08f4bdb2-469f-4ff7-9058-4ed84226a1dd  vlan           bond12.135  
bond12.136        a2d13afa-ccac-4efe-b295-1f615f0d001b  vlan           bond12.136  
bond12.137        487e29dc-6741-4406-acec-47e81bed30d4  vlan           bond12.137  
bond12.138        d6e2438f-2591-4a7a-8a56-6c435550c3ae  vlan           bond12.138  
bond12.139        8a2e21c3-531b-417e-b747-07ca555909b7  vlan           bond12.139  
bond12.140        8e3c5d65-5098-48a5-80c4-778d41b24634  vlan           bond12.140  
bond12.141        7aaca678-27e1-4219-9410-956649313c52  vlan           bond12.141  
bond12.142        6765c730-3240-48c8-ba29-88113c703a88  vlan           bond12.142  
bond12.143        3e9cef84-4cb1-4f17-98eb-de9a13501453  vlan           bond12.143  
bond12.144        ebaa63ee-10be-483d-9096-43252757b7fa  vlan           bond12.144  
bond12.145        1ba28e89-0578-4967-85d3-95c03677f036  vlan           bond12.145  
bond12.146        75ac1594-a761-4066-9ac9-a2f4cc853429  vlan           bond12.146  
bond12.147        b8c7e473-8179-49f7-9ea8-3494ce4a0244  vlan           bond12.147  
bond12.148        4c643923-8412-4550-b43c-cdb831dd28e9  vlan           bond12.148  
bond12.149        418fa841-24ba-4d6f-bc5a-37c8ffb25d45  vlan           bond12.149  
bond12.150        1eb8d1ce-256e-42f3-bacd-e7e5ac30bd9a  vlan           bond12.150  
bond12.151        aaab839b-0fbc-4ba9-9371-c460172566a2  vlan           bond12.151  
bond12.152        de2559c4-255b-45ac-8602-968796e647a6  vlan           bond12.152  
bond12.153        52b5d827-c212-45f1-975d-c0e5456c19e9  vlan           bond12.153  
bond12.154        26fc0abd-bfe5-4f66-a3a5-fadefdadb9df  vlan           bond12.154  
bond12.155        0677f4a8-9260-475c-93ca-e811a47d5780  vlan           bond12.155  
bond12.156        4b4039f4-1e7e-4427-bc3a-92fe37bec27e  vlan           bond12.156  
bond12.157        38b7003e-a20c-4ef6-8767-e4fdfb7cd61b  vlan           bond12.157  
bond12.158        7d073e1b-1cf7-4e49-9218-f96daf97150a  vlan           bond12.158  
bond12.159        3d8c5222-e59c-45c9-acb6-1a6169e4eb6d  vlan           bond12.159  
bond12.160        764bce7a-ec99-4f8b-9e39-d47056733c0c  vlan           bond12.160  
bond12.161        63ee9626-2c17-4335-aa17-07a38fa820d8  vlan           bond12.161  
bond12.162        6f8298ff-4341-42a6-93a8-66876042ca16  vlan           bond12.162  
bond12.163        7bb90042-f592-49c6-a0c9-f4d2cf829674  vlan           bond12.163  
bond12.164        3fd8b04f-8bd0-4e8d-b597-4fd37877d466  vlan           bond12.164  
bond12.165        06268a05-4533-4bd2-abb8-14c80a6d0411  vlan           bond12.165  
bond12.166        4fa1f0c1-e55d-4298-bfb5-3602ad446e61  vlan           bond12.166  
bond12.167        494e1a43-deb2-4a69-90da-2602c03400fb  vlan           bond12.167  
bond12.168        d2c034cd-d956-4d02-8b6e-075acfcd9288  vlan           bond12.168  
bond12.169        8e2467b7-80dd-45b6-becc-77cbc632f1f0  vlan           bond12.169  
bond12.170        3df788a3-1715-4a1c-9f5d-b51ffd3a5369  vlan           bond12.170  
dummy1            b4d7daa3-b112-4606-8b9c-cb99b936b2b9  dummy          dummy1      
dummy2            c99d8aa1-0627-47f3-ae57-f3f397adf0e8  dummy          dummy2      
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       enp138s0np0 
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens2f3      
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens4f2      
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens4f3      
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens8f0      
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens8f1      
Wired Connection  b7361c63-fb2a-4a95-80f4-c669fd368bbf  ethernet       ens8f3      
lo                ae4bbedd-1a2e-4c97-adf7-4339cf8fb226  loopback       lo          
ovs-if-br-ex      90af89d6-a3b0-4497-b6d0-7d2cc2d5098a  ovs-interface  --

Bug OCPBUGS-42618: The details of this Jira Card are restricted (Restricts access to project administrators and users who are involved in resolving the issue)

View the Description View the linked PRs

The details of this Jira Card are restricted (Restricts access to project administrators and users who are involved in resolving the issue)

https://github.com/openshift/origin/pull/29176

Bug OCPBUGS-44934: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5192

Vulnerability OCPBUGS-51973: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/140

Bug OCPBUGS-52588: [v2] d2m fails when invalid package is found in the ImageSetConfiguration

View the Description View the linked PRs

Description of problem:

When an invalid packages are included in the ImageSetConfiguration and the operator catalog becomes without any valid package, d2m fails. The operator catalog image should be skipped in case no operators were found for that operator catalog.

With the following ImageSetConfiguration:

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17
      packages:
       - name: netscaler-operator

   ./bin/oc-mirror -c ./alex-tests/alex-isc/pr-1093.yaml file://alex-tests/pr-1093 --v2

2. I removed the working-dir under the folder pr-1093 and the oc-mirror cache to simulate a d2m from scratch on a disconnected environment where I only have the tarball

    rm -rf ~/.oc-mirror/ && rm -rf ./alex-tests/pr-1093/working-dir

3. d2m

    ./bin/oc-mirror -c ./alex-tests/alex-isc/pr-1093.yaml --from file://alex-tests/pr-1093 docker://localhost:6000 --dest-tls-verify=false --v2

Actual results:

    ./bin/oc-mirror -c ./alex-tests/alex-isc/pr-1093.yaml --from file://alex-tests/pr-1093 docker://localhost:6000 --dest-tls-verify=false --v2

2025/03/07 12:29:46  [INFO]   : :wave: Hello, welcome to oc-mirror
2025/03/07 12:29:46  [INFO]   : :gear:  setting up the environment for you...
2025/03/07 12:29:46  [INFO]   : :twisted_rightwards_arrows: workflow mode: diskToMirror 
2025/03/07 12:30:11  [INFO]   : 🕵  going to discover the necessary images...
2025/03/07 12:30:11  [INFO]   : :mag: collecting release images...
2025/03/07 12:30:11  [INFO]   : :mag: collecting operator images...
2025/03/07 12:30:11  [ERROR]  : [OperatorImageCollector] stat .: no such file or directory
 ✗   () Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.17 
2025/03/07 12:30:11  [INFO]   : :wave: Goodbye, thank you for using oc-mirror
2025/03/07 12:30:11  [ERROR]  : stat .: no such file or directory

Expected results:

m2d should should fail when there is no related images found for the specified catalog (invalid operator netscaler-operator for the ImageSetConfiguration used above)

Additional info:

https://github.com/openshift/oc-mirror/pull/1105

Bug OCPBUGS-48536: Bump documentationBaseURL to 4.19

View the Description View the linked PRs

Description of problem:

documentationBaseURL still points to 4.18

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-01-16-064700

How reproducible:

Always

Steps to Reproduce:

1. check documentationBaseURL on a 4.19 cluster
$ oc get cm console-config -n openshift-console -o yaml | grep documentationon
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.18/
2.
3.

Actual results:

documentationBaseURL still links to 4.18

Expected results:

documentationBaseURL should link to 4.19

Additional info:

https://github.com/openshift/console-operator/pull/956

Bug CNV-56336: UI: the primary UDNs are created using a wrong template

View the Description View the linked PRs

Description of problem:

When I use the UI to provision a primary UDN into a namespace, I get the following error (after indicating which project I want the UDN in, and which subnet it'll output the following error:

"""
Admission Webhook Warning
UserDefinedNetwork primary-udn violates policy 299 - "unknown field \"spec.layer2.ipamLifecycle\"
"""

Version-Release number of selected component (if applicable):

4.99

How reproducible:

Always

Steps to Reproduce:

1. create a project
2. create a UDN in said project; define a subnet
3. watch error

Actual results:

The UDN is created, but the ipam.lifecycle attribute is *not* set to persistent, which for virtualization means you'll have a useless network.

Expected results:

The UDN must be created with ipam.lifecycle set to Persistent for the VMs to have stable IPs across live-migration and restart / stop / start.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/212

Bug OCPBUGS-45622: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/343

Vulnerability OCPBUGS-47036: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-disk-csi-driver/pull/91

Bug OCPBUGS-49825: registryOverrides not used in getCatalogImages

View the Description View the linked PRs

Description of problem:

    https://github.com/openshift/hypershift/blob/5d03cfa7e250e06cf408fd0b6c46d207ab915c1e/control-plane-operator/controllers/hostedcontrolplane/olm/catalogs.go#L126 We should check registry overrides here and retrive the images from the mirrored registry

We currently dont check this which means it could prevent the cluster operators coming up if the registry.redhat.io/redhat registry is being overriden.

In addition to the above the releaseImageProvider we use in the HCCO and HCP rencociler doesnt use the registry overrides set at the HO level which could cause confusions for users that expect those overrides to be propagated to the HCP.

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    100%

Steps to Reproduce:

    1. Set Registry override for registry.redhat.io/redhat 
    2. Notice catalog operatores still using registry.redhat.io/redhat images 
    3.

Actual results:

    catalog operators using registry.redhat.io/redhat images

Expected results:

    catalog operators use override set at HO level or set through OLM override flag on HC

Additional info:

https://github.com/openshift/hypershift/pull/5551

Bug OCPBUGS-50518: Kubernetes API Server apply-bootstrap container does not respect SIGTERM

View the Description View the linked PRs

Description of problem:

    The apply-bootstrap container does not terminate within the `TerminationGracePeriodSeconds` timer. This is because the container does not respect a SIGTERM signal. This is an issue to address this bug and ensure the container respects a SIGTERM when issued and thus terminates within the window.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Create hypershift enabled cluster
    2. Delete API server pods and observe the apply-bootstrap container be force killed instead of terminating gracefully.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5475

Vulnerability OCPBUGS-52215: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-operator/pull/400

Bug OCPBUGS-44698: Shared VPC: AWS client fails to assume role when token creation is delayed

View the Description View the linked PRs

Description of problem:

    In integration, creating a rosa HostedCluster with a shared vpc will result in a VPC endpoint that is not available.

Version-Release number of selected component (if applicable):

    4.17.3

How reproducible:

    Sometimes (currently every time in integration, but could be due to timing)

Steps to Reproduce:

    1. Create a HostedCluster with shared VPC
    2. Wait for HostedCluster to come up

Actual results:

VPC endpoint never gets created due to errors like:
{"level":"error","ts":"2024-11-18T20:37:51Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","AWSEndpointService":{"name":"private-router","namespace":"ocm-int-2f4labdgi2grpumbq5ufdsfv7nv9ro4g-cse2etests-gdb"},"namespace":"ocm-int-2f4labdgi2grpumbq5ufdsfv7nv9ro4g-cse2etests-gdb","name":"private-router","reconcileID":"bc5d8a6c-c9ad-4fc8-8ead-6b6c161db097","error":"failed to create vpc endpoint: UnauthorizedOperation","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222"}

Expected results:

    VPC endpoint gets created

Additional info:

    Deleting the control plane operator pod will get things working. 
The theory is that if the control plane operator pod is delayed in obtaining a web identity token, then the client will not assume the role that was passed to it.

Currently the client is only created once at the start, we should create it on every reconcile.

https://github.com/openshift/hypershift/pull/5179

Bug OCPBUGS-46452: i18n upload/download routine task - sprint 263

View the Description View the linked PRs

Description of problem:

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14629

Bug OCPBUGS-48827: [aws/edge/byo-vpc] Kubernetes cluster tag isn't set in edge deployments BYO VPC kubernetes.io/cluster/=shared

View the Description View the linked PRs

Description of problem:

The user-provided edge subnets (BYO VPC), created on zone type local-zone, is not tagged with kubernetes cluster tag `kubernetes.io/cluster/<infraID>` and value `shared` in install time.

Subnets in regular/default zones are correctly tagged.

The edge subnets created by installer in IPI is also tagged with value `owner`, so we need to check if there is an issue BYO VPC scenario or the implementation was just not replicated to edge subnets.

Version-Release number of selected component (if applicable):

    4.19 (or since edge subnets, 4.14+?)

How reproducible:

    always

Steps to Reproduce:

    1. create vpc
    2. create subnet in local zone
    3. create install-config with regular zones, and edge zones
    4. create the cluster
    5. check the tags of subnets in local-zones

Actual results:

    $ aws ec2 describe-subnets --subnet-ids $SUBNET_ID_PUB_WL | jq -r '.Subnets[] | [.AvailabilityZone, .Tags]'
[
  "us-east-1-nyc-1a",
  [
    {
      "Key": "openshift_creationDate",
      "Value": "2025-01-24T00:14:44.445494+00:00"
    },
    {
      "Key": "aws:cloudformation:stack-id",
      "Value": "arn:aws:cloudformation:us-east-1:[redacted]:stack/lzdemo-subnets-nyc-1a/10effe00-d9e0-11ef-b2ba-0ecca22ca195"
    },
    {
      "Key": "aws:cloudformation:logical-id",
      "Value": "PublicSubnet"
    },
    {
      "Key": "Name",
      "Value": "lzdemo-public-us-east-1-nyc-1a"
    },
    {
      "Key": "aws:cloudformation:stack-name",
      "Value": "lzdemo-subnets-nyc-1a"
    }
  ]
]

Expected results:

$ aws ec2 describe-subnets --subnet-ids $SUBNET_ID_PUB_WL | jq -r '.Subnets[] | [.AvailabilityZone, .Tags]'
[
  "us-east-1-nyc-1a",
  [
    {
      "Key": "openshift_creationDate",
      "Value": "2025-01-24T00:14:44.445494+00:00"
    },
    {
      "Key": "aws:cloudformation:stack-id",
      "Value": "arn:aws:cloudformation:us-east-1:[redacted]:stack/lzdemo-subnets-nyc-1a/10effe00-d9e0-11ef-b2ba-0ecca22ca195"
    },
    {
      "Key": "aws:cloudformation:logical-id",
      "Value": "PublicSubnet"
    },
    {
      "Key": "Name",
      "Value": "lzdemo-public-us-east-1-nyc-1a"
    },
    {
      "Key": "aws:cloudformation:stack-name",
      "Value": "lzdemo-subnets-nyc-1a"
    },
+    {
+      "Key": "kubernetes.io/cluster/lzdemo-4znjd",
+      "Value": "shared"
+    },
  ]
]

Additional info:

- Example of result in IPI deployment with edge zone (fully created by installer)

```
$ aws ec2 describe-subnets --subnet-ids subnet-08d8d32c7ee4b629c | jq -r '.Subnets[] | [.AvailabilityZone, .Tags]'
[
  "us-east-1-nyc-1a",
  [
    {
      "Key": "kubernetes.io/role/elb",
      "Value": "1"
    },
    {
      "Key": "Name",
      "Value": "lzipi-ljgzl-subnet-public-us-east-1-nyc-1a"
    },
    {
      "Key": "sigs.k8s.io/cluster-api-provider-aws/role",
      "Value": "public"
    },
    {
      "Key": "openshift_creationDate",
      "Value": "2025-01-24T00:14:44.445494+00:00"
    },
    {
      "Key": "sigs.k8s.io/cluster-api-provider-aws/cluster/lzipi-ljgzl",
      "Value": "owned"
    },
    {
      "Key": "kubernetes.io/cluster/lzipi-ljgzl",
      "Value": "owned"
    }
  ]
]

```

https://github.com/openshift/installer/pull/9413

Bug OCPBUGS-49823: Creation failed for performance profile with unsupported hugepages size in ARM

View the Description View the linked PRs

Description of problem:

Applying a performance profile with unsupported hugepages size, (In this example 512) fails to create the performanceprofile and it becomes degraded.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Label one of the worker nodes with worker-cnf
2. Create an mcp for worker-cnf
3. Apply this performanceprofile

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: 1-3,4-6
    reserved: 0,7
  hugepages:
    defaultHugepagesSize: 512M
    pages:
    - count: 1
      node: 0
      size: 512M
    - count: 128
      node: 1
      size: 2M
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker-cnf
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ''
  kernelPageSize: 64k
  numa:
    topologyPolicy: single-numa-node
  realTimeKernel:
    enabled: false
  workloadHints:
    highPowerConsumption: true
    perPodPowerManagement: false
    realTime: true


 
Status:
  Conditions:
    Last Heartbeat Time:   2025-02-04T10:14:52Z
    Last Transition Time:  2025-02-04T10:14:52Z
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2025-02-04T10:14:52Z
    Last Transition Time:  2025-02-04T10:14:52Z
    Status:                False
    Type:                  Upgradeable
    Last Heartbeat Time:   2025-02-04T10:14:52Z
    Last Transition Time:  2025-02-04T10:14:52Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2025-02-04T10:14:52Z
    Last Transition Time:  2025-02-04T10:14:52Z
    Message:               can not convert size "512M" to kilobytes
    Reason:                ComponentCreationFailed
    Status:                True
    Type:                  Degraded
  Runtime Class:           performance-performance
  Tuned:                   openshift-cluster-node-tuning-operator/openshift-node-performance-performance
Events:
  Type     Reason           Age                 From                            Message
  ----     ------           ----                ----                            -------
  Warning  Creation failed  11m (x19 over 33m)  performance-profile-controller  Failed to create all components: can not convert size "512M" to kilobytes

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/1294

Bug OCPBUGS-50002: MCC complains about v1 MachineOSConfig in default featureset

View the Description View the linked PRs

This was accidentally broken when we merged https://github.com/openshift/machine-config-operator/pull/4756, and should be fixed to prevent unnecessary API noise.

https://github.com/openshift/machine-config-operator/pull/4838

Story TRT-1761: Write a test that should ensure kubelet never had disk pressure

View the Description View the linked PRs

We think that low disk space is likely the cause of https://issues.redhat.com/browse/OCPBUGS-37785

It's not immediately obvious that this happened during the run without digging into the events.

Could we create a new test to enforce that the kubelet never reports disk pressure during a run?

https://github.com/openshift/origin/pull/29318

Bug OCPBUGS-38809: New nodes scaled using 4.5 base image cannot join the cluster if techpreview is enabled

View the Description View the linked PRs

Description of problem:Description of problem:

When we enable techpreview and we try to scale up a new node using a 4.5 base image, the node cannot join the cluster

Version-Release number of selected component (if applicable):

    
    IPI on AWS
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-19-165854   True        False         5h25m   Cluster version is 4.17.0-0.nightly-2024-08-19-165854

How reproducible:

Always

Steps to Reproduce:

    1. Create a new machineset using a 4.5 base image and a 2.2.0 ignition version
    
    Detailed commands to create this machineset can be found here: [OCP-52822-Create new config resources with 2.2.0 ignition boot image nodes|https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-52822]
    
    
    2. Scale up this machineset to create a new worker node

Actual results:

    The node cannot join the cluster. We can find this message in the machine-config-daemon-pull.service in the failed node
    
    Wed 2024-08-21 13:02:19 UTC ip-10-0-29-231 machine-config-daemon-pull.service[1971]: time="2024-08-21T13:02:19Z" level=warning msg="skip_mount_home option is no longer supported, ignoring option"
Wed 2024-08-21 13:02:20 UTC ip-10-0-29-231 machine-config-daemon-pull.service[1971]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a0afcde0e240601cb4a761e95f8311984b02ee76f827527d425670be3a39797": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a0afcde0e240601cb4a761e95f8311984b02ee76f827527d425670be3a39797: invalid policy in "/etc/containers/policy.json": Unknown policy requirement type "sigstoreSigned"

Expected results:

    Nodes should join the cluster

Additional info:

    If techpreview is not enabled, the node can join the cluster without problems
    
    The podman version in a 4.5 base image is:
    
$ podman version
WARN[0000] skip_mount_home option is no longer supported, ignoring option 
Version:            1.9.3
RemoteAPI Version:  1
Go Version:         go1.13.4
OS/Arch:            linux/amd64

https://github.com/openshift/machine-config-operator/pull/4554

Bug OCPBUGS-45678: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9282

Bug OU-649: Update http/net vulnerable dependency

View the linked PRs

https://github.com/openshift/monitoring-plugin/pull/332

Bug OCPBUGS-45668: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-operator/pull/387

Bug OCPBUGS-48238: UDN: TP CI Lane is passing 80% for default ports test

View the Description View the linked PRs

https://sippy.dptools.openshift.org/sippy-ng/tests/4.19/analysis?test=%5Bsig-network%5D%5BOCPFeatureGate%3ANetworkSegmentation%5D%5BFeature%3AUserDefinedPrimaryNetworks%5D%20when%20using%20openshift%20ovn-kubernetes%20UDN%20Pod%20should%20react%20to%20k8s.ovn.org%2Fopen-default-ports%20annotations%20changes%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-network%5D%5BOCPFeatureGate%3ANetworkSegmentation%5D%5BFeature%3AUserDefinedPrimaryNetworks%5D%20when%20using%20openshift%20ovn-kubernetes%20UDN%20Pod%20should%20react%20to%20k8s.ovn.org%2Fopen-default-ports%20annotations%20changes%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

it should be passing at 95%

https://github.com/openshift/origin/pull/29428

Bug OCPBUGS-50693: Show Observe section without PROMETHEUS and MONITORING flags

View the Description View the linked PRs

Description of problem:

The observe section is not bound only to monitoring anymore. Users that do not have access to all the namespaces or prometheus flags should be still able to see the menus under the observe section.

Version-Release number of selected component (if applicable):

4.19,4.18,4.17,4.16,4.15,4.14,4.12

How reproducible:

Always

Steps to Reproduce:

1. Enter the admin console without admin access to all namespaces

Actual results:

The observe section is not visible

Expected results:

The observe should be visible if plugins contribute menu items

Additional info:

If there are no items the observe section should be hidden

https://github.com/openshift/console/pull/14697

Bug OCPBUGS-50860: Configuration API mismatch between HO and CPO

View the Description View the linked PRs

Description of problem:

There's a missmatch between the .configuration API vendored for the HC controller and the one in those older versions of the CPO controller.
The HO computes the hash including "" for that field. The cpo doesn't see the field at all to compute the mcs hash. That'd cause the missmatch.

Slack thread https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1739540592972569?thread_ts=1739527507.065809&cid=C04EUL1DRHC

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Create an HostedCluster with ImageConfig set in an 4.17.15
    2. Check NodePool rollout
    3. Ignition is failing

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5651

Vulnerability OCPBUGS-43671: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14409

Bug OCPBUGS-45439: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-45459: Due to trailing dot(.) in domain name openshift installation getting failed.

View the Description View the linked PRs

Cloned to OpenStack as we have the same issue.

Description of problem:

Customer is trying to install Self managed OCP cluster in aws. This customer use AWS VPC DHCPOptionSet. where it has a trailing dot (.) at the end of domain name in dhcpoptionset. due to this setting Master nodes hostname also has trailing dot & this cause failure in OpenShift installation.

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

1.Please create a aws vpc with DHCPOptionSet, where DHCPoptionSet has trailing dot at the domain name.
2.Try installation of cluster with IPI.

Actual results:

    Openshift Installer should allowed to create AWS Master nodes, where domain has trailing dot(.).

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4778

Bug OCPBUGS-45859: Azure Cloud Controller Manager Panic

View the Description View the linked PRs

The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-azure-ovn/1864410356567248896 as well as pull-ci-openshift-installer-master-e2e-azure-ovn/1864312373058211840

: Undiagnosed panic detected in pod
{  pods/openshift-cloud-controller-manager_azure-cloud-controller-manager-5788c6f7f9-n2mnh_cloud-controller-manager_previous.log.gz:E1204 22:27:54.558549       1 iface.go:262] "Observed a panic" panic="interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.EndpointSlice" panicGoValue="&runtime.TypeAssertionError{_interface:(*abi.Type)(0x291daa0), concrete:(*abi.Type)(0x2b73880), asserted:(*abi.Type)(0x2f5cc20), missingMethod:\"\"}" stacktrace=<}

https://github.com/openshift/cloud-provider-azure/pull/130

Vulnerability OCPBUGS-47156: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9343

Bug OCPBUGS-48089: Installer should fail if multiple clusternetwork CIDRs for same IP family have different hostPrefix

View the Description View the linked PRs

Description of problem:

Based on what discussed in bug OCPBUGS-46514 the openshift installer should not allow the creation of a cluster with different hostPrefix for the ClusterNetwork CIDRs of the same IP family.

Version-Release number of selected component (if applicable):

all the supported releases.

https://github.com/openshift/installer/pull/9398

Bug CNV-50554: The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position

View the Description View the linked PRs

Description of problem:

The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position, it should on the top right.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/153

Bug OCPBUGS-45321: builder Unit Test Permanently Failing

View the Description View the linked PRs

Description of problem:

Unit tests for openshift/builder permanently failing for v4.18

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

    1. Run PR against openshift/builder

Actual results:

Test fails: 
--- FAIL: TestUnqualifiedClone (0.20s)
    source_test.go:171: unable to add submodule: "Cloning into '/tmp/test-unqualified335202210/sub'...\nfatal: transport 'file' not allowed\nfatal: clone of 'file:///tmp/test-submodule643317239' into submodule path '/tmp/test-unqualified335202210/sub' failed\n"
    source_test.go:195: unable to find submodule dir
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

Tests pass

Additional info:

Example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_builder/401/pull-ci-openshift-builder-master-unit/1853816128913018880

https://github.com/openshift/builder/pull/412

Bug OCPBUGS-45550: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/80

Bug OCPBUGS-46488: [GCP] installer cannot detect/expose encryption keyring non-existing error

View the Description View the linked PRs

Description of problem:

    It is the negative testing scenario of QE test case OCP-36887, i.e. specifying non-existing encryption keyring/name for compute & control-plane, and the expectation is creating instance error due to the problem keyring. But the testing results is, the installer does not tell the error in stdout, and keeps waiting until timeout.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-12-15-202509

How reproducible:

    Always

Steps to Reproduce:

1. "create install-config", and then insert the interested settings (see [1])
2. "create cluster" (see [2])

Actual results:

    The installer doesn't tell the real error and quit soon, although finally it tells below error,

ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to provision control-plane machines: machines are not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

Expected results:

    The installer should tell the keyring error and quit soon.

Additional info:

https://github.com/openshift/installer/pull/9328

Bug OCPBUGS-52192: Bump KSM to v2.15

View the Description View the linked PRs

Bump KSM to v2.15

https://github.com/openshift/kube-state-metrics/pull/119

Bug OCPBUGS-43561: openshift-install completion zsh

View the Description View the linked PRs

Description of problem:

    openshift-install has no zsh completion

Version-Release number of selected component (if applicable):

How reproducible:

    everytime

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    PR opened at https://github.com/openshift/installer/pull/9116

https://github.com/openshift/installer/pull/9116

Bug OCPBUGS-45512: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-runtime-extractor/pull/33

Bug OCPBUGS-46052: machine-os-builder deployment missing openshift.io/required-scc annotation

View the Description View the linked PRs

Description of problem:

The machine-os-builder deployment manifest does not set the openshift.io/required-scc annotation, which appears to be required for the upgrade conformance suite to pass. The rest of the MCO components currently set this annotation and we can probably use the same setting for the Machine Config Controller (which is restricted-v2). What I'm unsure of is whether this also needs to be set on the builder pods as well and what the appropriate setting would be for that case.

Version-Release number of selected component (if applicable):

How reproducible:

This always occurs in the new CI jobs, e2e-aws-ovn-upgrade-ocb-techpreview and e2e-aws-ovn-upgrade-ocb-conformance-suite-techpreview. Here's two examples from rehearsal failures:

Steps to Reproduce:

Run either of the aforementioned CI jobs.

Actual results:

Test [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation fails.

Expected results:

Test{{ [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation}} should pass.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4752

Bug OCPBUGS-49991: Operator Bundles should be skipped during the collection phase

View the Description View the linked PRs

Description of problem:

Currently related images failures are being treated as warning in the catalog_handler.go (handleRelatedImages func). If one related image of a bundle fails during the collection phase, the bundle should be removed from the list of collected images, otherwise the batch will copy bundles with related images missing.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Bundles are being added to the list of collected images even if one of its related images failed.

Expected results:

Bundles with related images that failed during the collection phase should be removed from the list of collected images in the collection phase.

Additional info:

https://github.com/openshift/oc-mirror/pull/1071

Bug OCPBUGS-52166: Sort function on Access review table is not working as expected

View the Description View the linked PRs

Description of problem:

    Sort function on Access review table is not working as expected

Version-Release number of selected component (if applicable):

4.19.0-0.test-2025-02-28-070949

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Hhome -> API Explorer -> Resource details -> Accessreview tab
      eg: /api-resource/all-namespaces/core~v1~Binding/access     
    2. Click on column headers Subject/Type to sort
    3.

Actual results:

    the sort function is not work as expected

Expected results:

    clicking on the column headers can correctly sort the data in ascending or descending order as expected

Additional info:

      click on column headers to sort, the table isn't using the sorted data, but the original filtered data instead

https://github.com/openshift/console/blob/e9d6ead6852600993cffef3d50cb9b122d64c068/frontend/public/components/api-explorer.tsx#L575

https://github.com/openshift/console/pull/14848

Bug OCPBUGS-45706: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-resizer/pull/167

Bug OCPBUGS-51350: PowerVS: PowerVS private endpoints are unreachable, blocking disconnected deploys.

View the Description View the linked PRs

Because we disable SNAT when deploying a disconnected cluster in PowerVS, the private PowerVS endpoints are unreachable. This causes workers to not launch and breaks disconnected deploys.

https://github.com/openshift/installer/pull/9510

Bug OCPBUGS-44789: oc-mirror rebuild catalog failed for oci catalog when run the same command twice

View the Description View the linked PRs

Description of problem:

when run the oc-mirror command twice , the rebuild catalog failed with error:
[ERROR]  : unable to rebuild catalog oci:///test/yinzhou/out20/working-dir/operator-catalogs/redhat-operator-index/33dd53f330f4518bd0427772debd3331aa4e21ef4ff4faeec0d9064f7e4f24a9/catalog-image: filtered declarative config not found

Version-Release number of selected component (if applicable):

 oc-mirror version 
W1120 10:40:11.056507    6751 mirror.go:102] ⚠️  oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-324-gbae91d5", GitCommit:"bae91d55", GitTreeState:"clean", BuildDate:"2024-11-20T02:06:04Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1. run the mirror2mirror twice with same imageseconfig and same workspace, the twice command failed with error:
cat config-20.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  operators:
  - catalog: oci:///test/redhat-operator-index
    packages:
    - name: aws-load-balancer-operator
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator

`oc-mirror -c config-20.yaml docker://my-route-e2e-test-ocmirrorv2-pxbg4.apps.yinzhou11202.qe.devcluster.openshift.com --workspace file://out20 --v2 --dest-tls-verify=false`

Actual results:

oc-mirror -c config-20.yaml docker://my-route-e2e-test-ocmirrorv2-pxbg4.apps.yinzhou11202.qe.devcluster.openshift.com --workspace file://out20 --v2 --dest-tls-verify=false2024/11/20 10:34:00  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/11/20 10:34:00  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/11/20 10:34:00  [INFO]   : ⚙️  setting up the environment for you...
2024/11/20 10:34:00  [INFO]   : 🔀 workflow mode: mirrorToMirror 
2024/11/20 10:34:00  [INFO]   : 🕵️  going to discover the necessary images...
2024/11/20 10:34:00  [INFO]   : 🔍 collecting release images...
2024/11/20 10:34:00  [INFO]   : 🔍 collecting operator images...
 ✓   () Collecting catalog oci:///test/redhat-operator-index 
 ✓   (2s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.15 
2024/11/20 10:34:02  [INFO]   : 🔍 collecting additional images...
2024/11/20 10:34:02  [INFO]   : 🔍 collecting helm images...
2024/11/20 10:34:02  [INFO]   : 🔂 rebuilding catalogs
2024/11/20 10:34:02  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/11/20 10:34:02  [ERROR]  : unable to rebuild catalog oci:///test/yinzhou/out20/working-dir/operator-catalogs/redhat-operator-index/33dd53f330f4518bd0427772debd3331aa4e21ef4ff4faeec0d9064f7e4f24a9/catalog-image: filtered declarative config not found

Expected results:

no error

Additional info:

delete the workspace file , run again, no issue.

https://github.com/openshift/oc-mirror/pull/962

Bug OCPBUGS-44595: Consoleplugin could be enabled repeatedly when it's already enabled.

View the Description View the linked PRs

Description of problem:

Consoleplugin could be enabled repeatedly when it's already enabled.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-14-090045

How reproducible:

Always

Steps to Reproduce:

    1.Go to console operator's 'Console Plugins' tab("/k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins"), choose on console plugin, enable consoleplugin from modal by cicking 'Enabled/Disabled' edit button, try several times even though the plugin has been enabled.
    2.Check console operator yaml.
    3.

Actual results:

1. Could enable consoleplugin repeatedly.
2. The same consoleplugin are added in console operator several times.
$ oc get consoles.operator.openshift.io cluster -ojson | jq '.spec.plugins'
[
  "monitoring-plugin",
  "monitoring-plugin",
  "networking-console-plugin",
  "networking-console-plugin"
]

Expected results:

1.Should not enable repeatedly.
2. Should not add the same consoleplugin multiple times in console operator

Additional info:

We can even add the same plugin name in console operator yaml directly, that's not corect.

https://github.com/openshift/console/pull/14532

Bug OCPBUGS-45402: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/135

Bug OCPBUGS-45984: [IBMCloud] [CAPI] ImageReconciliationFailed by invalid IAM token

View the Description View the linked PRs

Description of problem:

CAPI install got ImageReconciliationFailed when creating vpc custom image

Version-Release number of selected component (if applicable):

 4.19.0-0.nightly-2024-12-06-101930

How reproducible:

always

Steps to Reproduce:

1.add the following in install-config.yaml
featureSet: CustomNoUpgrade
featureGates: [ClusterAPIInstall=true]     
2. create IBMCloud cluster with IPI

Actual results:

level=info msg=Done creating infra manifests
level=info msg=Creating kubeconfig entry for capi cluster ci-op-h3ykp5jn-32a54-xprzg
level=info msg=Waiting up to 30m0s (until 11:25AM UTC) for network infrastructure to become ready...
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 30m0s: client rate limiter Wait returned an error: context deadline exceeded

in IBMVPCCluster-openshift-cluster-api-guests log

reason: ImageReconciliationFailed
    message: 'error failure trying to create vpc custom image: error unknown failure
      creating vpc custom image: The IAM token that was specified in the request has
      expired or is invalid. The request is not authorized to access the Cloud Object
      Storage resource.'

Expected results:

create cluster succeed

Additional info:

the resources created when install failed: 
ci-op-h3ykp5jn-32a54-xprzg-cos  dff97f5c-bc5e-4455-b470-411c3edbe49c crn:v1:bluemix:public:cloud-object-storage:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:f648897a-2178-4f02-b948-b3cd53f07d85::
ci-op-h3ykp5jn-32a54-xprzg-vpc  is.vpc crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::vpc:r022-46c7932d-8f4d-4d53-a398-555405dfbf18
copier-resurrect-panzer-resistant  is.security-group crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::security-group:r022-2367a32b-41d1-4f07-b148-63485ca8437b
deceiving-unashamed-unwind-outward  is.network-acl crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::network-acl:r022-b50286f6-1052-479f-89bc-fc66cd9bf613

https://github.com/openshift/installer/pull/9301

Bug OCPBUGS-49996: Horizontal navs defined in plugins cannot accept contributions from other plugins

View the Description View the linked PRs

Description of problem:

    the contextId property to allow users of HorizontalNav to receive dynamic tabs contributions is not available

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

    1.Import HorizontalNav from the dynamic plugins SDK
    2.'contextId' property does not exist
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14732

Bug OCPBUGS-45259: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-agent-image/pull/174

Bug OCPBUGS-45838: managed services namespace list is missing some

View the Description View the linked PRs

The helper doesn't have all the namespaces in it, and we're getting some flakes in CI like this:

{{batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources
does not have a cpu request (rule: "batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources/request[cpu]")}}

Bug OCPBUGS-46574: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/279

Bug OCPBUGS-48044: [Nutanix] Install multi-NICs cluster failed with failureDomains

View the Description View the linked PRs

Description of problem:

    when installing cluster with multiple NICs in failureDomains, it always report "Duplicate value" error

# ./openshift-install create cluster --dir cluster --log-level debug
...
INFO Creating infra manifests...
INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests
DEBUG {"level":"info","ts":"2025-01-01T11:28:56Z","msg":"Starting workers","controller":"nutanixcluster","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"NutanixCluster","worker count":10}
DEBUG {"level":"info","ts":"2025-01-01T11:28:57Z","msg":"Starting workers","controller":"nutanixmachine","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"NutanixMachine","worker count":10}
INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=sgao-nutanix-zonal-l96qg
DEBUG I0101 11:28:58.918576 2309035 recorder.go:104] "Cluster sgao-nutanix-zonal-l96qg is Provisioning" logger="events" type="Normal" object={"kind":"Cluster","namespace":"openshift-cluster-api-guests","name":"sgao-nutanix-zonal-l96qg","uid":"d86c6f80-0f60-431d-80fc-bddd7b1f2d7c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"257"} reason="Provisioning"
DEBUG Collecting applied cluster api manifests...
DEBUG I0101 11:28:58.924319 2309035 warning_handler.go:65] "metadata.finalizers: \"cluster.cluster.x-k8s.io\": prefer a domain-qualified finalizer name to avoid accidental conflicts with other finalizer writers" logger="KubeAPIWarningLogger"
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create infrastructure manifest: NutanixCluster.infrastructure.cluster.x-k8s.io "sgao-nutanix-zonal-l96qg" is invalid: spec.failureDomains[0].subnets[1]: Duplicate value: map[string]interface {}{"type":"uuid"}
INFO Shutting down local Cluster API controllers...
INFO Stopped controller: Cluster API
WARNING process cluster-api-provider-nutanix exited with error: signal: killed
INFO Stopped controller: nutanix infrastructure provider
INFO Shutting down local Cluster API control plane...
INFO Local Cluster API system has completed operations

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2025-01-04-101226

How reproducible:

    always

Steps to Reproduce:

    1. set multiple NICs in failureDomains and install cluster

e.g.
    failureDomains:
    ...
      subnetUUIDs:
      - 512c1d6f-c6e7-4746-8ae2-9c3e1db2aba6
      - a94cb75c-24ff-4ee2-85cf-c2f906ee9fe5
    - name: failure-domain-2
    ...
      subnetUUIDs:
      - d1b1b617-23de-4a9d-b53f-4b386fc27600
    - name: failure-domain-3
    ...
      subnetUUIDs:
      - 3624b067-61e2-4703-b8bf-3810de5cbac1

    2.
    3.

Actual results:

    Install failed

Expected results:

    Install should succeed with multiple NICs configured

Additional info:

slack discussion pls refer to https://redhat-external.slack.com/archives/C0211848DBN/p1735790959497809

https://github.com/openshift/installer/pull/9376

Bug OCPBUGS-52419: e2e: failure if HostedCluster has already rolled out its initial image

View the Description View the linked PRs

Since we merged https://github.com/openshift/hypershift/pull/5679

It seems like we need to have a rollout but in some cases, the HostedCluster is already rolled out to the latest image.

We need to support that case.

Example of error: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-openstack-aws/1897225681364848640

https://github.com/openshift/hypershift/pull/5771

Task HOSTEDCP-2259: Add CLI support for secure proxy

View the Description View the linked PRs

In the hypershift CLI we currently support creating a cluster with a proxy with the `--enable-proxy` flag.

This is to add a new flag `--enable-secure-proxy` that will create a cluster configured to use a secure proxy.

https://github.com/openshift/hypershift/pull/5526

Bug OCPBUGS-45325: UI is distorted for project dropdown when web terminal is open

View the Description View the linked PRs

Description of problem:

 Project dropdown is partially hidden due to web terminal

Version-Release number of selected component (if applicable):

How reproducible:

  Every time

Steps to Reproduce:

    1. Install and initialize web-terminal
    2. Open the Project bar
    3.

Actual results:

Attaching screenshot:

https://drive.google.com/file/d/1AaYXCZsEcBiXVCIBqXkavKvXbjFb1YlP/view?usp=sharing

Expected results:

Project namespace bar should be at the front

Additional info:

https://github.com/openshift/console/pull/14609

Bug OCPBUGS-46549: [4.19] Bootimage bump tracker

View the Description View the linked PRs

Tracker issue for bootimage bump in 4.19. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-44977.

https://github.com/openshift/installer/pull/9392

Story MCO-1482: machine-config ClusterOperator stays Upgradeable=True as new nodes are added

View the Description View the linked PRs

Implementing RFE-3017.

As a bare Story, without a Feature or Epic, because I'm trying to limit the amount of MCO-side paperwork required to get my own RFE itch scratched. As a Story and not a NO-ISSUE pull, because OCP QE had a bit of trouble handling mco#4637 when I went NO-ISSUE on that one, and I think this might be worth a 4.19 release note.

https://github.com/openshift/machine-config-operator/pull/4760

Bug OCPBUGS-39388: Creating Shipwright through form fails

View the Description View the linked PRs

Steps to Reproduce

Create namespace "demo"
Create a Shipwright build through the form
Choose source-to-image strategy
Provide Git repo
```
https://github.com/sclorg/nodejs-ex
```

Provide output image

image-registry.openshift-image-registry.svc:5000/demo/nodejs-app

Provide builder image

image-registry.openshift-image-registry.svc:5000/openshift/nodejs

Create the build
Start the build

Actual results:

BuildRun fails with the following error:

[User error] invalid input params for task nodejs-app-6cf5j-gk6f9: param types don't match the user-specified type: [registries-block registries-insecure]

Expected results:

BuildRun runs successfully

Build and BuildRun yaml.

https://gist.github.com/vikram-raj/fa67186f1860612b5ad378655085745e

https://github.com/openshift/console/pull/14514

Bug OCPBUGS-45545: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/frr/pull/70

Bug OCPBUGS-52323: The trusted-ca-bundle-managed ConfigMap requirement breaks those with their own PKI

View the Description View the linked PRs

Description of problem:

   The new `managed-trust-bundle` VolumeMount / `trusted-ca-bundle-managed` ConfigMap has recently been required given this latest change here: https://github.com/openshift/hypershift/pull/5667. However, this should be optional since folks that bring their own PKI shouldn't need this.

Version-Release number of selected component (if applicable):

    4.18.2

How reproducible:

    Every time.

Steps to Reproduce:

    1. Deploy ROKS (HyperShift) version 4.18.2 cluster.

Actual results:

    Cluster fails to deploy as the OpenShift API server fails to come up since it expects the `trusted-ca-bundle-managed` ConfigMap to exist.

Expected results:

    Cluster should deploy successfully.

Additional info:

https://github.com/openshift/hypershift/pull/5763

Bug OCPBUGS-42320: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5220

Bug OCPBUGS-45022: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1929

Bug OCPBUGS-35726: monitoring-plugin cert-hash not updated

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content

Actual results:

- monitoring-plugin is still using the old cert content.
- If the cluster doesn’t show much activity, the hash may take time to be updated.

Expected results:

CMO should detect that exact change and run a sync to  recompute and set the new hash.

Additional info:

- We shouldn't rely on another changeto trigger the sync loop.
- CMO should maybe watch that secret? (its name isn't known in advance).

Bug OCPBUGS-41892: [GWAPI-TP] Gateway API logs even when the feature is not enabled

View the Description View the linked PRs

Description of problem:

Reviewing some 4.17 cluster-ingress-operator logs, I found many (215) of these even when the GatewayAPI feature was disabled:

024-09-03T08:20:03.726Z	INFO	operator.gatewayapi_controller	controller/controller.go:114	reconciling	{"request": {"name":"cluster"}}

This makes it look like the feature was enabled when it was not.

Also check for same in the other gatewayapi controllers in the gateway-service-dns and gatewayclass folders.  A search for r.config.GatewayAPIEnabled should show where we are checking whether the feature is enabled.

Version-Release number of selected component (if applicable):

    4.17, 4.18 should have the fix

How reproducible:

    This was showing up in the CI logs for the e2e-vsphere-ovn-upi test: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-upi/1830872105227390976/artifacts/e2e-vsphere-ovn-upi/gather-extra/artifacts/pods/openshift-ingress-operator_ingress-operator-64df7b9cd4-hqkmh_ingress-operator.log

It is probably showing up in all logs to varying degrees.

Steps to Reproduce:

    1. Deploy 4.17
    2. Review cluster-ingress-operator logs

Actual results:

    Seeing a log that makes it look like the GatewayAPI feature is enabled even when it is not.

Expected results:

    Only see the log when the GatewayAPI feature is enabled.

Additional info:

    The GatewayAPI feature is enabled in the e2e-aws-gatewayapi  PR test and any techpreview PR test, and can be manually enabled on a test cluster by running:

oc patch featuregates/cluster --type=merge --patch='{"spec":{"featureSet":"CustomNoUpgrade","customNoUpgrade":{"enabled":["GatewayAPI"]}}}'

https://github.com/openshift/cluster-ingress-operator/pull/1165

Bug OCPBUGS-45443: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-pod-identity-webhook/pull/199

Bug OCPBUGS-45522: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-rhcos-downloader/pull/100

Bug OCPBUGS-46555: duplicate external link icon on Purchase button

View the Description View the linked PRs

Description of problem:

There are duplicate external link icon on Operator details modal for 'Purchase' button

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-18-013707

How reproducible:

Always

Steps to Reproduce:

1. find one Marketplace operator and click on operator tile

Actual results:

1. on Operator details modal, there is `Purchase` button and there are two duplicate external link icon displayed beside Purchase button

Expected results:

1. only one external link icon is required

Additional info:

screenshot https://drive.google.com/file/d/1uGCXxXdR8ayXRafhabHepW5mVqwePzcq/view?usp=drive_link

https://github.com/openshift/console/pull/14648

Bug OCPBUGS-48486: When installing an operator OLM locks the Subscription 3-15% of the times [release-4.19]

View the Description View the linked PRs

Description of problem:

    When installing ROSA/OSD operators OLM "locks up" the Subscription object with "ConstraintsNotSatisfiable" 3-15% of the times, depending on the environment.

Version-Release number of selected component (if applicable):

Recently tested on:
- OSD 4.17.5
- 4.18 nightly (from cluster bot)

Though prevalence across the ROSA fleet suggests this is not a new issue.

How reproducible:

Very. This is very prevalent across the OSD/ROSA Classic cluster fleet. Any new OSD/ROSA Classic cluster has a good chance of at least one of its ~12 OSD-specific operators being affected on install time.

Steps to Reproduce:

    0. Set up a cluster using cluster bot.
    1. Label at least one worker node with node-role.kubernetes.io=infra
    2. Install must gather operator with "oc apply -f mgo.yaml" (file attached)
    3. Wait for the pods to come up.
    4. Start this loop:
for i in `seq -w 999`; do echo -ne ">>>>>>> $i\t\t"; date; oc get -n openshift-must-gather-operator subscription/must-gather-operator -o yaml >mgo-sub-$i.yaml; oc delete -f mgo.yaml; oc apply -f mgo.yaml; sleep 100; done
    5. Let it run for a few hours.

Actual results:

Run "grep ConstraintsNotSatisfiable *.yaml"
 
You should find a few of the Subscriptions ended up in a "locked" state from which there is no upgrade without manual intervention:

  - message: 'constraints not satisfiable: @existing/openshift-must-gather-operator//must-gather-operator.v4.17.281-gd5416c9
      and must-gather-operator-registry/openshift-must-gather-operator/stable/must-gather-operator.v4.17.281-gd5416c9
      originate from package must-gather-operator, subscription must-gather-operator
      requires must-gather-operator-registry/openshift-must-gather-operator/stable/must-gather-operator.v4.17.281-gd5416c9,
      subscription must-gather-operator exists, clusterserviceversion must-gather-operator.v4.17.281-gd5416c9
      exists and is not referenced by a subscription'
    reason: ConstraintsNotSatisfiable
    status: "True"
    type: ResolutionFailed

Expected results:

    Each installation attempt should've worked fine.

Additional info:

mgo.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-must-gather-operator
  annotations:
    package-operator.run/collision-protection: IfNoController
    package-operator.run/phase: namespaces
    openshift.io/node-selector: ""
  labels:
    openshift.io/cluster-logging: "true"
    openshift.io/cluster-monitoring: 'true'
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: must-gather-operator-registry
  namespace: openshift-must-gather-operator
  annotations:
    package-operator.run/collision-protection: IfNoController
    package-operator.run/phase: must-gather-operator
  labels:
    opsrc-datastore: "true"
    opsrc-provider: redhat
spec:
  image: quay.io/app-sre/must-gather-operator-registry@sha256:0a0610e37a016fb4eed1b000308d840795838c2306f305a151c64cf3b4fd6bb4
  displayName: must-gather-operator
  icon:
    base64data: ''
    mediatype: ''
  publisher: Red Hat
  sourceType: grpc
  grpcPodConfig:
    securityContextConfig: restricted
    nodeSelector:
      node-role.kubernetes.io: infra
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/infra
      operator: Exists
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: must-gather-operator
  namespace: openshift-must-gather-operator
  annotations:
    package-operator.run/collision-protection: IfNoController
    package-operator.run/phase: must-gather-operator
spec:
  channel: stable
  name: must-gather-operator
  source: must-gather-operator-registry
  sourceNamespace: openshift-must-gather-operator
---
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: must-gather-operator
  namespace: openshift-must-gather-operator
  annotations:
    package-operator.run/collision-protection: IfNoController
    package-operator.run/phase: must-gather-operator
    olm.operatorframework.io/exclude-global-namespace-resolution: 'true'
spec:
  targetNamespaces:
  - openshift-must-gather-operator

https://github.com/openshift/operator-framework-olm/pull/936

Vulnerability OCPBUGS-49400: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/536

Bug OCPBUGS-50498: cluster-version-operator panics due to a port collision

View the Description View the linked PRs

Description of problem:

cluster-version-operator pod is crashing repeatedly. It is a disconnected cluster. Customer used to ABI (agent based installation) and it's bare metal.  

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.7    True        False         134d    Cluster version is 4.16.7

$ omc get pods
NAME                                        READY   STATUS   RESTARTS   AGE
cluster-version-operator-74688c8bc5-fbzlf   0/1     Error    1425       7d

====
2025-02-07T10:24:37.750428195+05:30 I0207 04:54:37.750394       1 start.go:565] FeatureGate found in cluster, using its feature set "" at startup
2025-02-07T10:24:37.750437148+05:30 I0207 04:54:37.750427       1 payload.go:307] Loading updatepayload from "/"
2025-02-07T10:24:37.750603865+05:30 I0207 04:54:37.750583       1 payload.go:403] Architecture from release-metadata (4.16.7) retrieved from runtime: "amd64"
2025-02-07T10:24:37.751094450+05:30 I0207 04:54:37.751078       1 metrics.go:154] Metrics port listening for HTTPS on 0.0.0.0:9099
2025-02-07T10:24:37.751105629+05:30 E0207 04:54:37.751091       1 metrics.go:166] Collected metrics HTTPS server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
2025-02-07T10:24:37.754776401+05:30 E0207 04:54:37.754748       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2025-02-07T10:24:37.754776401+05:30 goroutine 236 [running]:
2025-02-07T10:24:37.754776401+05:30 k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bdfc40?, 0x311dc90})
2025-02-07T10:24:37.754776401+05:30     /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
2025-02-07T10:24:37.754776401+05:30 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004fc370?})
2025-02-07T10:24:37.754776401+05:30     /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
2025-02-07T10:24:37.754776401+05:30 panic({0x1bdfc40?, 0x311dc90?})
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/runtime/panic.go:914 +0x21f
2025-02-07T10:24:37.754776401+05:30 crypto/tls.(*listener).Close(0x1e67240?)
2025-02-07T10:24:37.754776401+05:30     <autogenerated>:1 +0x1e
2025-02-07T10:24:37.754776401+05:30 net/http.(*onceCloseListener).close(...)
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/net/http/server.go:3502
2025-02-07T10:24:37.754776401+05:30 sync.(*Once).doSlow(0xc000701b68?, 0x414e32?)
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/sync/once.go:74 +0xbf
2025-02-07T10:24:37.754776401+05:30 sync.(*Once).Do(...)
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/sync/once.go:65
2025-02-07T10:24:37.754776401+05:30 net/http.(*onceCloseListener).Close(0xc000346ab0)
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/net/http/server.go:3498 +0x45
2025-02-07T10:24:37.754776401+05:30 net/http.(*Server).closeListenersLocked(0xa6?)
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/net/http/server.go:2864 +0x96
2025-02-07T10:24:37.754776401+05:30 net/http.(*Server).Shutdown(0xc0006c00f0, {0x21f3160, 0xc0002c85f0})
2025-02-07T10:24:37.754776401+05:30     /usr/lib/golang/src/net/http/server.go:2790 +0x96
2025-02-07T10:24:37.754776401+05:30 github.com/openshift/cluster-version-operator/pkg/cvo.RunMetrics({0x21f3160, 0xc0002c8640}, {0x21f3160, 0xc0002c85f0}, {0x7ffff38f18a7, 0xc}, {0x7ffff38f18c8, 0x1d}, {0x7ffff38f18f9, 0x1d})
2025-02-07T10:24:37.754776401+05:30     /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/metrics.go:292 +0x130c
2025-02-07T10:24:37.754776401+05:30 github.com/openshift/cluster-version-operator/pkg/start.(*Options).run.func3.1.1()
2025-02-07T10:24:37.754776401+05:30     /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:248 +0x85
2025-02-07T10:24:37.754776401+05:30 created by github.com/openshift/cluster-version-operator/pkg/start.(*Options).run.func3.1 in goroutine 234
2025-02-07T10:24:37.754776401+05:30     /go/src/github.com/openshift/cluster-version-operator/pkg/start/start.go:246 +0x176
2025-02-07T10:24:37.757971014+05:30 panic: runtime error: invalid memory address or nil pointer dereference [recovered]
2025-02-07T10:24:37.757971014+05:30     panic: runtime error: invalid memory address or nil pointer dereference
2025-02-07T10:24:37.757971014+05:30 [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x7c5afe] 
====

https://github.com/openshift/cluster-version-operator/pull/1160

Bug OCPBUGS-29067: [Custom DNS] API_URL and API_INT_URL are not resolvable

View the Description View the linked PRs

Description of problem:

Bootstrap process failed due to API_URL and API_INT_URL are not resolvable:

Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap...
Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane...
Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API
Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up

install logs:
...
time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host"
time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz"
time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
...

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165

How reproducible:

 Always.

Steps to Reproduce:

    1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade
    2. Create cluster
    3.

Actual results:

Failed to complete bootstrap process.

Expected results:

See description.

Additional info:

I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969

Bug OCPBUGS-53085: MCO Arbiter templates are missing the 'openshift.io/required-scc' label

View the Description View the linked PRs

Description of problem:

    A minor component regression - [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation is failing for arbiter templates. 

See https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&LayeredProduct=none&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Procedure=none&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2025-03-13%2023%3A59%3A59&baseRelease=4.18&baseStartTime=2025-02-11%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Machine%20Config%20Operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20unknown%20ha%20none&flakeAsFailure=false&ignoreDisruption=true&ignoreMissing=false&includeMultiReleaseAnalysis=true&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Acrun&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=FeatureSet%3Atechpreview&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=LayeredProduct%3Anone&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Owner%3Aservice-delivery&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Arosa&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2025-03-13%2023%3A59%3A59&sampleRelease=4.19&sampleStartTime=2025-03-06%2000%3A00%3A00&testBasisRelease=4.18&testId=openshift-tests%3Ac1e4aa5075ed2c622d44645fe531a7a1&testName=%5Bsig-auth%5D%20all%20workloads%20in%20ns%2Fopenshift-machine-config-operator%20must%20set%20the%20%27openshift.io%2Frequired-scc%27%20annotation

Version-Release number of selected component (if applicable):

    4.19 only

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4912

Bug OCPBUGS-45434: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-machine-approver/pull/247

Bug OCPBUGS-50004: Correlated query returns empty result due to missing cluster:master_nodes metric

View the Description View the linked PRs

Description of problem:

After following the procedure described at https://github.com/openshift/installer/blob/main/docs/user/openstack/observability.md the PromQL query:

sum by (vm_instance) (
  group by (vm_instance, resource) (ceilometer_cpu)
    / on (resource) group_right(vm_instance) (
      group by (node, resource) (
        label_replace(kube_node_info, "resource", "$1", "system_uuid", "(.+)")
      )
    / on (node) group_left group by (node) (
      cluster:master_nodes
    )
  )
)

returns an empty result because the cluster:master_nodes metric is not being scraped from the shift-on-stack cluster.

Version-Release number of selected component (if applicable):

4.18.0-rc.6

How reproducible:

Always

Steps to Reproduce: https://github.com/openshift/installer/blob/main/docs/user/openstack/observability.md

Actual results:

Empty result

Expected results:

The number of OpenShift master nodes per OpenStack host

Additional info:

Just querying the 'cluster:master_nodes' metric returns empty

The procedure described in the docs needs to be updated and include the cluster:master_nodes metric retrieval in the scrapeconfig definition.

spec:
  params:
    'match[]':
    - '{__name__=~"kube_node_info|kube_persistentvolume_info|cluster:master_nodes"}'

https://github.com/openshift/installer/pull/9447

Bug OCPBUGS-43061: Deployment created via OCP console fails to trigger on Imagestream change

View the Description View the linked PRs

Description of problem:

When the ImageStreamTag is updated, the pod does not reflect the new image automatically for Deployments created via the OpenShift Console.

Version-Release number of selected component (if applicable):

Red Had Openshift Container Platform

Steps to Reproduce:

    1. Create a Deployment via OpenShift console and check that the trigger annotation has a `paused` field as string (e.g., `"paused":"false"`) instead of a boolean value  (e.g., `"paused":false`)
    2. Make changes on the ImageStreamTag

Actual results:

No changes on the deployment or new Pods created.

Expected results:

The deployment should have the new changes of the ImageStreamTag.

https://github.com/openshift/console/pull/14523

Bug OCPBUGS-46470: Power VS: Add endpoint override for ResourceManager

View the Description View the linked PRs

Description of problem:

    We have recently enabled a few endpoint overrides, but ResourceManager was accidentally excluded.

https://github.com/openshift/installer/pull/9317

Bug OCPBUGS-51156: console-operator panic

View the Description View the linked PRs

Beginning with payload 4.19.0-0.nightly-2025-02-21-125506 panics in aggregated-hypershift-ovn-conformance are causing failures due to

[sig-arch] events should not repeat pathologically for ns/openshift-console-operator

event happened 772 times, something is wrong: namespace/openshift-console-operator deployment/console-operator hmsg/ac4940667c - reason/Console-OperatorPanic Panic observed: runtime error: invalid memory address or nil pointer dereference (14:06:18Z) result=reject

That payload contains console-operator#950 will begin investigation there.

https://github.com/openshift/console-operator/pull/961

Bug OCPBUGS-45189: Missing app in auditedAppList for Manila

View the Description View the linked PRs

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/5200/pull-ci-openshift-hypershift-main-e2e-openstack/1862228917390151680

{Failed  === RUN   TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations
    util.go:1232: 
        the pod  openstack-manila-csi-controllerplugin-676cc65ffc-tnnkb is not in the audited list for safe-eviction and should not contain the safe-to-evict-local-volume annotation
        Expected
            <string>: socket-dir
        to be empty
        --- FAIL: TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations (0.02s)
}

https://github.com/openshift/hypershift/pull/5202

Bug OCPBUGS-49609: CI failures due to running out of memory in the builds

View the Description View the linked PRs

Description of problem:

    CGO_ENABLED=0 GO111MODULE=on GOWORK=off GOFLAGS=-mod=vendor go build -gcflags=all='-N -l' -ldflags '-extldflags "-static"' -o bin/hypershift .
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v5: /usr/lib/golang/pkg/tool/linux_amd64/compile: signal: killed
make: *** [Makefile:118: hypershift] Error 1

Version-Release number of selected component (if applicable):

How reproducible:

25%

Steps to Reproduce:

    1.Send a PR
    2.Job builds the container

Actual results:

    The build aborts due to memory issues when processing AzureSDK imports

Expected results:

    The build completes

Additional info:

https://github.com/openshift/hypershift/pull/5499

Bug OCPBUGS-26601: Re-enable AWS on openshift/origin/test/extended/router/http2

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

The openshift/origin/text/extended/router/http2.go tests don't run on AWS. We disabled this sometime ago. Let's enable this to see if it is still an issue. I have been running the http2 tests on AWS this week and I have not run into the original issue highlighted by the original bugzilla bug.

// platformHasHTTP2LoadBalancerService returns true where the default
// router is exposed by a load balancer service and can support http/2
// clients.
func platformHasHTTP2LoadBalancerService(platformType configv1.PlatformType) bool {
    switch platformType {
    case configv1.AzurePlatformType, configv1.GCPPlatformType:
        return true
    case configv1.AWSPlatformType:
        e2e.Logf("AWS support waiting on https://bugzilla.redhat.com/show_bug.cgi?id=1912413")
        fallthrough
    default:
        return false
    }
}

https://github.com/openshift/origin/pull/29326

Bug OCPBUGS-32406: Test Serverless function gives no response when function is not running

View the Description View the linked PRs

Description of problem:

    If the serverless function is not running and on click of Test Serverless button, nothing is happening.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1.Install serverless operator
    2.Create serverless function and make sure the status is false
    3.Click on Test Serverless function

Actual results:

    No response

Expected results:

    May be an alert or may be we can hide that option if function is not ready?

Additional info:

https://github.com/openshift/console/pull/14610

Bug OCPBUGS-45624: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/openshift-state-metrics/pull/119

Bug OCPBUGS-48154: Insights-runtime-extractor pod return 500 after patching trustedCA using proxy

View the Description View the linked PRs

Description of problem:

    After creating a pair of self-signed tls cert and private key, then add it into trustde-ca-bundle by using the following cmd:

oc patch proxy/cluster \
     --type=merge \
     --patch='{"spec":{"trustedCA":{"name":"custom-ca"}}}'

The insights-runtime-extractor pod will return response with 500 status code, this is the https flow details:

*   Trying 10.129.2.15:8000...
* Connected to exporter.openshift-insights.svc.cluster.local (10.129.2.15) port 8000 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /var/run/configmaps/service-ca-bundle/service-ca.crt
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS header, Finished (20):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.2 (OUT), TLS header, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS header, Unknown (23):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS header, Unknown (23):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.exporter.openshift-insights.svc
*  start date: Jan  2 02:19:07 2025 GMT
*  expire date: Jan  2 02:19:08 2027 GMT
*  subjectAltName: host "exporter.openshift-insights.svc.cluster.local" matched cert's "exporter.openshift-insights.svc.cluster.local"
*  issuer: CN=openshift-service-serving-signer@1735784302
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.2 (OUT), TLS header, Unknown (23):
* TLSv1.2 (OUT), TLS header, Unknown (23):
* TLSv1.2 (OUT), TLS header, Unknown (23):
* Using Stream ID: 1 (easy handle 0x5577a19094a0)
* TLSv1.2 (OUT), TLS header, Unknown (23):
> GET /gather_runtime_info HTTP/2
> Host: exporter.openshift-insights.svc.cluster.local:8000
> accept: */*
> user-agent: insights-operator/one10time200gather184a34f6a168926d93c330 cluster/_f19625f5-ee5f-40c0-bc49-23a8ba1abe61_
> authorization: Bearer sha256~x9jj_SnjJf6LVlhhWFdUG8UqnPDHzZW0xMYa0WU05Gw
>
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.2 (OUT), TLS header, Unknown (23):
* TLSv1.2 (IN), TLS header, Unknown (23):
* TLSv1.2 (IN), TLS header, Unknown (23):
< HTTP/2 500
< content-type: text/plain; charset=utf-8
< date: Thu, 02 Jan 2025 08:18:59 GMT
< x-content-type-options: nosniff
< content-length: 33
<
* TLSv1.2 (IN), TLS header, Unknown (23):
stat : no such file or directory

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    True

Steps to Reproduce:

    1. Create a pair of self-signed tls cert and key
    2. Update trusted-ca-bundle by using following cmd:
      oc patch proxy/cluster \ --type=merge \ --patch='{"spec":{"trustedCA":{"name":"custom-ca"}}}'     
    3. Pull a request to insights-runtime-extractor pod via the following cmd:
    curl  -v --cacert  /var/run/configmaps/trusted-ca-bundle/ca-bundle.crt  -H "User-Agent: insights-operator/one10time200gather184a34f6a168926d93c330 cluster/_<cluster_id>_" -H "Authorization: <token>" -H 'Cache-Control: no-cache' https://api.openshift.com/api/accounts_mgmt/v1/certificates

Actual results:

    3. The status code of response to this request is 500

Expected results:

3. The status code of response to this request should be 200 and return the runtime info as expected.

Additional info:

https://github.com/openshift/insights-operator/pull/1067

Bug OCPBUGS-49450: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1960

Bug OCPBUGS-49731: Mismatch in documentation and MCO support for RHCOS extensions

View the Description View the linked PRs

Description of problem:

The documentation in the MCO repo on supported extensions does not match the experienced functionality for 4.17.

Version-Release number of selected component (if applicable):

4.17/4.18

How reproducible:

Always

Steps to Reproduce:

    1. Start a 4.17.x cluster.
    2. Apply a MC to install the sysstat extension.
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 80-worker-extensions
spec:
  config:
    ignition:
      version: 3.2.0
  extensions:
    - sysstat
    3. Wait for the MCP targeted by the MC in step 2 to degrade & check the reason in the controller logs.
Degraded Machine: ip-10-0-21-15.us-west-2.compute.internal and Degraded Reason: invalid extensions found: [sysstat]

Actual results:

Applying an MC to install sysstat results in a degraded MCP.

Expected results:

Documentation in the MCO repo suggests that `sysstat` is supported for OCP version 4.17. Either the extension should be supported in 4.17 or the documentation should be updated to indicate the correct version where support was added.

Additional info:

Referenced MCO Repo Documentation can be found here.

https://github.com/openshift/machine-config-operator/pull/4828

Bug OCPBUGS-44236: Hypershift OAuth failing to connect to IdentityProvider when using a proxy with additionalTrustBundle and IdentityProvider URL can be publicly verified

View the Description View the linked PRs

Description of problem:

Initially, the clusters at version 4.16.9 were having issues with reconciling the IDP. The error which was found in Dynatrace was

  "error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: Service Unavailable",

Initially it was assumed that the IDP service was unavialble but the CU confirmed that they also had the GroupSync operator running inside all clusters, which can successfully connect to the customer IDP and sync User + Group information from the IDP into the cluster.

The CU was advised to upgrade to 4.16.18 keeping in mind few of the other OCPBUGS which were related to proxy and would be resolved by upgrading to 4.16.15+

However, after upgrade the IDP is still failing to apply it seems. It looks like IDP reconciler isn't considering the Additional Trust Bundle for the customer proxy

Checking DT Logs, it seems to fail to verify the certificate

"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority",

  "error": "failed to update control plane: [failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority, failed to update status: Operation cannot be fulfilled on hostedcontrolplanes.hypershift.openshift.io \"rosa-staging\": the object has been modified; please apply your changes to the latest version and try again]",

Version-Release number of selected component (if applicable):

4.16.18

How reproducible:

Customer has a few clusters deployed and each of them has the same issue.

Steps to Reproduce:

    1. Create a HostedCluster with a proxy configuration that specifies an additionalTrustBundle, and an OpenID idp that can be publicly verified (ie. EntraID or Keycloak with LetsEncrypt certs)
    2. Wait for the cluster to come up and try to use the IDP
    3.

Actual results:

IDP is failing to work for HCP

Expected results:

IDP should be working for the clusters

Additional info:

    The issue will happen only if the IDP does not require a custom trust bundle to be verified.

https://github.com/openshift/hypershift/pull/5241

Bug OCPBUGS-45623: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-alertmanager/pull/97

Bug OCPBUGS-45685: Power VS: Available SysTypes should be decided by zone rather than region

View the Description View the linked PRs

Description of problem:

    As more systems have been added to Power VS, the assumption that every zone in a region has the same set of systypes has been broken. To properly represent what system types are available, the powervs_regions struct needed to be altered and parts of the installer referencing it needed to be updated.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Try to deploy with s1022 in dal10
    2. SysType not available, even though it is a valid option in Power VS.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9245

Bug OCPBUGS-49913: HyperShift Control Plane Operator doesn't honor proxy env variable in some places

View the Description View the linked PRs

The HyperShift Control Plane Operator doesn't honor the set *_PROXY variables in some places in the code. Despite the proxy vars being set. HTTP requests for validating KAS health fail when all egress traffic is blocked except through a proxy.

https://github.com/openshift/hypershift/pull/5572

Vulnerability OCPBUGS-52507: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/823

Spike CLID-203: [CVE-2024-0406] Analyze impact of replacing mholt library

View the Description View the linked PRs

I would like to check the complexity of replacing this library by archive/tar (from standard library).

Outcome:

How long it takes
What impact on 4.17 payload and GA

https://github.com/openshift/oc-mirror/pull/1092

Vulnerability OCPBUGS-48185: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/networking-console-plugin/pull/164

Bug OCPBUGS-48440: Primary UDN does *not* report the MTU

View the Description View the linked PRs

Description of problem:

When listing the UserDefinedNetwork via the UI (or CLI), the MTU is not reported.

This makes the user flow cumbersome, since they won't know which MTU they're using unless they log into the VM, and actually check what's there.

Version-Release number of selected component (if applicable):

4.18.0-rc4

How reproducible:

Always

Steps to Reproduce:

1. Provision a primary UDN (namespace scoped)

2. Read the created UDN data.

Actual results:

The MTU of the network is not available to the user.

Expected results:

The MTU of the network should be available in the UDN contents.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

All

https://github.com/openshift/networking-console-plugin/pull/181

Bug OCPBUGS-48821: bootstrap removes the bootstrap API prematurely

View the Description View the linked PRs

bootstrap API server should be terminated only after API is HA, we should wait for API to be available on at least 2 master nodes, these are the steps:

1. API is HA (api is available on 2+ master nodes)
1. delete the bootstrap kube-apiserver manifests
1. wait for the bootstrap API to be down
1. delete all other static manifests
1. mark the bootstrap process done

We should note the difference between a) the bootstrap node itself existing, and b) API being available on the bootstrap node. Today inside the cluster bootstrap, we remove the bootstrap API (b) as soon as two master nodes appear. This is what happens today on the bootstrap node:
a) create the static assets
b) wait for 2 master nodes to appear
c) remove the kube-apiserver from the bootstrap node
d) mark the bootstrap process as completed

But we already might have a time window where API is not available [starting from c, and until api is available on a master node].

cluster bootstrap executable is invoked here:
https://github.com/openshift/installer/blob/c534bb90b780ae488bc6ef7901e0f3f6273e2764/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L541
start --tear-down-early=false --asset-dir=/assets --required-pods="${REQUIRED_PODS}"

Then, cluster bootstrap removes the bootstrap API here: https://github.com/openshift/cluster-bootstrap/blob/bcd73a12a957ce3821bdfc0920751b8e3528dc98/pkg/start/start.go#L203-L209

https://github.com/openshift/cluster-bootstrap/blob/bcd73a12a957ce3821bdfc0920751b8e3528dc98/pkg/start/bootstrap.go#L124-L141

but the wait for API to be HA is done here: https://github.com/openshift/installer/blob/c534bb90b780ae488bc6ef7901e0f3f6273e2764/data/data/bootstrap/files/usr/local/bin/report-progress.sh#L24

The wait should happen from within cluster-bootstrap, this PR moves the wait before cluster bootstrap tears down the bootstrap API/control plane

https://github.com/openshift/cluster-bootstrap/pull/111

Bug OCPBUGS-52985: CI flake: AWS ELB throttled on create or describe LB calls

View the Description View the linked PRs

util.go:153: InfrastructureReady=False: WaitingOnInfrastructureReady(private-router load balancer is not provisioned: Error syncing load balancer: failed to ensure load balancer: error creating load balancer: "Throttling: Rate exceeded\n\tstatus code: 400, request id: 3cdc703c-49f3-489b-b1bf-60278686d946")

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn/1899122456220667904

util.go:153: InfrastructureReady=False: WaitingOnInfrastructureReady(private-router load balancer is not provisioned: Error syncing load balancer: failed to ensure load balancer: error describing load balancer: "Throttling: Rate exceeded\n\tstatus code: 400, request id: 216792b1-767b-4388-a14b-be5b749a1434")

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1899123660770250752

`aws-cloud-controller-manager` is very aggressive with AWS calls and across all mgmt clusters, we are getting throttled.

https://github.com/openshift/hypershift/pull/5818

Task CNTRLPLANE-220: Update "Create an Azure Hosted Cluster on AKS" docs

View the Description View the linked PRs

Step 7:

Unique names for AZURE_DISK_MI_NAME, AZURE_FILE_MI_NAME, IMAGE_REGISTRY_MI_NAME so they don't overlap with other names.

Step 10:

AKS_CLUSTER_NAME variable never made throughout steps (add to variables at the start).

Step 19:

"$CLUSTER_NAME" should be replaced with "$HC_NAME" from previous steps

https://hypershift-docs.netlify.app/how-to/azure/create-azure-cluster_on_aks/

https://github.com/openshift/hypershift/pull/5652

Bug OCPBUGS-29354: Provide error report from ingress-to-route controller

View the Description View the linked PRs

Description of problem:

    ingress-to-route controller does not provide any information about failed conversions from ingress to route. This is a big issue in environments heavily dependent on the ingress objects. The only way to find why a route is not created is guess and try as the only information one can get is that the route is not created.

Version-Release number of selected component (if applicable):

    OCP 4.14

How reproducible:

    100%

Steps to Reproduce:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    route.openshift.io/termination: passthrough
  name: hello-openshift-class
  namespace: test
spec:
  ingressClassName: openshift-default
  rules:
  - host: ingress01-rhodain-test01.apps.rhodain03.sbr-virt.gsslab.brq2.redhat.com
    http:
      paths:
      - backend:
          service:
            name: myapp02
            port:
              number: 8080
        path: /
        pathType: Prefix
  tls:
  - {}

Actual results:

    Route is not created and no error is logged

Expected results:

    En error is provided in the events or at least in the controllers logs. The events are prefered as the ingress objects are mainly created by uses without cluster admin privileges.

Additional info:

https://github.com/openshift/route-controller-manager/pull/48

Bug OCPBUGS-50680: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/1093

Bug OCPBUGS-51256: Unable to use multidomain SAN defaultCertificate for custom console route

View the Description View the linked PRs

Description of problem:

As admin, I can configure a defaultCertificate for the cluster domain (e.g. name.example.com) with SAN's for a custom domain (e.g. name.example.org). Cluster users can create application routes exposed on the custom domain (myapp.apps.example.org) without including a certificate in the rout e definition.

As an admin, I cannot expose the console over the custom domain and rely on the defaultCertificate without specifying a `ingress.spec.componentRoutes.servingCertKeyPairSecret`.

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

    1. configure defaultCertificate with SAN's for both .net and .org domains

    $ openssl x509 -in *.apps.name.example.net.crt -ext subjectAltName
    X509v3 Subject Alternative Name: 
        DNS:*.apps.name.example.net, DNS:*.apps.name.example.org
    $ oc create configmap custom-ca --from-file=ca-bundle.crt=rootCA.crt -n openshift-config
    $ oc patch proxy/cluster --type=merge --patch='{"spec":{"trustedCA":{"name":"custom-ca"}}}'
    $ oc create secret tls custom-ingress-cert --cert=*.apps.name.example.net.crt --key=*.apps.name.example.net.key -n openshift-ingress
    $ oc patch ingresscontroller.operator default --type=merge -p '{"spec":{"defaultCertificate": {"name": "custom-ingress-cert"}}}' -n openshift-ingress-operator

    2. create and expose user routes on default and custom domain, without specifying server certificate

    $ oc new-project san-multidom-wildcard
    $ kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname
    $ oc expose deployment/hello-node --port 9376
    $ oc create route edge --service=hello-node hello-node-default
    $ oc create route edge --service=hello-node hello-node-custom --hostname=hello-node-custom-san-multidom-wildcard.apps.name.example.org 
    $ curl --cacert rootCA.crt https://$(oc get route hello-node-default -ojsonpath='{.spec.host}')
    hello-node-8dd54cb99-27j5h
    $ curl --cacert rootCA.crt https://$(oc get route hello-node-custom -ojsonpath='{.spec.host}')
    hello-node-8dd54cb99-27j5h

    3. Expose the console on a custom route but default domain, test and undo again:

    $ oc patch ingress.config.openshift.io/cluster --type=merge -p '{"spec":{"componentRoutes": [{"name": "console", "namespace": "openshift-console", "hostname": "console.apps.name.example.net"}]}}' -n openshift-ingress-operator
    ingress.config.openshift.io/cluster patched
    $ curl --cacert rootCA.crt -Lv console.apps.name.example.net >/dev/null
    <...>
    * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
    * ALPN: server did not agree on a protocol. Uses default.
    * Server certificate:
    *  subject: CN=*.apps.name.example.net
    *  start date: Feb 24 10:34:18 2025 GMT
    *  expire date: Feb 24 10:34:18 2026 GMT
    *  subjectAltName: host "console.apps.name.example.net" matched cert's "*.apps.name.example.net"
    *  issuer: CN=MyRootCa
    *  SSL certificate verify ok.
    * using HTTP/1.x
    } [5 bytes data]
    > GET / HTTP/1.1
    > Host: console.apps.name.example.net
    <...>
    $ oc patch ingress.config.openshift.io/cluster --type=merge -p '{"spec":{"componentRoutes": []}}' -n openshift-ingress-operator
    $ oc get route -n openshift-console
    NAME        HOST/PORT                                               PATH   SERVICES    PORT    TERMINATION          WILDCARD
    console     console-openshift-console.apps.name.example.net            console     https   reencrypt/Redirect   None
    downloads   downloads-openshift-console.apps.name.example.net          downloads   http    edge/Redirect        None

    4. Expose the console on the custom domain without specifying servingCertKeyPairSecret

    $ oc patch ingress.config.openshift.io/cluster --type=merge -p '{"spec":{"componentRoutes": [{"name": "console", "namespace": "openshift-console", "hostname": "console.apps.name.example.org"}]}}' -n openshift-ingress-operator
    ingress.config.openshift.io/cluster patched
    $ oc get route -n openshift-console
    NAME        HOST/PORT                                               PATH   SERVICES    PORT    TERMINATION          WILDCARD
    console     console-openshift-console.apps.name.example.net            console     https   reencrypt/Redirect   None
    downloads   downloads-openshift-console.apps.name.example.net          downloads   http    edge/Redirect        None
    $ oc logs -n openshift-console-operator deployment/console-operator
    <...>
    E0224 15:45:30.836226       1 base_controller.go:268] ConsoleRouteController reconciliation failed: secret reference for custom route TLS secret is not defined

Actual results:

console-operator rejects the route:
ConsoleRouteController reconciliation failed: secret reference for custom route TLS secret is not defined

Expected results:

As an admin I expect the defaultCertificate with valid SAN's to be used for a custom console route. Currently I need to maintain the certificate in 2 different secret's (namespaces openshift-config & openshift-ingress).

Additional info:

https://github.com/openshift/console-operator/pull/965

Bug OCPBUGS-45580: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/990

Bug OCPBUGS-46531: crun >= 1.18 breaks critical openshift virt feature

View the Description View the linked PRs

Description of problem:

openshift virt allows hotplugging block volumes into it's pods, which relies on the fact that changing the cgroup corresponding to the pid of the container suffices.

crun is test driving some changes it integrated recently;
it's configuring two cgroups, `*.scope` and sub cgroup called `container`
while before, the parent existed as sort of a no op
(wasn't configured, so, all devices were allowed, for example)
This results in the volume hotplug breaking since applying the device filter to the sub cgroup is not enough anymore

Version-Release number of selected component (if applicable):

4.18.0 RC2

How reproducible:

100%

Steps to Reproduce:

    1. Block volume hotplug to VM
    2.
    3.

Actual results:

    Failure

Expected results:

    Success

Additional info:

https://kubevirt.io/user-guide/storage/hotplug_volumes/

Bug OCPBUGS-47459: [IBMCloud] [CAPI] install private cluster with CAPI failed

View the Description View the linked PRs

Description of problem:

install private cluster with CAPI failed by MissingNodeRef

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-13-083421

How reproducible:

Always

Steps to Reproduce:

1.install private cluster based on bastion host
2.prepare the install-config.yaml:
publish: Internal
featureSet: CustomNoUpgrade
featureGates: ["ClusterAPIInstall=true"]
baseDomain: private-ibmcloud-2.qe.devcluster.openshift.com
credentialsMode: Manual
platform:
  ibmcloud:
    region: jp-tok
    networkResourceGroupName: ci-op-0yrhzx7l-ac1a3-rg
    vpcName: ci-op-0yrhzx7l-ac1a3-vpc
    controlPlaneSubnets:
    - ci-op-0yrhzx7l-ac1a3-control-plane-jp-tok-3-0
    - ci-op-0yrhzx7l-ac1a3-control-plane-jp-tok-2-0
    - ci-op-0yrhzx7l-ac1a3-control-plane-jp-tok-1-0
    computeSubnets:
    - ci-op-0yrhzx7l-ac1a3-compute-jp-tok-3-0
    - ci-op-0yrhzx7l-ac1a3-compute-jp-tok-2-0
    - ci-op-0yrhzx7l-ac1a3-compute-jp-tok-1-0
    resourceGroupName: ci-op-0yrhzx7l-ac1a3-rg
controlPlane:
  name: master
  platform:
    ibmcloud:
      type: bx2-4x16
      zones: [jp-tok-1, jp-tok-2, jp-tok-3]
  replicas: 3
compute:
- name: worker
  platform:
    ibmcloud:
      type: bx2-4x16
      zones: [jp-tok-1, jp-tok-2, jp-tok-3]
  replicas: 3
proxy:
  httpProxy: http://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@10.244.128.4:3128
  httpsProxy: http://XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX@10.244.128.4:3128
networking:
  machineNetwork:
  - cidr: 10.244.0.0/16
 3. install cluster failed

Actual results:

install failed 
in bootkube.json of log-bundle
    "stage": "resolve-api-url",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "76 resolve_url /usr/local/bin/bootstrap-verify-api-server-urls.sh",
    "errorMessage": "Checking if api.maxu-capi9p.private-ibmcloud.qe.devcluster.openshift.com of type API_URL is resolvable\nStarting stage resolve-api-url\nUnable to resolve API_URL api.maxu-capi9p.private-ibmcloud.qe.devcluster.openshift.com"

in Cluster-openshift-cluster-api-guests-ci-op-0yrhzx7l-ac1a3-2vc6j.yaml
  - type: ControlPlaneInitialized
    status: "False"
    severity: Info
    lasttransitiontime: "2024-12-17T09:39:31Z"
    reason: MissingNodeRef
    message: Waiting for the first control plane machine to have its status.nodeRef
      set

Expected results:

install succeed.

Additional info:

install External cluster with proxy succeed.

install Internal cluster failed, ref Prow Job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59392/rehearse-59392-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.19-installer-rehearse-ibmcloud-proxy-private-capi/1868948420899639296  
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59392/rehearse-59392-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.19-cucushift-installer-rehearse-ibmcloud-ipi-private-capi/1871051791659962368

https://github.com/openshift/installer/pull/9397

Bug OCPBUGS-48642: Add testing for unused dynamic codeRefs in static plugins

View the Description View the linked PRs

Description of problem:

    When building console, it is possible to include exposedModules that are not used by console-extensions.json

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14695

Bug OCPBUGS-52325: Node Logs toolbar layout is off screen at mobile

View the Description View the linked PRs

Go to Nodes > Node details > Logs at mobile resolution and note the toolbar above the log is off the screen.

https://drive.google.com/file/d/1WDCSlEmbVqvrJ_QcXUQC_jXi_C1imDAt/view?usp=sharing

https://github.com/openshift/console/pull/14836

Bug OCPBUGS-6271: No validation that VIPs are present on vSphere

View the Description View the linked PRs

In the agent-based installer, the user must provide an apiVIP and ingressVIP on the baremetal and vsphere platforms.
In IPI, the VIPs must be provided for baremetal, but they are optional for vsphere.

There is currently no validation that checks that the VIPs are provided on vsphere, and if an install-config is provided that does not set platform.vsphere.apiVIP or platform.vsphere.ingressVIP, the installer crashes when trying to generate the agent ISO:

panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/agent/manifests.(*AgentClusterInstall).Generate(0xc0004ad350, 0x5?)
	/home/zbitter/openshift/installer/pkg/asset/agent/manifests/agentclusterinstall.go:182 +0xd79

https://github.com/openshift/installer/pull/9532

Bug OCPBUGS-42609: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-45274: Failing PodRejectionStatus test

View the Description View the linked PRs

The following test is failing with the updated 1.32 Kubernetes in OCP 4.19:

[It] [sig-node] PodRejectionStatus Kubelet should reject pod when the node didn't have enough resource

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_kubernetes/2148/pull-ci-openshift-kubernetes-master-k8s-e2e-gcp-ovn/1862475873232359424

This test will be disabled temporarily to not block the rebase progress. This bug ticket is used to track the work to enable this test in OCP 4.19 again.

https://github.com/openshift/kubernetes/pull/2228

Bug OCPBUGS-48437: PowerVS: limit by resourceGroupID

View the Description View the linked PRs

Description of problem:

When PowerVS deletes a cluster, it does it via pattern matching in the name.  Limit the searches by resource group ID to prevent collisions.

https://github.com/openshift/installer/pull/9359

Bug OCPBUGS-51076: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14790

Bug OCPBUGS-45280: GCP fails to assign permissions to installer created SA

View the Description View the linked PRs

Description of problem:

DEBUG Creating ServiceAccount for control plane nodes 
DEBUG Service account created for XXXXX-gcp-r4ncs-m 
DEBUG Getting policy for openshift-dev-installer   
DEBUG adding roles/compute.instanceAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/compute.networkAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/compute.securityAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/storage.admin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add master roles: failed to set IAM policy, unexpected error: googleapi: Error 400: Service account XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com does not exist., badRequest

It appears that the Service account was created correctly. The roles are assigned to the service account. It is possible that there needs to be a "wait for action to complete" on the server side to ensure that this will all be ok.

Version-Release number of selected component (if applicable):

How reproducible:

Random. Appears to be a sync issue

Steps to Reproduce:

    1. Run the installer for a normal GCP basic install
    2.
    3.

Actual results:

    Installer fails saying that the Service Account that the installer created does not have the permissions to perform an action. Sometimes it takes numerous tries for this to happen (very intermittent).

Expected results:

    Successful install

Additional info:

https://github.com/openshift/installer/pull/9299

Bug OCPBUGS-45567: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-pd-csi-driver/pull/73

Bug OCPBUGS-45692: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9300

Bug OCPBUGS-48449: [sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set is breaking other tests in techpreview lane

View the Description View the linked PRs

// NOTE: The nested `Context` containers inside the following `Describe` container are used to group certain tests based on the environments they demand.// NOTE: When adding a test-case, ensure that the test-case is placed in the appropriate `Context` container.// NOTE: The containers themselves are guaranteed to run in the order in which they appear.var _ = g.Describe("[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set", g.Ordered, func() {    defer g.GinkgoRecover()
    o.SetDefaultEventuallyTimeout(15 * time.Minute)    o.SetDefaultEventuallyPollingInterval(5 * time.Second)
    r := &runner{}

In OCP Origin we have the above test playing with global variables for poll interval and poll timeout which is causing all other tests in origin to have flakes.

Example in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview/1877755411424088064/artifacts/e2e-aws-ovn-techpreview/openshift-e2e-test/build-log.txt

a networking test is failing because we are not polling correctly, this above test overrode the default poll value of 10ms and instead made it poll 5sec which caused the test to fail because out poll timeout was itself only 5seconds

Please don't use the global variables or maybe we can unset them after the test run is over?

Please note that this causes flakes that are hard to debug, we didn't know what was causing the poll interval to be 5seconds instead of the default 10ms.

https://github.com/openshift/origin/pull/29428

Bug TRT-2003: AWS service-load-balancer-with-pdb disruption payload failures

View the Description View the linked PRs

3 payloads in a row have failed due to service-load-balancer-with-pdb-reused-connections mean disruption should be less than historical plus five standard deviations begining with 4.19.0-0.nightly-2025-02-07-133207

Tested the payload before the issue appeared as well as the noted payload via gangway and the issue was reproduced only in the jobs (started between 7:45 && 7:55 2/8/25)using the noted payload

Noted that cloud-provider-aws#98 is present in the payload the issues appeared to start. Will test revert to see if the issue clears or continues to reproduce.

https://github.com/openshift/cloud-provider-aws/pull/100

Bug OCPBUGS-45311: oc adm node-image create --pxe does not generate the correct artifacts

View the Description View the linked PRs

Description of problem:

oc adm node-image create --pxe does not generate only pxe artifacts, but copies everything from the node-joiner pod. Also, the name of the pxe artifacts are not corrected (prefixed with agent, instead of node)

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1. oc adm node-image create --pxe

Actual results:

    All the node-joiner pods are copied. PXE artifacts name are wrong.

Expected results:

    In the target folder, there should be only the following artifacts:

* node.x86_64-initrd.img
* node.x86_64-rootfs.img
* node.x86_64-vmlinuz

Additional info:

https://github.com/openshift/oc/pull/1931

Bug OCPBUGS-45481: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-azure/pull/123

Bug OCPBUGS-47652: [IBMCloud] [CAPI] byo-kms (Bring You Own Key) install failed by bootVolume.encryptionKey not support

View the Description View the linked PRs

Description of problem:

byo-kms instlal failed with CAPI

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-24-213048

How reproducible:

always

Steps to Reproduce:

    1.create a key
    2.use the key in install-config.yaml
publish: External
featureSet: CustomNoUpgrade
featureGates: ["ClusterAPIInstall=true"]
baseDomain: ibmcloud.qe.devcluster.openshift.com
credentialsMode: Manual
platform:
  ibmcloud:
    region: jp-tok
    networkResourceGroupName: ci-op-7hcfbzfy-142dd-rg
    vpcName: ci-op-7hcfbzfy-142dd-vpc
    controlPlaneSubnets:
    - ci-op-7hcfbzfy-142dd-control-plane-jp-tok-3-0
    - ci-op-7hcfbzfy-142dd-control-plane-jp-tok-2-0
    - ci-op-7hcfbzfy-142dd-control-plane-jp-tok-1-0
    computeSubnets:
    - ci-op-7hcfbzfy-142dd-compute-jp-tok-3-0
    - ci-op-7hcfbzfy-142dd-compute-jp-tok-2-0
    - ci-op-7hcfbzfy-142dd-compute-jp-tok-1-0
    resourceGroupName: ci-op-7hcfbzfy-142dd
    defaultMachinePlatform:
      bootVolume:
        encryptionKey: "crn:v1:bluemix:public:kms:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861:4a6c67ca-7708-44c0-87fe-9eff2c111c00:key:4cf691f0-9cb1-4011-80b5-02aed0bbae60"

or 
publish: External
featureSet: CustomNoUpgrade
featureGates: ["ClusterAPIInstall=true"]
baseDomain: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
credentialsMode: Manual
platform:
  ibmcloud:
    region: jp-tok
    networkResourceGroupName: ci-op-3py46711-142dd-rg
    vpcName: ci-op-3py46711-142dd-vpc
    controlPlaneSubnets:
    - ci-op-3py46711-142dd-control-plane-jp-tok-3-0
    - ci-op-3py46711-142dd-control-plane-jp-tok-2-0
    - ci-op-3py46711-142dd-control-plane-jp-tok-1-0
    computeSubnets:
    - ci-op-3py46711-142dd-compute-jp-tok-3-0
    - ci-op-3py46711-142dd-compute-jp-tok-2-0
    - ci-op-3py46711-142dd-compute-jp-tok-1-0
    resourceGroupName: ci-op-3py46711-142dd
controlPlane:
  name: master
  platform:
    ibmcloud:
      type: bx2-4x16
      zones: [jp-tok-1, jp-tok-2, jp-tok-3]
      bootVolume:
        encryptionKey: "crn:v1:bluemix:public:kms:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861:2aa5aefd-1168-4191-a525-c9dce0da520e:key:a95a2abe-c566-43f9-b523-b06698465601"
  replicas: 3
compute:
- name: worker
  platform:
    ibmcloud:
      type: bx2-4x16
      zones: [jp-tok-1, jp-tok-2, jp-tok-3]
      bootVolume:
        encryptionKey: "crn:v1:bluemix:public:kms:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861:95afa81f-7486-49ce-a84f-f7b491d85c8c:key:2efdff99-4fdf-46ab-b424-b30f171094df"
  replicas: 3

 3.install cluster with CAPI

Actual results:

install failed.
in kube-apiserver.log
rejected by webhook "vibmvpcmachine.kb.io": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"vibmvpcmachine.kb.io\" denied the request: IBMVPCMachine.infrastructure.cluster.x-k8s.io \"ci-op-7hcfbzfy-142dd-8vxvb-bootstrap\" is invalid: spec.bootVolume.sizeGiB: Invalid value: v1beta2.IBMVPCMachineSpec{Name:\"ci-op-7hcfbzfy-142dd-8vxvb-master-0\", CatalogOffering:(*v1beta2.IBMCloudCatalogOffering)(nil), PlacementTarget:(*v1beta2.VPCMachinePlacementTarget)(nil), Image:(*v1beta2.IBMVPCResourceReference)(0xc000d8a2c0), LoadBalancerPoolMembers:[]v1beta2.VPCLoadBalancerBackendPoolMember{v1beta2.VPCLoadBalancerBackendPoolMember{LoadBalancer:v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a2e0)}, Pool:v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a2f0)}, Port:6443, Weight:(*int64)(nil)}, v1beta2.VPCLoadBalancerBackendPoolMember{LoadBalancer:v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a300)}, Pool:v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a310)}, Port:22623, Weight:(*int64)(nil)}, v1beta2.VPCLoadBalancerBackendPoolMember{LoadBalancer:v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a320)}, Pool:v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a330)}, Port:6443, Weight:(*int64)(nil)}}, Zone:\"jp-tok-1\", Profile:\"bx2-4x16\", BootVolume:(*v1beta2.VPCVolume)(0xc0014f20f0), ProviderID:(*string)(nil), PrimaryNetworkInterface:v1beta2.NetworkInterface{SecurityGroups:[]v1beta2.VPCResource{v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a340)}, v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a350)}, v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a360)}, v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a370)}, v1beta2.VPCResource{ID:(*string)(nil), Name:(*string)(0xc000d8a380)}}, Subnet:\"ci-op-7hcfbzfy-142dd-control-plane-jp-tok-1-0\"}, SSHKeys:[]*v1beta2.IBMVPCResourceReference(nil)}: valid Boot VPCVolume size is 10 - 250 GB", Reason:"Invalid", Details:(*v1.StatusDetails)(0xc0044719e0), Code:422}}

Expected results:

install succeed with CAPI using byo-kms

Additional info:

ref: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59392/rehearse-59392-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.19-cucushift-installer-rehearse-ibmcloud-ipi-byo-kms-capi/1871830778820694016

https://github.com/openshift/installer/pull/9385

Bug OCPBUGS-49348: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14798

Bug OCPBUGS-44949: Handle HFC for non-redfish HW

View the Description View the linked PRs

Description of problem:

Currently we are creating HFC for all BMH, doesn't mater if they are using ipmi or redfish, this can lead to misunderstanding.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/baremetal-operator/pull/394

Bug OCPBUGS-46354: 4.18 HyperShift operator fails to serialize NodePool ConfigMaps with ImageDigestMirrorSet

View the Description View the linked PRs

Description of problem:

    4.18 HyperShift operator's NodePool controller fails to serialize NodePool ConfigMaps that contain ImageDigestMirrorSet. Inspecting the code, it fails on NTO reconciliation logic, where only machineconfiguration API schemas are loaded into the YAML serializer: https://github.com/openshift/hypershift/blob/f7ba5a14e5d0cf658cf83a13a10917bee1168011/hypershift-operator/controllers/nodepool/nto.go#L415-L421

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Install 4.18 HyperShift operator
    2. Create NodePool with configuration ConfigMap that includes ImageDigestMirrorSet
    3. HyperShift operator fails to reconcile NodePool

Actual results:

    HyperShift operator fails to reconcile NodePool

Expected results:

    HyperShift operator to successfully reconcile NodePool

Additional info:

    Regression introduced by https://github.com/openshift/hypershift/pull/4717

https://github.com/openshift/hypershift/pull/5280

Bug OCPBUGS-44896: Enable CEL format library in 4.18

View the Description View the linked PRs

Description of problem:

A new format library was introduced for CEL in 4.18, but, it is not usable in 4.18 due to upgrade checks put in place (to allow version skew between API servers and rollbacks).

This means that the library is actually only presently usable in 4.19 once 1.32 ships. However, there are some issues we may face.

We have a number of APIs in flight currently that would like to use this new library, we cannot get started on those features until this library is enabled.

Some of those features would also like to be backported to 4.18.

We also have risks on upgrades. If we decide to use this format library in any API that is upgraded prior to KAS, then during an upgrade, the CRD will be applied to the older version of the API server, blocking the upgrade as it will fail.

By backporting the library (pretending it was introduced earlier, and then introducing it directly into 4.17), we can enable anything that installs post KAS upgrade to leverage this from 4.18 (solving those features asking for backports), and enable anything that upgrades pre-kas to actually leverage this in 4.19.

API approvers will be responsible for making sure the libraries and upgrade compatibility are considered as new APIs are introduced.

Presently, the library has had no bug fixes applied to the release-1.31 or release-1.32 branches upstream. The backport from 4.18 to 4.17 was clean bar some conflict in the imports that was easily resolved. So I'm confident that if we do need to backport any bug fixes, this should be straight forward.

Any bugs in these libraries can be assigned to me (jspeed)

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/2139

Bug OCPBUGS-42707: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2390

Bug OCPBUGS-44910: PVC for Image Registry Operator is not created when Swift is unavailable

View the Description View the linked PRs

Description of problem:

When Swift is not available (from any manner: 403, 404, etc), Cinder is the backend for the Cluster Image Registry Operator with a PVC. The problem here is that we see an error that Swift is not available but then no PVC is being created.

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

Disable swiftoperator role for your user and no PVC will be created

Actual results:

E1122 15:37:26.301213       1 controller.go:379] unable to sync: unable to sync storage configuration: persistentvolumeclaims "image-registry-storage" not found, requeuing
E1122 15:37:50.851275       1 swift.go:84] error listing swift containers: Expected HTTP response code [200 204 300[] when accessing [GET https://10.8.1.135:13808/v1/AUTH_6640775c6b5d4e5fa997fb9b85254da1/], but got 403 instead: <html><h1>Forbidden</h1><p>Access was denied to this resource.</p></html>
I1122 15:37:50.858381       1 controller.go:294] object changed: *v1.Config, Name=cluster (metadata=false, spec=true): added:spec.storage.pvc.claim="image-registry-storage", changed:status.conditions.2.lastTransitionTime={"2024-11-22T15:37:26Z" -> "2024-11-22T15:37:50Z"}
I1122 15:37:50.873526       1 controller.go:340] object changed: *v1.Config, Name=cluster (status=true): changed:metadata.generation={"12.000000" -> "11.000000"}, removed:metadata.managedFields.2.apiVersion="imageregistry.operator.openshift.io/v1", removed:metadata.managedFields.2.fieldsType="FieldsV1", removed:metadata.managedFields.2.manager="cluster-image-registry-operator", removed:metadata.managedFields.2.operation="Update", removed:metadata.managedFields.2.time="2024-11-22T15:37:50Z", changed:status.conditions.2.lastTransitionTime={"2024-11-22T15:37:26Z" -> "2024-11-22T15:37:50Z"}, changed:status.observedGeneration={"10.000000" -> "11.000000"}
E1122 15:37:50.885488       1 controller.go:379] unable to sync: unable to sync storage configuration: persistentvolumeclaims "image-registry-storage" not found, requeuing

Expected results:

PVC should be created and therefore the operator to become healthy.

https://github.com/openshift/hypershift/pull/5178

Bug OCPBUGS-47778: Improving web-terminal test failures

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14657

Bug OCPBUGS-49829: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/635

Bug OCPBUGS-31550: Gateway API - deleting SMCP breaks Gateway API

View the Description View the linked PRs

Description of problem:

    After the SMCP is automatically created by the controller, it can be manually deleted but is not automatically recreated if the default gatewayclass exists.  If you also manually delete the default gatewayclass, you can automatically recreate both by recreating the default gatewayclass.

Perhaps if SMCP is deleted, the gatewayclass should also be automatically deleted?  Or, if the SMCP is deleted, check into why it doesn't get reconciled and recreated if the default gatewayclass exists.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Create the default gatewayclass, which automatically creates the SMCP
    2. Delete the SMCP
    3. Try to create a httpRoute

Actual results:

    SMCP is not recreated and gatewayAPI function gets broken, e.g. httpRoute never attaches to gatewayClass and never works.

Expected results:

    SMCP gets recreated if it is missing when the gatewayAPI controller reconciles.

Additional info:

    If you delete the SMCP and default gatewayClass at the same time, and then create a new gatewayClass, it will also recreate the SMCP at that time.

https://github.com/openshift/cluster-ingress-operator/pull/1115

Bug OCPBUGS-45614: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-gcp/pull/234

Bug OCPBUGS-45735: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/107

Bug MGMT-19771: installation-configuration service failing due to image-digest-sources.json being unmarshalled incorrectly

View the Description View the linked PRs

Description of the problem:

ImageClusterInstall is timing out due to ibi-monitor-cm configmap missing. Seems this is a result of the installation-configuration.service failing on the spoke cluster when attempting to unmarshal the image-digest-sources.json file containing IDMS information for the spoke.

How reproducible:

100%

Steps to reproduce:

1. Configure a spoke with IBIO using IDMS

Additional information:

Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Post pivot operation has started"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="waiting for block device with label cluster-config or for configuration folder /opt/openshift/cluster-configuration"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Reading seed image info"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Reading seed reconfiguration info"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="/opt/openshift/setSSHKey.done already exists, skipping"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="/opt/openshift/pull-secret.done already exists, skipping"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Copying nmconnection files if they were provided"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="/opt/openshift/apply-static-network.done already exists, skipping"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Running systemctl restart [NetworkManager.service]"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing systemctl with args [restart NetworkManager.service]"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Setting new hostname target-0-0"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing hostnamectl with args [set-hostname target-0-0]"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Writing machine network cidr 192.168.126.0 into /etc/default/nodeip-configuration"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Setting new dnsmasq and forcedns dispatcher script configuration"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Running systemctl restart [dnsmasq.service]"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing systemctl with args [restart dnsmasq.service]"
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing bash with args [-c update-ca-trust]"
Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="/opt/openshift/recert.done already exists, skipping"
Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="No server ssh keys were provided, fresh keys already regenerated by recert, skipping"
Jan 22 10:32:07 target-0-0 lca-cli[1216989]: 2025-01-22T10:32:07Z        INFO        post-pivot-dynamic-client        Setting up retry middleware
Jan 22 10:32:07 target-0-0 lca-cli[1216989]: 2025-01-22T10:32:07Z        INFO        post-pivot-dynamic-client        Successfully created dynamic client
Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="Running systemctl enable [kubelet --now]"
Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="Executing systemctl with args [enable kubelet --now]"
Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Start waiting for api"
Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="waiting for api"
Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Deleting ImageContentSourcePolicy and ImageDigestMirrorSet if they exist"
Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Deleting default catalog sources"
Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Applying manifests from /opt/openshift/cluster-configuration/manifests"
Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=info msg="manifest applied: /opt/openshift/cluster-configuration/manifests/99-master-ssh.json"
Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=info msg="manifest applied: /opt/openshift/cluster-configuration/manifests/99-worker-ssh.json"
Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=error msg="failed apply manifests: failed to decode manifest image-digest-sources.json: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go value of type map[string]interface {}"
Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=fatal msg="Post pivot operation failed"
Jan 22 10:32:09 target-0-0 systemd[1]: installation-configuration.service: Main process exited, code=exited, status=1/FAILURE

Seems to stem from the way imagedigestsources are being handled in the installer.

https://github.com/openshift/installer/blob/release-4.18/pkg/asset/imagebased/configimage/imagedigestsources.go#L24

https://github.com/openshift/installer/blob/release-4.18/pkg/asset/manifests/imagedigestmirrorset.go#L44

https://github.com/openshift/installer/pull/9391

Bug OCPBUGS-31285: ART requests updates to 4.16 image ose-olm-operator-controller-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/86

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-23924: No results rendered for failed task run

View the Description View the linked PRs

Description of problem:

Previously, failed task rus did not emit results, now they do but the UI still shows "No TaskRun results available due to failure" even though task run's status contains a result.

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Always with a task run producing a result but failing afterwards

Steps to Reproduce:

    1. Create the pipelinerun below
    2. have a look on its task run

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: hello-pipeline
spec:
  tasks:
  - name: hello
    taskSpec:
      results:
      - name: greeting1
      steps:
      - name: greet
        image: registry.access.redhat.com/ubi8/ubi-minimal
        script: |
          #!/usr/bin/env bash
          set -e
          echo -n "Hello World!" | tee $(results.greeting1.path)
          exit 1
  results:
  - name: greeting2
    value: $(tasks.hello.results.greeting1)

Actual results:

No results in UI

Expected results:

One result should be displayed even though task run failed

Additional info:

Pipelines 1.13.0

https://github.com/openshift/console/pull/14414

Vulnerability OCPBUGS-48206: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/networking-console-plugin/pull/174

Bug OCPBUGS-48719: Add new tested azure arm instances types in installer doc

View the Description View the linked PRs

Description of problem:

    When running 4.18 installer QE full function test, following arm64 instances types are detected and tested passed, so append them in installer doc[1]:  
* StandardDpdsv6Family
* StandardDpldsv6Family
* StandardDplsv6Family
* StandardDpsv6Family
* StandardEpdsv6Family
* StandardEpsv6Family  
[1] https://github.com/openshift/installer/blob/main/docs/user/azure/tested_instance_types_aarch64.md

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9387

Bug OCPBUGS-28206: ERROR in search tool: Cannot read properties of undefined (reading 'state')

View the Description View the linked PRs

Description of problem:

"Cannot read properties of undefined (reading 'state')" Error in search tool when filtering Subscriptions while adding new Subscriptions

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. As an Administrator, go to Home -> Search and filter by Subscription component
    2. Start creating subscriptions (bulk)

Actual results:

    The filtered results will turn in "Oh no! Something went wrong" view

Expected results:

    Get updated results every few seconds

Additional info:

If the view is reloaded -> Fix

Stack Trace:

TypeError: Cannot read properties of undefined (reading 'state')
    at L (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/subscriptions-chunk-89fe3c19814d1f6cdc84.min.js:1:3915)
    at na (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:58879)
    at Hs (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:111315)
    at Sc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98327)
    at Cc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98255)
    at _c (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98118)
    at pc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:95105)
    at https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44774
    at t.unstable_runWithPriority (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:289:3768)
    at Uo (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44551)

https://github.com/openshift/console/pull/14600

Bug OCPBUGS-44950: CNO doesn't propagate HCP labels to 2nd level operands

View the Description View the linked PRs

Description of problem:

    CNO doesn't propagate HCP labels to 2nd level operands

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Create hostedCluster with .spec.Labels

Actual results:

   cloud-network-config-controller, multus-admission-controller, network-node-identity, ovnkube-control-plane-6dd8775f97-75f89 pods don't have the specified labels.

Expected results:

   cloud-network-config-controller, multus-admission-controller, network-node-identity, ovnkube-control-plane-6dd8775f97-75f89 pods have the specified labels.

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2567

Bug OCPBUGS-45565: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-autoscaler/pull/330

Bug OCPBUGS-46461: Improving helm CI tests

View the Description View the linked PRs

Description of problem:

Improving tests to remove the issue in the following helm test case
Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02: Helm Release Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02 expand_less	37s
{The following error originated from your application code, not from Cypress. It was caused by an unhandled promise rejection.

  > Cannot read properties of undefined (reading 'repoName')

When Cypress detects uncaught errors originating from your application it will automatically fail the current test.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14617

Bug OCPBUGS-48330: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2420

Bug OCPBUGS-49381: Workloads-DeploymentConfigs-Add storage: i18n misses

View the Description View the linked PRs

Description of problem:

    Workloads - DeploymentConfigs - Add storage : 'container from image ' is in English.

Version-Release number of selected component (if applicable):

    4.18.0-rc.6

How reproducible:

    always

Steps to Reproduce:

    1. Change web console UI into non en_US locale
    2. Navigate to Workloads - DeploymentConfigs - <click on deployment config' - Actions - Add storage - <click on 'select specific container')
    3. 'Container from image ' is in English

Actual results:

    content is in English

Expected results:

    content should be in selected language

Additional info:

    Reference screenshot added

https://github.com/openshift/console/pull/14747

Bug CNV-54828: The create button is not consistent when UDNs exist or not

View the Description View the linked PRs

Description of problem:

1. when there is no UDNs, it just an button to create UDN from form
2. when tehre are UDNs, there are two options: create Cluster UDN and UDN

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/190

Bug OCPBUGS-44860: Guard against accidental 4.y.z -> 4.(y+2).z'

View the Description View the linked PRs

Description of problem

From our docs:

Due to fundamental Kubernetes design, all OpenShift Container Platform updates between minor versions must be serialized. You must update from OpenShift Container Platform <4.y> to <4.y+1>, and then to <4.y+2>. You cannot update from OpenShift Container Platform <4.y> to <4.y+2> directly. However, administrators who want to update between two even-numbered minor versions can do so incurring only a single reboot of non-control plane hosts.

We should add a new precondition that enforces that policy, so cluster admins who run --to-image ... don't hop straight from 4.y.z to 4.(y+2).z' or similar without realizing that they were outpacing testing and policy.

Version-Release number of selected component

The policy and current lack-of guard both date back to all OCP 4 releases, and since they're Kube-side constraints, they may date back to the start of Kube.

How reproducible

Every time.

Steps to Reproduce

1. Install a 4.y.z cluster.
2. Use --to-image to request an update to a 4.(y+2).z release.
3. Wait a few minutes for the cluster-version operator to consider the request.
4. Check with oc adm upgrade.

Actual results

Update accepted.

Expected results

Update rejected (unless it was forced), complaining about the excessively long hop.

https://github.com/openshift/cluster-version-operator/pull/1112

Bug OCPBUGS-45645: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/2154

Bug OCPBUGS-45829: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-policy-controller/pull/159

Bug OCPBUGS-48507: ocp upgrade to 4.18 failing due to openstack-manila-csi-controllerplugin-pdb

View the Description View the linked PRs

Description of problem:

Upgrade to 4.18 is not working, because the machine-config update is stuck:

$ oc get co/machine-config
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.17.0-0.nightly-2025-01-13-120007   True        True          True       133m    Unable to apply 4.18.0-rc.4: error during syncRequiredMachineConfigPools: [context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-ef1c06aa9aeedcebfa50569c3aa9472a expected a964f19a214946f0e5f1197c545d3805393d0705 has 3594c4b2eb42d8c9e56a146baea52d9c147721b0: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-826ddf793cf0a677228234437446740f, retrying]

The machine-config-controller shows the responsible for that:

$ oc logs -n openshift-machine-config-operator                  machine-config-controller-69f59598f7-57lkv
[...]
I0116 13:54:16.605692       1 drain_controller.go:183] node ostest-xgjnz-master-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"openstack-manila-csi-controllerplugin-6754c7589f-dwjtm" -n "openshift-manila-csi-driver": This pod has more than one PodDisruptionBudget, which the eviction subresource does not support.

There are 2 PDBs on the manila namespace:

$ oc get pdb -n openshift-manila-csi-driver
NAME                                        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
manila-csi-driver-controller-pdb            N/A             1                 1                     80m
openstack-manila-csi-controllerplugin-pdb   N/A             1                 1                     134m

So a workaround is to remove the pdb openstack-manila-csi-controllerplugin-pdb.

Version-Release number of selected component (if applicable):

From 4.17.0-0.nightly-2025-01-13-120007 to 4.18.0-rc.4
on top of RHOS-17.1-RHEL-9-20241030.n.1

How reproducible:

Always

Steps to Reproduce:
1. Install latest 4.17
2. Update to 4.18, for example:

oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.18.0-rc.4 --allow-explicit-upgrade --force

Additional info: must-gather on private comment.

https://github.com/openshift/csi-operator/pull/353

Bug OCPBUGS-49410: [OLMv1] Enable metrics for OCP monitor

View the Description View the linked PRs

Description of problem:

All OCP components should be included in OCP's monitoring.
OLMV1 does not have this configuration.

References

OCP docs: https://docs.openshift.com/container-platform/4.17/observability/monitoring/managing-metrics.html#specifying-how-a-servic[…]nitored_managing-metrics
OLMV0 code implementation: https://github.com/openshift/operator-framework-olm/blob/master/manifests/0000_90_olm_00-service-monitor.yaml
PR with the info about how to enable this integration upstream: https://github.com/operator-framework/operator-controller/pull/1524/files

Note that the certs for downstream are not the same: https://redhat-internal.slack.com/archives/C06KP34REFJ/p1738059321951629

Acceptence Criteria:

Operator-Controller and Catalogd should have the metrics endpoint enabled by default to be consumed by OCP's monitoring
The change for both needs to be done at https://github.com/openshift/operator-framework-operator-controller for 4.19,4.20, main
For 4.18 we need also apply the changes at https://github.com/openshift/operator-framework-catalogd (before monorepo)

PS.: We should also check if we cannot add an e2e downstream test that covers it.

https://github.com/openshift/cluster-olm-operator/pull/103

Bug OCPBUGS-49803: v2: inconsistent cli flags

View the Description View the linked PRs

Description of problem:

    Common cli flags are redefined for the copy and delete commands, causing inconsistencies. For example, `--log-level` for copy and `--loglevel` for delete.
    We should make it so common flags are shared and set only once.

Version-Release number of selected component (if applicable):

    4.18+

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

    delete accepts a `--log-level` flag. Other than that, no change in the available flags for each command

Additional info:

https://github.com/openshift/oc-mirror/pull/1043

Bug OCPBUGS-42636: Multiple reboots during EUS upgrade on Control Plane nodes

View the Description View the linked PRs

Description of problem:

    During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. 

Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.

Version-Release number of selected component (if applicable):

    OCP 4.14.16 > 4.15.31  > 4.16.11

How reproducible:

    Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)

Steps to Reproduce:

    1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below:

---
apiVersion: config.openshift.io/v1
kind: Node
metadata:
  name: cluster
spec:
  cgroupMode: v1

    2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool.

    3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)

Actual results:

    You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.

Expected results:

  What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4727

Spike OPRUN-3268: Impact statement request for OCPBUGS-24009 OLM Operator packageserver Reporting Unavailable on InstallComponentFailed

View the Description View the linked PRs

Impact assessment of OCPBUGS-24009

Which 4.y.z to 4.y'.z' updates increase vulnerability?

Any upgrade up to 4.15.{current-z}

Which types of clusters?

Any non-Microshift cluster with an operator installed via OLM before upgrade to 4.15. After upgrading to 4.15, re-installing a previously uninstalled operator may also cause this issue.

What is the impact? Is it serious enough to warrant removing update recommendations?

OLM Operators can't be upgraded and may incorrectly report failed status.

How involved is remediation?

Delete the resources associated with the OLM installation related to the failure message in the olm-operator.

A failure message similar to this may appear on the CSV:

InstallComponentFailed install strategy failed: rolebindings.rbac.authorization.k8s.io "openshift-gitops-operator-controller-manager-service-auth-reader" already exists

The following resource types have been observed to encounter this issue and should be safe to delete:

ClusterRoleBinding suffixed with "-system:auth-delegator"
Service
RoleBinding suffixed with "-auth-reader"

Under no circumstances should a user delete a CustomResourceDefinition (CRD) if the same error occurs and names such a resource as data loss may occur. Note that we have not seen this type of resource named in the error from any of our users so far.

Labeling the problematic resources with olm.managed: "true" then restarting the olm-operator pod in the openshift-operator-lifecycle-manager namespace may also resolve the issue if the resource appears risky to delete.

Is this a regression?

Yes, functionality which worked in 4.14 may break after upgrading to 4.15.Not a regression, this is a new issue related to performance improvements added to OLM in 4.15

https://issues.redhat.com/browse/OCPBUGS-24009

https://issues.redhat.com/browse/OCPBUGS-31080

https://issues.redhat.com/browse/OCPBUGS-28845

https://github.com/openshift/operator-framework-operator-controller/pull/94

Bug OCPBUGS-36212: Missing translation for ""Read write once pod (RWOP)" ja and zh

View the Description View the linked PRs

Description of problem:

Missing translation for ""Read write once pod (RWOP)" ja and zh

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14565

Bug OCPBUGS-43531: zone/projectID field in gcpProviderspec allows invalid value

View the Description View the linked PRs

Description of problem:

    We can input invalid value to zone field in gcpproviderSpec

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-16-094159

How reproducible:

    Always

Steps to Reproduce:

    1.Edit machineset with invalid zone value , scale machineset

Actual results:

    Machineset edited successfully 

Machines stuck with blank status and do not fail 

miyadav@miyadav-thinkpadx1carbongen8:~/multifieldsgcp$ oc get machines
NAME                                 PHASE     TYPE            REGION        ZONE            AGE
miyadav-1809g-7bdh4-master-0         Running   n2-standard-4   us-central1   us-central1-a   62m
miyadav-1809g-7bdh4-master-1         Running   n2-standard-4   us-central1   us-central1-b   62m
miyadav-1809g-7bdh4-master-2         Running   n2-standard-4   us-central1   us-central1-c   62m
miyadav-1809g-7bdh4-worker-a-9kmdv   Running   n2-standard-4   us-central1   us-central1-a   57m
miyadav-1809g-7bdh4-worker-b-srj28   Running   n2-standard-4   us-central1   us-central1-b   57m
miyadav-1809g-7bdh4-worker-c-828v9   Running   n2-standard-4   us-central1   us-central1-c   57m
miyadav-1809g-7bdh4-worker-f-7d9bx                                                           11m
miyadav-1809g-7bdh4-worker-f-bcr7v   Running   n2-standard-4   us-central1   us-central1-f   20m
miyadav-1809g-7bdh4-worker-f-tjfjk                                                           7m3s

Expected results:

    machines status can report failed status and the reason , may be timeout instead of waiting continuously .

Additional info:

    logs are present in machine-controller 
"E1018 03:55:39.735293       1 controller.go:316] miyadav-1809g-7bdh4-worker-f-7d9bx: failed to check if machine exists: unable to verify project/zone exists: openshift-qe/us-central1-in; err: googleapi: Error 400: Invalid value for field 'zone': 'us-central1-in'. Unknown zone., invalid"

the machines will be stuck in deletion also because of no status.


for Invalid ProjectID - Errors in logs - 
urce project OPENSHIFT-QE.
Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.Help",
    "links": [
      {
        "description": "Google developers console",
        "url": "https://console.developers.google.com"
      }
    ]
  },
  {
    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
    "domain": "googleapis.com",
    "metadatas": {
      "consumer": "projects/OPENSHIFT-QE",
      "service": "compute.googleapis.com"
    },
    "reason": "CONSUMER_INVALID"
  }
]
, forbidden
E1018 08:59:40.405238       1 controller.go:316] "msg"="Reconciler error" "error"="unable to verify project/zone exists: OPENSHIFT-QE/us-central1-f; err: googleapi: Error 403: Permission denied on resource project OPENSHIFT-QE.\nDetails:\n[\n  {\n    \"@type\": \"type.googleapis.com/google.rpc.Help\",\n    \"links\": [\n      {\n        \"description\": \"Google developers console\",\n        \"url\": \"https://console.developers.google.com\"\n      }\n    ]\n  },\n  {\n    \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n    \"domain\": \"googleapis.com\",\n    \"metadatas\": {\n      \"consumer\": \"projects/OPENSHIFT-QE\",\n      \"service\": \"compute.googleapis.com\"\n    },\n    \"reason\": \"CONSUMER_INVALID\"\n  }\n]\n, forbidden" "controller"="machine-controller" "name"="miyadav-1809g-7bdh4-worker-f-dcnf5" "namespace"="openshift-machine-api" "object"={"name":"miyadav-1809g-7bdh4-worker-f-dcnf5","namespace":"openshift-machine-api"} "reconcileID"="293f9d09-1387-4702-8b67-2d209316585e"

must-gather- https://drive.google.com/file/d/1N--U8V3EfdEYgQUvK-fcrGxBYRDnzK1G/view?usp=sharing

ProjectID issue must-gather -https://drive.google.com/file/d/1lKNOu4eVmJJbo23gbieD5uVNtw_qF7p6/view?usp=sharing

https://github.com/openshift/machine-api-provider-gcp/pull/101

Bug OCPBUGS-45396: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/908

Bug OCPBUGS-45597: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-samples-operator/pull/591

Bug OCPBUGS-49611: Tweak Konnectivity agent readiness

View the Description View the linked PRs

Description of problem:

Konnectivity introduced a smarter readiness check with kubernetes-sigs/apiserver-network-proxy#485. It would be nice to do some better readiness and liveness check on startup.

Version-Release number of selected component (if applicable):

How reproducib

Steps to Reproduce:

Actual results

Expected results:

Additional info: Implementation in https://github.com/openshift/hypershift/pull/4829

https://github.com/openshift/hypershift/pull/4829

Bug OCPBUGS-49748: If hostDevices.deviceName has multiple types, the generated hostDevices.name has duplicates.

View the Description View the linked PRs

Description of problem:

    If hostDevices.deviceName has multiple types, the generated hostDevices.name may have duplicates.

Version-Release number of selected component (if applicable):

    4.19 4.18 4.17

How reproducible:

    100%

Steps to Reproduce:

apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
spec:
 platform:
  kubevirt:
   hostDevices:
   - count: 8
     deviceName: nvidia.com/H20
   - count;4
     deviceName: nvidia.com/NVSwitch

Actual results:

kubevirtmachines yaml
hostDevices:
 - deviceName: nvidia.com/H20
   name :hostdevice-1
 - deviceName: nvidia.com/H20
   name :hostdevice-2
 - deviceName: nvidia.com/H20
   name :hostdevice-3
 - deviceName: nvidia.com/H20
   name :hostdevice-4
 - deviceName: nvidia.com/NVSwitch
   name :hostdevice-1
 - deviceName: nvidia.com/NVSwitch
   name :hostdevice-2

Expected results:

kubevirtmachines yaml
hostDevices:
 - deviceName: nvidia.com/H20
   name :hostdevice-1
 - deviceName: nvidia.com/H20
   name :hostdevice-2
 - deviceName: nvidia.com/H20
   name :hostdevice-3
 - deviceName: nvidia.com/H20
   name :hostdevice-4
 - deviceName: nvidia.com/NVSwitch
   name :hostdevice-5
 - deviceName: nvidia.com/NVSwitch
   name :hostdevice-6

Additional info:

https://github.com/openshift/hypershift/pull/5535

Bug OCPBUGS-52578: coreos/etcd:latest does not exist anymore

View the Description View the linked PRs

Description of problem:


Tests are failing because {{oc image info quay.io/coreos/etcd:latest}} does not work anymore

e.g. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1897807117629263872

[sig-imageregistry][Feature:ImageInfo] Image info should display information about images [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29586

Bug OCPBUGS-44449: oc-mirror delete log is wrong

View the Description View the linked PRs

Description of problem:

when delete the logs are wrong, still saying mirror is ongoing:
Mirroring is ongoing. No errors

Version-Release number of selected component (if applicable):

./oc-mirror.rhel8 version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202411090338.p0.g0a7dbc9.assembly.stream.el9-0a7dbc9", GitCommit:"0a7dbc90746a26ddff3bd438c7db16214dcda1c3", GitTreeState:"clean", BuildDate:"2024-11-09T08:33:46Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

when execute the delete, the logs still say mirror is ongoing:

oc mirror delete --delete-yaml-file test/yinzhou/debug72708/working-dir/delete/delete-images.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2  --dest-tls-verify=false --force-cache-delete=true
envar TEST_E2E detected - bypassing unshare2024/11/12 03:10:04  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/11/12 03:10:04  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/11/12 03:10:04  [INFO]   : ⚙️  setting up the environment for you...
2024/11/12 03:10:04  [INFO]   : 🔀 workflow mode: diskToMirror / delete
2024/11/12 03:10:04  [INFO]   : 👀 Reading delete file...
2024/11/12 03:10:04  [INFO]   : 🚀 Start deleting the images...
2024/11/12 03:10:04  [INFO]   : images to delete 396 
 ✓ 1/396 : (0s) docker://registry.redhat.io/devworkspace/devworkspace-operator-bundle@sha256:5689ad3d80dea99cd842992523debcb1aea17b6db8dbd80e412cb2e…
2024/11/12 03:10:04  [INFO]   : Mirroring is ongoing. No errors.

Actual results:

oc mirror delete --delete-yaml-file test/yinzhou/debug72708/working-dir/delete/delete-images.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2  --dest-tls-verify=false --force-cache-delete=true
envar TEST_E2E detected - bypassing unshare2024/11/12 03:10:04  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/11/12 03:10:04  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/11/12 03:10:04  [INFO]   : ⚙️  setting up the environment for you...
2024/11/12 03:10:04  [INFO]   : 🔀 workflow mode: diskToMirror / delete
2024/11/12 03:10:04  [INFO]   : 👀 Reading delete file...
2024/11/12 03:10:04  [INFO]   : 🚀 Start deleting the images...
2024/11/12 03:10:04  [INFO]   : images to delete 396 
 ✓ 1/396 : (0s) docker://registry.redhat.io/devworkspace/devworkspace-operator-bundle@sha256:5689ad3d80dea99cd842992523debcb1aea17b6db8dbd80e412cb2e…
2024/11/12 03:10:04  [INFO]   : Mirroring is ongoing. No errors.

Expected results:

Show correct delete logs

Additional info:

https://github.com/openshift/oc-mirror/pull/967

Bug OCPBUGS-45691: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/96

Bug OCPBUGS-48629: UDN TP CI: Network Policies when using openshift ovn-kubernetes pods within namespace should be isolated when deny policy is present in L2 dualstack primary UDN

View the Description View the linked PRs

https://sippy.dptools.openshift.org/sippy-ng/tests/4.19/analysis?test=%5Bsig-network%5D%5BOCPFeatureGate%3ANetworkSegmentation%5D%5BFeature%3AUserDefinedPrimaryNetworks%5D%20Network%20Policies%20when%20using%20openshift%20ovn-kubernetes%20pods%20within%20namespace%20should%20be%20isolated%20when%20deny%20policy%20is%20present%20in%20L2%20dualstack%20primary%20UDN%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-network%5D%5BOCPFeatureGate%3ANetworkSegmentation%5D%5BFeature%3AUserDefinedPrimaryNetworks%5D%20Network%20Policies%20when%20using%20openshift%20ovn-kubernetes%20pods%20within%20namespace%20should%20be%20isolated%20when%20deny%20policy%20is%20present%20in%20L2%20dualstack%20primary%20UDN%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

pass rate dropped to below 95%

Bug OCPBUGS-48737: metal3-ramdisk-logs busy loop burning a core away

View the Description View the linked PRs

Description of problem:

clusters running on OpenShift Virt (Agent Based Install) where I see `metal3-ramdisk-logs` container eating up a core, but the logs are empty:

oc adm top pod --sort-by=cpu --sum -n openshift-machine-api --containers 
POD                                           NAME                                     CPU(cores)   MEMORY(bytes)   
metal3-55c9bc8ff4-nh792                       metal3-ramdisk-logs                      988m         1Mi             
metal3-55c9bc8ff4-nh792                       metal3-httpd                             1m           20Mi            
metal3-55c9bc8ff4-nh792                       metal3-ironic                            0m           121Mi           
cluster-baremetal-operator-5bf8bcbbdd-jvhq7   cluster-baremetal-operator               1m           25Mi

Version-Release number of selected component (if applicable):

4.17.12

How reproducible:

always

Steps to Reproduce:

Cluster is reachable on Red Hat VPN - reach out on slack to get access

Actual results:

logs are empty, but a core is consumed

Expected results:

container should be more or less idle

Additional info:

https://github.com/openshift/ironic-image/pull/629

Bug OCPBUGS-52361: The validatingwebhookconfigurations is being deleted when deploying metallb

View the Description View the linked PRs

Description of problem:

When deploying Metallb the validatingwebhookconfigurations.admissionregistration is being deleted. It can take several minutes to come up. During this time it is possible to configure invalid frrconfigurations.

Version-Release number of selected component (if applicable):

How reproducible: Easily

Steps to Reproduce:
1. Verify validatingwebhookconfigurations.admissionregistration is deployed
2. Deploy metallb
3. Verify that validatingwebhookconfigurations.admissionregistration has been removed.
Actual results:

validatingwebhookconfigurations.admissionregistration is being removed

Expected results:

When metallb is deployed it should not remove the validatingwebhookconfigurations.admissionregistration

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2659

Bug OCPBUGS-48152: The installation failed in the disconnected environment due to GetRegistryOverride() does not take SHA into account.

View the Description View the linked PRs

Description of problem:

    The installation failed in the disconnected environment due to a failure to get controlPlaneOperatorImageLabels: failed to look up image metadata.

Version-Release number of selected component (if applicable):

    4.19 4.18

How reproducible:

    100%

Steps to Reproduce:

    1.disconnected env
    2.create agent hostedcluster

Actual results:

    cluster can be ready

Expected results:

       - lastTransitionTime: "2025-01-05T13:55:14Z"
      message: 'failed to get controlPlaneOperatorImageLabels: failed to look up image
        metadata for registry.ci.openshift.org/ocp/4.18-2025-01-04-031500@sha256:ba93b7791accfb38e76634edbc815d596ebf39c3d4683a001f8286b3e122ae69:
        failed to obtain root manifest for registry.ci.openshift.org/ocp/4.18-2025-01-04-031500@sha256:ba93b7791accfb38e76634edbc815d596ebf39c3d4683a001f8286b3e122ae69:
        manifest unknown: manifest unknown'
      observedGeneration: 2
      reason: ReconciliationError
      status: "False"
      type: ReconciliationSucceeded

Additional info:

    - mirrors:
    - virthost.ostest.test.metalkube.org:5000/localimages/local-release-image
  source: registry.build01.ci.openshift.org/ci-op-p2mqdwjp/release
- mirrors:
    - virthost.ostest.test.metalkube.org:5000/localimages/local-release-image
  source: registry.ci.openshift.org/ocp/4.18-2025-01-04-031500
- mirrors:
    - virthost.ostest.test.metalkube.org:6001/openshifttest
  source: quay.io/openshifttest
- mirrors:
    - virthost.ostest.test.metalkube.org:6001/openshift-qe-optional-operators
  source: quay.io/openshift-qe-optional-operators
- mirrors:
    - virthost.ostest.test.metalkube.org:6001/olmqe
  source: quay.io/olmqe
- mirrors:
    - virthost.ostest.test.metalkube.org:6002
  source: registry.redhat.io
- mirrors:
    - virthost.ostest.test.metalkube.org:6002
  source: brew.registry.redhat.io
- mirrors:
    - virthost.ostest.test.metalkube.org:6002
  source: registry.stage.redhat.io
- mirrors:
    - virthost.ostest.test.metalkube.org:6002
  source: registry-proxy.engineering.redhat.com

https://github.com/openshift/hypershift/pull/5353

Bug OCPBUGS-45075: jkyros is leaving, so we shouldn't have him in the OWNERS files

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes-autoscaler/pull/322

Bug OCPBUGS-45915: Console reports internal version to telemetry

View the Description View the linked PRs

Console reports its internal version back in segment.io telemetry. This version is opaque and cannot easily be correlated back to a particular OpenShift version. We should use an OpenShift versions like 4.17.4 instead in segment.io events.

Ali Mobrem

https://github.com/openshift/console/pull/14579

Bug OCPBUGS-33144: machine-config-daemon pod not picking up on label and mcp change to push out new rendered- config

View the Description View the linked PRs

Description of problem:

Node was created today with worker label. It was labeled as a loadbalancer to match mcp selector. MCP saw the selector and moved to Updating but the machine-config-daemon pod isn't responding. We tried deleting the pod and it still didn't pick up that it needed to get a new config. Manually editing the desired config appears to workaround the issue but shouldn't be necessary.

Node created today:

[dasmall@supportshell-1 03803880]$ oc get nodes worker-048.kub3.sttlwazu.vzwops.com -o yaml | yq .metadata.creationTimestamp
'2024-04-30T17:17:56Z'

Node has worker and loadbalancer roles:

[dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com
NAME                                  STATUS   ROLES                 AGE   VERSION
worker-048.kub3.sttlwazu.vzwops.com   Ready    loadbalancer,worker   1h    v1.25.14+a52e8df


MCP shows a loadbalancer needing Update and 0 nodes in worker pool:

[dasmall@supportshell-1 03803880]$ oc get mcp
NAME           CONFIG                                                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
loadbalancer   rendered-loadbalancer-1486d925cac5a9366d6345552af26c89   False     True       False      4              3                   3                     0                      87d
master         rendered-master-47f6fa5afe8ce8f156d80a104f8bacae         True      False      False      3              3                   3                     0                      87d
worker         rendered-worker-a6be9fb3f667b76a611ce51811434cf9         True      False      False      0              0                   0                     0                      87d
workerperf     rendered-workerperf-477d3621fe19f1f980d1557a02276b4e     True      False      False      38             38                  38                    0                      87d


Status shows mcp updating:

[dasmall@supportshell-1 03803880]$ oc get mcp loadbalancer -o yaml | yq .status.conditions[4]
lastTransitionTime: '2024-04-30T17:33:21Z'
message: All nodes are updating to rendered-loadbalancer-1486d925cac5a9366d6345552af26c89
reason: ''
status: 'True'
type: Updating


Node still appears happy with worker MC:

[dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com -o yaml | grep rendered-
    machineconfiguration.openshift.io/currentConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9


machine-config-daemon pod appears idle:

[dasmall@supportshell-1 03803880]$ oc logs -n openshift-machine-config-operator machine-config-daemon-wx2b8 -c machine-config-daemon
2024-04-30T17:48:29.868191425Z I0430 17:48:29.868156   19112 start.go:112] Version: v4.12.0-202311220908.p0.gef25c81.assembly.stream-dirty (ef25c81205a65d5361cfc464e16fd5d47c0c6f17)
2024-04-30T17:48:29.871340319Z I0430 17:48:29.871328   19112 start.go:125] Calling chroot("/rootfs")
2024-04-30T17:48:29.871602466Z I0430 17:48:29.871593   19112 update.go:2110] Running: systemctl daemon-reload
2024-04-30T17:48:30.066554346Z I0430 17:48:30.066006   19112 rpm-ostree.go:85] Enabled workaround for bug 2111817
2024-04-30T17:48:30.297743470Z I0430 17:48:30.297706   19112 daemon.go:241] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 (412.86.202311271639-0) 828584d351fcb58e4d799cebf271094d5d9b5c1a515d491ee5607b1dcf6ebf6b
2024-04-30T17:48:30.324852197Z I0430 17:48:30.324543   19112 start.go:101] Copied self to /run/bin/machine-config-daemon on host
2024-04-30T17:48:30.325677959Z I0430 17:48:30.325666   19112 start.go:188] overriding kubernetes api to https://api-int.kub3.sttlwazu.vzwops.com:6443
2024-04-30T17:48:30.326381479Z I0430 17:48:30.326368   19112 metrics.go:106] Registering Prometheus metrics
2024-04-30T17:48:30.326447815Z I0430 17:48:30.326440   19112 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
2024-04-30T17:48:30.327835814Z I0430 17:48:30.327811   19112 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
2024-04-30T17:48:30.327932144Z I0430 17:48:30.327923   19112 update.go:2125] Starting to manage node: worker-048.kub3.sttlwazu.vzwops.com
2024-04-30T17:48:30.332123862Z I0430 17:48:30.332097   19112 rpm-ostree.go:394] Running captured: rpm-ostree status
2024-04-30T17:48:30.332928272Z I0430 17:48:30.332909   19112 daemon.go:1049] Detected a new login session: New session 1 of user core.
2024-04-30T17:48:30.332935796Z I0430 17:48:30.332926   19112 daemon.go:1050] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
2024-04-30T17:48:30.368619942Z I0430 17:48:30.368598   19112 daemon.go:1298] State: idle
2024-04-30T17:48:30.368619942Z Deployments:
2024-04-30T17:48:30.368619942Z * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                    Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                   Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z)
2024-04-30T17:48:30.368619942Z           LayeredPackages: kernel-devel kernel-headers
2024-04-30T17:48:30.368619942Z
2024-04-30T17:48:30.368619942Z   ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                    Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                   Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z)
2024-04-30T17:48:30.368619942Z           LayeredPackages: kernel-devel kernel-headers
2024-04-30T17:48:30.368907860Z I0430 17:48:30.368884   19112 coreos.go:54] CoreOS aleph version: mtime=2023-08-08 11:20:41.285 +0000 UTC build=412.86.202308081039-0 imgid=rhcos-412.86.202308081039-0-metal.x86_64.raw
2024-04-30T17:48:30.368932886Z I0430 17:48:30.368926   19112 coreos.go:71] Ignition provisioning: time=2024-04-30T17:03:44Z
2024-04-30T17:48:30.368938120Z I0430 17:48:30.368931   19112 rpm-ostree.go:394] Running captured: journalctl --list-boots
2024-04-30T17:48:30.372893750Z I0430 17:48:30.372884   19112 daemon.go:1307] journalctl --list-boots:
2024-04-30T17:48:30.372893750Z -2 847e119666d9498da2ae1bd89aa4c4d0 Tue 2024-04-30 17:03:13 UTC—Tue 2024-04-30 17:06:32 UTC
2024-04-30T17:48:30.372893750Z -1 9617b204b8b8412fb31438787f56a62f Tue 2024-04-30 17:09:06 UTC—Tue 2024-04-30 17:36:39 UTC
2024-04-30T17:48:30.372893750Z  0 3cbf6edcacde408b8979692c16e3d01b Tue 2024-04-30 17:39:20 UTC—Tue 2024-04-30 17:48:30 UTC
2024-04-30T17:48:30.372912686Z I0430 17:48:30.372891   19112 rpm-ostree.go:394] Running captured: systemctl list-units --state=failed --no-legend
2024-04-30T17:48:30.378069332Z I0430 17:48:30.378059   19112 daemon.go:1322] systemd service state: OK
2024-04-30T17:48:30.378069332Z I0430 17:48:30.378066   19112 daemon.go:987] Starting MachineConfigDaemon
2024-04-30T17:48:30.378121340Z I0430 17:48:30.378106   19112 daemon.go:994] Enabling Kubelet Healthz Monitor
2024-04-30T17:48:31.486786667Z I0430 17:48:31.486747   19112 daemon.go:457] Node worker-048.kub3.sttlwazu.vzwops.com is not labeled node-role.kubernetes.io/master
2024-04-30T17:48:31.491674986Z I0430 17:48:31.491594   19112 daemon.go:1243] Current+desired config: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:31.491674986Z I0430 17:48:31.491603   19112 daemon.go:1253] state: Done
2024-04-30T17:48:31.495704843Z I0430 17:48:31.495617   19112 daemon.go:617] Detected a login session before the daemon took over on first boot
2024-04-30T17:48:31.495704843Z I0430 17:48:31.495624   19112 daemon.go:618] Applying annotation: machineconfiguration.openshift.io/ssh
2024-04-30T17:48:31.503165515Z I0430 17:48:31.503052   19112 update.go:2110] Running: rpm-ostree cleanup -r
2024-04-30T17:48:32.232728843Z Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
2024-04-30T17:48:35.755815139Z Freed: 92.3 MB (pkgcache branches: 0)
2024-04-30T17:48:35.764568364Z I0430 17:48:35.764548   19112 daemon.go:1563] Validating against current config rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:36.120148982Z I0430 17:48:36.120119   19112 rpm-ostree.go:394] Running captured: rpm-ostree kargs
2024-04-30T17:48:36.179660790Z I0430 17:48:36.179631   19112 update.go:2125] Validated on-disk state
2024-04-30T17:48:36.182434142Z I0430 17:48:36.182406   19112 daemon.go:1646] In desired config rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:36.196911084Z I0430 17:48:36.196879   19112 config_drift_monitor.go:246] Config Drift Monitor started

Version-Release number of selected component (if applicable):

    4.12.45

How reproducible:

    They can reproduce in multiple clusters

Actual results:

    Node stays with rendered-worker config

Expected results:

    machineconfigpool updating should prompt a change to the desired config which the machine-config-daemon pod then updates node to

Additional info:

    here is the latest must-gather where this issue is occuring:
https://attachments.access.redhat.com/hydra/rest/cases/03803880/attachments/3fd0cf52-a770-4525-aecd-3a437ea70c9b?usePresignedUrl=true

https://github.com/openshift/machine-config-operator/pull/4757

Bug OCPBUGS-45242: ConsolePluginBackendDetail is throwing an error on some specific ConsolePlugin manifest

View the Description View the linked PRs

Description of problem:

Console plugin details page is throwing error on some specific YAML

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-30-141716

How reproducible:

Always

Steps to Reproduce:

1. Create a ConsolePlugin with minimum required fields  apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: console-demo-plugin-two
spec:
  backend:
    type: Service
  displayName: OpenShift Console Demo Plugin

2. Visit consoleplugin details page at /k8s/cluster/console.openshift.io~v1~ConsolePlugin/console-demo-plugin

Actual results:

2. We will see an error page

Expected results:

2. we should not show an error page since ConsolePlugin YAML has every required fields although they are not complete

Additional info:

https://github.com/openshift/console/pull/14582

Bug OCPBUGS-46513: "contentSecurityPolicy" spec not supported by ConsolePlugin CRD (in production)

View the Description View the linked PRs

Description of problem:

On "OCP 4.18.0-0.nightly-2024-12-14-152515" tried adding the spec but it was removed on reconcile with an "Admission Webhook" warning popup:
``` ConsolePlugin odf-console violates policy 299 - "unknown field \"spec.contentSecurityPolicy)"'' ```

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Try adding "contentSecurityPolicy" to any ConsolePlugin CR.
    2.
    3.

Actual results:

    Spec is getting removed.

Expected results:

    Spec should be supported by ConsolePlugin "v1".

Additional info:

    Refer https://redhat-internal.slack.com/archives/C011BL0FEKZ/p1734339650501379 for more details.

Bug OCPBUGS-45179: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2529

Vulnerability OCPBUGS-50674: CVE-2024-45338 golang.org/x/net/html: Non-linear parsing of case-insensitive content in golang.org/x/net/html

View the Description View the linked PRs

Security Tracking Issue

Do not make this issue public.

Flaw:

Non-linear parsing of case-insensitive content in golang.org/x/net/html
https://bugzilla.redhat.com/show_bug.cgi?id=2333122

An attacker can craft an input to the Parse functions that would be processed non-linearly with respect to its length, resulting in extremely slow parsing. This could cause a denial of service.

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/99

Bug OCPBUGS-19096: Update 4.15 ose-olm-operator-controller image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/26

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-operator-controller/pull/27

Bug OCPBUGS-44834: [aws] permissions missing for edge zones

View the Description View the linked PRs

Description of problem:

    Some permissions are missing when edge zones are specified in the install-config.yaml, probably those related to Carrier Gateways (but maybe more)

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always with minimal permissions

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    time="2024-11-20T22:40:58Z" level=debug msg="\tfailed to describe carrier gateways in vpc \"vpc-0bdb2ab5d111dfe52\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-girt7h2j-4515a-minimal-perm is not authorized to perform: ec2:DescribeCarrierGateways because no identity-based policy allows the ec2:DescribeCarrierGateways action"

Expected results:

    All required permissions are listed in pkg/asset/installconfig/aws/permissions.go

Additional info:

    See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9222/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1859351015715770368 for a failed min-perms install

https://github.com/openshift/installer/pull/9230

Bug OCPBUGS-45468: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14581

Bug OCPBUGS-45595: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5272

Bug OCPBUGS-45029: create-from-git.feature file requires implementation update

View the Description View the linked PRs

Description of problem:

A-06-TC02, A-06-TC05, A-06-TC10 test cases are failing for create-from-git.feature file. The file requires an update

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    Tests are failing with timeout error

Expected results:

    Test should run green

Additional info:

https://github.com/openshift/console/pull/14685

Bug OCPBUGS-45745: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2381

Bug OCPBUGS-48126: audit log analyzer should find resources which are being often updated

View the Description View the linked PRs

Description of problem:

    Alongside users which are updating resources often audit log analyzer should find resources updated often. Existing tests don't trigger when resource is being updated by different users or not namespaced

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29397

Bug MGMT-19588: Hostname of a discovery host with bonding configuration is not replaced from localhost.localdomain to a MAC address based name in Assisted-Installer.

View the Description View the linked PRs

Description of problem:

In the Assisted-Installer, discovery hosts use reverse DNS (rDNS) to set their hostname (in case the hostname is not in the DHCP). If the hostname cannot be resolved via DNS, the default hostname localhost.localdomain is assigned.

During the discovery phase, there is a validation check that prevents installation if the hostname is localhost. To bypass this check, the system replaces unresolved hostnames with the MAC address of the NIC associated with the IP address.

However, this replacement does not work for NICs configured with bonding. As a result, users must manually change the hostname to proceed with the installation.

https://github.com/openshift/assisted-installer-agent/pull/920

Story TRT-1874: Rename origin master branch to main

View the Description View the linked PRs

Part of DEI efforts, may need input from testplatform and CRT.

Justin mentions there is a transparent way to do this with github and he has perms to do it. Update branch config in release repo.

https://github.com/openshift/origin/pull/29565

Bug OCPBUGS-50650: Namespace path in URL is ignored and changed to all-namespaces for the first login

View the Description View the linked PRs

Description of problem:

    When a developer sandbox user logs in for the first time, OpenShift Console ignores the path in the URL the user clicked on (eg."/add/ns/krana-dev") and navigates the user to 'all projects' view instead of the namespace from the user URL.

 This also happens in any other OCP instance, the behavior is apparently tied to the user console settings ConfigMap. When a user logs in for the first time, the console user setting ConfigMap is not created yet. When the CM is not created (and doesn't have the console.lastNamespace field populated yet), the UI overrides the namespace navigation from "/add/ns/krana-dev" to "add/all-namespaces".
 For the second login, when the CM with the field is present already, the URL navigation works as expected.

Version-Release number of selected component (if applicable):

    not sure if previous or future versions are impacted. we are currently on 4.17, so can confirm that Openshift 4.17 is impacted.

How reproducible:

    reproducible for first time user.

Steps to Reproduce:

In any OCP cluster
    1. Provision new cluster (or use any cluster you haven't logged in before)
    2. Navigate to <cluster-url>/add/ns/default (you can also use any other namespace already present in the cluster)
    3. You will land in the "all-namespaces" view and the URL will be changed to /add/all-namespaces

In Developer Sandbox
    1. make sure you're not active user on https://developers.redhat.com/developer-sandbox
    2. if you are already active, you can either create a new user or ask anyone in #forum-dev-sandbox to deactivate your user.
    3. Go to https://console.redhat.com/openshift/sandbox click on "Get Started" and then click on "launch" button in the Red Hat OpenShift tile. This redirects you to your namespace /add/ns/<username>-dev
    4. OpenShift console doesn't navigate you to the "*-dev" namespace, but navigates you to all-namespaces view. The change is also visible in the url.

Actual results:

    User is navigated to all-namespaces.

This behavior has negative impact on the UX in the Developer Sandbox. A good percentage of users of DevSandbox are trying to learn more about openshift and often don't know what to do and where to go. Not landing in the right namespace can discourage users from experimenting with OpenShift and can also break any redirection from learning paths or other RH materials.

Expected results:

  The path in the URL is not ignored and not overwritten, and the user is navigated to the namespace from the URL

Additional info:

https://github.com/openshift/console/pull/14810

Bug OCPBUGS-51864: [IBMCloud] MAPI replacing unhealthy CP nodes

View the Description View the linked PRs

Description of problem:

It appears that during cluster creation, when MAPI starts up and begins to manage Machines in an IPI deployed cluster on IBM Cloud, it can detect an unhealthy CP node and attempt a one time replacement of that node, effectively destroying the cluster.

Version-Release number of selected component (if applicable):

4.19

How reproducible:

< 10%

Steps to Reproduce:

A potential way to reproduce, but is a relatively small timing window to meet
1. Create a new IPI cluster on IBM Cloud
2. Attempt to Stop a CP node once MAPI starts deploying, to allow MAPI to believe a CP node needs replacement. (This is an extremely tight window)

Replication may not be possible manually, and only just by luck.

Actual results:

One or more CP nodes get replaced during cluster creation, destroying etcd and other deployment of critical CP workloads, effectively breaking the cluster.

Expected results:

Successful cluster deployment.

Additional info:

Back when OCP was using RHEL 8 (RHCOS base), a known bug with NetworkManager caused the loss of the assigned IP a new IBM Cloud Instance (VSI), resulting in the new Instance never being able to start up with dracut and Ignition to work with the MCO.
Because of this bug with NetworkManager, a fix was created to force a one time replacement of that VSI by MAPI, to try to resolve this issue, and allow the VSI to complete bringup and report into the cluster.

https://issues.redhat.com/browse/OCPBUGS-1327


Unfortunately at that time, this appeared to only affect worker nodes, but in a few cases, it appears it is now affecting CP nodes as well, which was not the intention. I will add some logs and details with what I think is proof that MAPI is performing this same replacement on CP nodes.

https://github.com/openshift/machine-api-provider-ibmcloud/pull/61

Bug OCPBUGS-49959: Panic in the MCC when using OCL v1 GA

View the Description View the linked PRs

Description of problem:

MCC pod reports a panic when we install a cluster using the PR that makes OCL v1 GA.

The panic doesn't seem related to OCL it self, but with the ManagedBootImages functionality. Nevertheless, we only have been able to reproduce it when we use the mentioned PR to build the cluster.

Version-Release number of selected component (if applicable):

We only have been able to reproduce the issue while using this PR to create the cluster
https://github.com/openshift/api/pull/2192

How reproducible:

2 out of 2. Not sure if always.

Steps to Reproduce:

    1. Use https://github.com/openshift/api/pull/2192 to create payload image
    2. Install the cluster using the image generated in step 1

Actual results:

MCC reports a panic in its 'previous' logs

$ oc logs -p machine-config-controller-5b4b8d7d94-bmhdh
......
I0206 09:55:53.678676       1 kubelet_config_controller.go:222] Re-syncing all kubelet config controller generated MachineConfigs due to apiServer cluster change
E0206 09:55:53.678746       1 template_controller.go:245] "Unhandled Error" err="couldn't get ControllerConfig on dependency callback &%!w(errors.StatusError=errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ListMeta:v1.ListMeta{SelfLink:\"\", ResourceVersion:\"\", Continue:\"\", RemainingItemCount:(*int64)(nil)}, Status:\"Failure\", Message:\"controllerconfig.machineconfiguration.openshift.io \\\"machine-config-controller\\\" not found\", Reason:\"NotFound\", Details:(*v1.StatusDetails)(0xc0009a0de0), Code:404}})"
I0206 09:55:53.679532       1 reflector.go:368] Caches populated for *v1.ClusterVersion from github.com/openshift/client-go/config/informers/externalversions/factory.go:125
I0206 09:55:53.680703       1 template_controller.go:198] Re-syncing ControllerConfig due to apiServer cluster change
I0206 09:55:53.680747       1 reflector.go:368] Caches populated for *v1.ConfigMap from k8s.io/client-go/informers/factory.go:160
I0206 09:55:53.686268       1 machine_set_boot_image_controller.go:221] configMap coreos-bootimages added, reconciling enrolled machine resources
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x3a87c1a]

goroutine 372 [running]:
github.com/openshift/machine-config-operator/pkg/controller/machine-set-boot-image.(*Controller).syncMAPIMachineSets(0xc000dc8900, {0x43aeec5, 0x17})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go:329 +0xba
github.com/openshift/machine-config-operator/pkg/controller/machine-set-boot-image.(*Controller).addConfigMap.func1()
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go:225 +0x25
created by github.com/openshift/machine-config-operator/pkg/controller/machine-set-boot-image.(*Controller).addConfigMap in goroutine 347
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/machine-set-boot-image/machine_set_boot_image_controller.go:225 +0x145

Expected results:

No panic should be reported

Additional info:

https://github.com/openshift/machine-config-operator/pull/4834

Bug OCPBUGS-24795: Update 4.16 ose-olm-operator-controller-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/51

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-operator-controller/pull/51

Vulnerability OCPBUGS-50710: The details of this Jira Card are restricted (Restricts access to project administrators and users who are involved in resolving the issue)

View the Description View the linked PRs

The details of this Jira Card are restricted (Restricts access to project administrators and users who are involved in resolving the issue)

https://github.com/openshift/machine-api-provider-nutanix/pull/89

Bug OCPBUGS-52477: kube and openshift apiserver operator conditions permafailing on some metal jobs

View the Description View the linked PRs

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

operator conditions kube-apiserver

Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 99.46% to 90.48%.

Sample (being evaluated) Release: 4.19
Start Time: 2025-02-27T00:00:00Z
End Time: 2025-03-06T12:00:00Z
Success Rate: 90.48%
Successes: 57
Failures: 6
Flakes: 0

Base (historical) Release: 4.18
Start Time: 2025-02-04T00:00:00Z
End Time: 2025-03-06T12:00:00Z
Success Rate: 99.46%
Successes: 183
Failures: 1
Flakes: 0

View the test details report for additional context.

A selection of baremetal jobs now fail due to the following error:

 time="2025-03-06T06:08:48Z" level=info msg="Extracted /usr/bin/k8s-tests-ext.gz for tag hyperkube from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c21b9b37064cdf3297b44e747ea86ca92269e1e133f5042f8bd3ef9aaeda9e6e (disk size 119671560, extraction duration 7.038773908s)"
time="2025-03-06T06:08:48Z" level=info msg="Listing images for \"k8s-tests-ext\""
error: encountered errors while listing tests: failed running '/alabama/.cache/openshift-tests/registry_ci_openshift_org_ocp_release_4_19_0-0_nightly-2025-03-06-043021_f1a1994f6536/k8s-tests-ext list': exit status 1
Output:   E0306 06:08:48.426861     116 test_context.go:584] Unknown provider "baremetal". The following providers are known: aws azure gce kubemark local openstack skeleton vsphere
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2025-03-06T06:08:48Z"}
error: failed to execute wrapped command: exit status 1 
INFO[2025-03-06T06:08:50Z] Step e2e-metal-ipi-serial-ovn-ipv6-baremetalds-e2e-test failed after 22s.

This surfaces in the operator conditions tests for kube and openshift APIs as we fail fast, before these can stabilize, and thus the cluster does not appear ready when we expect it to be.

It is believed to have been caused by https://github.com/openshift/kubernetes/pull/2229

https://github.com/openshift/kubernetes/pull/2236

Task CNTRLPLANE-68: Add flags for setting azure marketplace images in e2e

View the Description View the linked PRs

We should be using marketplace images when testing aks as that's what will be used in production.

https://github.com/openshift/hypershift/pull/5356

Bug OCPBUGS-43083: Pods cannot connect to apiserver in IPv6 disconnected hosted cluster

View the Description View the linked PRs

Description of problem:

Installing 4.17 agent-based hosted cluster on bare-metal with IPv6 stack in disconnected environment. We cannot install MetalLB operator on the hosted cluster to expose openshift router and handle ingress because the openshift-marketplace pods that extract the operator bundle and the relative pods are in Error state. They try to execute the following command but cannot reach the cluster apiserver:

opm alpha bundle extract -m /bundle/ -n openshift-marketplace -c b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1 -z

INFO[0000] Using in-cluster kube client config          
Error: error loading manifests from directory: Get "https://[fd02::1]:443/api/v1/namespaces/openshift-marketplace/configmaps/b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1": dial tcp [fd02::1]:443: connect: connection refused



In our hosted cluster fd02::1 is the clusterIP of the kubernetes service and the endpoint associated to the service is [fd00::1]:6443. By debugging the pods we see that connection to clusterIP is refused but if we try to connect to its endpoint the connection is established and we get 403 Forbidden:

sh-5.1$ curl -k https://[fd02::1]:443
curl: (7) Failed to connect to fd02::1 port 443: Connection refused


sh-5.1$ curl -k https://[fd00::1]:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403

This issue is happening also in other pods in the hosted cluster which are in Error or in CrashLoopBackOff, we have similar error in their logs, e.g.:

F1011 09:11:54.129077       1 cmd.go:162] failed checking apiserver connectivity: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca-operator/leases/service-ca-operator-lock": dial tcp [fd02::1]:443: connect: connection refused


IPv6 disconnected 4.16 hosted cluster with same configuration was installed successfully and didn't show this issue, and neither IPv4 disconnected 4.17. So the issue is with IPv6 stack only.

Version-Release number of selected component (if applicable):

Hub cluster: 4.17.0-0.nightly-2024-10-10-004834

MCE 2.7.0-DOWNANDBACK-2024-09-27-14-52-56

Hosted cluster: version 4.17.1
image: registry.ci.openshift.org/ocp/release@sha256:e16ac60ac6971e5b6f89c1d818f5ae711c0d63ad6a6a26ffe795c738e8cc4dde

How reproducible:

100%

Steps to Reproduce:

    1. Install MCE 2.7 on 4.17 IPv6 disconnected BM hub cluster
    2. Install 4.17 agent-based hosted cluster and scale up the nodepool 
    3. After worker nodes are installed, attempt to install MetalLB operator to hanlde ingress

Actual results:

MetalLB operator cannot be installed because pods cannot connect to the cluster apiserver.

Expected results:

Pods in the cluster can connect to apiserver.

Additional info:

https://github.com/openshift/hypershift/pull/5168

Bug OCPBUGS-43352: Assisted installer cluster installation fails with IP collision validation, due to a Porxy ARP request.

View the Description View the linked PRs

Description of problem:

     In Agent-Base Installation, storage network on all nodes will be also configured. If both VLAN interfaces in the same L3 switch used as gateways for compute cluster management and storage networks of the OCP cluster have arp-proxy enabled, then the IP collisions validation will report errors. The reason why the IP collision validation fails is that the validation seems to send the arp-request from both interfaces bond0.4082 and bond1.2716 for all addresses used for compute cluster management and storage networks.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

    Always

Steps to Reproduce:

    1. Install cluster using Agent-base with proxy enabled.
    2.  Configur both VLAN interfaces in the same L3 switch, used as gateways for compute cluster management and storage networks of the OCP cluster have arp-proxy enabled

Actual results:

    IP collision validation fails, the validation seems to send the arp-request from both interfaces bond0.4082 and bond1.2716 for all addresses used for compute cluster management and storage networks.  The validation seems to trigger the arp-requests sent out from all NICs of the nodes. If the gateways connected to the different NICs of the node have arp-proxy configured, then the IP collision failure will be observed.

Expected results:

    Validation should pass in Arp Porxy Scenario

Additional info:

    In arp-proxy scenario, the arp reply is not needed from the NIC which IP address is not in the same subnet as the destination IP address of the arp-request. Generally speaking, when the node tries to communicate to any destination IP address in the same subnet as the IP address of one NIC, it will only send out the arp-request from this NIC. So even arp-proxy in this case, it will cause any issue.

https://github.com/openshift/assisted-installer-agent/pull/877

Bug OCPBUGS-45452: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/images/pull/202

Bug OCPBUGS-45491: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9273

Bug OCPBUGS-45694: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-azure/pull/129

Bug OCPBUGS-45732: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-powervs/pull/93

Bug OCPBUGS-44373: AWS installation fails when AssociatePublicIpAddress value is set to false in SCP.

View the Description View the linked PRs

Description of problem:

The installation with aws installation fails when the SCP has the value for AssociatePublicIpAddress set to False. The IAM user is not able to create new EC2 instances i.e. the worker nodes are not getting created. 
However the bootstrap and Master nodes gets created.

The below logs can be observed in the machine-api controller logs :

2024/10/31 16:05:28 failed to create instance: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:sts::<account-id>:assumed-role/<role-name> is not authorized to perform: ec2:RunInstances on resource: arn:aws:ec2:ap-southeast-1:<account-id>:network-interface/* with an explicit deny in a service control policy. Encoded authorization failure message: <encoded-message>

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Set the value of AssociatePublicIpAddress: False inside SCP.
    2. Perform a normal IPI aws installation with IAM user which has the above SCP applied.
    3. Observe that the workers are not getting created.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-api-provider-aws/pull/116

Bug OCPBUGS-45351: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-bootstrap/pull/110

Bug OCPBUGS-47629: Switching container runtime using ctrcfg does not update the conmon container runtime's --root path

View the Description View the linked PRs

Description of problem:

 When switching the container runtime using a ContainerRuntimeConfig, conmon using the wrong,unupdated --root path for the runtime.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-12-20-221830

Steps to Reproduce:

1.Apply containerRuntimeConfig requesting runc as the default runtime. apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: enable-runc-worker
spec:
  containerRuntimeConfig:
    defaultRuntime: runc
  machineConfigPoolSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker

2. Wait for MCP to turn updated
3. oc debug the worker node
4. ps -ef | grep -E 'crun|runc'

Actual results:

The last command will yield a list of processes like:
root     1230048       1  0 08:33 ?        00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/9917f380f2e7a88cd8cf09023f13a7500014cd81cfd97213b29bcd35258f886d/userdata -c ........../userdata -r /usr/bin/runc --runtime-arg --root=/run/crun --socket-dir-path /var/run/crio --syslog -u 9917f380f2e7a88cd8cf09023f13a7500014cd81cfd97213b29bcd35258f886d -s -t  


Upon closer inspection, the --root=/run/crun value does not align with the -r /usr/bin/runc runtime specification.

Expected results:

All processes should be using runc with correct --root value:
root     1230048       1  0 08:33 ?        00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/9917f380f2e7a88cd8cf09023f13a7500014cd81cfd97213b29bcd35258f886d/userdata -c ........../userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio --syslog -u 9917f380f2e7a88cd8cf09023f13a7500014cd81cfd97213b29bcd35258f886d -s -t

Additional info:

https://github.com/openshift/machine-config-operator/pull/4850

Vulnerability OCPBUGS-49378: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/operator-framework/operator-marketplace/pull/583

Bug OCPBUGS-45317: node-joiner --pxe does not rename the pxe artifacts

View the Description View the linked PRs

Description of problem:

node-joiner --pxe does not rename pxe artifacts

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1. node-joiner --pxe

Actual results:

   agent*.* artifacts are generated in the working dir

Expected results:

    In the target folder, there should be only the following artifacts:

* node.x86_64-initrd.img
* node.x86_64-rootfs.img
* node.x86_64-vmlinuz
* node.x86_64.ipxe (if required)

Additional info:

https://github.com/openshift/installer/pull/9280

Bug OCPBUGS-48637: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14709

Vulnerability OCPBUGS-43661: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14429

Bug CNV-55129: CUDN creation view allow setting namespace-selector with no rules, result in affecting all cluster pods

View the Description View the linked PRs

Description of problem:

The CUDN creation view doesnt prevent namespace-selector with no rules

Version-Release number of selected component (if applicable):

4.18

How reproducible:

100%

Steps to Reproduce:

1.In the UI, go to CUDN creation view, create CDUN with empty namespace-selector.
2.
3.

Actual results:

The CUDN will select all namespaces exist in the cluster, including openshift-* namespaces.
Affecting the cluster system components including the api-server, etcd.

Expected results:

I expect the UI to block creating CUDN with namespace-selector that has zero rules.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/178

Bug OCPBUGS-44319: Fix oauth-proxy e2e-component tests

View the Description View the linked PRs

Description of problem:

We’re unable to find a stable and accessible OAuth-proxy image, which is causing a bug that we haven’t fully resolved yet. Krzys made a PR to address this , but it’s not a complete solution since the image path doesn’t seem consistently available. Krzys tried referencing the OAuth-proxy image from the OpenShift openshift namespace, but it didn’t work reliably.There’s an imagestream for OAuth-proxy in the openshift namespace, which we might be able to reference in tests, but not certain of the correct Docker URL format for it. Also, it’s possible that there are permission issues, which could be why the image isn’t accessible when referenced this way.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oauth-proxy/pull/289

Bug OU-499: Remove deprecated components from the monitoring plugin

View the Description View the linked PRs

Background

The monitoring-plugin is still using Patternfly v4; it needs to be upgraded to Patternfly v5. This major version release deprecates components in the monitoring-plugin. These components will need to be replaced/removed to accommodate the version update.

We need to remove the deprecated components from the monitoring plugin, extending the work from ~~CONSOLE-4124~~

Work to be done:

upgrade monitoring-plugin > package.json > Patternfly v5
Remove/replace any deprecated components after upgrading to Patternfly v5.
- For example in the Alerts > Silences , the <Dropdown /> components will need to be replaced by the <Select /> component. See ~~CONSOLE-4124~~ related PR: https://github.com/openshift/console/pull/14138/files

Outcome

The monitoring-plugin > package.json will be upgrade to use Patternfly v5
Any deprecrated components from Patternfly v4 will be removed or replaced my similiar Patternfly v5 components

Bug OCPBUGS-44800: Improve findability of Console plugins

View the Description View the linked PRs

Description of problem:

Finding the Console plugins list can be challenging as it is not in the primary nav.  We should add it to the primary nav so it is easier to find.

https://github.com/openshift/console/pull/14521

Bug OCPBUGS-45454: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1938

Bug OCPBUGS-45583: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-metrics-server/pull/39

Bug OCPBUGS-38749: clusteroperator/machine-config blips Degraded=True during non-upgrade job run

View the Description View the linked PRs

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures machine-config operator that blips Degraded=True during some ci job runs.


Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843561357304139776
  
Reasons associated with the blip: MachineConfigDaemonFailed or MachineConfigurationFailed

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exception is defined here: https://github.com/openshift/origin/blob/e5e76d7ca739b5699639dd4c500f6c076c697da6/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L109


See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-45474: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-os-images/pull/48

Vulnerability OCPBUGS-52224: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/352

Bug OCPBUGS-45218: [aws] using default instance type for edge pools often fails

View the Description View the linked PRs

Description of problem:

    If the install is performed with an AWS user missing the `ec2:DescribeInstanceTypeOfferings`, the installer will use a hardcoded instance type from the set of non-edge machine pools. This can potentially cause the edge node to fail during provisioning, since the instance type doesn't take into account edge/wavelength zones support.

Because edge nodes are not needed for the installation to complete, the issue is not noticed by the installer, only by inspecting the status of the edge nodes.

Version-Release number of selected component (if applicable):

    4.16+ (since edge nodes support was added)

How reproducible:

    always

Steps to Reproduce:

    1. Specify an edge machine pool in the install-config without an instance type
    2. Run the install with an user without `ec2:DescribeInstanceTypeOfferings`
    3.

Actual results:

    In CI the `node-readiness` test step will fail and the edge nodes will show

                    errorMessage: 'error launching instance: The requested configuration is currently not supported. Please check the documentation for supported configurations.'         
                    errorReason: InvalidConfiguration

Expected results:

    Either
1. the permission is always required when instance type is not set for an edge pool; or
2.  a better instance type default is used

Additional info:

    Example CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1862140149505200128

https://github.com/openshift/installer/pull/9256

Bug OCPBUGS-44818: oc-mirror failed when try to mirror a full catalog

View the Description View the linked PRs

Description of problem:

When try to a full catalog will failed with error:
2024/11/21 02:55:48  [ERROR]  : unable to rebuild catalog docker://registry.redhat.io/redhat/redhat-operator-index:v4.17: filtered declarative config not found

Version-Release number of selected component (if applicable):

oc-mirror version 
W1121 02:59:58.748933   61010 mirror.go:102] ⚠️  oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-324-gbae91d5", GitCommit:"bae91d55", GitTreeState:"clean", BuildDate:"2024-11-20T02:06:04Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1.  mirror ocp with full==true for catalog :  cat config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17
    full: true

oc-mirror -c config.yaml docker://localhost:5000 --workspace file://full-catalog --v2

Actual results:

oc-mirror -c config.yaml docker://localhost:5000 --workspace file://full-catalog --v22024/11/21 02:55:27  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/11/21 02:55:27  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/11/21 02:55:27  [INFO]   : ⚙️  setting up the environment for you...
2024/11/21 02:55:27  [INFO]   : 🔀 workflow mode: mirrorToMirror 
2024/11/21 02:55:27  [INFO]   : 🕵️  going to discover the necessary images...
2024/11/21 02:55:27  [INFO]   : 🔍 collecting release images...
2024/11/21 02:55:27  [INFO]   : 🔍 collecting operator images...
 ⠦   (20s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.17 
2024/11/21 02:55:48  [WARN]   : error parsing image registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 : registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 unable to parse image co ✓   (20s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.17 
2024/11/21 02:55:48  [WARN]   : registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 unable to parse image correctly : tag and digest are empty : SKIPPING
2024/11/21 02:55:48  [WARN]   : [OperatorImageCollector] gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 has both tag and digest : using digest to pull, but tag only for mirroring
2024/11/21 02:55:48  [INFO]   : 🔍 collecting additional images...
2024/11/21 02:55:48  [INFO]   : 🔍 collecting helm images...
2024/11/21 02:55:48  [INFO]   : 🔂 rebuilding catalogs
2024/11/21 02:55:48  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/11/21 02:55:48  [ERROR]  : unable to rebuild catalog docker://registry.redhat.io/redhat/redhat-operator-index:v4.17: filtered declarative config not found

Expected results:

no error

Additional info:

https://github.com/openshift/oc-mirror/pull/963

Bug OCPBUGS-44960: bump k8s to v0.31.1

View the Description View the linked PRs

Description of problem:

    Need to bump k8s to v0.31.1 in 4.18

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/4927

Bug OCPBUGS-45893: Network name conflict

View the Description View the linked PRs

Description of problem:

Due to internal conversion of network name, which replaces '-' with '.', non-conflicting networks get conflicting internal names, e.g. for node switches. https://github.com/ovn-kubernetes/ovn-kubernetes/blob/cb682053aafcbe8e35dd8bb705c3c9d2bd72b821/go-controller/pkg/util/multi_network.go#L725-L728

For example, NADs with network name 'tenant-blue' and 'tenant.blue' will have such conflict. In UDN case, network name is built as <namespace>.<name>, therefore 2 UDNs with namespace+name => network will have a conflict:

'test' + 'tenant-blue' => test.tenant-blue

'test-tenant' + 'blue' => test-tenant.blue

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

Bug OCPBUGS-45554: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/134

Bug OCPBUGS-45441: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/route-controller-manager/pull/52

Bug OCPBUGS-46571: OSD 4.17 installation on Google Cloud require constraints/compute.vmCanIpForward to not be enforced

View the Description View the linked PRs

Description of problem:

Installations on Google Cloud require the constraints/compute.vmCanIpForward to not be enforced.
Error:

time=\"2024-12-16T10:20:27Z\" level=debug msg=\"E1216 10:20:27.538990 97 reconcile.go:155] \\"Error creating an instance\\" err=\\"googleapi: Error 412: Constraint constraints/compute.vmCanIpForward violated for projects/ino-paas-tst. Enabling IP forwarding is not allowed for projects/ino-paas-tst/zones/europe-west1-b/instances/paas-osd-tst2-68r4m-master-0., conditionNotMet\\" controller=\\"gcpmachine\\" controllerGroup=\\"infrastructure.cluster.x-k8s.io\\" controllerKind=\\"GCPMachine\\" GCPMachine=\\"openshift-cluster-api-guests/paas-osd-tst2-68r4m-master-0\\" namespace=\\"openshift-cluster-api-guests\\" reconcileID=\\"3af74f44-96fe-408a-a0ad-9d63f023d2ee\\" name=\\"paas-osd-tst2-68r4m-master-0\\" zone=\\"europe-west1-b\\"\"

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Every Time

Steps to Reproduce:

    1. Enable constraints/compute.vmCanIpForward on a project
    2. Install OSD 4.17 on that project 
    3. Installation fails

Actual results:

    Installation fails

Expected results:

    Installation does not fail

Additional info:

    More info in the attachments

https://github.com/openshift/installer/pull/9332

Bug OCPBUGS-51193: Add runbook link to CoreDNSErrorsHigh

View the Description View the linked PRs

Description of problem:

The current "description" annotation of the CoreDNSErrorsHigh alert doesn't provide much context about what's happening and what to do next.

Version-Release number of selected component (if applicable):

4.14.39

How reproducible:

always

Steps to Reproduce:

Deploy OCP and use unhealthy nameservers as upstream for CoreDNS ! Or control DNS pods placement without complying with the standard architecture...

Actual results:

CoreDNSErrorsHigh but the description annotation doesn't provide lots of information about what to do next.

Expected results:

Detailed instructions.

Additional info:

https://access.redhat.com/solutions/5917331

https://github.com/openshift/cluster-dns-operator/pull/426

Bug OCPBUGS-53096: [multus-additional-cni-plugins][4.19] egress-router-binary-copy runs on a 4.16 based image

View the Description View the linked PRs

Description of problem:
egress-router-binary-copy, an init-container of multus-additional-cni-plugins runs on the egress-router-cni container image. This image is still based on the 4.16 image.

I'm not sure about other possible implications, but when run on confidential TDX instances, this causes a segfault that 4.19 images do not cause. This is possibly due to a known glibc issue.

Version-Release number of selected component (if applicable): 4.19

How reproducible: 100%

Steps to Reproduce:

1. Install a cluster on TDX instances (being developed).

2. egress-router-binary-copy segfaults.

Actual results:

egress-router-binary-copy container image is based on 4.16

It segfaults when running on a TDX confidential cluster node.

Expected results:

egress-router-binary-copy container image is based on 4.19.

It runs normally on a TDX confidential cluster node.

Additional info:

Fix proposal available in https://github.com/openshift/egress-router-cni/pull/89

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/egress-router-cni/pull/89

Bug OCPBUGS-43357: Control plane pods missing tolerations specified in hypershift create cluster azure --tolerations

View the Description View the linked PRs

Description of problem:

Some control plane pods are not receiving the tolerations specified using the hypershift create cluster azure --toleration command.

Steps to Reproduce:

1. Create Azure HC with hypershift create cluster azure --toleration key=foo-bar.baz/quux,operator=Exists --toleration=key=fred,operator=Equal,value=foo,effect=NoSchedule --toleration key=waldo,operator=Equal,value=bar,effect=NoExecute,tolerationSeconds=3600 ...

2. Run the following script against the MC

NAMESPACE="clusters-XXX"
PODS="$(oc get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}')"

for POD in $PODS; do
  echo "Checking pod: $POD"  
  tolerations="$(oc get po -n $NAMESPACE $POD -o jsonpath='{.spec.tolerations}' | jq -c --sort-keys)"
  failed="false"
  
  if ! grep -q '"key":"foo-bar.baz/quux","operator":"Exists"' <<< "$tolerations"; then
    echo "No foo-bar.baz/quux key found" >&2
    failed="true"
  fi
  
  if ! grep -q '"effect":"NoSchedule","key":"fred","operator":"Equal","value":"foo"' <<< "$tolerations"; then
    echo "No fred key found" >&2
    failed="true"
  fi
  
  if ! grep -q '"effect":"NoExecute","key":"waldo","operator":"Equal","tolerationSeconds":3600,"value":"bar"' <<< "$tolerations"; then
    echo "No waldo key found" >&2
    failed="true"
  fi
  
  if [[ $failed == "true" ]]; then
    echo "Tolerations: "
    echo "$tolerations" | jq --sort-keys
  fi
  echo 
done

3. Take note of the results

Actual results (and dump files):

https://drive.google.com/drive/folders/1MQYihLSaK_9WDq3b-H7vx-LheSX69d2O?usp=sharing

Expected results:

All specified tolerations are propagated to all control plane pods.

https://github.com/openshift/csi-operator/pull/326

Bug OCPBUGS-43812: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/builder/pull/458

Bug OCPBUGS-47529: OWNERS update for build componet

View the Description View the linked PRs

Description of problem:

  OWNERS file updated to include prabhakar and Moe as owners and reviewers

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    This is to fecilitate easy backport via automation

https://github.com/openshift/origin/pull/29384

Bug OCPBUGS-49804: Hypershift presubmit CI pull-ci-openshift-api-master-e2e-aws-ovn-hypershift is perma failing

View the Description View the linked PRs

See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/1997/pull-ci-openshift-api-master-e2e-aws-ovn-hypershift/1886643435440443392 and https://redhat-internal.slack.com/archives/C01CQA76KMX/p1738645154113239?thread_ts=1738642163.306719&cid=C01CQA76KMX for example

: TestCreateCluster/Main/EnsureCustomLabels expand_less	0s
{Failed  === RUN   TestCreateCluster/Main/EnsureCustomLabels
    util.go:1954: expected pods [aws-cloud-controller-manager-67bfd8cbc5-8dvwp, aws-ebs-csi-driver-controller-6f8786f899-9mjgz, aws-ebs-csi-driver-operator-6c5b795565-gv8dh, capi-provider-76b766f99-2gjhh, catalog-operator-5b478677d5-thwqt, certified-operators-catalog-84cb448c4d-zn4h9, cloud-credential-operator-68bb55c657-2fxm8, cloud-network-config-controller-589ff54f97-7sbp6, cluster-api-75855d4758-9pnq9, cluster-image-registry-operator-58475b676c-6xsrf, cluster-network-operator-56c777dbc4-26gng, cluster-node-tuning-operator-5697596c7d-z7gt8, cluster-policy-controller-c8d6d6b6c-lllbn, cluster-storage-operator-75ddcdd454-94sb7, cluster-version-operator-5c5677754d-hqxxq, community-operators-catalog-75cb5645bd-5zz89, control-plane-operator-7886f776d4-sth5w, control-plane-pki-operator-5f6fb9f6fd-pqv9z, csi-snapshot-controller-7479d75445-wchhd, csi-snapshot-controller-operator-f657b76f8-9g7fl, dns-operator-7bbfbd7568-jj8zt, etcd-0, hosted-cluster-config-operator-866bc4b498-4m7nc, ignition-server-54bd9b464-5vwsb, ignition-server-proxy-7966cfbcf-vx4qc, ingress-operator-74cdbc59f8-xvmvs, konnectivity-agent-5775fbfd6f-kwx85, kube-apiserver-6f4f79b98c-qj7vs, kube-controller-manager-5999d8597b-hbnnp, kube-scheduler-745d45554b-6gn6c, machine-approver-566b56d5cc-9xflj, multus-admission-controller-698f686986-2dj4r, network-node-identity-5b995c6748-pnc7f, oauth-openshift-649d7467c5-nszbt, olm-operator-6cfc78d86f-xspl9, openshift-apiserver-6678b9d68-rt77m, openshift-controller-manager-59dc766b95-nm8xs, openshift-oauth-apiserver-68b8f7fbdc-54sj4, openshift-route-controller-manager-fdc875484-bqnlf, ovnkube-control-plane-654b979d95-54zgf, packageserver-57dfb7b586-mmdpt, redhat-marketplace-catalog-79f86885f-lxdt5, redhat-operators-catalog-5764567c54-mhfg9, router-7799588d9b-jrnq4] to have label hypershift-e2e-test-label=test
        --- FAIL: TestCreateCluster/Main/EnsureCustomLabels (0.02s)

this test seems to be failing across all PRs?

https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-api-master-e2e-aws-ovn-hypershift

https://github.com/openshift/hypershift/pull/5547

Bug OCPBUGS-50963: A --dry-run option in oc-mirror version 2 deletes yaml resources in the ocp_mirror/working-dir/cluster-resources/ directory

View the Description View the linked PRs

Description of problem:

    When using oc-mirror version 2, it creates idms-oc-mirror.yaml and itms-oc-mirror.yaml after a successful mirroring process(mirror to mirror). However, On a second attempt, I've added an operator in the imageset-config.yaml and then re-run the same command with a --dry-run, I noticed that these resource files got deleted.

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Create a imageset-config.yaml
    2. Run oc-mirror --v2, mirror images
    3. Run with a --dry-run option

Actual results:

    the existing yaml in the ocp_mirror/working-dir/cluster-resources/ directory when running with a --dry-run option

Expected results:

    --dry-run supposed not to delete files

Additional info:

    tested with 4.17 oc-mirror version

https://github.com/openshift/oc-mirror/pull/1087

Bug OCPBUGS-44957: HyperShift: placeholder pods in size tagging configurations should not be placed in nodes that have a pair label associated with an existing HostedCluster

View the Description View the linked PRs

Description of problem:

    When hosted clusters are delayed in deleting, their dedicated request serving nodes may have already been removed, but the configmap indicating that the node pair label is in use remains. Placeholder pods are currently getting scheduled on new nodes that have these pair labels. When the scheduler tries to use these new nodes, it says it can't because there is a configmap associating the pair label with a cluster that is in the process of deleting.

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    sometimes

Steps to Reproduce:

    1. In a size tagging dedicated request serving architecture, create hosted cluster(s).
    2. Place an arbitrary finalizer on the hosted cluster(s) so it cannot be deleted.
    3. Delete the hosted clusters
    4. Look at placeholder pods in hypershift-request-serving-node-placeholders

Actual results:

some placeholder pods are scheduled on nodes that correspond to fleet manager pairs taken up by the deleting clusters

Expected results:

    no placeholder pods are scheduled on nodes that correspond to hosted clusters.

Additional info:

https://github.com/openshift/hypershift/pull/5191

Task CLID-300: Update v2 operator filtering documentation

View the Description View the linked PRs

https://github.com/openshift/oc-mirror/blob/main/v2/docs/operator-filtering-investigation.md#annex---acceptance-criteria-set-for-v2 is obsolete since work on https://github.com/sherine-k/catalog-filter/pull/7 was done as part of https://issues.redhat.com/browse/OCPBUGS-43731 and https://issues.redhat.com/browse/CLID-235

https://github.com/openshift/oc-mirror/pull/1047

Bug OCPBUGS-39315: Excessive Restarts on ingress operator

View the Description View the linked PRs

We are aiming to find containers that are restarting more than 3 times in the progress of a test.

https://search.dptools.openshift.org/?search=restarted+.*+times+at&maxAge=336h&context=1&type=junit&name=4.18&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

If this link code rots, you just need to search for "restarted .* times" in the openshift CI for 4.18.

PS I took a guess at how owns the openshift/ingress-operator so please assign once you find the correct owner.

We are adding an exclusion for this container but we ask that you look into fixing this.

https://github.com/openshift/origin/pull/29093

Vulnerability OCPBUGS-46260: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9314

Bug OCPBUGS-50839: [AWS] "kubernetes.io/cluster/${infra_id}:shared" tag was missing in BYO private subnets

View the Description View the linked PRs

Description of problem:

"kubernetes.io/cluster/${infra_id}:shared" should be attached to the private subnets, but it's missing now.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2025-02-11-060301

How reproducible:

Always

Steps to Reproduce:

    1. Create a cluster with byo-VPC
    2.
    3.

Actual results:

"kubernetes.io/cluster/${infra_id}:shared" tag was not attached to private subnets.

Expected results:

"kubernetes.io/cluster/${infra_id}:shared" tag presents in private subnets.

Additional info:

Checking the recent change, it looks like caused by the typo issue [1] in:
https://github.com/openshift/installer/pull/9430 (4.18.0-0.nightly-2025-02-07-182732)
https://github.com/openshift/installer/pull/9445 (4.17.0-0.nightly-2025-02-11-131208)


[1] https://github.com/openshift/installer/pull/9430/files#diff-703b72d7af46ab11b2fd79c7073598468fdf038db0666628521f9e6923dc78daR72

This issue blocks C2S/SC2S cluster testing:
level=info msg=Credentials loaded from the "default" profile in file "/tmp/secret/aws_temp_creds"
level=info msg=Creating infrastructure resources...
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": could not add tags to subnets: MissingParameter: The request must contain the parameter resourceIdSet
level=fatal msg=	status code: 400, request id: d6ffad31-f2a7-4a1d-883f-c44bbc1ee1c7

https://github.com/openshift/installer/pull/9480

Bug OCPBUGS-51171: [AWS] Performing ReplaceRoute action is not allowed with minimum permission policy

View the Description View the linked PRs

Description of problem:


In some cases, installer may need to call ReplaceRoute action, but with minimum permission, this is not allowed:

...
time="2025-02-22T06:44:35Z" level=debug msg="E0222 06:44:35.976720 	218 awscluster_controller.go:319] \"failed to reconcile network\" err=<"
time="2025-02-22T06:44:35Z" level=debug msg="\tfailed to replace outdated route on route table \"rtb-0f3322786d2a7a9fc\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::301721915996:user/ci-op-n3z38rfl-21543-minimal-perm-installer is not authorized to perform: ec2:ReplaceRoute on resource: arn:aws:ec2:us-east-1:301721915996:route-table/rtb-0f3322786d2a7a9fc because no identity-based policy allows the ec2:ReplaceRoute action. Encoded authorization failure message: HIDDEN"
time="2025-02-22T06:44:35Z" level=debug msg="\t\tstatus code: 403, request id: 405cded7-daae-49b6-aa38-ed2d2fbceb75"
time="2025-02-22T06:44:35Z" level=debug msg=" > controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-n3z38rfl-21543-6h797\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-n3z38rfl-21543-6h797\" reconcileID=\"24aae75e-bd3e-4705-a88a-e69bfa0b4974\" cluster=\"openshift-cluster-api-guests/ci-op-n3z38rfl-21543-6h797\""
time="2025-02-22T06:44:35Z" level=debug msg="I0222 06:44:35.976749 	218 recorder.go:104] \"Operation ReplaceRoute failed with a credentials or permission issue\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AWSCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"ci-op-n3z38rfl-21543-6h797\",\"uid\":\"dfdcd50e-f0d5-4456-b5d6-c26de8f2c2ce\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta2\",\"resourceVersion\":\"448\"} reason=\"UnauthorizedOperation\""
time="2025-02-22T06:44:35Z" level=debug msg="I0222 06:44:35.976773 	218 recorder.go:104] \"Failed to replace outdated route on managed RouteTable \\\"rtb-0f3322786d2a7a9fc\\\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::301721915996:user/ci-op-n3z38rfl-21543-minimal-perm-installer is not authorized to perform: ec2:ReplaceRoute on resource: arn:aws:ec2:us-east-1:301721915996:route-table/rtb-0f3322786d2a7a9fc because no identity-based policy allows the ec2:ReplaceRoute action. Encoded authorization failure message: [HIDDEN]\\n\\tstatus code: 403, request id: 405cded7-daae-49b6-aa38-ed2d2fbceb75\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AWSCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"ci-op-n3z38rfl-21543-6h797\",\"uid\":\"dfdcd50e-f0d5-4456-b5d6-c26de8f2c2ce\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta2\",\"resourceVersion\":\"448\"} reason=\"FailedReplaceRoute\""
time="2025-02-22T06:44:36Z" level=debug msg="E0222 06:44:36.035951 	218 controller.go:324] \"Reconciler error\" err=<"
...

Version-Release number of selected component (if applicable):

4.18.1

How reproducible:

Occasionally

Steps to Reproduce:

it's not always reproducible, in this case, the install-config is like:

...
fips: true
controlPlane:
  platform:
	aws:
  	zones:
  	- us-east-1c
  	- us-east-1b
  	type: m6i.xlarge
  architecture: amd64
  name: master
  replicas: 3
compute:
- platform:
	aws:
  	zones:
  	- us-east-1c
  	- us-east-1b
  	type: m5.xlarge
  architecture: amd64
  name: worker
  replicas: 3
- name: edge
  architecture: amd64
  hyperthreading: Enabled
  replicas: 1
  platform:
	aws:
  	zones: [us-east-1-atl-1a]
baseDomain: qe.devcluster.openshift.com
platform:
  aws:
	region: us-east-1
...

Actual results:

Install failed.

Expected results:

   Install succeeds.

Additional info:

  It looks like the issue comes from the upstream CAPA [1], so all CAPI installs (4.16+) might be affected.

[1] https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/4e912b4e4d1f855abf9b5194acaf9f31b5763c57/pkg/cloud/services/network/routetables.go#L160

https://github.com/openshift/installer/pull/9525

Bug OCPBUGS-42809: quorum loss during bootstrapping

View the Description View the linked PRs

Description of problem:

During bootstrapping we're running into the following scenario:

4 members: master 0, 1 and 2 (are full voting) and bootstrap (torn down/dead member) revision rollout causes 0 to restart and leaves you with 2/4 healthy, which means quorum-loss.

This causes apiserver unavailability during the installation and should be avoided.

Version-Release number of selected component (if applicable):

4.17, 4.18 but is likely a longer standing issue

How reproducible:

rarely

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

apiserver should not return any errors

Additional info:

https://github.com/openshift/cluster-etcd-operator/pull/1372

Vulnerability OCPBUGS-48741: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/builder/pull/455

Bug TRT-2006: [4.19] EgressFirewall Payload Failures

View the Description View the linked PRs

Beginning with payloads 4.19.0-0.ci-2025-02-12-111027 and 4.19.0-0.nightly-2025-02-13-083804 aggregate azure jobs started failing on test [sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel] - not all the time, but enough to fail aggregation.

We believe this is due to an added step in https://github.com/openshift/origin/pull/29414/files and are testing the revert in https://github.com/openshift/origin/pull/29540

https://github.com/openshift/origin/pull/29540

Bug OCPBUGS-26771: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/1092

Bug OCPBUGS-37058: i18n: Missing translations for "PodDisruptionBudget violated" string

View the Description View the linked PRs

Description of problem:

Missing translations for "PodDisruptionBudget violated" string

Code:

"`count` PodDisruptionBudget violated_one": "`count` PodDisruptionBudget violated_one", "`count` PodDisruptionBudget violated_other": "`count` PodDisruptionBudget violated_other",

Code:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14586

Bug OCPBUGS-48740: Add missing CSP directives

View the Description View the linked PRs

Description of problem:

ConsolePlugin CRD is missing connect-src CSP directives, which need to be added to its API and ported into both console-operator and console itself.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Set connect-src CSP directive in the ConsolePlugin CR.
    2. Save changes
    3.

Actual results:

API server will error out with unknown DirectiveType type

Expected results:

Added CSP directives should be saved as part of the updated ConsolePlugin CR, and aggregated CSP directives should be set as part of the bridge server response header, containing the added CSP directives

Additional info:

Bug OCPBUGS-48288: MAPO ports ignore tags field

View the Description View the linked PRs

Description of problem:

As noted in https://github.com/openshift/api/pull/1963#discussion_r1910598226, we are currently ignoring tags set on a port in a MAPO Machine or MachineSet. This appears to be a mistake that we should correct.

Version-Release number of selected component (if applicable):

All versions of MAPO that use CAPO under the hood are affected.

How reproducible:

n/a

Steps to Reproduce:

n/a

Actual results:

n/a

Expected results:

n/a

Additional info:

See https://github.com/openshift/api/pull/1963#discussion_r1910598226

https://github.com/openshift/machine-api-provider-openstack/pull/130

Bug OCPBUGS-34586: MCO doesn't recover pool from degraded state

View the Description View the linked PRs

Description of problem:

Once Machine Config Pool goes into degraded state due to incorrect machine config, pool doesn't recover from this state even after updating machine config with correct config.

Version-Release number of selected component (if applicable):

  4.16.0, applicable for previous versions as well.

How reproducible:

    Always

Steps to Reproduce:

    1. Create Machine Config with invalid extension name.
    2. Wait for Machine Config Pool goes into degraded state.
    3. Update Machine Config with correct extension name or delete Machine Config.

Actual results:

    Machine Config Pool doesn't recover and always in degraded state.

Expected results:

    Machine Config Pool must be restored and degraded condition must be set with false.

Additional info:

      conditions:
  - lastTransitionTime: "2024-05-16T11:15:51Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2024-05-27T15:05:50Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2024-05-27T15:07:41Z"
    message: 'Node worker-1 is reporting: "invalid extensions found: [ipsec11]"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2024-05-27T15:07:41Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  - lastTransitionTime: "2024-05-27T15:05:50Z"
    message: All nodes are updating to MachineConfig rendered-worker-c585a5140738aa0a2792cf5f25b4eb20
    reason: ""
    status: "True"
    type: Updating

https://github.com/openshift/machine-config-operator/pull/4763

Bug OCPBUGS-45995: Azure CAPI: Always set cross_tenant_replication_enabled parameter to False

View the Description View the linked PRs

Description of problem:

Improving the OpenShift installer for Azure Deployments to comply PCI-DSS/BAFIN regluations.

The OpenShift installer utilizes thegithub.com/hashicorp/terraform-provider-azurermmodule which in versions < 4 have the cross_tenant_replication_enabled parameter set to true. Two options available to fix this are:
1. adjust the OpenShift installer to create the resourceStorageAccount [1] as requested with the default set to FALSE
2. upgrade the OpenShift installer module version used of terraform-provider-azurerm to 4.x were this parameter now defaults to FALSE [1] https://github.com/hashicorp/terraform-provider-azurerm/blob/57cd1c81d557a49e18b2f49651a4c741b465937b/internal/services/storage/storage_account_resource.go#L212

This security voilation blocks using and scaling Clusters in Public cloud environments for the Banking and Financial industry which need to comply to BAFIN and PCI-DSS regulations.4. List any affected packages or components.OpenShift Installer 4.xCompliance Policy Azure https://learn.microsoft.com/en-us/azure/storage/common/security-controls-policy.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9322

Bug OCPBUGS-49403: 4.19 Automate AMD LLC Configuration test cases

View the Description View the linked PRs

Description of problem:

    Automate Configuration related Test cases of AMD Last Level Cache Feature.

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/1291

Task SPLAT-1940: Create PR for machine-api refactoring of feature gate parameters

View the Description View the linked PRs

User Story:
As an OpenShift Engineer I want Create PR for machine-api refactoring of feature gate parameters so We need to pull out the logic from Neil's PR that removes individual feature gate parameters to use the new FeatureGate mutable map.

Description:
< Record any background information >

Acceptance Criteria:
< Record how we'll know we're done >

Other Information:
< Record anything else that may be helpful to someone else picking up the card >

issue created by splat-bot

https://github.com/openshift/machine-api-operator/pull/1315

Bug OCPBUGS-50606: [4.19 HCP only] HyperShift hosted cluster kubeadmin login always fails with "Login failed (401 Unauthorized)"

View the Description View the linked PRs

Description of problem:

In 4.19 HyperShift hosted cluster, kubeadmin login always fails.
4.18 HyperShift hosted cluster (MGTM cluster and hosted cluster both are 4.18) doesn't have the issue.

Version-Release number of selected component (if applicable):

MGMT cluster version and hosted cluster version both are 4.19.0-0.nightly-2025-02-11-161912

How reproducible:

Always

Steps to Reproduce:

1. Launch 4.19 HyperShift management cluster and a hosted cluster on it.
2. Run kubeadmin login against HCP:
$ export KUBECONFIG=/path/to/mgmt/kubeconfig
$ oc get secret kubeadmin-password -n clusters-hypershift-ci-334742 -o 'jsonpath={ .data.password }' | base64 -d
WJt9r-xxxxx-xxxxx-fpAMT

$ export KUBECONFIG=/path/to/hosted-cluster/kubeconfig
$ oc login -u kubeadmin -p "WJt9r-xxxxx-xxxxx-fpAMT"
Login failed (401 Unauthorized)
Verify you have provided the correct credentials.

Actual results:

HyperShift hosted cluster kubeadmin login always fails.

Expected results:

Success.

Additional info:

If I then configured htpasswd IDP for the hosted cluster, htpasswd user can login successfully.

https://github.com/openshift/hypershift/pull/5631

Bug OCPBUGS-45289: Incorrect ELB name was recognized by installer on ap-southeast-5

View the Description View the linked PRs

Description of problem:

The LB name should be yunjiang-ap55-sk6jl-ext-a6aae262b13b0580, rather than ending with ELB service endpoint (elb.ap-southeast-5.amazonaws.com):

	failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed provisioning resources after infrastructure ready: failed to find HostedZone ID for NLB: failed to list load balancers: ValidationError: The load balancer name 'yunjiang-ap55-sk6jl-ext-a6aae262b13b0580.elb.ap-southeast-5.amazonaws.com' cannot be longer than '32' characters\n\tstatus code: 400, request id: f8adce67-d844-4088-9289-4950ce4d0c83

Checking the tag value, the value of Name key is correct: yunjiang-ap55-sk6jl-ext

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-30-141716

How reproducible:

always

Steps to Reproduce:

    1. Deploy a cluster on ap-southeast-5
    2.
    3.

Actual results:

The LB can not be created

Expected results:

Create a cluster successfully.

Additional info:

No such issues on other AWS regions.

https://github.com/openshift/installer/pull/9263

Bug OCPBUGS-45037: Multus thin plugin's CmdDel waits for API server indefinitely

View the Description View the linked PRs

Description of problem:

This is a "clone" of https://issues.redhat.com/browse/OCPBUGS-38647

Multus CNI's delete doesn't delete Pods if the API server is not up.

Version-Release number of selected component (if applicable):

MicroShift ships Multus in 4.16+

How reproducible:

100%

Steps to Reproduce:

1. Start MicroShift
2. Cleanup microshift (sudo microshift-cleanup-data --all)
3. Run `sudo crictl pods`

Actual results:

There are Pods running

Expected results:

There should be no Pods running

Additional info:

Primary problem seems to be that func GetPod() [https://github.com/k8snetworkplumbingwg/multus-cni/blob/master/pkg/multus/multus.go#L510] doesn't return - it takes too long and perhaps CRI-O doesn't wait that long:

2024-11-26T08:45:51Z [debug] CmdDel: &{8cc9b938cc29474eeca8593c1c22a2f258a4794bdc1e2bfa2ffb2572a1d2671a /var/run/netns/c5786f46-2faf-43b7-84d5-8f1e1cf5d3b4 eth0 IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-service-ca;K8S_POD_NAME=service-ca-7cf6f558c6-7b4t7;K8S_POD_INFRA_CONTAINER_ID=8cc9b938cc29474eeca8593c1c22a2f258a4794bdc1e2bfa2ffb2572a1d2671a;K8S_POD_UID=1f986d36-414b-4fb6-892c-06ebb86e6f19 /run/cni/bin:/usr/libexec/cni  [...]}, <nil>, <nil>
2024-11-26T08:45:51Z [debug] GetPod for [openshift-service-ca/service-ca-7cf6f558c6-7b4t7] starting

Full log: https://drive.google.com/file/d/1zUyJ7DXdwV0sogSxMZ66BgGiMgQYSmMp/view?usp=sharing

https://github.com/openshift/multus-cni/pull/258

Bug OCPBUGS-45425: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-olm-operator/pull/94

Bug OCPBUGS-45714: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-nfs/pull/150

Bug OCPBUGS-48053: audit log analyzer is not outputting correct message when "too many applies" test flakes

View the Description View the linked PRs

Description of problem:

    When "users in ns/openshift-... must not produce too many applies" test flakes it doesn't output a useful output: it `{  details in audit log}`.
Instead it should be
```
{user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller had 43897 applies, check the audit log and operator log to figure out why
    user system:serviceaccount:openshift-infra:podsecurity-admission-label-syncer-controller had 1034 applies, check the audit log and operator log to figure out why  details in audit log}
```

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29402

Bug OCPBUGS-49763: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1742

Bug OCPBUGS-45451: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-autoscaler-operator/pull/336

Bug OCPBUGS-48566: NMState-configuration service is not available for assisted baremetal

View the Description View the linked PRs

Description of problem:

When doing deployments on baremetal with assisted installer, it is not possible to use nmstate-configuration because it is only enabled for platform baremetal, and AI uses platform none. Since we have many baremetal users on assisted we should enable it there as well.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4795

Bug OCPBUGS-25852: Missing metric - example: cluster_autoscaler_failed_scale_ups_total

View the Description View the linked PRs

Description of problem:

Missing metrics - example: cluster_autoscaler_failed_scale_ups_total

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

#curl the autoscalers metrics endpoint: 

$ oc exec deployment/cluster-autoscaler-default -- curl -s http://localhost:8085/metrics | grep cluster_autoscaler_failed_scale_ups_total

Actual results:

the metrics does not return a value until an event has happened

Expected results:

The metric counter should be initialized at start up providing a zero value

Additional info:

I have been through the file: 

https://raw.githubusercontent.com/openshift/kubernetes-autoscaler/master/cluster-autoscaler/metrics/metrics.go 

and checked off the metrics that do not appear when scraping the metrics endpoint straight after deployment. 

the following metrics are in metrics.go but are missing from the scrape

~~~
node_group_min_count
node_group_max_count
pending_node_deletions
errors_total
scaled_up_gpu_nodes_total
failed_scale_ups_total
failed_gpu_scale_ups_total
scaled_down_nodes_total
scaled_down_gpu_nodes_total
unremovable_nodes_count 
skipped_scale_events_count
~~~

https://github.com/openshift/kubernetes-autoscaler/pull/332

Bug OCPBUGS-45801: Unable to edit "until" in silences (of alerts) from the Admin/Developer perspective

View the Description View the linked PRs

Description of problem:

checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-031229, ~~OCPBUGS-34533~~ is reproduced on 4.18+, no such issue with 4.17 and below.

steps: login admin console or developer console(admin console go to "Observe -> Alerting -> Silences" tab, developer console go to "Observe -> Silences" tab), to create silence, edit the "Until" option, even with a valid timestamp or invalid stamp, will get error "[object Object]" in the "Until" field. see screen recording: https://drive.google.com/file/d/14JYcNyslSVYP10jFmsTaOvPFZSky1eg_/view?usp=drive_link

checked 4.17 fix for ~~OCPBUGS-34533~~ is already in 4.18+ code

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always

Steps to Reproduce:

1. see the descriptions

Actual results:

Unable to edit "until" filed in silences

Expected results:

able to edit "until" filed in silences

https://github.com/openshift/monitoring-plugin/pull/292

Vulnerability OCPBUGS-47536: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-baremetal/pull/223

Bug OCPBUGS-51039: Return marshalled dataplane even if assignprincipals is false

View the Description View the linked PRs

Description of problem:

    Currently if we try to create a cluster with assigServicePrincipals set to false it wont return the marshalled data plane identities and will fail to create a cluster. We should still return this without assigning service principal roles

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5610

Bug OCPBUGS-46438: Pipeline visualisation shows all tasks as Failed and after that goes to Running state

View the Description View the linked PRs

Description of problem:

   When user start a Pipeline, in Pipeline visualization, it will show as Failed for Tasks and after that it will show as Running state.

Version-Release number of selected component (if applicable):

    4.17.z

How reproducible:

    Not always but more frequently

Steps to Reproduce:

    1. Create a Pipeline and start it
    2. Observe Pipeline visualization in details page

Actual results:

    Pipeline visualisation shows all tasks as Failed and after that goes to Running state

Expected results:

    Pipeline visualisation should not shows all tasks as Failed before it goes to Running state

Additional info:

https://github.com/openshift/console/pull/14628

Vulnerability OCPBUGS-47043: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/85

Bug OCPBUGS-48665: The name proposed by the UI for a new service is `exampleasd`

View the Description View the linked PRs

Description of problem:

The proposed name for services in the ui has an additional 'asd' after example: `exampleasd`.

4.18.0-0.nightly-multi-2025-01-15-030049

    How reproducible: Always

Steps to Reproduce:

    1. Go to the UI -> Networking -> Services 
    2. Click create a new service

Actual results:

---
apiVersion: v1
kind: Service
metadata:
  name: exampleasd
  namespace: test
spec:
  selector:
    app: name
spec:
...

Expected results:

---
apiVersion: v1
kind: Service
metadata:
  name: example
  namespace: test
spec:
....

Additional info:

https://github.com/openshift/networking-console-plugin/pull/184

Bug OCPBUGS-47504: Power VS: dnssvcs default private endpoint needs to specify API version

View the Description View the linked PRs

Description of problem:

   The initial set of default endpoint overrides we specified in the installer are missing a v1 at the end of the DNS services override.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9335

Bug OCPBUGS-50599: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4844

Bug OCPBUGS-42189: cluster-network-operator failing to start metrics server on port 8080

View the Description View the linked PRs

Description of problem:

Starting with OpenShift Container Platform 4.16, it was observed that cluster-network-operator is stuck in CrashLoopBackOff state because of the below error reported.

2024-09-17T16:32:46.503056041Z I0917 16:32:46.503016       1 controller.go:242] "All workers finished" controller="pod-watcher"
2024-09-17T16:32:46.503056041Z I0917 16:32:46.503045       1 internal.go:526] "Stopping and waiting for caches"
2024-09-17T16:32:46.503209536Z I0917 16:32:46.503189       1 internal.go:530] "Stopping and waiting for webhooks"
2024-09-17T16:32:46.503209536Z I0917 16:32:46.503206       1 internal.go:533] "Stopping and waiting for HTTP servers"
2024-09-17T16:32:46.503217413Z I0917 16:32:46.503212       1 internal.go:537] "Wait completed, proceeding to shutdown the manager"
2024-09-17T16:32:46.503231142Z F0917 16:32:46.503221       1 operator.go:130] Failed to start controller-runtime manager: failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use

That problem seems to be related to the change done in https://github.com/openshift/cluster-network-operator/pull/2274/commits/acd67b432be4ef2efb470710aebba2e3551bc00d#diff-99c0290799daf9abc6240df64063e20bfaf67b371577b67ac7eec6f4725622ff, where it was missed to pass BindAddress with 0 https://github.com/openshift/cluster-network-operator/blob/master/vendor/sigs.k8s.io/controller-runtime/pkg/metrics/server/server.go#L70 to keep previous functionality.
With the current code in place, cluster-network-operator will expose a metrics server on port 8080 which was not the case and can create conflicts with custom application.

This is especially true in environment, where compact OpenShift Container Platform 4 - Clusters are running (three-node cluster)

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.16

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.15 (three-node cluster) and create a service that is listening on HostNetwork with port 8080
2. Update to OpenShift Container Platform 4.16
3. Watch cluster-network-operator being stuck in CrashLoopBackOff state because port 8080 is already bound

Actual results:

2024-09-17T16:32:46.503056041Z I0917 16:32:46.503016       1 controller.go:242] "All workers finished" controller="pod-watcher"
2024-09-17T16:32:46.503056041Z I0917 16:32:46.503045       1 internal.go:526] "Stopping and waiting for caches"
2024-09-17T16:32:46.503209536Z I0917 16:32:46.503189       1 internal.go:530] "Stopping and waiting for webhooks"
2024-09-17T16:32:46.503209536Z I0917 16:32:46.503206       1 internal.go:533] "Stopping and waiting for HTTP servers"
2024-09-17T16:32:46.503217413Z I0917 16:32:46.503212       1 internal.go:537] "Wait completed, proceeding to shutdown the manager"
2024-09-17T16:32:46.503231142Z F0917 16:32:46.503221       1 operator.go:130] Failed to start controller-runtime manager: failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use

Expected results:

In previous version BindAddress was set to 0 for the Metrics server, meaning it would not start respectively expose on port 8080. Therefore the same should be done with OpenShift Container Platform 4.16 to keep backward compatability and prevent port conflicts.

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2516

Bug OCPBUGS-51310: runlogwatch in ironic-image is broken

View the Description View the linked PRs

see upstream issue https://github.com/metal3-io/ironic-image/issues/630

that looks like it's not getting the correct values from the read command, also the colorized output is interfering with the parsing

https://github.com/openshift/ironic-image/pull/641

Bug OCPBUGS-20062: kube-storage-version-migrator goes Available=False with reason=KubeStorageVersionMigrator_Deploying during updates

View the Description View the linked PRs

Description of problem:

Reviving rhbz#1948087, the kube-storage-version-migrator ClusterOperator occasionally goes Available=False with reason=KubeStorageVersionMigrator_Deploying. For example, this run includes:

: [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available expand_less	1h34m30s
{  1 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 03 22:09:07.933 - 33s   E clusteroperator/kube-storage-version-migrator condition/Available reason/KubeStorageVersionMigrator_Deploying status/False KubeStorageVersionMigratorAvailable: Waiting for Deployment

But that is a node rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the KSVM operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

4.8 and 4.15. Possibly all supported versions of the KSVM operator have this exposure.

How reproducible:

Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with KSVM going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

Steps to Reproduce:

w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/kube-storage-version-migrator+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort

Actual results:

periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-kubevirt-conformance (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 163% of failures match = 68% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 118% of failures match = 72% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 189% of failures match = 89% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 86% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 114% of failures match = 73% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 65 runs, 45% failed, 169% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 6 runs, 50% failed, 133% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 75 runs, 24% failed, 361% of failures match = 87% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 75 runs, 29% failed, 277% of failures match = 81% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 74 runs, 36% failed, 185% of failures match = 68% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 156% of failures match = 77% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 60 runs, 38% failed, 187% of failures match = 72% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 200% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 7 runs, 29% failed, 300% of failures match = 86% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 80% of failures match = 57% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 100% failed, 83% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 83% of failures match = 83% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 71% of failures match = 38% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 70% of failures match = 44% impact

Expected results:

KSVM goes Available=False if and only if immediate admin intervention is appropriate.

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/118

Bug OCPBUGS-38121: Removed APIs are still used in some test environment

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-arch][Late] clients should not use APIs that are removed in upcoming releases [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]

Probability of significant regression: 99.98%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-01T00:00:00Z
End Time: 2024-08-07T23:59:59Z
Success Rate: 94.59%
Successes: 105
Failures: 6
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 36
Failures: 0
Flakes: 302

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Unknown&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-07%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-01%2000%3A00%3A00&testId=openshift-tests%3A91672aad25cfdd6f79c4f18b04208e88&testName=%5Bsig-arch%5D%5BLate%5D%20clients%20should%20not%20use%20APIs%20that%20are%20removed%20in%20upcoming%20releases%20%5Bapigroup%3Aapiserver.openshift.io%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

https://github.com/openshift/kubernetes/pull/2113

Bug OCPBUGS-44372: PPC: false negative reporting while comparing the topologies of affected compute nodes

View the Description View the linked PRs

Description of problem:

   This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446
ALthough both nodes topologies are equavilent, PPC reported a false negative:

  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    always

Steps to Reproduce:

    1.TBD
    2.
    3.

Actual results:

    Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Expected results:

    topologies matches, the PPC should work fine

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/1236

Bug OCPBUGS-45116: ABI day2 pxe install can't reboot from disk properly with the setBootOrder error

View the Description View the linked PRs

Description of problem:

    Test pxe boot in ABI day2 install, day2 host can not reboot from the disk properly, but reboot from pxe again. This is not reproduced in all hosts and reboots.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-11-26-075648

How reproducible:

    Not always, about 70% in amd64 and 100% in arm64

Steps to Reproduce:

    1. Run ABI day1 install booting from pxe
    2. After day1 cluster is installed, run ABI day2 install booting from pxe
    3. Day2 host hasn't reboot from disk as expected, but reboot from pxe again. From agent.service log, we can see the error:

level=info msg=\"SetBootOrder, runtime.GOARCH: amd64, device: /dev/sda\"\ntime=\"2024-11-27T06:48:15Z\" level=info msg=\"Setting efibootmgr to boot from disk\"\ntime=\"2024-11-27T06:48:15Z\" level=error msg=\"failed to find EFI directory\" error=\"failed to mount efi device: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- mount /dev/sda2 /mnt], Error exit status 32, LastOutput \\\"mount: /var/mnt: special device /dev/sda2 does not exist.\\\"\"\ntime=\"2024-11-27T06:48:15Z\" level=warning msg=\"Failed to set boot order\" error=\"failed to mount efi device: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- mount /dev/sda2 /mnt], Error exit status 32

Actual results:

    Day2 host hasn't reboot from disk as expected

Expected results:

    Day2 host should reboot from disk to complete the installation

Additional info:

https://github.com/openshift/assisted-installer/pull/947

Bug OCPBUGS-45418: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-baremetal-operator/pull/457

Bug OCPBUGS-45578: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-aws/pull/119

Bug OCPBUGS-45702: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/router/pull/643

Bug OCPBUGS-49784: Runtime panic occurs when clusternetwork CIDR mask is greater than hostPrefix and OVN ipv4 join subnet is provided

View the Description View the linked PRs

Description of problem:

When a clusternetwork entry has an invalid hostPrefix, which is <= CIDR mask and the custom IPv4 join subnet is provided in the install-config, the installer gives a runtime panic error: negative shift amount The error is expected to occur as it is an invalid configuration, but a more descriptive error should be returned instead of panics.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

Create a cluster with the following install-config (i.e. some fields are intentionally not included):

```
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: devcluster.openshift.com
compute:
  - architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform: {}
    replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: thvo-dev
networking:
  clusterNetwork:
    - cidr: 10.128.0.0/19
      hostPrefix: 18 # Bad because hostPrefix must be >= 19
    - cidr: 10.128.32.0/19
      hostPrefix: 23
  machineNetwork:
    - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
    - 172.30.0.0/16
  ovnKubernetesConfig:
    ipv4:
      internalJoinSubnet: 101.64.0.0/16
platform:
  aws:
    region: us-east-1
publish: External
```

Actual results:

panic: runtime error: negative shift amount

Expected results:

A descriptive user-friendly error that points out why it is invalid

Additional info:

https://github.com/openshift/installer/pull/9432

Bug CNV-52127: [CI-Watcher] CI is down due to i18n msising key issue

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/143

Bug OCPBUGS-45346: Invalid PerformanceProfile cpus panics on validation webhook

View the Description View the linked PRs

Description of problem:

Application of PerformanceProfile with invalid cpuset in one of the reserved/isolated/shared/offlined cpu fields causing webhook validation to panic instead of returning an informant error.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-22-231049

How reproducible:

Apply a PerformanceProfile with invalid cpu values

Steps to Reproduce:

Apply the following PerformanceProfile with invalid cpu values:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: pp
spec:
  cpu:
    isolated: 'garbage'
    reserved: 0-3
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/worker-cnf: ""
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Actual results:

On OCP >= 4.18 the error is:

Error from server: error when creating "pp.yaml": admission webhook "vwb.performance.openshift.io" denied the request: panic: runtime error: invalid memory address or nil pointer dereference [recovered]

On OCP <= 4.17 the error is:

Validation webhook passes without any errors. Invalid configuration propogates to the cluster and breaks it.

Expected results:

We expect to pushback an informant error when invalid cpuset has been entered, without panicking or accepting it!

https://github.com/openshift/cluster-node-tuning-operator/pull/1231

Bug OCPBUGS-50014: Control plane components do not restart automatically when certificates are renewed

View the Description View the linked PRs

Description of problem:

    When internal serving certificates expire (and are renewed), the new certificates are not picked up automatically by control plane components, resulting in an unstable control plane.

Version-Release number of selected component (if applicable):

All

How reproducible:

  Always

Steps to Reproduce:

    1. Create a HostedCluster with annotations for a short certificate expiration time:
    
hypershift.openshift.io/certificate-validity: "1h"    
hypershift.openshift.io/certificate-renewal: "0.3"
    2. Wait for initial certificates to expire

Actual results:

    Cluster becomes degraded, apiservices in hosted cluster API become unavailable. To test this, obtain a kubeconfig for the hosted cluster and list apiservices:
$ oc get apiservices

API services that are external to the kube-apiserver appear as unavailable.

Expected results:

    Cluster continues to function as expected

Additional info:

https://github.com/openshift/hypershift/pull/5601

Task CNTRLPLANE-25: Investigate aks-e2e CreateClusterV2 Issue

View the Description View the linked PRs

The aks-e2e test keeps failing on the CreateClusterV2 test because the `ValidReleaseInfo` condition is not set. The patch that sets this status keeps failing. Investigate why & provide a fix.

https://github.com/openshift/hypershift/pull/5316

Bug OCPBUGS-45424: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/node_exporter/pull/157

Bug OCPBUGS-49779: OCP/API repo verify-feature-gates promotion PR uses latest version's sippy results even on older releases

View the Description View the linked PRs

See https://github.com/openshift/api/pull/2127 and when verify ran here: https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_api/2127/pull-ci-openshift-api-release-4.18-verify/1886269516866916352/build-log.txt it pulled sippy results from 4.19 instead of 4.18, sippy shows nothing red for https://sippy.dptools.openshift.org/sippy-ng/tests/4.18/details?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22%5BOCPFeatureGate%3ANetworkSegmentation%5D%22%7D%5D%7D vsphere on 4.18 which is what fails on that verify PR

See https://redhat-internal.slack.com/archives/C01CQA76KMX/p1738586200370409 for more details

https://github.com/openshift/api/pull/2187

Bug OCPBUGS-50534: Installer azure panics if there's insufficient permissions

View the Description View the linked PRs

Description of problem:

    Azure installer panics if there's insufficient permissions to check for ip availability when creating a load balancer

Version-Release number of selected component (if applicable):

How reproducible:

    Frequent if there's a permissions issue to get available IPs

Steps to Reproduce:

    1. Create cluster on Azure

Actual results:

    Full on panic
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3c11064]
goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/manifests/azure.getNextAvailableIP({0x22548bc0, 0x272aab80}, 0xc002056840)
    /go/src/github.com/openshift/installer/pkg/asset/manifests/azure/cluster.go:270 +0x124
github.com/openshift/installer/pkg/asset/manifests/azure.GenerateClusterAssets(0xc002056840, 0xc0010b7e40)
    /go/src/github.com/openshift/installer/pkg/asset/manifests/azure/cluster.go:120 +0xbcb
github.com/openshift/installer/pkg/asset/manifests/clusterapi.(*Cluster).Generate(0x27263570, {0x5?, 0x8aa87ee?}, 0xc001fcf530)
    /go/src/github.com/openshift/installer/pkg/asset/manifests/clusterapi/cluster.go:103 +0x57b
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000eed200, {0x22548c70, 0xc001182000}, {0x2251ea10, 0x27263570}, {0x0, 0x0})
    /go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x6e2
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0xc000eed200, {0x22548c70?, 0xc001182000?}, {0x2251ea10, 0x27263570}, {0x272259c0, 0x6, 0x6})
    /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x4e
github.com/openshift/installer/pkg/asset/store.(*fetcher).FetchAndPersist(0xc00119a3e0, {0x22548c70, 0xc001182000}, {0x272259c0, 0x6, 0x6})
    /go/src/github.com/openshift/installer/pkg/asset/store/assetsfetcher.go:47 +0x16b
main.newCreateCmd.runTargetCmd.func3({0x7ffcc595a72a?, 0xe?})
    /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:321 +0x6a
main.newCreateCmd.runTargetCmd.func4(0x27231000, {0xc00119a1d0?, 0x4?, 0x8a5d385?})
    /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:335 +0x102
github.com/spf13/cobra.(*Command).execute(0x27231000, {0xc00119a1b0, 0x1, 0x1})
    /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:989 +0xa91
github.com/spf13/cobra.(*Command).ExecuteC(0xc000ee0f08)
    /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1117 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
    /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1041
main.installerMain()
    /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:67 +0x390
main.main()
    /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:39 +0x168

Expected results:

    No panic and provide error messages.

Additional info:

    Slack link: https://redhat-internal.slack.com/archives/C01V1DP387R/p1738888488097589

Vulnerability OCPBUGS-46197: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/281

Bug OCPBUGS-46596: ec2:DescribeInstanceTypeOfferings should be required

View the Description View the linked PRs

Description of problem:

   When instance types are not specified in the machine pool, the installer checks which instance types (from a list) are available in a given az. If the ec2:DescribeInstanceType permission is not present, the check will fail gracefully and default to using the m6i instance type. This instance type is not available in all regions (e.g. ap-southeast-4 and eu-south-2), so those installs will fail.

OCPBUGS-45218 describes a similar issue with edge nodes.

ec2:DescribeInstanceTypeOfferings is not a controversial permission and should be required by default for all installs to avoid this type of issue.

Version-Release number of selected component (if applicable):

    Affects all versions, but we will just fix in main (4.19)

How reproducible:

    Always

Steps to Reproduce:

See OCPBUGS-45218 for one example.

Another example (unverified)
    1. Use permissions without ec2:DescribeInstanceTypeOfferings
    2. Install config: set region to eu-south-2 or ap-southeast-4. Do not set instance types
    3. Installer should default to m6i instance type (can be confirmed from machine manifests).
    4.  Install will fail as m6i instances are not available in those regions: https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-instance-regions.html

Actual results:

    Install fails due to unavailable m6i instance

Expected results:

    Installer should select different instance type, m5

Additional info:

https://github.com/openshift/installer/pull/9341

Bug OCPBUGS-47724: [CustomDNS] coredns pod was not created on bootstrap machine on AWS

View the Description View the linked PRs

Description of problem:

Create a cluster with custom DNS enabled, the coredns.yaml file was created on bootstrap but no coredns pod:


[core@ip-10-0-98-49 ~]$ sudo crictl ps
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                             ATTEMPT             POD ID              POD
50264b0c68ffd       registry.ci.openshift.org/ocp/release@sha256:c7b642a1fdc7c2bd99d33695c57f7c45401f422a08996640a1eb8b7a9a50a983            14 minutes ago      Running             cluster-version-operator         2                   afafcb955fef5       bootstrap-cluster-version-operator-ip-10-0-98-49
b1410c782f7aa       7691fe6c8036ecba066c3cfa9865455cb714dacf10bb6bef6b0aca10dc37f96e                                                         15 minutes ago      Running             kube-controller-manager          1                   485370aaad3b5       bootstrap-kube-controller-manager-ip-10-0-98-49
369341e2b5bdc       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e5d868208c8ee95b2973406833be802bca7ef220eea6ad4273384990abe10f3   15 minutes ago      Running             cluster-policy-controller        0                   485370aaad3b5       bootstrap-kube-controller-manager-ip-10-0-98-49
a390d13a3847b       934889bfd49afe7a91680889af526bc5722c813d8fa76c23b7544951a0300a66                                                         15 minutes ago      Running             kube-apiserver-insecure-readyz   0                   2a3112486283a       bootstrap-kube-apiserver-ip-10-0-98-49
cd370437856c6       7691fe6c8036ecba066c3cfa9865455cb714dacf10bb6bef6b0aca10dc37f96e                                                         15 minutes ago      Running             kube-apiserver                   0                   2a3112486283a       bootstrap-kube-apiserver-ip-10-0-98-49
4c7b7d343c38f       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1dbd4ef2dddd9a124a128754580bfcbe50de106b7a503f3df59da58bd88bf0c9   15 minutes ago      Running             kube-scheduler                   0                   f26b53a65bd94       bootstrap-kube-scheduler-ip-10-0-98-49
b88105f79dfe1       936d4cb44a2471bb928caf54c80597686160b3fff5c49a793cf7677e9dbb3e48                                                         15 minutes ago      Running             cloud-credential-operator        0                   33371da90c541       cloud-credential-operator-ip-10-0-98-49
d66c2126bf646       64f5aaf4e366de0d2e3301def8a9c38e4db99d7bcc578f0a50bfba0d73918e4f                                                         15 minutes ago      Running             machine-config-server            0                   050647bcc3ee3       bootstrap-machine-config-operator-ip-10-0-98-49
75ced844a6524       c35e3479a3b694513582d6ede502b1472c908ee1b7ae87a5f705d09570ed21ac                                                         16 minutes ago      Running             etcd                             0                   970037306ed33       etcd-bootstrap-member-ip-10-0-98-49
eb50696dda431       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:598847ee4f002b98d9251e34d91041881837464aca652acc8952f2c647cf49af   16 minutes ago      Running             etcdctl                          0                   970037306ed33       etcd-bootstrap-member-ip-10-0-98-49


install-config:

platform:
  aws:
    userProvisionedDNS: Enabled
featureSet: TechPreviewNoUpgrade

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-16-065305

How reproducible:

Always

Steps to Reproduce:

    1. Config install-config.yaml as described in the description
    2. Create cluster.
    3.

Actual results:

No coredns pod on bootstrap machine.

Expected results:

Like on GCP, coredns pod should be created:
sudo crictl ps | grep coredns
a6122cfb07b2f       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8677b12826bd9d33e8f6bb96d70bbcdaf3c5dec9e28a8b72af2c2e60620b6b19   33 minutes ago      Running             coredns                          0                   09622bf07804a       coredns-yunjiang-dnsgcp9-4xfj4-bootstrap

Additional info:

https://github.com/openshift/installer/pull/9489

Bug OCPBUGS-43478: Build Subscription Content Tests Fail on Power

View the Description View the linked PRs

Description of problem:

Observing these e2e failures consistently
[sig-builds][Feature:Builds][subscription-content] builds installing subscription content [apigroup:build.openshift.io] should succeed for RHEL 7 base images [Suite:openshift/conformance/parallel]
[sig-builds][Feature:Builds][subscription-content] builds installing subscription content [apigroup:build.openshift.io] should succeed for RHEL 8 base images [Suite:openshift/conformance/parallel]
[sig-builds][Feature:Builds][subscription-content] builds installing subscription content [apigroup:build.openshift.io] should succeed for RHEL 9 base images [Suite:openshift/conformance/parallel]

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Fails consistently and fails when run steps are run manually

Steps to Reproduce:

    1. Setup 4.18 cluster
    2. Run e2e 
    3. Test cases in file - origin/test/extended/builds/subscription_content.go   

raw - https://raw.githubusercontent.com/openshift/origin/f7e4413793877efb24be86de05319dad00d05897/test/extended/builds/subscription_content.go

Actual results:

    Test case fails

Expected results:

Additional info:

Failures were observed in both ~~OCP-4~~.17 as well as ~~OCP-4~~.18. Following are the logs.

4.18-failure-log

4.17-failure-log

https://github.com/openshift/origin/pull/29315

Task MON-4155: Upgrade alertmanager to 0.28.0 downstream

View the Description View the linked PRs

Upgrade alertmanager to 0.28.0 downstream

https://github.com/openshift/prometheus-alertmanager

syncbot update failed: https://github.com/rhobs/syncbot/commit/c06d1d22eba68699bf218705b76f53e180eaf940/checks

This will need to update syncbot repo as well to fix the build

https://github.com/openshift/prometheus-alertmanager/pull/98

Bug OCPBUGS-45482: Installer deletes bootstrap machine before etcd bootstrap member removed from cluster

View the linked PRs

https://github.com/openshift/installer/pull/9261

Bug OCPBUGS-45213: Large font size of `BuildSpec details` on BuildRun details page

View the Description View the linked PRs

Description of problem:

Font size of `BuildSpec details` on BuildRun details page is larger than other title on the page

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Navigate to Shipwright BuildRun details page

Actual results:

    Font size of `BuildSpec details` on BuildRun details page is larger than other title on the page

Expected results:

    All title should be of same size

Additional info:
Screenshots
https://github.com/user-attachments/assets/74853838-1fff-46d5-9ed6-5b605caebbf0

https://github.com/openshift/console/pull/14554

Bug OCPBUGS-49933: ovnkube-controller container crashed on RHEL-8 worker

View the Description View the linked PRs

Description of problem:

We're seeing the following error in the ovnkube-controller container log on RHEL-8 worker which leads to the network is not ready of the node

F0206 03:40:21.953369   12091 ovnkube.go:137] failed to run ovnkube: [failed to start network controller: failed to start default network controller - while waiting for any node to have zone: "ip-10-0-75-250.ec2.internal", error: context canceled, failed to start node network controller: failed to start default node network controller: failed to find kubelet cgroup path: %!w(<nil>)]

The full log of the ovnkube-controller container:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-workers-rhel8/1887322975150018560/artifacts/e2e-aws-ovn-workers-rhel8/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-js6vn_ovnkube-controller.log

Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2025-02-05-033447/4.18.0-0.nightly-2025-02-04-192134

How reproducible:
Always

Steps to Reproduce:
1. Add RHEL-8 worker to 4.18 OCP cluster, but the RHEL workers can't be ready, found the following error about ovnkube-controller in the kublet.log

Feb 06 11:38:34 ip-10-0-50-48.us-east-2.compute.internal kubenswrapper[15267]: E0206 11:38:34.798490   15267 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"ovnkube-controller\" with CrashLoopBackOff: \"back-off 20s restarting failed container=ovnkube-controller pod=ovnkube-node-txkkp_openshift-ovn-kubernetes(c22474ab-6f0b-4403-93a6-eb80766934e6)\"" pod="openshift-ovn-kubernetes/ovnkube-node-txkkp" podUID="c22474ab-6f0b-4403-93a6-eb80766934e6"

An example failure job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-workers-rhel8/1887322975150018560

Actual results:

Expected results:

Additional info:
Based on the test history, it's working for 4.18.0-0.nightly-2025-02-04-114552, but start failing for 4.18.0-0.nightly-2025-02-05-033447.
(Update: confirmed it's also failed on 4.18.0-0.nightly-2025-02-04-192134)

Here's the daily 4.18 rhel8 job history link:
https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-workers-rhel8

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/ovn-kubernetes/pull/2446

Bug OCPBUGS-52214: PF5 ("pf-v5-u") classes not working on latest console "main" branch

View the Description View the linked PRs

Description of problem:

PF "-v5-" classes are breaking for the plugin on console's latest main branch.

Ref: https://redhat-internal.slack.com/archives/C011BL0FEKZ/p1740475225829469

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. Run console latest "main" locally.
    2. Run odf-console plugin latest "master" locally (https://github.com/red-hat-storage/odf-console), or any demo/actual plugin using PatternFly version 5.
    3. Navigate to pages/components using classes like: "pf-v5-u-mt-md", "pf-v5-u-pt-sm" etc.

Actual results:

    pf-v5-* classes are not getting applied.

Expected results:

    PF v5 classes should work as expected.

Additional info:

https://github.com/openshift/console/pull/14813

Bug OCPBUGS-45496: In OCL. Rarely, a new MC is rendered but no MOSB is created

View the Description View the linked PRs

Description of problem:


In order to test OCL we run e2e automated test cases in a cluster that has OCL enabled in master and worker pools.

We have seen that rarely a new machineconfig is rendered but no MOSB resource is created.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Rare

Steps to Reproduce:

We don't have any steps to reproduce it. It happens eventually when we run a regression in a cluster with OCL enabled in master and worker pools.

Actual results:

We see that in some scenarios a new MC is created, then a new rendered MC is created too, but now MOSB is created and the pool is stuck forever.

Expected results:

Whenever a new rendered MC is created, a new MOSB sould be created too to build the new image.

Additional info:

In the comments section we will add all the must-gather files that are related to this issue.


In some scenarios we can see this error reported by the os-builder pod:


2024-12-03T16:44:14.874310241Z I1203 16:44:14.874268       1 request.go:632] Waited for 596.269343ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/secrets?labelSelector=machineconfiguration.openshift.io%2Fephemeral-build-object%2Cmachineconfiguration.openshift.io%2Fmachine-os-build%3Dmosc-worker-5fc70e666518756a629ac4823fc35690%2Cmachineconfiguration.openshift.io%2Fon-cluster-layering%2Cmachineconfiguration.openshift.io%2Frendered-machine-config%3Drendered-worker-7c0a57dfe9cd7674b26bc5c030732b35%2Cmachineconfiguration.openshift.io%2Ftarget-machine-config-pool%3Dworker


Nevertheless, we only see this error in some of them, not in all of them.

https://github.com/openshift/machine-config-operator/pull/4739

Bug OCPBUGS-24588: CSI Drivers can be progressing forever when CCO has issues

View the Description View the linked PRs

Description of problem:

When CCO does not provide credentials for CSI driver operators, the CSI driver operators are Progressing=true forever, with a message:

Operator unavailable (GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying): GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment  Operator unavailable (GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying): GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment

(full job run).

This will be further emphasized by CCO being an optional component in OCP 4.15. CSO / CSI driver operators should provide a more useful error and even degrade the cluster when Secret is not available in X minutes. The message should point to CCO, so users know where to look for details.

Bug OCPBUGS-45860: VSphere MCO Panic

View the Description View the linked PRs

The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-vsphere-ovn-upi-serial/1861922894817267712

Undiagnosed panic detected in pod
{  pods/openshift-machine-config-operator_machine-config-daemon-4mzxf_machine-config-daemon_previous.log.gz:E1128 00:28:30.700325    4480 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}

https://github.com/openshift/machine-config-operator/pull/4740

Bug OCPBUGS-47477: install OCP on AWS region us-east-1 is always returning failures when non-zone is set

View the Description View the linked PRs

Description of problem:

    openshift-install is always raise WANING when install a cluster on AWS Region us-east-1 with default configuration, no zones is set.

~~~
WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint 
WARNING failed to find default instance type: no instance type found for the zone constraint 
~~~

The process to discover zones list all zones in the region, then tries to describe instance type offerings across all zones from a list of supported instance types by installer.

The problem that there is an "dead" zone in this region, the us-east-1e (ID use1-az3), which does not support any instance type we support, leading in creating infra resources in one zone which isn't useful as may not able to launch supported instance types there.

Version-Release number of selected component (if applicable):

* (?)

How reproducible:

always

Steps to Reproduce:

    1. create install-config targeting AWS region us-east-1, without setting zones (default)
    2. create manifests, or create cluster

Actual results:

~~~
WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint  WARNING failed to find default instance type: no instance type found for the zone constraint 
~~~

The WARNING is raised, the install does not fail because the fallback instance type is supported across zones used to make Control Plane and Worker nodes

Expected results:

No WARNINGS/failures

Additional info:

https://github.com/openshift/installer/pull/9333

Bug OCPBUGS-31462: Upgrade failing because custom scc in version pod

View the Description View the linked PRs

Description of problem:

Openshift cluster upgrade from 4.12.10 to 4.12.30 failing because pod version-4.12.30-xxx is in CreateContainerConfigError. Also tested in 4.14

Steps to Reproduce:

Deploy new 4.12.10 cluster

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.10   True        False         12m     Cluster version is 4.12.10

Create the following SCC

---
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups: []
kind: SecurityContextConstraints
metadata:
  name: scc-hostpath-cnf-cat-1
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: MustRunAsNonRoot
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- hostPath
- persistentVolumeClaim
- projected
- secret

Upgrade to 4.12.30

$ oc adm upgrade --to=4.12.30
$ oc get pod -n openshift-cluster-version
NAME                                        READY   STATUS                            RESTARTS   AGE
cluster-version-operator-85db98885c-jt25z   1/1     Running                           0          41m
version-4.12.30-vw4pm-l2nng                 0/1     Init:CreateContainerConfigError   0          42s

$ oc get events | grep Failed
10s         Warning   Failed                  pod/version-4.12.30-p6k4r-nmn6m                  Error: container has runAsNonRoot and image will run as root (pod: "version-4.12.30-p6k4r-nmn6m_openshift-cluster-version(4d1704d9-ca34-4aa3-86e1-1742e8cead0c)", container: cleanup)

$ oc get pod version-4.12.30-97nbr-88mxp -o yaml  |grep scc
    openshift.io/scc: scc-hostpath-cnf-cat-1

As a workaround, we can remove the scc "scc-hostpath-cnf-cat-1" and the pod version-xxx and the upgrade worked. Customer has created custom scc for use of applications.

$ oc get pod version-4.12.30-nmskz-d5x2c -o yaml | grep scc
    openshift.io/scc: node-exporter

$ oc get pod
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-6cb5557f8f-v65vb   1/1     Running     0          54s
version-4.12.30-nmskz-d5x2c                 0/1     Completed   0          67s

There's an old bug https://issues.redhat.com/browse/OCPBUGSM-47192 which was fixed setting readOnlyRootFilesystem to false, but in this case the scc is still failing.

https://github.com/openshift/cluster-version-operator/blob/release-4.12/pkg/cvo/updatepayload.go#L206

---
container.SecurityContext = &corev1.SecurityContext{
	Privileged:             pointer.BoolPtr(true),
	ReadOnlyRootFilesystem: pointer.BoolPtr(false),
}
---

https://github.com/openshift/cluster-version-operator/pull/1106

Bug OCPBUGS-46144: Azure: installer sometimes fails to provision control plane

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

install should succeed: infrastructure

installer fails with:

time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded"

Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&Network=ovn&NetworkAccess=default&Platform=azure&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Installer%20%2F%20openshift-installer&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-10-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-14%2000%3A00%3A00&testId=cluster%20install%3A3e14279ba2c202608dd9a041e5023c4c&testName=install%20should%20succeed%3A%20infrastructure

https://github.com/openshift/installer/pull/9310

Bug OCPBUGS-46585: [EIP UDN Layer3/2 pre-merge testing] In SGW and LGW modes, after restarting ovnkube-node pod of client host of local EIP pod, EIP traffic from remote EIP pod can not be captured on egress node ovs-if-phys0

View the Description View the linked PRs

Description of problem: EIP UDN Layer 3 pre-merge testing] In SGW and LGW modes, after restarting ovnkube-node pod of client host of local EIP pod, EIP traffic from remote EIP pod can not be captured on egress node ovs-if-phys0

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. labeled a node to be egress node, created an egressIP object

2. Created a namespace, applied layer3 UDN CRD to it

3. Created two test pods, one local to egress node, the other one is remote to egress node

4. Restarted the ovnkube-node pod of the local EIP pod's client host (or egress node), waited till the ovnkube-node pod recreated and ovnkube-node ds rollout succeeded

5. Curl external from both test pods

Actual results: egressing packets from remote EIP pod can not be captured on egress node ovs-if-phys0 after restarting the ovnkube-node pod of egress node

Expected results: egressing packets from either EIP pod can be captured on egress node ovs-if-phys0 after restarting the ovnkube-node pod of egress node

Additional info:

egressing packets from local EIP pod can not be captured on egress node ovs-if-phys0 after restarting the ovnkube-node pod of egress node

must-gather: https://drive.google.com/file/d/12aonBDHMPsmoGKmM47yGTBFzqyl1IRlx/view?usp=drive_link

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/ovn-kubernetes/pull/2420

Bug OCPBUGS-49727: primary UDNs may take over a minute to start

View the Description View the linked PRs

Description of problem:

When a primary UDN or CUDN is created, it creates what is known as a secondary zone network controller that handles configuring OVN and getting the network created so that pods can be attached. The time it takes for this to happen can be up to a minute if namespaces are being deleted on the cluster while the UDN controller is starting.

This is because if the namespace is deleted, GetActiveNetworkForNamespace will fail, and the pod will be retried for up to a minute.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Create a bunch of namespaces with pods

2. Create a primary CUDN or UDN

3. Very quickly start deleting namespaces that have pods in them while OVNK is starting up its network controller

Actual results:

Logs show time to start the controller taking a minute:
I0131 16:15:30.383221 5583 secondary_layer2_network_controller.go:365] Starting controller for secondary network e2e-network-segmentation-e2e-8086.blue I0131 16:16:30.390813 5583 secondary_layer2_network_controller.go:369] Starting controller for secondary network e2e-network-segmentation-e2e-8086.blue took 1m0.007579788s

Expected results:

Once started the controller should only take a few seconds (depending on cluster size and load) to finish starting.

https://github.com/openshift/ovn-kubernetes/pull/2431

Vulnerability OCPBUGS-52509: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-pod-identity-webhook/pull/200

Bug OCPBUGS-42135: Route doc link still points to 4.16 version

View the Description View the linked PRs

Description of problem:

There is "Learn more about Route" doc link on routes page if there is not route in the project, it points to 4.16 doc link "https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/networking/configuring-routes". It should points to 4.17 for 4.17 cluster and 4.18 for 4.18 cluster.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-18-003538
4.18.0-0.nightly-2024-09-17-060032

How reproducible:

Always

Steps to Reproduce:

    1. Check "Learn more about Route" doc link on routes page.
    2.
    3.

Actual results:

1. It points to 4.16 link.

Expected results:

1. It should point to 4.17 for 4.17 cluster and 4.18 for 4.18 cluster.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/214

Bug OCPBUGS-44786: support the LLC alignment cpumanager policy option

View the Description View the linked PRs

Description of problem:

    Pull support from upstream kubernetes (see KEP 4800: https://github.com/kubernetes/enhancements/issues/4800) for LLC alignment support in cpumanager

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    100%

Steps to Reproduce:

    1. try to schedule a pod which requires exclusive CPU allocation and whose CPUs should be affine to the same LLC block
    2. observe random and likely wrong (not LLC-aligned) allocation
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/2136

Bug OCPBUGS-44995: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4681

Bug OCPBUGS-45560: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-nutanix/pull/87

Bug OU-681: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/358

Bug OCPBUGS-38543: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14313

Bug OCPBUGS-45141: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14700

Bug OCPBUGS-45531: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-workload-identity/pull/25

Bug OCPBUGS-45753: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/83

Bug OCPBUGS-48730: Thanos CVE-2024-45338: Update golang.org/x/net to v0.33

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/thanos/pull/156

Bug OCPBUGS-48532: Hosted Cluster is deployed but still getting incorrect condition "HostedCluster is deploying, upgrading, or reconfiguring)"

View the Description View the linked PRs

Description of problem:

Issue can be observed on Hypershift e2e-powervs CI (reference : https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-powervs-ovn/1879800904798965760)

HostedCluster is deployed but still getting incorrect condition for status,"HostedCluster is deploying, upgrading, or reconfiguring)" 

This is happening because of following issue observed on cluster-version

Logs reference : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-powervs-ovn/1879800904798965760/artifacts/e2e-powervs-ovn/run-e2e/build-log.txt

```
 eventually.go:226:  - incorrect condition: wanted Progressing=False, got Progressing=True: Progressing(HostedCluster is deploying, upgrading, or reconfiguring)
    eventually.go:226:  - wanted HostedCluster to desire image registry.build01.ci.openshift.org/ci-op-cxr9zifq/release@sha256:7e40dc5dace8cb816ce91829517309e3609c7f4f6de061bf12a8b21ee97bb713, got registry.build01.ci.openshift.org/ci-op-cxr9zifq/release@sha256:e17cb3eab53be67097dc9866734202cb0f882afc04b2972c02997d9bc1a6e96b
    eventually.go:103: Failed to get *v1beta1.HostedCluster: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
```

Note : Issue oberved on 4.19.0 Hypershift e2e-aws-multi CI as well (reference : https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-multi/1880072687829651456)

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    100%

Steps to Reproduce:

    1. Create PowerVS hypershift cluster with 4.18.0 release
    2.
    3.

Actual results:

    HostedCluster getting deployed but still getting incorrect condition for cluster-version

Expected results:

    HostedCluster should get deployed successfully with all conditions met

Additional info:

    The issue was first observed on Dec 25, 2024. Currently reproducible on 4.19.0 (reference : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-multi/1880072687829651456/artifacts/e2e-aws-multi/hypershift-aws-run-e2e-external/build-log.txt)

https://github.com/openshift/hypershift/pull/5487

Bug OCPBUGS-44130: Azure CredentialsRequest for Machine API Operator may be missing some permissions

View the Description View the linked PRs

During review of ARO MiWi permissions, some permissions in the MAPI CredentialsRequest for Azure having other permissions identified through a linked action that are missing.

The following permissions were identified as possibly needed in MAPI CredsRequest as they are specified as linked action of one of MAPI's existing permissions

Microsoft.Compute/disks/beginGetAccess/action
Microsoft.KeyVault/vaults/deploy/action
Microsoft.ManagedIdentity/userAssignedIdentities/assign/action
Microsoft.Network/applicationGateways/backendAddressPools/join/action
Microsoft.Network/applicationSecurityGroups/joinIpConfiguration/action
Microsoft.Network/applicationSecurityGroups/joinNetworkSecurityRule/action
Microsoft.Network/ddosProtectionPlans/join/action
Microsoft.Network/gatewayLoadBalancerAliases/join/action
Microsoft.Network/loadBalancers/backendAddressPools/join/action
Microsoft.Network/loadBalancers/frontendIPConfigurations/join/action
Microsoft.Network/loadBalancers/inboundNatPools/join/action
Microsoft.Network/loadBalancers/inboundNatRules/join/action
Microsoft.Network/networkInterfaces/join/action
Microsoft.Network/networkSecurityGroups/join/action
Microsoft.Network/publicIPAddresses/join/action
Microsoft.Network/publicIPPrefixes/join/action
Microsoft.Network/virtualNetworks/subnets/join/action

Each permission needs to be validated as to whether it is needed by MAPI through any of its code paths.

https://github.com/openshift/machine-api-operator/pull/1309

Bug OCPBUGS-44491: Registry storage alerts should have links to a runbook

View the Description View the linked PRs

Description of problem:

    Registry storage alerts did not link a runbook

Version-Release number of selected component (if applicable):

   4.18

How reproducible:

    always

Steps to Reproduce:

According to the doc: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required, I think should add runhook link to the registry storage alert pr: https://github.com/openshift/cluster-image-registry-operator/pull/1147/files, thanks

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/1159

Bug OCPBUGS-45663: [azure] Worker machines get Failed state if region has no availability zones or availability set fault domains

View the Description View the linked PRs

Description of problem:

In Azure, there are 2 regions that don't have availability zones or availability set fault domains (centraluseuap, eastusstg). They are test regions, one of which is in-use by the ARO team.

Machine API provider seems to be hardcoding an availability set fault domain count of 2 in creation of the machineset: https://github.com/openshift/machine-api-provider-azure/blob/main/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L32, so if there is not at least a fault domain count of 2 in the target region, the install will fail because worker nodes get a Failed status.

This is the error from Azure, reported by the machine API:

`The specified fault domain count 2 must fall in the range 1 to 1.`

Because of this, the regions are not able to support OCP clusters.

Version-Release number of selected component (if applicable):

    Observed in 4.15

How reproducible:

    Very

Steps to Reproduce:

    1. Attempt creation of an OCP cluster in centraluseuap or eastusstg regions
    2. Observe worker machine failures

Actual results:

    Worker machines get a failed state

Expected results:

    Worker machines are able to start. I am guessing that this would happen via dynamic setting of the availability set fault domain count rather than hardcoding it to 2, which right now just happens to work in most regions in Azure because the fault domain counts are typically at least 2.

In upstream, it looks like we're dynamically setting this by querying the amount of fault domains in a region: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/40f0fabc264388de02a88de7fbe400c21d22e7e2/azure/services/availabilitysets/spec.go#L70

Additional info:

https://github.com/openshift/machine-api-provider-azure/pull/124

Task CNTRLPLANE-129: Move the location of microsoft repositories installed in the e2e dockerfile

View the Description View the linked PRs

The image tests are currently failing on presubs as they cannot build our hypershift-tests image. This is caused by the fact that in 4.19 dnf used in CI is really a wrapper and we should now install the Microsoft repositories in /etc/yum/repos.art/ci/ so that the azure-cli can be found when attempting to install it with dnf.

Slack Thread: https://redhat-internal.slack.com/archives/CB95J6R4N/p1737060704212009

https://github.com/openshift/hypershift/pull/5416

Bug OCPBUGS-50657: PowerVS: hack DHCP destroy code

View the Description View the linked PRs

Description of problem:

We have been asked by PowerVS cloud team to change how we destory DHCP networks.  Because after they have been deleted, there are still sub-resources still in the process of being deleted.  So, we are also check to make sure there are no subnets still open as they will be deleted also.

https://github.com/openshift/installer/pull/9458

Vulnerability OCPBUGS-51328: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/1097

Bug OCPBUGS-51376: [azure-disk-csi-driver] ARO HCP uses UserAssignedIdentityCredentials could not provision volume

View the Description View the linked PRs

Description of problem:

[azure-disk-csi-driver] ARO HCP uses UserAssignedIdentityCredentials could not provision volume

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-02-26-050226

How reproducible:

  Always

Steps to Reproduce:

    1. Install ARO hypershift cluster uses UserAssignedIdentityCredentials mode.
    2. Create pvc using the managed-csi(azure disk csi provisioner) storageclass, and create pod consume the pvc.
    3. Check the pvc could provision successfully and pod could start running.

Actual results:

  In step3: the pvc provision failed of ->
I0225 13:44:52.851851       1 controllerserver.go:281] begin to create azure disk(pvc-99f969ab-3629-4729-87c7-e796e081f27e) account type(Premium_LRS) rg(generic-managed-rg) location(eastus) size(5) diskZone() maxShares(0)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x68 pc=0x21ced59]

goroutine 516 [running]:
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*ManagedDiskController).CreateManagedDisk(0xc000624668, {0x2e5c030, 0xc0004fef30}, 0xc00030d040)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/azure_managedDiskController.go:271 +0x1759
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc00033c800, {0x2e5c030, 0xc0004fef30}, 0xc00029ec80)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:332 +0x36c5
...

Expected results:

 In step3: the pvc could provision successfully and pod could start running.

Additional info:

 From the csi driver controller logs we could see ->
azure_auth.go:175] "No valid auth method found" logger="GetServicePrincipalToken"

It seems current cloud-provider-azure(https://github.com/openshift/azure-disk-csi-driver/blob/master/go.mod#L46) depeddency does not contains the support for UserAssignedIdentityCredentials.

https://github.com/openshift/azure-disk-csi-driver/pull/102

Bug OCPBUGS-48523: [cluster-etcd-operator] Inconsistent static pod operator statuses after apply migration

View the Description View the linked PRs

Description of problem:

Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers:

https://issues.redhat.com/browse/OCPBUGS-45924
https://issues.redhat.com/browse/OCPBUGS-46372
https://issues.redhat.com/browse/OCPBUGS-48276

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-45519: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kube-state-metrics/pull/118

Bug OCPBUGS-48216: Idempotency issue on removing CVO resources in init container

View the Description View the linked PRs

Description of problem:

While working on ARO-13685 I (accidentally) crashed the CVO payload init containers. 

I found that the removal logic based on plain "rm" is not idempotent, so if any of the init containers crash mid-way, the restart will never be able to succeed.

The fix is to use "rm -f" in all places instead.

Version-Release number of selected component (if applicable):

4.18 / main, but existed in prior versions

How reproducible:

always

Steps to Reproduce:

    1. inject a crash in the bootstrap init container https://github.com/openshift/hypershift/blob/99c34c1b6904448fb065cd65c7c12545f04fb7c9/control-plane-operator/controllers/hostedcontrolplane/cvo/reconcile.go#L353 

    2. the restarting previous init container "prepare-payload" will crash loop on "rm" not succeeding as the previous invocation already deleted all manifests

Actual results:

the prepare-payload init container will crash loop forever, preventing the container CVO from running

Expected results:

a crashing init container should be able to restart gracefully without getting stuck on file removal and eventually run the CVO container

Additional info:

based off the work in https://github.com/openshift/hypershift/pull/5315

https://github.com/openshift/hypershift/pull/5390

Bug OCPBUGS-44448: oc-mirror delete failed with error: Image may not exist or is not stored with a v2 Schema in a v2 registry

View the Description View the linked PRs

Description of problem:

when delete platform images, oc-mirror failed with error: 
Unable to delete my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com/openshift/release:4.15.37-s390x-alibaba-cloud-csi-driver. Image may not exist or is not stored with a v2 Schema in a v2 registry

Version-Release number of selected component (if applicable):

./oc-mirror.rhel8 version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202411090338.p0.g0a7dbc9.assembly.stream.el9-0a7dbc9", GitCommit:"0a7dbc90746a26ddff3bd438c7db16214dcda1c3", GitTreeState:"clean", BuildDate:"2024-11-09T08:33:46Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1. imagesetconfig as follow :  

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest                        
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator  
  platform:
    architectures:
      - "s390x"
    channels:
    - name: stable-4.15
      type: ocp

  2. run the mirror2disk and disk2mirror command : 

`oc mirror -c /home/fedora/yinzhou/openshift-tests-private/test/extended/testdata/workloads/config-72708.yaml file://test/yinzhou/debug72708   --v2`

`oc mirror -c /home/fedora/yinzhou/openshift-tests-private/test/extended/testdata/workloads/config-72708.yaml --from file://test/yinzhou/debug72708   --v2 docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --dest-tls-verify=false`

3. generate delete image list:

`oc mirror delete --config /home/fedora/yinzhou/openshift-tests-private/test/extended/testdata/workloads/delete-config-72708.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2 --workspace file://test/yinzhou/debug72708  --generate`

4. execute the delete command :

`oc mirror delete --delete-yaml-file test/yinzhou/debug72708/working-dir/delete/delete-images.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2  --dest-tls-verify=false --force-cache-delete=true`

Actual results:

4. delete hit error :     

⠋   21/396 : (0s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd62cc631a6bf6e13366d29da5ae64088d3b42410f9b52579077cc82d2ea2ab9 
2024/11/12 03:10:07  [ERROR]  : Unable to delete my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com/openshift/release:4.15.37-s390x-alibaba-cloud-csi-driver. Image may not exist or is not stored with a v2 Schema in a v2 registry

Expected results:

4. no error

Additional info:

https://github.com/openshift/oc-mirror/pull/968

Bug OCPBUGS-45318: network-tools pod-run-netns-command failed due to "ERROR: Can't get netns pid"

View the Description View the linked PRs

Description of problem:

network-tools pod-run-netns-command failed due to "ERROR: Can't get netns pid".
Seems containerRuntime changed from runc to crun, so we need to update network-tools utils:
https://github.com/openshift/network-tools/blob/1df82dfade80ce31b325dab703b37bf7e8924e99/debug-scripts/utils#L108

Version-Release number of selected component (if applicable):

4.18.0-0.test-2024-11-27-013900-ci-ln-s87rfh2-latest

How reproducible:

always

Steps to Reproduce:

1. create test pod in namespace test
$ oc get pod -n test
NAME         READY   STATUS    RESTARTS   AGE
hello-pod2   1/1     Running   0          22s

2.run command "ip a" with network-tools script pod-run-netns-command

Actual results:

$ ./network-tools pod-run-netns-command test hello-pod2 "ip route show"
Temporary namespace openshift-debug-btzwc is created for debugging node...
Starting pod/qiowang-120303-zb568-worker-0-5phll-debug ...
To use host binaries, run `chroot /host`


Removing debug pod ...
Temporary namespace openshift-debug-btzwc was removed.
error: non-zero exit code from debug container
ERROR: Can't get netns pid   <--- Failed


INFO: Running ip route show in the netns of pod hello-pod2
Temporary namespace openshift-debug-l7xv4 is created for debugging node...
Starting pod/qiowang-120303-zb568-worker-0-5phll-debug ...
To use host binaries, run `chroot /host`
nsenter: failed to parse pid: 'parse'


Removing debug pod ...
Temporary namespace openshift-debug-l7xv4 was removed.
error: non-zero exit code from debug container
ERROR: Command returned non-zero exit code, check output or logs.

Expected results:

run command with network-tools script pod-run-netns-command successfuly

Additional info:

There is no container running:
$ oc debug node/qiowang-120303-zb568-worker-0-5phll
Temporary namespace openshift-debug-hrr94 is created for debugging node...
Starting pod/qiowang-120303-zb568-worker-0-5phll-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.2.190
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# runc list
ID          PID         STATUS      BUNDLE      CREATED     OWNER
sh-5.1#

https://github.com/openshift/network-tools/pull/136

Bug OCPBUGS-45319: HorizontalNav component of dynamic plugin sdk don't have all the necessary props

View the Description View the linked PRs

Description of problem:

HorizontalNav component of @openshift-console/dynamic-plugin-sdk doest not have customData prop which is available in console repo. 

This prop is needed to pass any value between tabs in details page

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14575

Bug OCPBUGS-45638: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/613

Bug CNV-54111: The UserDefinedNetworks presents under NetworkAttachmentDefinitions

View the Description View the linked PRs

Description of problem:

Create a UDN network, check NAD list, the UDN network is also presenting there

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

UDN network only presents under UDN, not presents in NAD

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/162

Story METAL-1286: Remove unused annotation from CBO

View the Description View the linked PRs

Remove legacy, unused annotation 'exclude.release.openshift.io/internal-openshift-hosted'

https://github.com/openshift/cluster-baremetal-operator/pull/462

Bug OCPBUGS-46088: [Hypershift] Filter by Node type list is empty

View the Description View the linked PRs

Description of problem:

on Hypershift guest cluster, 'Filter by Node type' list on Cluster utilization is empty

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-11-221222

How reproducible:

Always

Steps to Reproduce:

1. click on 'Filter by Node type' dropdown in Cluster utilization card on Home -> Overview

Actual results:

1. The dropdown list is empty

Expected results:

1. Hypershift guest cluster only has Worker nodes, so we should probably need only present `worker` option

$ oc get node --kubeconfig ./kubeconfig-guest 
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-141-150.us-east-2.compute.internal   Ready    worker   4h21m   v1.31.3
ip-10-0-148-36.us-east-2.compute.internal    Ready    worker   4h21m   v1.31.3
ip-10-0-167-238.us-east-2.compute.internal   Ready    worker   4h22m   v1.31.3

Additional info:

https://github.com/openshift/console/pull/14704

Bug OCPBUGS-48171: ART requests updates to 4.19 image ose-service-ca-operator-container

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-40772~~. The following is the description of the original issue:
—
Please review the following PR: https://github.com/openshift/service-ca-operator/pull/246

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/service-ca-operator/pull/254

Bug OCPBUGS-45640: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/122

Bug OCPBUGS-43859: Getting `Oh no, something went wrong` error when trying to install operator.

View the Description View the linked PRs

Description of problem:

I was going through the operator hub and searching for "apache zookeeper operator" and "stackable common operator." When I clicked on those operators, I'm getting, `Oh no, something went wrong` error message.

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    100%

Steps to Reproduce:

    1. Go to the web console and then go to operator hub.
    2. Search for "apache zookeeper operator" and  "stackable common operator"
    3. And then click on that operator to install, and you will see that error message.

Actual results:

    Getting Oh no something went wrong message

Expected results:

    Should show the next page to install the operator

Additional info:

https://github.com/openshift/console/pull/14526

Bug OCPBUGS-49805: hostpath-provisioner: csi-provisioner managed-by label is failing with the long-node names

View the Description View the linked PRs

Description of problem:

After the hostpath-provisioner installation the csi-provisioner should create the CSIstoragecapacity objects. But the creation is failing with the following error.

I0120 04:14:26.724089 1 controller.go:873] "Started provisioner controller" component="kubevirt.io.hostpath-provisioner_hostpath-provisioner-csi-smrh4_6264f065-eaed-47aa-90a9-3a2563c965c2"
W0120 04:14:28.020372 1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list v1.CSIStorageCapacity: unable to parse requirement: values[0][csi.storage.k8s.io/managed-by]: Invalid value: "-53f40b57-worker1.ocp-virt1-s390x.s390g.lab.eng.rdu2.redhat.com": a valid label must be an empty string or consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.])?[A-Za-z0-9])?')
E0120 04:14:28.020406 1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSIStorageCapacity: failed to list v1.CSIStorageCapacity: unable to parse requirement: values[0][csi.storage.k8s.io/managed-by]: Invalid value: "-53f40b57-worker1.ocp-virt1-s390x.s390g.lab.eng.rdu2.redhat.com": a valid label must be an empty string or consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.])?[A-Za-z0-9])?')

This issue is coming with the worker nodes have long node names.
worker0.ocp-virt1-s390x.s390g.lab.eng.rdu2.redhat.com
worker1.ocp-virt1-s390x.s390g.lab.eng.rdu2.redhat.com

This is coming if the worker node name length is exactly 53 characters.

For more details refer this:

https://github.com/kubernetes-csi/external-provisioner/issues/1333

Version-Release number of selected component (if applicable):

How reproducible:

corner case.

Steps to Reproduce:

    1.  Create the openshift cluster with long worker node names. The worker name should match exactly 53 characters.
    2. Install the hostpath provisoner in waitForfirstconsumer mode.
    3. Try to create a pvc using the hostpath-provisioner storage class.

Actual results:

    csi-provisoner side car is throwing these errors.
I0120 04:14:26.724089 1 controller.go:873] "Started provisioner controller" component="kubevirt.io.hostpath-provisioner_hostpath-provisioner-csi-smrh4_6264f065-eaed-47aa-90a9-3a2563c965c2"
W0120 04:14:28.020372 1 reflector.go:547] k8s.io/client-go/informers/factory.go:160: failed to list v1.CSIStorageCapacity: unable to parse requirement: values[0][csi.storage.k8s.io/managed-by]: Invalid value: "-53f40b57-worker1.ocp-virt1-s390x.s390g.lab.eng.rdu2.redhat.com": a valid label must be an empty string or consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.])?[A-Za-z0-9])?')
E0120 04:14:28.020406 1 reflector.go:150] k8s.io/client-go/informers/factory.go:160: Failed to watch *v1.CSIStorageCapacity: failed to list v1.CSIStorageCapacity: unable to parse requirement: values[0][csi.storage.k8s.io/managed-by]: Invalid value: "-53f40b57-worker1.ocp-virt1-s390x.s390g.lab.eng.rdu2.redhat.com": a valid label must be an empty string or consist of alphanumeric characters, '-', '' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9.])?[A-Za-z0-9])?')

Expected results:

    It should successfully create the CSI storage capacities without any errors.

Additional info:

https://github.com/openshift/csi-external-provisioner/pull/110

Vulnerability OCPBUGS-52226: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5750

Bug OCPBUGS-43610: OpenShift Login - Button Validation

View the Description View the linked PRs

Description of problem:

OpenShift - the log in button can be clicked repeatably

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

 1. Login to OpenShift
 2. Enter a valid Username and Password
 3. Click the 'Log in' button repeatably

Actual results:

 The login button can be clicked repeatedly

Expected results:

The login button can be clicked several times which could have negative consequences elsewhere in the application. Ideally, buttons should be unclickable once clicked initially to prevent against this behaviour.

Additional info:

https://github.com/openshift/cluster-authentication-operator/pull/751

Bug OCPBUGS-45574: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-azure/pull/128

Bug OCPBUGS-48481: [annotations gcp] service stuck as pending when no value passed to cloud.google.com/network-tier

View the Description View the linked PRs

Description of problem:

    When no value passed to annotation cloud.google.com/network-tier service remain stuck in pending

Version-Release number of selected component (if applicable):

    4.19.0-0.nightly-2025-01-15-060507

How reproducible:

    Always

Steps to Reproduce:

    1.Create service use below yaml 
apiVersion: v1
kind: Service
metadata:
  labels:
    name: lb-service-unsecure
  name: lb-service-unsecure
  annotations:
     cloud.google.com/network-tier: ""
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    name: web-server-rc
  type: LoadBalancer     

2. Service gets created successfully 
3. miyadav@miyadav-thinkpadx1carbongen8:~/annotations$ oc get svc lb-service1
NAME          TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
lb-service1   LoadBalancer   172.30.233.126   <pending>     80:30502/TCP   9m36s
miyadav@miyadav-thinkpadx1carbongen8:~/annotations$

Actual results:

service stuck in pending state

Expected results:

     If possible it could produce error ( admission webhook may be to pass value to annotation )

Additional info:
describe svc does give info like below -

Events:
  Type     Reason                  Age                From                Message
  ----     ------                  ----               ----                -------
  Normal   EnsuringLoadBalancer    33s (x8 over 10m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  32s (x8 over 10m)  service-controller  Error syncing load balancer: failed to ensure load balancer: unsupported network tier: ""

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/379

Bug OCPBUGS-49613: Significant increase in average ovnkube-controller cpu usage between two nightlies

View the Description View the linked PRs

Description of problem:

Note this is without UDN

While tracking nightly runs we see that there is a big jump in CPU utilization between 4.18.0-0.nightly-2025-01-28-114040 and 4.18.0-0.nightly-2025-01-28-165333 nightlies
and
4.19.0-0.nightly-2025-01-25-021606 and 4.19.0-0.nightly-2025-01-27-130640
nightlies

which corresponds to https://github.com/openshift/ovn-kubernetes/pull/2426 https://github.com/openshift/ovn-kubernetes/pull/2420 PRs being merged

Version-Release number of selected component (if applicable): 4.18 and 4.19

How reproducible:

Always

Steps to Reproduce:

1. Run node-density and check ovnkube-controller cpu usage

Actual results:

Worse CPU utilization

Expected results:

Similar or better CPU utilization

https://github.com/openshift/ovn-kubernetes/pull/2446

Bug OCPBUGS-48425: Non-admin users cannot read or modify UserDefinedNetwork CRs

View the Description View the linked PRs

Description of problem:

Non admin users cannot create UserDefinedNetwork instances.

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

1. Create UDN instance as non-admin users.

Actual results:

In the UI, openning the UserDefinedPage fails with the following error

```
userdefinednetworks.k8s.ovn.org is forbidden: User "test" cannot list resource "userdefinednetworks" in API group "k8s.ovn.org" at the cluster scope
```
We get similar error trying to create one.
Expected results:

As a non-admin user I want to be able to create UDN CR w/o cluster-admin intervention.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/cluster-network-operator/pull/2619

Bug OCPBUGS-42338: Further increase the default node-monitor-grace-period

View the Description View the linked PRs

Description of problem:

    See https://github.com/kubernetes/kubernetes/issues/127352

Version-Release number of selected component (if applicable):

    See https://github.com/kubernetes/kubernetes/issues/127352

How reproducible:

    See https://github.com/kubernetes/kubernetes/issues/127352

Steps to Reproduce:

    See https://github.com/kubernetes/kubernetes/issues/127352

Actual results:

    See https://github.com/kubernetes/kubernetes/issues/127352

Expected results:

    See https://github.com/kubernetes/kubernetes/issues/127352

Additional info:

    See https://github.com/kubernetes/kubernetes/issues/127352

https://github.com/openshift/hypershift/pull/5166

Bug OCPBUGS-43608: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1320

Bug OCPBUGS-45449: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/334

Bug OCPBUGS-45607: UDN: TP CI Lane is passing 92% for overlapping IPs test

View the Description View the linked PRs

Description of problem:

https://github.com/openshift/api/pull/1997 is the PR where we plan to promote the FG for UDNs in early Jan - the week of 6th

I need 95% pass rate for overlapping IPs test which is currently at 92% threshold,

See the verify jobs: https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_api/1997/pull-ci-openshift-api-master-verify/1864369939800920064/build-log.txt that are failing on that PR

Talk with TRT and understand the flakes and improve the tests to get good pass results.

This is blocker for GA

Fix:

INSUFFICIENT CI testing for "NetworkSegmentation". F1204 18:09:40.736814 181286 root.go:64] Error running codegen: error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions is isolated from the default network with L2 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork is isolated from the default network with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {metal amd64 ha ipv4} error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {metal amd64 ha ipv4}

https://github.com/openshift/origin/pull/29415

Bug OCPBUGS-45802: Layout incorrect on ‘Edit Pod count’ pops up windows

View the Description View the linked PRs

Description of problem:

The 'Plus' button in the 'Edit Pod Count' popup window overlaps the input field, which is incorrect.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-12-05-103644

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Workloads -> ReplicaSets page， choose one resource, and click the Keban list buton, choose ‘Edit Pod count’
    2.
    3.

Actual results:

    The Layout is incorrect

Expected results:

    The 'Plus' button in the 'Edit Pod Count' popup window should not overlaps the input field

Additional info:

 Snapshot: https://drive.google.com/file/d/1mL7xeT7FzkdsM1TZlqGdgCP5BG6XA8uh/view?usp=drive_link
https://drive.google.com/file/d/1qmcal_4hypEPjmG6PTG11AJPwdgt65py/view?usp=drive_link

https://github.com/openshift/console/pull/14597

Vulnerability OCPBUGS-52505: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/821

Bug OCPBUGS-18961: oc adm release extract --included should include ImageRegistry in 4.13-to-4.14 extractions

View the Description View the linked PRs

Description of problem

When updating a 4.13 cluster to 4.14, the new-in-4.14 ImageRegistry capability will always be enabled, because capabilities cannot be uninstalled.

Version-Release number of selected component (if applicable)

4.14 oc should learn about this, so they will appropriately extract registry CredentialsRequests when connecting to 4.13 clusters for 4.14 manifests. 4.15 oc will get OTA-1010 to handle this kind of issue automatically, but there's no problem with getting an ImageRegistry hack into 4.15 engineering candidates in the meantime.

How reproducible

100%

Steps to Reproduce

1. Connect your oc to a 4.13 cluster.
2. Extract manifests for a 4.14 release.
3. Check for ImageRegistry CredentialsRequests.

Actual results

$ oc adm upgrade | head -n1
Cluster version is 4.13.12
$ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64
$ grep -r ImageRegistry credentials-requests
...no hits...

Expected results

$ oc adm upgrade | head -n1
Cluster version is 4.13.12
$ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64
$ grep -r ImageRegistry credentials-requests
credentials-requests/0000_50_cluster-image-registry-operator_01-registry-credentials-request.yaml:    capability.openshift.io/name: ImageRegistry

Additional info

We already do this for MachineAPI. The ImageRegistry capability landed later, and this is us catching the oc-extract hack up with that change.

https://github.com/openshift/oc/pull/1539

Bug OCPBUGS-49838: Autoscaler deployment is not scaled up when a NodePool is modified to use autoScaling

View the Description View the linked PRs

Description of problem:

When a HostedCluster is initially deployed with a NodePool that does not autoscale, but then the NodePool is later changed to using autoscaling, the autoscaler deployment on the HCP is not scaled up and thus the NodePool does not autoscale.

We observe this as failures in the e2e TestAutoscaling

Version-Release number of selected component (if applicable):

How reproducible:

    Occasionally

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    When NodePool spec changes to enabling autoscaling, the annotation disabling the autoscaler deployment is not being removed from the HCP

Expected results:

    When NodePool spec changes to enabling autoscaling, the annotation disabling the autoscaler deployment in the HCP should be removed

Additional info:

https://github.com/openshift/hypershift/pull/5553

Bug OCPBUGS-50907: Propagate Reason for Available condition from HostedControlPlane to HostedCluster

View the Description View the linked PRs

Description of problem:

The Reason for the Available condition should propagate from HostedControlPlane to HostedCluster together with Status and Message. Currently, only Status and Message is propagated: link. In this case we end up with KASLoadBalanderNotReachable in HCP and WaitForAvailable in HC, but we propagate the detailed message:

ᐅ oc get hc $CLUSTER_NAME -n $NAMESPACE -oyaml

  - lastTransitionTime: "2025-02-13T15:13:55Z"
    message: 'Get "https://ad470e4971ffe4f24bb0085802628868-46f6d7fdaaca476a.elb.us-east-1.amazonaws.com:6443/healthz":
      dial tcp: lookup ad470e4971ffe4f24bb0085802628868-46f6d7fdaaca476a.elb.us-east-1.amazonaws.com
      on 172.31.0.10:53: no such host'
    observedGeneration: 3
    reason: WaitingForAvailable
    status: "False"
    type: Available

ᐅ oc get hcp hc1 -n clusters-hc1 -oyaml

  - lastTransitionTime: "2025-02-13T15:14:09Z"
    message: 'Get "https://ad470e4971ffe4f24bb0085802628868-46f6d7fdaaca476a.elb.us-east-1.amazonaws.com:6443/healthz":
      dial tcp: lookup ad470e4971ffe4f24bb0085802628868-46f6d7fdaaca476a.elb.us-east-1.amazonaws.com
      on 172.31.0.10:53: no such host'
    observedGeneration: 1
    reason: KASLoadBalancerNotReachable
    status: "False"
    type: Available

Version-Release number of selected component (if applicable):

4.19.0

How reproducible:

Reproduced as part of https://issues.redhat.com/browse/OCPBUGS-49913 which uses cluster-wide proxy for the management cluster. In this case, the HCP and HC do not become available and show the errors above.

Steps to Reproduce:

Steps described in this JIRA comment

Actual results:

The Reason for HCP is KASLoadBalancerNotReachable while the reason for HC is WaitingForAvailable, but the Message is same in both cases.

Expected results:

The reason KASLoadBalancerNotReachable is propagated to HostedCluster.

Additional info:

https://github.com/openshift/hypershift/pull/5650

Bug OCPBUGS-52258: Increase context when the CVO fails to read update manifests

View the Description View the linked PRs

Description of problem:


Increase the verbosity of logging for the CVO when applying the manifests in"/etc/cvo/" directory.
We have a case when the error is "Permission Denied" but this is very vague.

The error in the CVO logs is:
~~~
failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/7WNaprXJNWTsPAepCHJ00Q/release-manifests/release-metadata: permission denied.
~~~

At this moment we are unable to figure out the reason why it fails.
The CVO mounts the "/etc/cvo" directory as hostPath in Read-Only mode, and the files are in 444 with container_file_t selinux, but the CVO runs with spc_t so selinux should not cause the issue.

With that, it would be good to add additional context for the problem, why the CVO can't read the manifests.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.15

How reproducible:

n/a

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-version-operator/pull/1166

Bug OCPBUGS-45654: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1786

Bug OCPBUGS-49436: resolv-prepender fails on missing env file

View the Description View the linked PRs

Description of problem:

If resolv-prepender is triggered by the path unit before the dispatcher script has populated the env file it fails because the env file is mandatory. We should make it optional by using EnvironmentFile=-

Version-Release number of selected component (if applicable):

4.16

How reproducible:


$ systemctl cat on-prem-resolv-prepender.service
# /etc/systemd/system/on-prem-resolv-prepender.service
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0
[Service]
Type=oneshot
# Would prefer to do Restart=on-failure instead of this bash retry loop, but
# the version of systemd we have right now doesn't support it. It should be
# available in systemd v244 and higher.
ExecStart=/bin/bash -c " \
  until \
  /usr/local/bin/resolv-prepender.sh; \
  do \
  sleep 10; \
  done"
EnvironmentFile=/run/resolv-prepender/env

$ systemctl cat crio-wipe.service
No files found for crio-wipe.service.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4815

Bug OCPBUGS-51202: Timestamp component incorrectly destructures "now" state

View the Description View the linked PRs

Description of problem:

The "now" constant defined at the beginning of the Timestamp component is not a valid Date value.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14793

Task OSASINFRA-3695: bulk-create security group rules in Installer

View the Description View the linked PRs

Create security group rules in batches to reduce the number of calls to the OpenStack API. This is a performance improvement that is not expected to result in any functional change.

https://github.com/openshift/installer/pull/9042

Bug OCPBUGS-44560: additional network ignored on SingleStackIPv6 IPI installation

View the Description View the linked PRs

Description of problem:

Additional network is not correctly configured on the secondary interface inside the masters and the workers.

With install-config.yaml with this section:

# This file is autogenerated by infrared openshift plugin                                                                                                                                                                                                                                                                    
apiVersion: v1                                                                                                                                                                                                                                                                                                               
baseDomain: "shiftstack.local"
compute:
- name: worker
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9']
      type: "worker"
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9']
      type: "master"
  replicas: 3
metadata:
  name: "ostest"
networking:
  clusterNetworks:
  - cidr: fd01::/48
    hostPrefix: 64
  serviceNetwork:
    - fd02::/112
  machineNetwork:
    - cidr: "fd2e:6f44:5dd8:c956::/64"
  networkType: "OVNKubernetes"
platform:
  openstack:
    cloud:            "shiftstack"
    region:           "regionOne"
    defaultMachinePlatform:
      type: "master"
    apiVIPs: ["fd2e:6f44:5dd8:c956::5"]
    ingressVIPs: ["fd2e:6f44:5dd8:c956::7"]
    controlPlanePort:
      fixedIPs:
        - subnet:
            name: "subnet-ssipv6"
pullSecret: |
  {"auths": {"installer-host.example.com:8443": {"auth": "ZHVtbXkxMjM6ZHVtbXkxMjM="}}}
sshKey: <hidden>
additionalTrustBundle: <hidden>
imageContentSources:
- mirrors:
  - installer-host.example.com:8443/registry
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - installer-host.example.com:8443/registry
  source: registry.ci.openshift.org/ocp/release

The installation works. However, the additional network is not configured on the masters or the workers, which leads in our case to faulty manila integration.

In journal of all OCP nodes, it's observed logs repeteadly like below one from the master-0:

Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9667] device (enp4s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <warn>  [1731590504.9672] device (enp4s0): Activation: failed for connection 'Wired connection 1'
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9674] device (enp4s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): canceled DHCP transaction
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): activation: beginning transaction (timeout in 45 seconds)
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): state changed no lease

Where that server has specifically an interface connected to the subnet "StorageNFSSubnet":

$ openstack server list | grep master-0
| da23da4a-4af8-4e54-ac60-88d6db2627b6 | ostest-kmmtt-master-0       | ACTIVE | StorageNFS=fd00:fd00:fd00:5000::fb:d8; network-ssipv6=fd2e:6f44:5dd8:c956::2e4            | ostest-kmmtt-rhcos                            | master    |

That subnet is defined in openstack as dhcpv6-stateful:

$ openstack subnet show StorageNFSSubnet
+----------------------+-------------------------------------------------------+
| Field                | Value                                                 |
+----------------------+-------------------------------------------------------+
| allocation_pools     | fd00:fd00:fd00:5000::fb:10-fd00:fd00:fd00:5000::fb:fe |
| cidr                 | fd00:fd00:fd00:5000::/64                              |
| created_at           | 2024-11-13T12:34:41Z                                  |
| description          |                                                       |
| dns_nameservers      |                                                       |
| dns_publish_fixed_ip | None                                                  |
| enable_dhcp          | True                                                  |
| gateway_ip           | None                                                  |
| host_routes          |                                                       |
| id                   | 480d7b2a-915f-4f0c-9717-90c55b48f912                  |
| ip_version           | 6                                                     |
| ipv6_address_mode    | dhcpv6-stateful                                       |
| ipv6_ra_mode         | dhcpv6-stateful                                       |
| name                 | StorageNFSSubnet                                      |
| network_id           | 26a751c3-c316-483c-91ed-615702bcbba9                  |
| prefix_length        | None                                                  |
| project_id           | 4566c393806c43b9b4e9455ebae1cbb6                      |
| revision_number      | 0                                                     |
| segment_id           | None                                                  |
| service_types        | None                                                  |
| subnetpool_id        | None                                                  |
| tags                 |                                                       |
| updated_at           | 2024-11-13T12:34:41Z                                  |
+----------------------+-------------------------------------------------------+

I also compared with ipv4 installation, and the storageNFSsubnet IP is successfully configured on enp4s0.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-12-201730,
~~RHOS-17~~.1-RHEL-9-20240701.n.1

How reproducible: Always
Additional info: must-gather and journal of the OCP nodes provided in private comment.

https://github.com/openshift/installer/pull/9323

Bug OCPBUGS-45412: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/316

Bug OCPBUGS-45636: Ensure ironic caches node information after successful cleaning/servicing

View the Description View the linked PRs

Description of problem:

    When doing firmware updates we saw cases where the update is successful but the newer information wasn't stored in the HFC, the root cause was that ironic didn't save the newer information in the DB.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ironic-image/pull/614

Bug OCPBUGS-45684: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-openstack/pull/127

Bug OCPBUGS-45698: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/372

Bug OCPBUGS-45724: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-scheduler-operator/pull/554

Bug OCPBUGS-49953: PowerVS: support destroy tag 1

View the Description View the linked PRs

Description of problem:

We are in the process of supporting the use of tags on PowerVS resources.  So the destroy code needs the ability to query either by name or tag.

https://github.com/openshift/installer/pull/9403

Bug OCPBUGS-45374: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/337

Bug OCPBUGS-45631: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-network-config-controller/pull/160

Bug OCPBUGS-46010: Bootstrap kube-apiserver removal leaves only one endpoint in kubernetes service

View the Description View the linked PRs

Description of problem:

Bootstrapping currently waits to observe 2 endpoints in the "kubernetes" service in HA topologies. The bootstrap kube-apiserver instance itself appears to be included in that number. Soon after observing 2 (bootstrap instance plus one permanent instance), the bootstrap instance is torn down and leaves the cluster with only 1 instance. Each rollout to that instance causes disruption to kube-apiserver availability until the second permanent instance is started for the first time, easily totaling multiple minutes of 0% kube-apiserver availability.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9302

Task HOSTEDCP-1958: Integrate codespell into Make Verify

View the Description View the linked PRs

Integrate codespell into Make Verify so that things are spell correctly in our upstream docs and codebase.

https://github.com/codespell-project/codespell

https://github.com/openshift/hypershift/pull/4700

Bug OCPBUGS-45371: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14578

Bug OCPBUGS-47743: Unexpected Permissions in `cluster-reader` ClusterRole in OpenShift 4.16

View the Description View the linked PRs

Description of problem:

We identified a discrepancy in the cluster-reader ClusterRole between OpenShift 4.14 and OpenShift 4.16. Specifically, the cluster-reader role in OpenShift 4.16 includes permissions for delete, create, update, and patch verbs, which are unexpected for this role.

We identified that the cluster-reader ClusterRole in OpenShift 4.16 uses an aggregationRule to pull rules from other ClusterRoles matching the following labels:

rbac.authorization.k8s.io/aggregate-to-cluster-reader: "true"
rbac.authorization.k8s.io/aggregate-to-view: "true"

Further investigation revealed that the system:openshift:machine-config-operator:cluster-reader ClusterRole contributes specific rules under the machineconfiguration.openshift.io API group. These permissions include:

Resources: machineconfignodes, machineconfignodes/status, machineosconfigs, machineosconfigs/status, machineosbuilds, machineosbuilds/status
Verbs: get, list, watch, delete, create, update, patch

The identified permissions appear to originate from the MCO and are linked to the following pull requests:

PR 4062 (OCPBUGS-24416)
PR 4327 (MCO-1131)

Request:

Can the MCO team confirm if these additional permissions are intentional? If not, adjustments may be required as the cluster-reader role should not include delete, create, update, or patch verbs.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%

Steps to Reproduce:

1. Deploy a fresh OpenShift 4.16 environment. 
2. Inspect the rules under the cluster-reader ClusterRole. 
3. Observe the inclusion of delete, create, update, and patch verbs for resources under the machineconfiguration.openshift.io API group.

Actual results:

The cluster-reader ClusterRole in OpenShift 4.16 includes unexpected permissions for the above-mentioned verbs.

Expected results:

The cluster-reader ClusterRole in OpenShift 4.16 should not have delete, create, update, and patch verbs.

Additional info:

This behavior deviates from the expected permissions in earlier versions (e.g., OpenShift 4.14) and could lead to potential security or operational concerns.

https://github.com/openshift/machine-config-operator/pull/4823

Bug OCPBUGS-19824: coreos-bootimages ConfigMap should have 0.0.1-snapshot substituted

View the Description View the linked PRs

Description of problem:

In 4.8's installer#4760, the installer began passing oc adm release new ... a manifest so the cluster-version operator would manage a coreos-bootimages ConfigMap in the openshift-machine-config-operator namespace. installer#4797 reported issues with the 0.0.1-snapshot placeholder not getting substituted, and installer#4814 attempted to fix that issue by converting the manifest from JSON to YAML to align with the replacement rexexp. But for reasons I don't understand, that manifest still doesn't seem to be getting replaced.

Version-Release number of selected component (if applicable):

From 4.8 through 4.15.

How reproducible:

100%

Steps to Reproduce:

With 4.8.0:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64
$ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml

Actual results:

  releaseVersion: 0.0.1-snapshot

Expected results:

  releaseVersion: 4.8.0

or other output that matches the extracted release. We just don't want the 0.0.1-snapshot placeholder.

Additional info:

Reproducing in the latest 4.14 RC:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64
$ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml
  releaseVersion: 0.0.1-snapshot

https://github.com/openshift/oc/pull/1945

Bug OCPBUGS-31356: Monitoring operator is not checking if resource exists before attempting to remove it

View the Description View the linked PRs

Description of problem:

    When user projects monitoring feature is turned off the operator is cleaning up resources for user project monitoring, running multiple delete requests to apiserver.
This has several drawbacks:
* API Server can't cache DELETE requests, so it has to request etcd every time
* Audit log is flooded with "delete failed: object 'foo' not found" record

The function should first check that the object exists (GET requests are cachable) before issuing a DELETE request

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    Always

Steps to Reproduce:

    1. Start 4.16.0
    2. Check audit log
    3.

Actual results:

    Audit log has messages like:
configmaps "serving-certs-ca-bundle" not found
servicemonitors.monitoring.coreos.com "thanos-ruler" not found

printed every few minutes

Expected results:

    No failed delete requests

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2547

Bug OCPBUGS-45586: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus/pull/238

Vulnerability OCPBUGS-46727: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/109

Story CORS-2508: Unused AWS action DescribeAutoScalingGroups

View the Description View the linked PRs

The OpenShift installer includes a autoscaling:DescribeAutoScalingGroups IAM permission which I believe its not used and carried over from something that may have existed in Hive a long time ago or maybe not at all.

Reference installer commit

I don't see references for API calls for it Hive in the code and I chatting with the Hive team in slack they too don't see it.

Done criteria:

Test creating a cluster with ROSA (Hive)
Create enabling auto scaling
Destroy cluster and ensure everything was cleaned up
Validate with CloudTrail that the IAM call was not used

https://github.com/openshift/installer/pull/9223

Task HOSTEDCP-2181: prep for 4.19 branching

View the Description View the linked PRs

To do

Change Over Dockerfile base images
Double Check image versions in new e2e configs eg inital-4.17,n1minor,n2minor etc.....
Do we still need hypershift-aws-e2e-4.17 on newer branches (Seth)
MCE config file in release repo
Add n-1 e2e test on e2e test file change

Bug OCPBUGS-45383: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/209

Bug CNV-55457: Network tab in new project dialog cannot work

View the Description View the linked PRs

Description of problem:

Due to the workaround / solution of https://issues.redhat.com/browse/OCPBUGS-42609, namespaces must be created with a specific label to allow the use of primary UDN. This label must be added by the cluster admin - making it impossible for regular users to self-provision their network.

With this, the dialog we introduced to the UI where a UDN can be created while defining a Project is no longer functioning (labels cannot be set through Projects). Until a different solution will be introduced, primary UDNs will not be self-service and therefore we should remove the Network tab from Create project dialog.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

1. Open the new project dialog

Actual results:

The "Network" tab is there

Expected results:

It should be removed for now

Additional info:

I'm marking this as critical, since this UI element is very visible and would easily confuse users.

https://github.com/openshift/networking-console-plugin/pull/202

Bug OCPBUGS-44714: oc-mirror v2 automatically deleting the manifests which are generated under working-dir/cluster-resources while generating delete-images.yaml

View the Description View the linked PRs

Description of problem:

While generating delete-images.yaml for the pruning of images using oc-mirror v2, the manifest which are generated under working-dir/cluster-resources (IDMS,ITMS etc) are getting deleted automatically

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100% reproducible

Steps to Reproduce:

1- Create a DeleteImageSetConfiguration file like below

apiVersion: mirror.openshift.io/v2alpha1
kind: DeleteImageSetConfiguration
delete:
  platform:
    channels:
    - name: stable-4.17
      minVersion: 4.17.3
      maxVersion: 4.17.3
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17
      packages:
       - name: aws-load-balancer-operator
       - name: node-observability-operator
       - name: 3scale-operator
  additionalImages:
   - name: registry.redhat.io/ubi8/ubi:latest
   - name: registry.redhat.io/ubi9/ubi@sha256:20f695d2a91352d4eaa25107535126727b5945bff38ed36a3e59590f495046f0

2- ensure that the manifest are generated by the oc-mirror are present in the working-dir/cluster-resources

 ls -lrth /opt/417ocmirror/working-dir/cluster-resources/
total 16K

-rw-r--r--. 1 root root 491 Nov 18 21:57 itms-oc-mirror.yaml
-rw-r--r--. 1 root root 958 Nov 18 21:57 idms-oc-mirror.yaml
-rw-r--r--. 1 root root 322 Nov 18 21:57 updateService.yaml
-rw-r--r--. 1 root root 268 Nov 18 21:57 cs-redhat-operator-index-v4-17.yaml

3- Generate the delete-images.yaml using below command

./oc-mirror  delete --config ./deleteimageset.yaml --workspace file:///opt/417ocmirror --v2 --generate docker://bastionmirror.amuhamme.upi:8443/417images2024/11/18 23:53:12  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/11/18 23:53:12  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/11/18 23:53:12  [INFO]   : ⚙️  setting up the environment for you...
2024/11/18 23:53:12  [INFO]   : 🔀 workflow mode: diskToMirror / delete
2024/11/18 23:53:12  [INFO]   : 🕵️  going to discover the necessary images...
2024/11/18 23:53:12  [INFO]   : 🔍 collecting release images...
2024/11/18 23:53:12  [INFO]   : 🔍 collecting operator images...
2024/11/18 23:53:13  [INFO]   : 🔍 collecting additional images...
2024/11/18 23:53:13  [INFO]   : 📄 Generating delete file...
2024/11/18 23:53:13  [INFO]   : /opt/417ocmirror/working-dir/delete file created
2024/11/18 23:53:13  [INFO]   : delete time     : 712.42082ms
2024/11/18 23:53:13  [INFO]   : 👋 Goodbye, thank you for using oc-mirror

4- Verify after generating the delete-images.yaml the manifests present in the working-dir/cluster-resources/ got deleted.

# ls -lrth /opt/417ocmirror/working-dir/cluster-resources/
total 0

# ls -lrth /opt/417ocmirror/working-dir/delete
total 72K
-rwxr-xr-x. 1 root root 65K Nov 18 23:53 delete-images.yaml
-rwxr-xr-x. 1 root root 617 Nov 18 23:53 delete-imageset-config.yaml

Actual results:

Generating delete-images.yaml is deleting the manifest under working-dir/cluster-resources/

Expected results:

Generating delete-images.yaml  should not delete the manifest under working-dir/cluster-resources/

Additional info:

https://github.com/openshift/oc-mirror/pull/1004

Bug OCPBUGS-44999: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/782

Bug OCPBUGS-45401: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-attacher/pull/81

Bug OCPBUGS-45453: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/317

Bug OCPBUGS-45588: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-livenessprobe/pull/70

Bug OCPBUGS-45759: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/router/pull/644

Bug OCPBUGS-48450: [sig-storage][FeatureGate:VSphereDriverConfiguration][Serial][apigroup:operator.openshift.io] vSphere CSI Driver Configuration

View the Description View the linked PRs

// This is [Serial] because it modifies ClusterCSIDriver.var _ = g.Describe("[sig-storage][FeatureGate:VSphereDriverConfiguration][Serial][apigroup:operator.openshift.io] vSphere CSI Driver Configuration", func() {    defer g.GinkgoRecover()    var (        ctx                      = context.Background()        oc                       = exutil.NewCLI(projectName)        originalDriverConfigSpec *opv1.CSIDriverConfigSpec    )
    o.SetDefaultEventuallyTimeout(5 * time.Minute)    o.SetDefaultEventuallyPollingInterval(5 * time.Second)

In OCP Origin we have the above test playing with global variables for poll interval and poll timeout which is causing all other tests in origin to have flakes.

Please don't use the global variables or maybe we can unset them after the test run is over?

Please note that this causes flakes that are hard to debug, we didn't know what was causing the poll interval to be 5seconds instead of the default 10ms.

https://github.com/openshift/origin/pull/29428

Bug OCPBUGS-45816: Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong

View the Description View the linked PRs

Description of problem:

checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-03122, admin console go to "Observe -> Metrics", execute one query, make sure there is result for it, for example "cluster_version", click the kebab menu, "Show all series" under the list, it's wrong, should be "Hide all series", click "Show all series" will unselect all series, then "Hide all series" always show under the menu, click it, the series would be changed from selected and unselected, but always see "Hide all series", see recording: https://drive.google.com/file/d/1kfwAH7FuhcloCFdRK--l01JYabtzcG6e/view?usp=drive_link

same issue for developer console for 4.18+, 4.17 and below does not have such issue

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always with 4.18+

Steps to Reproduce:

see the description

Actual results:

Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong

Expected results:

should be right

https://github.com/openshift/monitoring-plugin/pull/291

Bug OCPBUGS-45382: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-metal3/pull/29

Bug OCPBUGS-48794: Authorization error when creating internal load balancer in Azure HCP

View the Description View the linked PRs

Description of problem:

On the Azure HCP cluster when creating internal ingress controller we are getting authorization error

Version-Release number of selected component (if applicable):

    4.19 and may be further versions

How reproducible:

    create internal ingress controller in cluster bot or prowci created Azure HCP cluster

Steps to Reproduce:

    1.Create a internal ingress controller
mjoseph@mjoseph-mac Downloads % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.19.0-0.nightly-2025-01-21-163021   True        False         False      107m    
csi-snapshot-controller                    4.19.0-0.nightly-2025-01-21-163021   True        False         False      120m    
dns                                        4.19.0-0.nightly-2025-01-21-163021   True        False         False      107m    
image-registry                             4.19.0-0.nightly-2025-01-21-163021   True        False         False      107m    
ingress                                    4.19.0-0.nightly-2025-01-21-163021   True        False         False      108m    
insights                                   4.19.0-0.nightly-2025-01-21-163021   True        False         False      109m    
kube-apiserver                             4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
kube-controller-manager                    4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
kube-scheduler                             4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
kube-storage-version-migrator              4.19.0-0.nightly-2025-01-21-163021   True        False         False      109m    
monitoring                                 4.19.0-0.nightly-2025-01-21-163021   True        False         False      102m    
network                                    4.19.0-0.nightly-2025-01-21-163021   True        False         False      120m    
node-tuning                                4.19.0-0.nightly-2025-01-21-163021   True        False         False      112m    
openshift-apiserver                        4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
openshift-controller-manager               4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
openshift-samples                          4.19.0-0.nightly-2025-01-21-163021   True        False         False      107m    
operator-lifecycle-manager                 4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
operator-lifecycle-manager-catalog         4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
operator-lifecycle-manager-packageserver   4.19.0-0.nightly-2025-01-21-163021   True        False         False      121m    
service-ca                                 4.19.0-0.nightly-2025-01-21-163021   True        False         False      109m    
storage                                    4.19.0-0.nightly-2025-01-21-163021   True        False         False      109m    
mjoseph@mjoseph-mac Downloads % oc get ingresses.config/cluster -o jsonpath={.spec.domain}
apps.93499d233a19644b81ad.qe.azure.devcluster.openshift.com%  

mjoseph@mjoseph-mac Downloads %  oc create -f New\ Folder\ With\ Items/internal_ingress_controller.yaml 
ingresscontroller.operator.openshift.io/internal created
mjoseph@mjoseph-mac Downloads % 
mjoseph@mjoseph-mac Downloads % 
mjoseph@mjoseph-mac Downloads % 
mjoseph@mjoseph-mac Downloads % cat New\ Folder\ With\ Items/internal_ingress_controller.yaml 
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: internal
  namespace: openshift-ingress-operator
spec:
  domain: internal.93499d233a19644b81ad.qe.azure.devcluster.openshift.com
  replicas: 1
  endpointPublishingStrategy:
    loadBalancer:
      scope: Internal
    type: LoadBalancerService

    2. Check the controller status
mjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller 
NAME       AGE
default    139m
internal   29s
mjoseph@mjoseph-mac Downloads % oc get po -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-5c4db6659b-7cq46   1/1     Running   0          128m
router-internal-6b6547cb9-hhtzq   1/1     Running   0          39s
mjoseph@mjoseph-mac Downloads % 
mjoseph@mjoseph-mac Downloads % 
mjoseph@mjoseph-mac Downloads % oc get co/ingress                                                      
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.19.0-0.nightly-2025-01-21-163021   True        True          False      127m    Not all ingress controllers are available.

     3. Check the internal ingress controller status
mjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller  internal -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2025-01-23T07:46:15Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 2
  name: internal
  namespace: openshift-ingress-operator
  resourceVersion: "29755"
  uid: 29244558-4d19-4ea4-a5b8-e98b9c07edb3
spec:
  clientTLS:
    clientCA:
      name: ""
    clientCertificatePolicy: ""
  domain: internal.93499d233a19644b81ad.qe.azure.devcluster.openshift.com
  endpointPublishingStrategy:
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: Internal
    type: LoadBalancerService
  httpCompression: {}
  httpEmptyRequestsPolicy: Respond
  httpErrorCodePages:
    name: ""
  replicas: 1
  tuningOptions:
    reloadInterval: 0s
  unsupportedConfigOverrides: null
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2025-01-23T07:46:15Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2025-01-23T07:46:50Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2025-01-23T07:46:50Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2025-01-23T07:46:50Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2025-01-23T07:46:50Z"
    message: Deployment is not actively rolling out
    reason: DeploymentNotRollingOut
    status: "False"
    type: DeploymentRollingOut
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: |-
      The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '51b4e7f0-f41b-4b52-9bfc-412366b68308' with object id '51b4e7f0-f41b-4b52-9bfc-412366b68308' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-ln-wqg34k2-c04e6-vnet-rg/providers/Microsoft.Network/virtualNetworks/ci-ln-wqg34k2-c04e6-vnet/subnets/ci-ln-wqg34k2-c04e6-subnet' or the scope is invalid. If access was recently granted, please refresh your credentials."}}
      The cloud-controller-manager logs may contain more details.
    reason: SyncLoadBalancerFailed
    status: "False"
    type: LoadBalancerReady
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: LoadBalancer is not progressing
    reason: LoadBalancerNotProgressing
    status: "False"
    type: LoadBalancerProgressing
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: DNS management is supported and zones are specified in the cluster DNS
      config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: The wildcard record resource was not found.
    reason: RecordNotFound
    status: "False"
    type: DNSReady
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: |-
      One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '51b4e7f0-f41b-4b52-9bfc-412366b68308' with object id '51b4e7f0-f41b-4b52-9bfc-412366b68308' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-ln-wqg34k2-c04e6-vnet-rg/providers/Microsoft.Network/virtualNetworks/ci-ln-wqg34k2-c04e6-vnet/subnets/ci-ln-wqg34k2-c04e6-subnet' or the scope is invalid. If access was recently granted, please refresh your credentials."}}
      The cloud-controller-manager logs may contain more details.)
    reason: IngressControllerUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2025-01-23T07:46:50Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2025-01-23T07:47:46Z"
    message: |-
      One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '51b4e7f0-f41b-4b52-9bfc-412366b68308' with object id '51b4e7f0-f41b-4b52-9bfc-412366b68308' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-ln-wqg34k2-c04e6-vnet-rg/providers/Microsoft.Network/virtualNetworks/ci-ln-wqg34k2-c04e6-vnet/subnets/ci-ln-wqg34k2-c04e6-subnet' or the scope is invalid. If access was recently granted, please refresh your credentials."}}
      The cloud-controller-manager logs may contain more details.)
    reason: DegradedConditions
    status: "True"
    type: Degraded
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: IngressController is upgradeable.
    reason: Upgradeable
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2025-01-23T07:46:16Z"
    message: No evaluation condition is detected.
    reason: NoEvaluationCondition
    status: "False"
    type: EvaluationConditionsDetected
  domain: internal.93499d233a19644b81ad.qe.azure.devcluster.openshift.com
  endpointPublishingStrategy:
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: Internal
    type: LoadBalancerService
  observedGeneration: 2
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal
  tlsProfile:
    ciphers:
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    minTLSVersion: VersionTLS12
mjoseph@mjoseph-mac Downloads %

Actual results:

mjoseph@mjoseph-mac Downloads % oc get co/ingress                                                      
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.19.0-0.nightly-2025-01-21-163021   True        True          False      127m    Not all ingress controllers are available.

Expected results:

    the internal controller should come up

Additional info:

 One more test scenario which is causing the similar error in the HCP cluster in internal LB

1. Create a web server with two services
mjoseph@mjoseph-mac Downloads % oc create -f New\ Folder\ With\ Items/webrc.yaml 
replicationcontroller/web-server-rc created
service/service-secure created
service/service-unsecure created
mjoseph@mjoseph-mac Downloads % oc get po
NAME                  READY   STATUS    RESTARTS   AGE
web-server-rc-q87rv   1/1     Running   0          40s
mjoseph@mjoseph-mac Downloads % oc get svc
oc geNAME                        TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)     AGE
kubernetes                  ClusterIP      172.31.0.1       <none>                                 443/TCP     152m
openshift                   ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>      147m
openshift-apiserver         ClusterIP      172.31.165.239   <none>                                 443/TCP     150m
openshift-oauth-apiserver   ClusterIP      172.31.254.44    <none>                                 443/TCP     150m
packageserver               ClusterIP      172.31.131.10    <none>                                 443/TCP     150m
service-secure              ClusterIP      172.31.6.17      <none>                                 27443/TCP   46s
service-unsecure            ClusterIP      172.31.199.11    <none>                                 27017/TCP   46s

2. Add two lb services
mjoseph@mjoseph-mac Downloads % oc create -f ../Git/openshift-tests-private/test/extended/testdata/router/bug2013004-lb-services.yaml 
service/external-lb-57089 created
service/internal-lb-57089 created
mjoseph@mjoseph-mac Downloads % cat ../Git/openshift-tests-private/test/extended/testdata/router/bug2013004-lb-services.yaml
apiVersion: v1
kind: List
items:
- apiVersion: v1
  kind: Service
  metadata:
    name: external-lb-57089
  spec:
    ports:
    - name: https
      port: 28443
      protocol: TCP
      targetPort: 8443
    selector:
      name: web-server-rc
    type: LoadBalancer
- apiVersion: v1
  kind: Service
  metadata:
    name: internal-lb-57089
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  spec:
    ports:
    - name: https
      port: 29443
      protocol: TCP
      targetPort: 8443
    selector:
      name: web-server-rc
    type: LoadBalancer


3. Check the external ip of the internal service, which is not yet assigned
mjoseph@mjoseph-mac Downloads % oc get svc -owide                                                                                    
NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)           AGE    SELECTOR
external-lb-57089           LoadBalancer   172.31.248.177   20.83.73.54                            28443:30437/TCP   44s    name=web-server-rc
internal-lb-57089           LoadBalancer   172.31.156.88    <pending>                              29443:31885/TCP   44s    name=web-server-rc
kubernetes                  ClusterIP      172.31.0.1       <none>                                 443/TCP           153m   <none>
openshift                   ExternalName   <none>           kubernetes.default.svc.cluster.local   <none>            148m   <none>
openshift-apiserver         ClusterIP      172.31.165.239   <none>                                 443/TCP           151m   <none>
openshift-oauth-apiserver   ClusterIP      172.31.254.44    <none>                                 443/TCP           151m   <none>
packageserver               ClusterIP      172.31.131.10    <none>                                 443/TCP           151m   <none>
service-secure              ClusterIP      172.31.6.17      <none>                                 27443/TCP         112s   name=web-server-rc
service-unsecure            ClusterIP      172.31.199.11    <none>                                 27017/TCP         112s   name=web-server-rc

https://github.com/openshift/hypershift/pull/5468

Bug OCPBUGS-49840: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2435

Bug OCPBUGS-44380: ca-bundle.crt is not injected in the global-ca configmaps from builds in HCP cluster

View the Description View the linked PRs

Description of problem:

https://access.redhat.com/errata/RHSA-2024:5422 did not seemingly fix the issue https://issues.redhat.com/browse/OCPBUGS-37060 in ROSA HCP so opening new bug.

The builds installed in the hosted clusters are having issues to git-clone repositories from external URLs where their CA are configured in the ca-bundle.crt from trsutedCA section:

 spec:
    configuration:
      apiServer:
       [...]
      proxy:
        trustedCA:
          name: user-ca-bundle <---

In traditional OCP implementations, the *-global-ca configmap is installed in the same namespace from the build and the ca-bundle.crt is injected into this configmap. In hosted clusters the configmap is being created empty: 

$ oc get cm -n <app-namespace> <build-name>-global-ca  -oyaml
apiVersion: v1
data:
  ca-bundle.crt: ""


As mentioned, the user-ca-bundle has the certificates configured:

$ oc get cm -n openshift-config user-ca-bundle -oyaml
apiVersion: v1
data:
  ca-bundle.crt: |
    -----BEGIN CERTIFICATE----- <---

Version-Release number of selected component (if applicable):

    4.16.17

How reproducible:

Steps to Reproduce:

1. Install hosted cluster with trustedCA configmap
2. Run a build in the hosted cluster
3. Check the global-ca configmap

Actual results:

    global-ca is empty

Expected results:

    global-ca injects the ca-bundle.crt properly

Additional info:

Created a new ROSA HCP cluster behind a transparent proxy at version 4.16.8 as it was mentioned as fixed in the above errata and the issue still exists.
The transparent proxy certificate provided at cluster installation time is referenced in proxy/cluster as "user-ca-bundle-abcdefgh" and both "user-ca-bundle" and "user-ca-bundle-abcdefgh" configmaps in the "openshift-config" contain the certificate.

However starting a template build for example such as "oc new-app cakephp-mysql-persistent" still results in the certificate not being injected into the "cakephp-mysql-persistent-1-global-ca" configmap and the build failing unlike the same scenario in an OCP cluster.

oc logs build.build.openshift.io/cakephp-mysql-persistent-1
Cloning "https://github.com/sclorg/cakephp-ex.git" ...
error: fatal: unable to access 'https://github.com/sclorg/cakephp-ex.git/': SSL certificate problem: unable to get local issuer certificate

Also upgraded the cluster to 4.16.17 and still the issue persists.

https://github.com/openshift/hypershift/pull/5197

Bug OCPBUGS-50847: Textarea resizing causes it to overflow outside the window.

View the Description View the linked PRs

Description of problem:

    Textarea resizing causes it to overflow outside the window.

Version-Release number of selected component (if applicable):

    4.19.0-0.test-2025-02-14-032806

How reproducible:

    Always

Steps to Reproduce:

    1. Find a resizable textarea, for example on Create Project page
    2. Resizing the textarea for the description
    3.

Actual results:

    the textarea is out of windows

Expected results:

    the textarea should not out of windows

Additional info:

    https://drive.google.com/file/d/1m_UCHju1FXYGbMJBa4P4R1SJDoOUsIkG/view?usp=drive_link

https://github.com/openshift/console/pull/14773

Vulnerability OCPBUGS-51276: CVE-2024-45338 golang.org/x/net/html: Non-linear parsing of case-insensitive content in golang.org/x/net/html

View the Description View the linked PRs

Security Tracking Issue

Do not make this issue public.

Flaw:

Non-linear parsing of case-insensitive content in golang.org/x/net/html
https://bugzilla.redhat.com/show_bug.cgi?id=2333122

An attacker can craft an input to the Parse functions that would be processed non-linearly with respect to its length, resulting in extremely slow parsing. This could cause a denial of service.

https://github.com/openshift/machine-api-provider-powervs/pull/100

Bug OCPBUGS-41727: aws-sdk-go-v2 fails to authenticate AssumeRoleWithWebIdentity on AWS STS clusters

View the Description View the linked PRs

Original bug title:

cert-manager [v1.15 Regression] Failed to issue certs with ACME Route53 dns01 solver in AWS STS env

Description of problem:

    When using Route53 as the dns01 solver to create certificates, it fails in both automated and manual tests. For the full log, please refer to the "Actual results" section.

Version-Release number of selected component (if applicable):

    cert-manager operator v1.15.0 staging build

How reproducible:

    Always

Steps to Reproduce: also documented in gist

    1. Install the cert-manager operator 1.15.0
    2. Follow the doc to auth operator with AWS STS using ccoctl: https://docs.openshift.com/container-platform/4.16/security/cert_manager_operator/cert-manager-authenticate.html#cert-manager-configure-cloud-credentials-aws-sts_cert-manager-authenticate
     3. Create a ACME issuer with Route53 dns01 solver
     4. Create a cert using the created issuer

OR:

Refer by running `/pj-rehearse pull-ci-openshift-cert-manager-operator-master-e2e-operator-aws-sts` on https://github.com/openshift/release/pull/59568

Actual results:

1. The certificate is not Ready.
2. The challenge of the cert is stuck in the pending status:

PresentError: Error presenting challenge: failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region

Expected results:

The certificate should be Ready. The challenge should succeed.

Additional info:

The only way to get it working again seems to be injecting the "AWS_REGION" environment variable into the controller pod. See upstream discussion/change:

I couldn't find a way to inject the env var into our operator-managed operands, so I only verified this workaround using the upstream build v1.15.3. After applying the patch with the following command, the challenge succeeded and the certificate became Ready.

oc patch deployment cert-manager -n cert-manager \
--patch '{"spec": {"template": {"spec": {"containers": [{"name": "cert-manager-controller", "env": [{"name": "AWS_REGION", "value": "aws-global"}]}]}}}}'

https://github.com/openshift/cloud-credential-operator/pull/789

Task HOSTEDCP-2203: Enable creation of AWS HostedClusters with public subnets only

View the Description View the linked PRs

Creating clusters in which machines are created in a public subnet and use a public IP makes it possible to avoid creating NAT gateways (or proxies) for AWS clusters. While not applicable for every test, this configuration will save us money and cloud resources.

Bug OCPBUGS-44185: Using nmcli to activate (up) or deactivate (down) the active slaves breaks the bond.

View the Description View the linked PRs

Description of problem:
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race in /var/usrlocal/bin/configure-ovs.sh

This race condition results in flapping.

The customer who encountered the issue, in July, worked with the IBM LTC Power team to track the issue through the Linux Kernel to OVN-Kube and into the MCO configuration. The customer details can be shared in slack.

The corresponding BZ https://bugzilla.linux.ibm.com/show_bug.cgi?id=210291 could not be mirrored.

The GH issue is in https://github.com/openshift/machine-config-operator/issues/4605
The fix is in https://github.com/openshift/machine-config-operator/pull/4609

From Dave Wilder... the interfaces are setup as described in the issue...

At this point the MACs of the bond's slaves (enP32807p1s0,enP49154p1s0) are the same. The purpose of fail_over_mac=follow is to insure the MACs will not be the same. This is preventing the bond from functioning. This initially appeared to be a problem with the bonding driver, after tracing the calls NetworkManager is making to the bonding driver I discovered the root of the problem is in configure-ovs.sh.

The function: activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. In my case the following profiles are activated one at a time in this order:
br-ex, ovs-if-phys0, enP32807p1s0-slave-ovs-clone, enP49154p1s0-slave-ovs-clone, ovs-if-br-ex

However the generated profiles have autoconnect-slaves set, therefore when br-ex is activated ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone's state changes to "activating", as we are only checking for the "activated" state these profiles may be activated again. As the list is walked, some of the profile's state will automatically go from activating to active. These interfaces are not activated a second time leaving the state of the bond in an unpredictable state. I am able to see in the bonding traces why both slave interface have the same MAC.

My fix is to check for either activating or active states.

— configure-ovs.sh 2024-09-20 15:29:03.160536239 -0700
+++ configure-ovs.sh.patched 2024-09-20 15:33:38.040336032 -0700
@@ -575,8 +575,8 @@

But set the entry in master_interfaces to true if this is a slave

Also set autoconnect to yes
local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")

Version-Release number of selected component (if applicable): First seen in 4.14 OVN-Kube

How reproducible: Specific OVN-Kube configuration with network bonding set for fail_over_mac=follow. This is the ideal setting for the SR-IOV/Network setup at the customer site where they rely on high availability.

Steps to Reproduce:
1. Setup the interfaces as described.

Actual results: Failed Bonding

Expected results: No flapping and the failover workers

Additional info:
https://github.com/openshift/machine-config-operator/issues/4605
https://github.com/openshift/machine-config-operator/pull/4609
#rhel-netorking-subsystem https://redhat-internal.slack.com/archives/C04NN96F1S4/p1719943109040989

https://github.com/openshift/machine-config-operator/pull/4609

Bug OCPBUGS-49723: Remove namespace from clusterCatalog yaml file as this is a cluster scoped resource

View the Description View the linked PRs

Description of problem:

clustercatalog is a cluster scoped resource and it does not need to have namespace mentioned in its yaml file. Currently below is how it is been seen.

{code:java}
apiVersion: olm.operatorframework.io/v1
kind: ClusterCatalog
metadata:
  name: cc-redhat-operator-index-v4-17
  namespace: openshift-marketplace
spec:
  priority: 0
  source:
    image:
      ref: ec2-3-137-182-27.us-east-2.compute.amazonaws.com:5000/cc/redhat/redhat-operator-index:v4.17
    type: Image
status: {}

    Version-Release number of selected component (if applicable):{code:none}
     4.18

How reproducible:

     Always

Steps to Reproduce:

    1. Mirror an imageSetConfig as shown below with the PR which has fix for clusterCatlaog
    
{code:java}
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17
    packages:
    - name: security-profiles-operator

2. Run command `oc-mirror -c isc.yaml --workspace file://test docker://localhost:5000 --v2 --dest-tls-verify=false
3.

    Actual results:{code:none}
     Generating ClusterCatalog yaml has namespace as shown in the description.

Expected results:

     There should not be any namespace in the yaml file since clustercatalog is a cluster scoped resource.

Additional info:

    https://redhat-internal.slack.com/archives/C050P27C71S/p1738324252074669?thread_ts=1738322462.210089&cid=C050P27C71S

https://github.com/openshift/oc-mirror/pull/1058

Bug OCPBUGS-46385: layout issue on Overview page

View the Description View the linked PRs

Description of problem:

the go to arrow and new doc link icon not aligned with text any more

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-12-144418

How reproducible:

Always

Steps to Reproduce:

    1. goes to Home -> Overview page
    2.
    3.

Actual results:

the go to arrow and new doc link icon are not horizontal aligned with their text any more

Expected results:

icon and text should be aligned

Additional info:

    screenshot https://drive.google.com/file/d/1S61XY-lqmmJgGbwB5hcR2YU_O1JSJPtI/view?usp=drive_link

https://github.com/openshift/console/pull/14630

Bug OCPBUGS-49337: [4.19] UDN - Egress L2 VMs IPv4 traffic over previous node after live migration

View the Description View the linked PRs

Description of problem:

After kubevirt vm live migration, VM egress traffic still going over the previous node wher the VM was running before live migration.

Version-Release number of selected component (if applicable): 4.19

How reproducible: Always

Steps to Reproduce:

1. Create a vm with, primary UDN, l2 topology and ipv4
2. Send egress traffic
3 . Live migrate the VM
4. Send egress traffic

Actual results:

IPv4 egress traffic goes over the node was running before live migration

Expected results:

IPv4 Egress traffic should go over the node where the VM is running after the live migration

Additional info:

The problem is related to VM ipv4 neighbors cache not being updated, to fix this a GARP should be send by ovn-kubernetes to update that cache after live migration.

https://github.com/openshift/ovn-kubernetes/pull/2441

Bug OCPBUGS-52848: Component Readiness: samples operator failing installs frequently on gcp

View the Description View the linked PRs

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a regression in the following test:

install should succeed: overall

Extreme regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 98.88% to 57.89%.
Overrode base stats using release 4.17

Sample (being evaluated) Release: 4.19
Start Time: 2025-03-03T00:00:00Z
End Time: 2025-03-10T08:00:00Z
Success Rate: 57.89%
Successes: 22
Failures: 16
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T00:00:00Z
Success Rate: 98.88%
Successes: 88
Failures: 1
Flakes: 0

View the test details report for additional context.

gcp installs seem to be failing frequently with the error:

These cluster operators were not stable: [openshift-samples]

From: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview/1898814955482779648

The samples operator reports:

status:
  conditions:
    - lastTransitionTime: "2025-03-09T19:56:05Z"
      status: "False"
      type: Degraded
    - lastTransitionTime: "2025-03-09T19:56:17Z"
      message: Samples installation successful at 4.19.0-0.nightly-2025-03-09-190956
      status: "True"
      type: Available
    - lastTransitionTime: "2025-03-09T20:43:02Z"
      message: "Samples installed at 4.19.0-0.nightly-2025-03-09-190956, with image import failures for these imagestreams: java,kube-root-ca.crt,openshift-service-ca.crt,nodejs; last import attempt 2025-03-09 19:57:39 +0000 UTC"
      reason: FailedImageImports
      status: "False"
      type: Progressing

I'm confused how this is failing install given available=true and degraded=false, and yet there does appear to be a problem reported in the message. It is possible this artifact was collected a few minutes after the install failed, is it possible the operator stabilizes (ignores these errors) in that time? Note that not all installs are failing this way, but a good chunk.

Problem appears limited to 4.19 gcp, I do see one hit for vsphere though.

https://search.dptools.openshift.org/?search=These+cluster+operators+were+not+stable%3A.*openshift-samples&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Bug CNV-54825: Disable delete button for UDN if the UDN cannot be deleted

View the Description View the linked PRs

Description of problem:

Disable delete button for UDN if the UDN cannot be deleted

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Create a UDN and then delete it
2.
3.

Actual results:

The UDN is not deleted

Expected results:

Disable the delete button if UDN cannot be removed

Additional info:

https://github.com/openshift/networking-console-plugin/pull/191

Bug OCPBUGS-45059: v2 unable to delete operator images from oci catalogs mirrored with v1 due to tag convention difference between v1 and v2

View the Description View the linked PRs

Description of problem:

when deleting operator images with v2 for operator images mirrored by v1 from oci catalog, oc-mirror doesn' t find the same tags to delete and fails to delete the images

Version-Release number of selected component (if applicable):

GitCommit:"affa0177"

How reproducible:

always

Steps to Reproduce:

    1. Mirror to mirror with v1:  ./bin/oc-mirror -c config_logs/bugs.yaml docker://localhost:5000/437311 --dest-skip-tls --dest-use-http

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
    - catalog: oci:///home/skhoury/redhat-index-all
      targetCatalog: "ocicatalog73452"
      targetTag: "v16"
      packages:
        - name: cluster-kube-descheduler-operator


    2. mirror to disk with v2, and almost same ISC (but v2alpha1): ./bin/oc-mirror --v2 -c config_logs/bugs.yaml file:///home/skhoury/43774v2
    3. delete with ./bin/oc-mirror delete --generate --delete-v1-images --v2 -c config_logs/bugs.yaml --workspace file:///home/skhoury/43774v2 docker://sherinefedora:5000/437311
kind: DeleteImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
delete:
  operators:
    - catalog: oci:///home/skhoury/redhat-index-all
      targetCatalog: "ocicatalog73452"
      targetTag: "v16"
      packages:
        - name: cluster-kube-descheduler-operator

Actual results:

mapping.txt of v1:
registry.redhat.io/openshift-sandboxed-containers/osc-cloud-api-adaptor-webhook-rhel9@sha256:4da2fe27ef0235afcac1a1b5e90522d072426f58c0349702093ea59c40e5ca68=localhost:5000/437311/openshift-sandboxed-containers/osc-cloud-api-adaptor-webhook-rhel9:491be520
registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:cb836456974e510eb4bccbffadbc6d99d5f57c36caec54c767a158ffd8a025d5=localhost:5000/437311/openshift4/ose-kube-rbac-proxy:d07492b2
registry.redhat.io/openshift-sandboxed-containers/osc-operator-bundle@sha256:7465f4e228cfc44a3389f042f7d7b68d75cbb03f2adca1134a7ec417bbd89663=localhost:5000/437311/openshift-sandboxed-containers/osc-operator-bundle:a2f35fa7
registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator@sha256:c6c589d5e47ba9564c66c84fc2bc7e5e046dae1d56a3dc99d7343f01e42e4d31=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator:d7b79dea
registry.redhat.io/openshift-sandboxed-containers/osc-operator-bundle@sha256:ff2bb666c2696fed365df55de78141a02e372044647b8031e6d06e7583478af4=localhost:5000/437311/openshift-sandboxed-containers/osc-operator-bundle:695e2e19
registry.redhat.io/openshift-sandboxed-containers/osc-rhel8-operator@sha256:5d2b03721043e5221dfb0cf164cf59eba396ba3aae40a56c53aa3496c625eea0=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel8-operator:204cb113
registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator@sha256:b1e824e126c579db0f56d04c3d1796d82ed033110c6bc923de66d95b67099611=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator:1957f330
registry.redhat.io/openshift-sandboxed-containers/osc-rhel8-operator@sha256:a660f0b54b9139bed9a3aeef3408001c0d50ba60648364a98a09059b466fbcc1=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel8-operator:ab38b9d5
registry.redhat.io/openshift-sandboxed-containers/osc-operator-bundle@sha256:8da62ba1c19c905bc1b87a6233ead475b047a766dc2acb7569149ac5cfe7f0f1=localhost:5000/437311/openshift-sandboxed-containers/osc-operator-bundle:1adce9f
registry.redhat.io/redhat/redhat-operator-index:v4.15=localhost:5000/437311/redhat/redhat-operator-index:v4.15
registry.redhat.io/openshift-sandboxed-containers/osc-monitor-rhel9@sha256:03381ad7a468abc1350b229a8a7f9375fcb315e59786fdacac8e5539af4a3cdc=localhost:5000/437311/openshift-sandboxed-containers/osc-monitor-rhel9:53bbc3cb
registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-operator-bundle@sha256:2808a0397495982b4ea0001ede078803a043d5c9b0285662b08044fe4c11f243=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-operator-bundle:c30c7861
registry.redhat.io/openshift-sandboxed-containers/osc-podvm-payload-rhel9@sha256:4bca24d469a41be77db7450e02fa01660a14f4c68e829cba4a8ae253d427bbfd=localhost:5000/437311/openshift-sandboxed-containers/osc-podvm-payload-rhel9:d25beb31
registry.redhat.io/openshift-sandboxed-containers/osc-cloud-api-adaptor-rhel9@sha256:7185c1b6658147e2cfbb0326e6b5f59899f14f5de73148ef9a07aa5c7b9ead74=localhost:5000/437311/openshift-sandboxed-containers/osc-cloud-api-adaptor-rhel9:18ba6d86
registry.redhat.io/openshift-sandboxed-containers/osc-rhel8-operator@sha256:8f30a9129d817c3f4e404d2c43fb47e196d8c8da3badba4c48f65d440a4d7584=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel8-operator:17b81cfd
registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator@sha256:051bd7f1dad8cc3251430fee32184be8d64077aba78580184cef0255d267bdcf=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator:6a87f996
registry.redhat.io/openshift-sandboxed-containers/osc-rhel9-operator@sha256:3e3b8849f8a0c8cd750815e6bde7eb2006e5a2b4ea898c9d3ea27f2bfed635d9=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel9-operator:4c46a1f7
registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-operator-bundle@sha256:a91cee14f47824ce49759628d06bf4e48276e67dae00b50123d3233d78531720=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-operator-bundle:d22b8cff
registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:7efeeb8b29872a6f0271f651d7ae02c91daea16d853c50e374c310f044d8c76c=localhost:5000/437311/openshift4/ose-kube-rbac-proxy:5574585a
registry.redhat.io/openshift-sandboxed-containers/osc-podvm-builder-rhel9@sha256:a4099ea5ad907ad1daee3dc2c9d659b5a751adf2da65f8425212e82577b227e7=localhost:5000/437311/openshift-sandboxed-containers/osc-podvm-builder-rhel9:36a60f3f


delete-images.yaml of v2
apiVersion: mirror.openshift.io/v2alpha1
items:
- imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator-bundle@sha256:b473fba287414d3ccb09aaabc64f463af2c912c322ca2c41723020b216d98d14
  imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator-bundle:52836815
  type: operatorBundle
- imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator-bundle@sha256:b148d5cf4943d0341781a0f7c6f2a7116d315c617f8beb65c9e7a24ac99304ff
  imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator-bundle:bd7c9abe
  type: operatorBundle
- imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:c7b198e686dc7117994d71027710ebc6ac0bf21afa436a79794d2e64970c8003
  imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator:223f8a32
  type: operatorRelatedImage
- imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:ba0b71ff2a30a069b4a8a8f3c1e0898aaadc6db112e4cc12aff7c77ced7a0405
  imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator:b0b2a0ab
  type: operatorRelatedImage
- imageName: docker://registry.redhat.io/openshift4/ose-descheduler@sha256:257b69180cc667f2b8c1ce32c60fcd23a119195ad9ba2fdd6a6155ec5290f8cf
  imageReference: docker://sherinefedora:5000/437311/openshift4/ose-descheduler:6585e5e1
  type: operatorRelatedImage
- imageName: docker://registry.redhat.io/openshift4/ose-descheduler@sha256:45dc69ad93ab50bdf9ce1bb79f6d98f849e320db68af30475b10b7f5497a1b13
  imageReference: docker://sherinefedora:5000/437311/openshift4/ose-descheduler:7ac5ce2
  type: operatorRelatedImage
kind: DeleteImageList

Expected results:

same tags found for destination images

Additional info:

https://github.com/openshift/oc-mirror/pull/1036

Bug OCPBUGS-49406: RHOCP 4.16 upgrade blocker - kubernetes-sigs#3015 cherry-pick request for the vsphere-csi-driver

View the Description View the linked PRs

Description of problem:

In the context of the ListVolumes optimizations #2249 delivered with the RHOCP 4.16 rebase on vSphere CSI Driver v3.1.2, ListVolumes() is being called every minute in RHOCP 4.16 clusters.
- EDIT, for a brief correction: the PR #2249 above seems to be the Workload Control Plane (WCP) implementation, and PR #2276 is the vanilla controller equivalent change that concerns this bug.
Bug priority set to Critical as this issue is a blocker for updating over 55 RHOCP clusters from 4.14 to 4.16.

Version-Release number of selected component (if applicable):

4.16 and newer

How reproducible:

Always

Steps to Reproduce:

Deploy a 4.16.z-stream cluster with thin-csi storage class and watch vmware-vsphere-csi-driver-controller -c csi-driver logs for recurrent ListVolumes() operations on every vSphere CSI Driver CNS volume.

Actual results:

ListVolumes() is being called every minute in RHOCP 4.16 clusters
In a context where the customer has over +3000 CNS volumes provisioned and aims to upgrade a +55 RHOCP cluster fleet to 4.16, more than +3000 API calls are being sent every minute to the vCenter API, overloading it and impacting core operations (i.e. stalling volume provisioning, volume deletion, volume updates, etc.)

Expected results:

A fix for this has been seemingly already brought upstream as part of kubernetes-sigs#3015, but seemingly has yet to be implemented in an upstream driver 3.y.z version
Therefore, the expectation of this bug if for kubernetes-sigs#3015, merged into the latest RHOCP 4 branch and backported to a 4.16.z-stream

Additional info:

Tentative workaround has been shared with the customer:

$ oc --context sharedocp416-sbr patch clustercsidriver csi.vsphere.vmware.com --type merge -p "{\"spec\":{\"managementState\":\"Unmanaged\"}}"
$ oc --context sharedocp416-sbr -n openshift-cluster-csi-drivers get deploy/vmware-vsphere-csi-driver-controller -o json | jq -r '.spec.template.spec.containers[] | select(.name == "csi-attacher").args'
[
  "--csi-address=$(ADDRESS)",
  "--timeout=300s",
  "--http-endpoint=localhost:8203",
  "--leader-election",
  "--leader-election-lease-duration=137s",
  "--leader-election-renew-deadline=107s",
  "--leader-election-retry-period=26s",
  "--v=2"
  "--reconcile-sync=10m"   <<----------------- ADD THE INCREASED RSYNC INTERVAL
]

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/284

Task MGMT-19537: CVE-2024-45338 multicluster-engine-assisted-service-8-container: Non-linear parsing of case-insensitive content in golang.org/x/net/html

View the Description View the linked PRs

Security Tracking Issue

Do not make this issue public.

Flaw:

Non-linear parsing of case-insensitive content in golang.org/x/net/html
https://bugzilla.redhat.com/show_bug.cgi?id=2333122

An attacker can craft an input to the Parse functions that would be processed non-linearly with respect to its length, resulting in extremely slow parsing. This could cause a denial of service.

~~~

Bug OCPBUGS-48077: Ratcheting validation does not work for status subresource

View the Description View the linked PRs

Description of problem:

Ratcheting validation was implemented and made beta in 1.30.

Validation ratcheting works for changes to the main resource, but does not work when applying updates to a status subresource.

Details in https://github.com/kubernetes/kubernetes/issues/129503

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. Install 4.17
    2. Set powervs serviceEndpoints in the platformStatus to a valid lowercase string
    3. Upgrade to 4.18 - validation has changed
    4. Attempt to update an adjacent status field

Actual results:

    Validation fails and rejects the update

Expected results:

    Ratcheting should kick in and accept the object

Additional info:

https://github.com/openshift/kubernetes/pull/2167

Bug OCPBUGS-46433: Cluster-reader couldn't able to view controlplancemachineset

View the Description View the linked PRs

Description of problem:

Cluster-reader couldn't able to view controlplancemachineset resources

Version-Release number of selected component (if applicable):

4.19.0-0.ci-2024-12-15-181719

How reproducible:

Always

Steps to Reproduce:

1. Add cluster-reader role to a common user     
$ oc adm policy add-cluster-role-to-user cluster-reader testuser-48 --as system:admin     
3. Login in the cluster with the common user
$ oc login -u testuser-48
Authentication required for https://api.zhsungcp58.qe.gcp.devcluster.openshift.com:6443 (openshift)
Username: testuser-48
Password: 
Login successful.
4. Check cluster-reader could view controlplancemachineset resources.

Actual results:

cluster-reader couldn't view controlplancemachineset resources    
$ oc get controlplanemachineset      
Error from server (Forbidden): controlplanemachinesets.machine.openshift.io is forbidden: User "testuser-48" cannot list resource "controlplanemachinesets" in API group "machine.openshift.io" in the namespace "openshift-machine-api"

Expected results:

cluster-reader could view controlplanemachinesets resources

Additional info:

https://github.com/openshift/machine-api-operator/pull/1316

Bug OCPBUGS-31778: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4839

Bug OCPBUGS-44641: OpenShift installation on GCP via IPI on existing Subnet is failing because of a private dns zone using the same dns name but not binding to the cluster's VPC

View the Description View the linked PRs

Description of problem:

    A similar testing scenario to OCPBUGS-38719, but with the pre-existing dns private zone is not a peering zone, instead it is a normal dns zone which binds to another VPC network. And the installation will fail finally, because the dns record-set "*.apps.<cluster name>.<base domain>" is added to the above dns private zone, rather than the cluster's dns private zone.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-24-093933

How reproducible:

    Always

Steps to Reproduce:

    Please refer to the steps told in https://issues.redhat.com/browse/OCPBUGS-38719?focusedId=25944076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25944076

Actual results:

    The installation failed, due to the cluster operator "ingress" degraded

Expected results:

    The installation should succeed.

Additional info:

https://github.com/openshift/installer/pull/9216

Bug OCPBUGS-45359: Failing RecoverVolumeExpansionFailure tests

View the Description View the linked PRs

The following tests are failing with the updated 1.32 Kubernetes in OCP 4.19:

[sig-storage] CSI Mock volume expansion Expansion with recovery [Feature:RecoverVolumeExpansionFailure] recovery should be possible for node-only expanded volumes with final error
[sig-storage] CSI Mock volume expansion Expansion with recovery [Feature:RecoverVolumeExpansionFailure] should record target size in allocated resources
[sig-storage] CSI Mock volume expansion Expansion with recovery [Feature:RecoverVolumeExpansionFailure] should allow recovery if controller expansion fails with infeasible error
[sig-storage] CSI Mock volume expansion Expansion with recovery [Feature:RecoverVolumeExpansionFailure] recovery should not be possible in partially expanded volumes
[sig-storage] CSI Mock volume expansion Expansion with recovery [Feature:RecoverVolumeExpansionFailure] recovery should be possible for node-only expanded volumes with infeasible error

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_kubernetes/2148/pull-ci-openshift-kubernetes-master-okd-scos-e2e-aws-ovn/1863682525520465920

They will be disabled temporarily to not block the rebase progress. This bug ticket is used to track the work to enable this test in OCP 4.19 again.

https://github.com/openshift/kubernetes/pull/2218

Vulnerability OCPBUGS-46218: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/84

Bug OCPBUGS-52972: Whereabouts controller needs correct namespace ENV

View the linked PRs

https://github.com/openshift/cluster-network-operator/pull/2665

Bug OCPBUGS-45264: rendered machine config fails to apply when performance profile contains very big list of cpus

View the Description View the linked PRs

Description of problem:

    When Applying profile with isolated field containing huge cpu  list, profile doesn't apply and no errors is reported

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-11-26-075648

How reproducible:

    Everytime.

Steps to Reproduce:

    1. Create a profile as specified below:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  annotations:
    kubeletconfig.experimental: '{"topologyManagerPolicy":"restricted"}'
  creationTimestamp: "2024-11-27T10:25:13Z"
  finalizers:
  - foreground-deletion
  generation: 61
  name: performance
  resourceVersion: "3001998"
  uid: 8534b3bf-7bf7-48e1-8413-6e728e89e745
spec:
  cpu:
    isolated: 25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,371,118,374,104,360,108,364,70,326,72,328,76,332,96,352,99,355,64,320,80,336,97,353,8,264,11,267,38,294,53,309,57,313,103,359,14,270,87,343,7,263,40,296,51,307,94,350,116,372,39,295,46,302,90,346,101,357,107,363,26,282,67,323,98,354,106,362,113,369,6,262,10,266,20,276,33,289,112,368,85,341,121,377,68,324,71,327,79,335,81,337,83,339,88,344,9,265,89,345,91,347,100,356,54,310,31,287,58,314,59,315,22,278,47,303,105,361,17,273,114,370,111,367,28,284,49,305,55,311,84,340,27,283,95,351,5,261,36,292,41,297,43,299,45,301,75,331,102,358,109,365,37,293,56,312,63,319,65,321,74,330,125,381,13,269,42,298,44,300,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,225,481,236,492,152,408,203,459,214,470,166,422,207,463,212,468,130,386,155,411,215,471,188,444,201,457,210,466,193,449,200,456,248,504,141,397,167,423,191,447,181,437,222,478,252,508,128,384,139,395,174,430,164,420,168,424,187,443,232,488,133,389,157,413,208,464,140,396,185,441,241,497,219,475,175,431,184,440,213,469,154,410,197,453,249,505,209,465,218,474,227,483,244,500,134,390,153,409,178,434,160,416,195,451,196,452,211,467,132,388,136,392,146,402,138,394,150,406,239,495,173,429,192,448,202,458,205,461,216,472,158,414,159,415,176,432,189,445,237,493,242,498,177,433,182,438,204,460,240,496,254,510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480
    reserved: 0,256,1,257
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 20
      size: 2M
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker-cnf
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: false
  workloadHints:
    highPowerConsumption: true
    perPodPowerManagement: false
    realTime: true

    2. The worker-cnf node doesn't contain any kernel args associated with the above profile.
    3.

Actual results:

    System doesn't boot with kernel args associated with above profile

Expected results:

    System should boot with Kernel args presented from Performance Profile.

Additional info:

We can see MCO gets the details and creates the mc:

Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: machine-config-daemon[9550]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1=\"all\" --delete=psi=0 --delete=skew_tick=1 --delete=tsc=reliable --delete=rcupda>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: cbs=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,3>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 4,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,2>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: systemd.cpu_affinity=0,1,256,257 --append=iommu=pt --append=amd_pstate=guided --append=tsc=reliable --append=nmi_watchdog=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=is>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 --append=nohz_full=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,49>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ppend=nosoftlockup --append=skew_tick=1 --append=rcutree.kthread_prio=11 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=20]"
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: client(id:machine-config-operator dbus:1.336 unit:crio-36c845a9c9a58a79a0e09dab668f8b21b5e46e5734a527c269c6a5067faa423b.scope uid:0) added; new total=1
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: Loaded sysroot

Actual Kernel args:
BOOT_IMAGE=(hd1,gpt3)/boot/ostree/rhcos-854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/vmlinuz-5.14.0-427.44.1.el9_4.x86_64 rw ostree=/ostree/boot.0/rhcos/854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/0 ignition.platform.id=metal ip=dhcp root=UUID=0068e804-432c-409d-aabc-260aa71e3669 rw rootflags=prjquota boot=UUID=7797d927-876e-426b-9a30-d1e600c1a382 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on

https://github.com/openshift/cluster-node-tuning-operator/pull/1232

Bug OCPBUGS-45686: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-autoscaler/pull/331

Vulnerability OCPBUGS-46125: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/108

Bug OCPBUGS-49594: [aws/byo-public-ipv4] missing permission ec2:ReleaseAddress when destroying the cluster

View the Description View the linked PRs

Description of problem:

    The deprovision CI step[1] e2e-aws-ovn-shared-vpc-edge-zones-ipi-deprovision-deprovision is missing the permission ec2:ReleaseAddress in the installer user to remove the custom IPv4 Address (EIP) allocated in the cluster creation. The BYO IPv4 is default on CI jobs, and enabled when the pool has IP address.

Error:
level=warning msg=UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-rxxt8srv-bf840-minimal-perm-installer is not authorized to perform: ec2:ReleaseAddress on resource: arn:aws:ec2:us-east-1:[redacted]:elastic-ip/eipalloc-0f4b652b702e73204 because no identity-based policy allows the ec2:ReleaseAddress action.

Job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9413/pull-ci-openshift-installer-main-e2e-aws-ovn-shared-vpc-edge-zones/1884340831955980288

Version-Release number of selected component (if applicable):

4.19

How reproducible:

    always when BYO Public IPv4 pool is activated in the install-config

Steps to Reproduce:

    1. install a cluster with byo IPv4 pool set on install-config
    2.
    3.

Actual results:

level=warning msg=UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-rxxt8srv-bf840-minimal-perm-installer is not authorized to perform: ec2:ReleaseAddress on resource: arn:aws:ec2:us-east-1:[Redacted]:elastic-ip/eipalloc-0f4b652b702e73204 because no identity-based policy allows the ec2:ReleaseAddress action.

Expected results:

    Permissions granted, EIP released.

Additional info:

https://github.com/openshift/installer/pull/9420

Bug OCPBUGS-51169: [Azure-Disk-CSI-Driver] allocatable volumes count incorrect in csinode for D family v6 instance types

View the Description View the linked PRs

Description of problem:

[Azure-Disk-CSI-Driver] allocatable volumes count incorrect in csinode for D family v6 instance types

Version-Release number of selected component (if applicable):

 4.19.0-0.nightly-2025-02-20-131651

How reproducible:

  Always

Steps to Reproduce:

    1. Install openshift cluster with instance type "Standard_D8lds_v6".
    2. Check the csinode object max allocatable volumes count should be consistent with cloud supports.

Actual results:

  In step2 the csi node max allocatable volumes count is 16(correct value is 24) which is not consistent with cloud.
$ oc get no/pewang-dxv6-a-qf9ck-worker-eastus2-l49ft -oyaml|grep 'instance'
    beta.kubernetes.io/instance-type: Standard_D8lds_v6
    node.kubernetes.io/instance-type: Standard_D8lds_v6
$ oc get csinode pewang-dxv6-a-qf9ck-worker-eastus2-l49ft -oyaml|yq .spec
drivers:
  - name: file.csi.azure.com
    nodeID: pewang-dxv6-a-qf9ck-worker-eastus2-l49ft
    topologyKeys: null
  - allocatable:
      count: 16
    name: disk.csi.azure.com
    nodeID: pewang-dxv6-a-qf9ck-worker-eastus2-l49ft
    topologyKeys:
      - topology.disk.csi.azure.com/zone
      - topology.kubernetes.io/zone

$ az vm list-sizes -l eastus -o tsv | awk -F '\t' '{print "\""$3"\":"$1","}' | sort | uniq|grep 'Standard_D8lds_v6'
"Standard_D8lds_v6":24,

Expected results:

 In step2 the csi node max allocatable volumes count is 24

Additional info:

  Dlsv6, Dldsv6, Dsv6, Ddsv6 are not set in ->
  https://github.com/openshift/azure-disk-csi-driver/blob/master/pkg/azuredisk/azure_dd_max_disk_count.go

https://github.com/openshift/azure-disk-csi-driver/pull/100

Story CORS-3718: 4.19 Release Branching Checklist

View the Description

User Story:

This is a checklist of tasks for when we break off a new feature branch.

Acceptance Criteria:

Description of criteria:

Complete checklist items
Clone this card to the next release

Sub-task CORS-3722: Update CVO channel

View the Description View the linked PRs

https://github.com/openshift/installer/pull/6302/commits/35023b3804335775a04d6cbfb6665706eaef586f

https://github.com/openshift/installer/pull/9375

Sub-task CORS-3721: Update default release image

View the Description View the linked PRs

e.g. https://github.com/openshift/installer/pull/5774/files

https://github.com/openshift/installer/pull/9374

Sub-task CORS-3724: Update k8s dependencies (api, etc)

View the Description View the linked PRs

Note: also notify the Hive team we're doing these bumps.

https://github.com/openshift/installer/pull/9396

Bug OCPBUGS-45763: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/telemeter/pull/552

Bug OCPBUGS-46050: Determine if disabled a11y rules can be re-enabled

View the Description View the linked PRs

We have a number of a11y rules that are disabled. Once axe is upgraded via https://github.com/openshift/console/pull/14311, we should check to see if these rules can be enabled.

https://github.com/openshift/console/pull/14643

Bug OCPBUGS-45423: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/93

Bug OCPBUGS-42610: unnecessary daemonset / deployment rollouts on vsphere

View the Description View the linked PRs

Description of problem:

Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.

Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?

I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:

prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done

It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:

####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
<       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
>       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
<       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
>       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
####

The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

~20% failure rate in 4.18 vsphere-ovn-serial jobs

Steps to Reproduce:

Actual results:

operator rolls out unnecessary daemonset / deployment changes

Expected results:

don't roll out changes unless there is a spec change

Additional info:

https://github.com/openshift/origin/pull/29311

Bug OCPBUGS-46342: HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms

View the Description View the linked PRs

Description of problem:

HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms
Can't add a Bare Metal worker node to the hosted cluster. 
This was discussed on #project-hypershift Slack channel.

Version-Release number of selected component (if applicable):

MultiClusterEngine v2.7.2 
HyperShift Operator image: 
registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a

How reproducible:

Always

Steps to Reproduce:

1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform
2. Try to create a NodePool with ARM64 architecture specification

Actual results:

- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms
- Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure"
- Additional validation in NodePool spec also blocks arm64 architecture

Expected results:

- Allow ARM64 architecture specification for NodePools on BareMetal platform 
- Remove or update the CEL validation to support this use case

Additional info:

NodePool YAML:
apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
metadata:
  name: nodepool-doca5-1
  namespace: doca5
spec:
  arch: arm64
  clusterName: doca5
  management:
    autoRepair: false
    replace:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      strategy: RollingUpdate
    upgradeType: InPlace
  platform:
    agent:
      agentLabelSelector: {}
    type: Agent
  release:
    image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi
  replicas: 1

https://github.com/openshift/hypershift/pull/5276

Bug OCPBUGS-48078: Upgrading a PowerVS cluster from 4.17 to 4.18 breaks infrastructure status subresource writes

View the Description View the linked PRs

Description of problem:

If a customer populates the serviceEndpoints for powervs via the install config in 4.17, the validation is incorrect and persists lowercase values.

status:
platformStatus:
type: PowerVS
powervs:
serviceEndpoints:
- name: dnsservices
url: ...

On upgrade, the schema is currently updated to an enum, courtesy of https://github.com/openshift/api/pull/2076

The validation upgrade and ratcheting was tested, but only for the `spec` version of the field. It was assumed that spec and status validation behaved the same.

However, https://issues.redhat.com/browse/OCPBUGS-48077, has recently been found, and this means that on upgrade, all writes to the status subresource of the infrastructure object fail, until the serviceEndpoints are fixed.

In a steady state, this may not cause general cluster degredation, writing to the status of the infrastructure object is not common.

However, any controller that does attempt to write to it, will fail, and end up erroring until the value has been fixed.

There are several possible approaches to resolve this:
1. Revert https://github.com/openshift/api/pull/2076 and anything else that depended on it
2. Merge and backport the fix for https://issues.redhat.com/browse/OCPBUGS-48077
3. Introduce something in 4.18 to fix invalid values in the status (eg convert dnsservices to DNSServices)

Until one of these three (or perhaps other fixes) is taken, I think this needs to be considered a PowerVS upgrade blocker, and then management can decide if this is enough to block 4.18

Version-Release number of selected component (if applicable):

    4.17 to 4.18

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-39148: AWSNetworkLoadBalancerParameters API CEL should be featureGated on IngressControllerLBSubnetsAWS

View the Description View the linked PRs

Description of problem:

The CEL for AWSNetworkLoadBalancerParameters that ensures Subnets and EIPs are equal, should be "feature gated" by both SetEIPForNLBIngressController and IngressControllerLBSubnetsAWS. Meaning, the CEL should only be present/executed if both feature gates are enabled.

At the time we released this feature, there wasn't a way to do "AND" for the FeatureGateAwareXValidation marker, but recently https://github.com/openshift/kubernetes-sigs-controller-tools/pull/21 has been merged which now supports that.

However, nothing is currently broken since both feature gates are now enabled by default, but if the IngressControllerLBSubnetsAWS feature gate was disabled for any reason, the IngressController CRD would become invalid and unable to install. You'd get an error message similar to:

ERROR: <input>:1:157: undefined field 'subnets'

Version-Release number of selected component (if applicable):

    4.17 and 4.18

How reproducible:

    100%?

Steps to Reproduce:

    1. Disable IngressControllerLBSubnetsAWS feature gate

Actual results:

    IngressController CRD is now broken

Expected results:

IngressController shouldn't be broken.

Additional info:

    To be clear, this is not a bug with an active impact, but this is more of an inconsistency that could cause problems in the future.

https://github.com/openshift/api/pull/2131

Bug OCPBUGS-38078: Abnormal values for 'router.openshift.io/haproxy.health.check.interval' annotation breaks the router-default pods

View the Description View the linked PRs

Description of problem:

There is no clipValue function for the annotation router.openshift.io/haproxy.health.check.interval. Once any value with abnormal values, the router-default starts to report the following messages:

[ALERT]    (50) : config : [/var/lib/haproxy/conf/haproxy.config:13791] : 'server be_secure:xxx:httpd-gateway-route/pod:xxx:xxx-gateway-service:pass-through-https:10.129.xx.xx:8243' : timer overflow in argument <50000d> to <inter> of server pod:xxx:xxx:pass-through-https:10.129.xx.xx:8243, maximum value is 2147483647 ms (~24.8 days)..

In the above case, the value 50000d was passed to the route annotation router.openshift.io/haproxy.health.check.interval accidentally

Version-Release number of selected component (if applicable):

How reproducible:

Easily

Steps to Reproduce:

1. Run the following script and this will break the cluster

oc get routes -A | awk '{print $1 " " $2}' | tail -n+2 | while read line; do    
 read -r namespace routename <<<$(echo $line)   echo -n "NS: $namespace | "   echo "ROUTENAME: $routename"   
 CMD="oc annotate route -n $namespace $routename --overwrite router.openshift.io/haproxy.health.check.interval=50000d"   
 echo "Annotating route with:"   
 echo $CMD ; eval "$CMD"  
 echo "---" 
done

Actual results:

    The alert messages are reported and the router-default pod never reaches the ready state.

Expected results:

    Clip the value in order to prevent the issue

Additional info:

https://github.com/openshift/router/pull/618

Bug HOSTEDCP-2197: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5212

Bug OCPBUGS-33894: Avoid dumping Go struct in potentially user-facing reason annotation

View the Description View the linked PRs

Description of problem:

Related to OCPBUGS-33891, sometimes degraded nodes are accompanied by a message like this:

failed to set annotations on node: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +00
00 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacit
y:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeI
nfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage
{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1215.qe.devcluster.openshift.com:6443/api/v1/nodes/<node>": read tcp 10.0.26.198:41196
->10.0.15.142:6443: read: connection reset by peer

The reason annotation is potentially user-facing (or am I wrong?) so dumping the full &Node... structure in there is probably not useful, I would expect such info only in the log or in clearly non-user-facing location.

Version-Release number of selected component (if applicable):

4.16+

How reproducible:

Like OCPBUGS-33891

Steps to Reproduce:

Like OCPBUGS-33891

Actual results:

Go struct dump in the message

Expected results:

No Go struct dump in the message

https://github.com/openshift/machine-config-operator/pull/4814

Bug OCPBUGS-45241: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/2213

Bug OCPBUGS-49370: Deprecated usage of manager with`metrics-bind-addr`

View the Description View the linked PRs

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-api-provider-openstack/352/pull-ci-openshift-cluster-api-provider-openstack-main-e2e-hypershift/1882816789549682688/artifacts/e2e-hypershift/hypershift-openstack-e2e-execute/artifacts/TestCreateCluster/namespaces/e2e-clusters-cj5s5-example-s7dc2/core/pods/logs/capi-provider-9f4d8d668-jr8hp-manager.log

flag: --metrics-bind-addr was deprecated and is now removed, we need to update how we deploy CAPO.

https://github.com/openshift/hypershift/pull/5483

Bug OCPBUGS-49598: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2459

Bug OCPBUGS-45389: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2533

Bug OCPBUGS-45490: Evicted Pods owned by Catalogsource are not rescheduled

View the Description View the linked PRs

Description of problem:

For various reasons, Pods may get evicted. Once they are evicted, the owner of the Pod should recreate the Pod so it is scheduled again.

With OLM, we can see that evicted Pods owned by Catalogsources are not rescheduled. The outcome is that all subscriptions have a "ResolutionFailed=True" condition, which hinders an upgrade of the operator. Specifically the customer is seeing an affected CatalogSource is "multicluster-engine-CENSORED_NAME-redhat-operator-index "in openshift-marketplace namespace, pod name: "multicluster-engine-CENSORED_NAME-redhat-operator-index-5ng9j"

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.16.21

How reproducible:

Sometimes, when Pods are evicted on the cluster

Steps to Reproduce:

1. Set up an OpenShift Container Platform 4.16 cluster, install various Operators
2. Create a condition that a Node will evict Pods (for example by creating DiskPressure on the Node)
3. Observe if any Pods owned by CatalogSources are being evicted

Actual results:

If Pods owned by CatalogSources are being evicted, they are not recreated / rescheduled.

Expected results:

When Pods owned by CatalogSources are being evicted, they are being recreacted / rescheduled.

Additional info:

Discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1726170881413389?thread_ts=1726126461.479019&cid=C3VS0LV41
Support Case with "must-gather": 04003784

https://github.com/openshift/operator-framework-olm/pull/907

Vulnerability OCPBUGS-46331: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/138

Bug OCPBUGS-44929: Wrong manila operator name in e2e tests

View the Description View the linked PRs

Description of problem:

=== RUN   TestNodePool/HostedCluster2/EnsureHostedCluster/EnsureSATokenNotMountedUnlessNecessary
    util.go:1943: 
        Expected
            <string>: kube-api-access-5jlcn
        not to have prefix
            <string>: kube-api-access-

Pod spec:

    name: manila-csi-driver-operator
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
      runAsUser: 1000690000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/guest-kubeconfig
      name: guest-kubeconfig
    - mountPath: /etc/openstack-ca/
      name: cacert
    - mountPath: /etc/openstack/
      name: cloud-credentials
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5jlcn
      readOnly: true

https://github.com/openshift/hypershift/pull/5186

Bug OCPBUGS-46656: Installer doesn't show any warnings or failures when installing OCP with OVNKubernetes with overlapping configurations

View the Description View the linked PRs

Description of problem:

     For upgrades and migrations we introduced a check to avoid them when overlapping CIDRs exist - https://github.com/openshift/ovn-kubernetes/pull/2313.

However the installation allows it which ends with the installation broken and no information from anywhere to easily lead to the issue.For example if I create a install-config with clusternetwork CIDR overlapping with OVNK-I transit switch, everything goes on and at certain point the install will just get stuck and only by accessing one of the masters and seeing the ovnkube-controller logs we can see what is the issue. And this is because I know what I was doing wrong since I configure this on purpose.

Our installation documentation has no mention of this CIDR (only the join subnet) and we can't expect customers to read all other documents prior to install the clusters as we can see in the example below:https://docs.openshift.com/container-platform/4.15/installing/installing_bare_metal/installing-bare-metal-network-customizations.html

Version-Release number of selected component (if applicable):

    OCP 4.14 and 4.15

How reproducible:

    every time

Steps to Reproduce:

    1. Install a cluster with clusternetwork CIDR that overlaps with the OVN internal transit switch subnet    
 2.
    3.

Actual results:

Expected results:

    Get an error warning about overlapping CIDRs and asking user to change.

Additional info:

https://github.com/openshift/installer/pull/9428

Bug OCPBUGS-48076: origin deletes local copy of cluster pull secret when using extension binaries

View the Description View the linked PRs

I had extracted this code, but I broke the case where we're getting the pull secret from the cluster – because it was immediately getting deleted on return. In-line the code instead to prevent that, so it gets deleted when we're done using it.

The error you get when this happens looks like:error running options:

could not create external binary provider: couldn't extract release payload image stream: failed extracting image-references from "quay.io/openshift-release-dev/ocp-release-nightly@sha256:856d044a4f97813fb31bc4edda39b05b2b7c02de1327b9b297bdf93edc08fa95": error during image extract: exit status 1 (error: unable to load --registry-config: stat /tmp/external-binary2166428580/.dockerconfigjson: no such file or directory

https://github.com/openshift/origin/pull/29396

Bug OCPBUGS-50649: vSphere, if ESXi host is powered off importing the ova fails

View the Description View the linked PRs

If ESXi host is powered off in a cluster that an OVA is being imported to the ova import will fail.

https://github.com/openshift/installer/pull/9456

Bug OCPBUGS-48413: On the page /command-line-tools, the oc and virtctl sort the links differently

View the Description View the linked PRs

Description of problem:

On the page /command-line-tools, the oc and virtctl sort the links differently

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

    1.Login in and browse page /command-line-tools
    2.Check the download links in oc and virtctl
    3.

Actual results:

oc and virtctl sort the links differently
oc sort them via  x86_64, ARM 64, IBM blabla,
while virtctl sort them via Linux, Mac, Windows

Both sort are fine. But putting them together, it's a little wired.

Expected results:

All download links on this page sort in the same way.

Additional info:

https://github.com/openshift/console/pull/14819

Bug OCPBUGS-47541: Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs

View the Description View the linked PRs

Description of problem:

Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14646

Bug OCPBUGS-45787: Installing operator with a + in the version name doesn't work

View the Description View the linked PRs

Description of problem:

When attempting to install a specific version of an operator from the web console, the install plan of the latest version of that operator is created if the operator version had a + in it.

Version-Release number of selected component (if applicable):

4.17.6 (Tested version)

How reproducible:

Easily reproducible

Steps to Reproduce:

1. Under Operators > Operator Hub, install an operator with a + character in the version.
2. On the next screen, note that the + in the version text box is missing.
3. Make no changes to the default options and proceed to install the operator.
4. An install plan is created to install the operator with the latest version from the channel.

Actual results:

The install plan is created for the latest version from the channel.

Expected results:

The install plan is created for the requested version.

Additional info:

Notes on the reproducer:
- For step 1: the selected version shouldn't be the latest version from the channel for the purposes of this bug. 
- For step 1: The version will need to be selected from the version dropdown to reproduce the bug. If the default version that appears in the dropdown is used, then the bug won't reproduce. 
 
Other Notes: 
- This might also happen with other special characters in the version string other than +, but this is not something that I tested.

https://github.com/openshift/console/pull/14602

Bug OCPBUGS-46529: kubevirt hosted cluster with apiserver noderport using hostname ends without network policies

View the Description View the linked PRs

Description of problem:

    When creating a kubevirt hosted cluster with the following apiserver publishing configuration

- service: APIServer
    servicePublishingStrategy:
      type: NodePort
      nodePort:
        address: my.hostna.me
        port: 305030

Shows following error:

"failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"

And network policies and not propertly deployed at the virtual machine namespaces.

Version-Release number of selected component (if applicable):

 4.17

How reproducible:

    Always

Steps to Reproduce:

    1.Create a kubevirt hosted cluster with apiserver nodeport publish with a hostname
    2. Wait for hosted cluster creation.

Actual results:

Following error pops up and network policies are not created

"failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"

Expected results:

    No error pops ups and network policies are created.

Additional info:

    This is where the error get originated -> https://github.com/openshift/hypershift/blob/ef8596d4d69a53eb60838ae45ffce2bca0bfa3b2/hypershift-operator/controllers/hostedcluster/network_policies.go#L644

    That error should prevent network policies creation.

https://github.com/openshift/hypershift/pull/5313

Bug OCPBUGS-48445: fix e2e test flake

View the Description View the linked PRs

There's a possible flake in openshift-tests because the upstream test "Should recreate evicted statefulset" occasionally causes kubelet to emit a "failed to bind hostport" event because it tries to recreate a deleted pod too quickly, and this gets flagged by openshift-tests as a bad thing, even though it's not (because it retries and succeeds).

I filed a PR to fix this a long time ago, it just needs review.

https://github.com/openshift/origin/pull/28590

Bug OCPBUGS-48708: [control-plane-operator] azure-file-csi using nfs protocal provision volume failed of "vnetName or location is empty"

View the Description View the linked PRs

Description of problem:

[control-plane-operator] azure-file-csi using nfs protocal provision volume failed of "vnetName or location is empty"

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2025-01-21-070749

How reproducible:

Always

Steps to Reproduce:

    1. Create aro hosted cluster on azure.
    2. Create a new storageclass using azure file csi provisioner and nfs protocol and create pvc with the created storageclass, create pod comsume the pvc.

$ oc apply -f - <<EOF
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-csi-nfs
parameters:
  protocol: nfs
provisioner: file.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mypvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: azurefile-csi-nfs
  volumeMode: Filesystem
---
apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
    - name: hello-app
      image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
      volumeMounts:
        - mountPath: /mnt/storage
          name: data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: mypvc
EOF

     3. Check the volume should be provisioned and pod could read and write inside the file volume.

Actual results:

  In step 3: the volume provision failed of vnetName or location is empty
$  oc describe pvc mypvc
Name:          mypvc
Namespace:     default
StorageClass:  azurefile-csi-nfs
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com
               volume.kubernetes.io/storage-provisioner: file.csi.azure.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       mypod
Events:
  Type     Reason                Age               From                                                                                                       Message
  ----     ------                ----              ----                                                                                                       -------
  Normal   ExternalProvisioning  7s (x3 over 10s)  persistentvolume-controller                                                                                Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          3s (x4 over 10s)  file.csi.azure.com_azure-file-csi-driver-controller-7cb9b5f788-n9ztr_85399802-95c4-468e-814d-2c4df5140069  External provisioner is provisioning volume for claim "default/mypvc"
  Warning  ProvisioningFailed    3s (x4 over 10s)  file.csi.azure.com_azure-file-csi-driver-controller-7cb9b5f788-n9ztr_85399802-95c4-468e-814d-2c4df5140069  failed to provision volume with StorageClass "azurefile-csi-nfs": rpc error: code = Internal desc = update service endpoints failed with error: vnetName or location is empty

Expected results:

    In step 3: the volume should be provisioned and pod could read and write inside the file volume.

Additional info:

 # aro hcp missed the vnetName/vnetResourceGroup will caused using nfs protocol provision volumes failed
oc extract secret/azure-file-csi-config --to=-
# cloud.conf
{
  "cloud": "AzurePublicCloud",
  "tenantId": "XXXXXXXXXX",
  "useManagedIdentityExtension": false,
  "subscriptionId": "XXXXXXXXXX",
  "aadClientId": "XXXXXXXXXX",
  "aadClientSecret": "",
  "aadClientCertPath": "/mnt/certs/ci-op-gcprj1wl-0a358-azure-file",
  "resourceGroup": "ci-op-gcprj1wl-0a358-rg",
  "location": "centralus",
  "vnetName": "",
  "vnetResourceGroup": "",
  "subnetName": "",
  "securityGroupName": "",
  "securityGroupResourceGroup": "",
  "routeTableName": "",
  "cloudProviderBackoff": false,
  "cloudProviderBackoffDuration": 0,
  "useInstanceMetadata": false,
  "loadBalancerSku": "",
  "disableOutboundSNAT": false,
  "loadBalancerName": ""
}

https://github.com/openshift/hypershift/pull/5453

Bug CNV-55131: Dialog creating the primary namespaced UDN does not need "name" field

View the Description View the linked PRs

Description of problem:

Dialog creating the primary namespaced UDN does not need "name" field. Users can only use one primary UDN per namespace. We can make the flow smoother by generating (or hardcoding) the name on the UI. This should be static (not random). A side effect of this would be, that it would prevent users from creating multiple primary UDNs by mistake.

Version-Release number of selected component (if applicable):

rc.4

How reproducible:

Always

Steps to Reproduce:

1. Go to the create UDN dialog
2.
3.

Actual results:

It asks for a name

Expected results:

It should not ask for a name, using "primary-udn" as the hardcoded value

OR

It should still give the option to set it, but use "primary-udn" as the default pre-filled in the textbox

Additional info:

https://github.com/openshift/networking-console-plugin/pull/177

Bug OCPBUGS-47535: Layout issue on metrics page after PattrnFly 4 shared modules removed

View the Description View the linked PRs

Description of problem:

    Table layout missing on Metrics page. After PR https://github.com/openshift/console/pull/14615 change, based on the changes the PatternFly 4 shared modules has been removed

Version-Release number of selected component (if applicable):

    pre-merge

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Observe -> Metrics page
    2. Click 'Insert example query' button
    3. Check the layout for Query results table

Actual results:

    The results table layout issue

Expected results:

    Layout should as same as OCP 4.18

Additional info:

    More infomation: could be check the PR https://github.com/openshift/console/pull/14615

https://github.com/openshift/monitoring-plugin/pull/307

Vulnerability OCPBUGS-48172: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/618

Bug OCPBUGS-52819: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5808

Bug OCPBUGS-45626: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-runtime-extractor/pull/34

Bug OCPBUGS-25981: image registry: S3 storage doesn't use proxy environment when skipVerify=true

View the Description View the linked PRs

Tracking https://github.com/distribution/distribution/issues/4112 and/or our own fixes.

user specified tls skip verify which triggers a bug that do not respect proxy values.
short term fix: if self signed cert is used, specify cacert accordingly instead of skipping verification.

https://github.com/openshift/image-registry/pull/419

Bug OCPBUGS-47796: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29398

Bug OCPBUGS-49724: HostedCluster validation during e2e fails when image provided doesnt contain digest

View the Description View the linked PRs

Description of problem:

    when I run the e2e locally and specify the image to be registry.ci.openshift.org/ocp/release:4.19.0-0.nightly-2025-01-30-091858, The e2e never validates the cluster due to
    eventually.go:226:  - wanted HostedCluster to desire image registry.ci.openshift.org/ocp/release:4.19.0-0.nightly-2025-01-30-091858, got registry.ci.openshift.org/ocp/release@sha256:daccaa3c0223e23bfc6d9890d7f8e52faa8a5071b2e80d5f753900f16584e3f0even though the hc.status is completed and the desired image matches the history
 desired:
      image: registry.ci.openshift.org/ocp/release@sha256:daccaa3c0223e23bfc6d9890d7f8e52faa8a5071b2e80d5f753900f16584e3f0
      version: 4.19.0-0.nightly-2025-01-30-091858
    history:
    - completionTime: "2025-01-31T12:45:37Z"
      image: registry.ci.openshift.org/ocp/release@sha256:daccaa3c0223e23bfc6d9890d7f8e52faa8a5071b2e80d5f753900f16584e3f0
      startedTime: "2025-01-31T12:35:07Z"
      state: Completed
      verified: false
      version: 4.19.0-0.nightly-2025-01-30-091858
    observedGeneration: 1

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. 
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5524

Bug OCPBUGS-49990: Kubevirt and OSUS image skipped in collection phase

View the Description View the linked PRs

Description of problem:

    During the collection, the KubeVirt image and the OSUS (graph image) are being skipped in case of failure to find them.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

 Check the local_store_collector.go on the release package

Actual results:

    Images are skipped

Expected results:

    If the image was requested in the image set config, since these are release images, they should not be skipped, the workflow should fail

Additional info:

https://github.com/openshift/oc-mirror/pull/1070

Bug OCPBUGS-53025: Prometheus: Scraping: Bump cache iteration after error to avoid false duplicate detections.

View the Description View the linked PRs

backporting https://github.com/prometheus/prometheus/pull/16174

release note entry is to be added.

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/prometheus/pull/245

Story CNTRLPLANE-20: openshift/cluster-kube-apiserver-operator - Upgrade to Kubernetes 1.32

View the Description View the linked PRs

Update openshift/api to k8s 1.32

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1791

Bug OCPBUGS-38975: Day2 monitoring is not handling api server temporarily disconnection

View the Description View the linked PRs

Description of problem:

  Day2 monitoring is not handling api server temporarily disconnection

Version-Release number of selected component (if applicable):

    4.17.0-0.ci-2024-08-26-170911

How reproducible:

    always in libvirt manually run

Steps to Reproduce:

    1. Run agent install in libvirt env manually
    2. Run day2 install after cluster is installed succeed
    3. Run 'oc adm node-image monitor' to track the day2 install, when there is api server temporarily disconnection , monitoring program will run into error/EOF.
    4, Only reproduced in libvirt env, baremetal platform is working fine.

Actual results:

    Day2 monitoring should run without break to track day2 install in libvirt

Expected results:

    Day2 monitoring run into error/EOF

Additional info:

    Monitoring output link: https://docs.google.com/spreadsheets/d/17cOCfYvqxLHlhzBHkwCnFZDUatDRcG1Ej-HQDTDin0c/edit?gid=0#gid=0

https://github.com/openshift/oc/pull/1949

Bug OCPBUGS-43275: Duplication data when more than one release is added to ImageSetConfig.yaml

View the Description View the linked PRs

Description of problem:

when more than one release is added to ImageSetConfig.yaml the images number is double and incorrect, check the log we could see duplications.

ImageSetConfig.yaml:
=================
[fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-232.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
    - name: stable-4.16
      minVersion: 4.16.0
      maxVersion: 4.16.0
    - name: stable-4.15
      minVersion: 4.15.0
      maxVersion: 4.15.0


images to copy 958
cat /tmp/sss |grep 1fd628f40d321354832b0f409d2bf9b89910de27bc6263a4fb5a55c25e160a99
 ✓   178/958 : (8s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1fd628f40d321354832b0f409d2bf9b89910de27bc6263a4fb5a55c25e160a99 
 ✓   945/958 : (8s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1fd628f40d321354832b0f409d2bf9b89910de27bc6263a4fb5a55c25e160a99



 cat /tmp/sss |grep x86_64
 ✓   191/958 : (3s) quay.io/openshift-release-dev/ocp-release:4.16.0-x86_64 
 ✓   383/958 : (2s) quay.io/openshift-release-dev/ocp-release:4.15.0-x86_64 
 ✓   575/958 : (1s) quay.io/openshift-release-dev/ocp-release:4.15.0-x86_64 
 ✓   767/958 : (11s) quay.io/openshift-release-dev/ocp-release:4.15.35-x86_64 
 ✓   958/958 : (5s) quay.io/openshift-release-dev/ocp-release:4.16.0-x86_64

Version-Release number of selected component (if applicable):

    ./oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-unknown-68e608e2", GitCommit:"68e608e2", GitTreeState:"clean", BuildDate:"2024-10-14T05:57:17Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

    Always

Steps to Reproduce:

    1.  clone oc-mirror repo, cd oc-mirror, run make build
    2.  Now use the imageSetConfig.yaml present above and run mirror2disk & disk2mirror commands
    3. oc-mirror -c /tmp/clid-232.yaml file://CLID-232 --v2 ; oc-mirror -c /tmp/clid-232.yaml --from file://CLID-232 docker://localhost:5000/clid-232 --dest-tls-verify=false --v2

Actual results:

 1. see mirror duplication

Expected results:

no dup.

Additional info:

https://github.com/openshift/oc-mirror/pull/969

Bug OCPBUGS-44199: Setting userTags in the install-config file for AWS does not support all AWS valid characters

View the Description View the linked PRs

Description of problem:

While setting userTags in the install-config file for AWS does not support all AWS valid characters as per [1].  
platform:
  aws:
    region: us-east-1
    propagateUserTags: true
    userTags:
      key1: "Test Space" 
      key2: value2

ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.aws.userTags[key1]: Invalid value: "Test Space": value contains invalid characters

The documentation at: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/installation-config-parameters-aws.html#installation-configuration-parameters-optional-aws_installation-config-parameters-aws does not refer to any restrictions.

However:

Validation is done here:

https://github.com/openshift/installer/blob/74ee94f2a34555a41107a5a7da627ab5de0c7373/pkg/types/aws/validation/platform.go#L106

Which in turn refers to a regex here:

https://github.com/openshift/installer/blob/74ee94f2a34555a41107a5a7da627ab5de0c7373/pkg/types/aws/validation/platform.go#L17

Which allows these characters: `^[0-9A-Za-z_.:/=+-@]*$`

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-restrictions).

Version-Release number of selected component (if applicable):

How reproducible:

    100 %

Steps to Reproduce:

    1. Create a install-config with a value usertags as mention in description.
    2. Run the installer.

Actual results:

Command failed with below error:

ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.aws.userTags[key1]: Invalid value: "Test Space": value contains invalid characters

Expected results:

    Installer should run successfully.

Additional info:

    In userTags when the value with space is set then the installer failed to compile the install-config.

https://github.com/openshift/api/pull/2124

Bug OCPBUGS-45381: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/906

Bug OCPBUGS-45562: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/280

Bug OCPBUGS-45563: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-49730: oc-mirror enters into infinite loop when size of images to copy is zero

View the Description View the linked PRs

Description of problem:

When there are no images for copying or size of images to copy is zero oc-mirror enters into infinite loop as shown below

{code:java}
[fedora@knarra-fedora knarra]$ ./oc-mirror -c /tmp/iscoci.yaml file://test --v2 --cache-dir /test/yinzhou

2025/01/31 17:25:34  [INFO]   : :wave: Hello, welcome to oc-mirror
2025/01/31 17:25:34  [INFO]   : :gear:  setting up the environment for you...
2025/01/31 17:25:34  [INFO]   : :twisted_rightwards_arrows: workflow mode: mirrorToDisk 
2025/01/31 17:25:34  [INFO]   : 🕵  going to discover the necessary images...
2025/01/31 17:25:34  [INFO]   : :mag: collecting release images...
2025/01/31 17:25:34  [INFO]   : :mag: collecting operator images...
 ✗   () Collecting catalog oci:///home/fedora/knarra/openshift-tests-private/redhat-operator-index 
2025/01/31 17:25:34  [WARN]   : [OperatorImageCollector] catalog invalid source name oci:///home/fedora/knarra/openshift-tests-private/redhat-operator-index: lstat /home/fedora/knarra/openshift-tests-private: no such file or directory : SKIPPING
2025/01/31 17:25:34  [INFO]   : :mag: collecting additional images...
2025/01/31 17:25:34  [INFO]   : :mag: collecting helm images...
2025/01/31 17:25:34  [INFO]   : :repeat_one: rebuilding catalogs
2025/01/31 17:25:34  [INFO]   : :rocket: Start copying the images...
207712134 / 0 (7m56s) [------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------] 0 %

    Version-Release number of selected component (if applicable):{code:none}
     4.18

How reproducible:

     Always

Steps to Reproduce:

    1. Run the command to copy redhat-operator-index as an oci file by using command skopeo copy --all docker://registry.redhat.io/redhat/redhat-operator-index:v4.16 --remove-signatures --insecure-policy oci:///home/fedora/knarra/redhat-operator-index
    2. Now create imageSetConfig.yaml as shown below

{code:java}
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
    - catalog: oci:///home/fedora/knarra/openshift-tests-private/redhat-operator-index
      targetCatalog: "ocicatalog73452"
      targetTag: "v16"
      packages:
        - name: cluster-kube-descheduler-operator

3. Run command oc-mirror -c /tmp/iscoci.yaml file://test --v2 --cache-dir /test/yinzhou

    Actual results:{code:none}
    You can see that mirroring enters into infinite loop

Expected results:

     Mirroring should fail since there is no such directory and come out.

Additional info:

    https://redhat-internal.slack.com/archives/C050P27C71S/p1738344713804659

https://github.com/openshift/oc-mirror/pull/1060

Bug OCPBUGS-48781: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2474

Bug OCPBUGS-43089: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-gcp/pull/72

Bug OCPBUGS-49718: Pipeline repository overview page close button is always loading

View the Description View the linked PRs

Description of problem:

Pipeline repository overview page close button have always loading

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Navigate to create Repository form
    2. Fill all details and submit the form

Actual results:

    Close button on Reposiotry overview page have a loading icon

Expected results:

    Loading icon should not be there.

Additional info:

https://github.com/openshift/console/pull/14708

Bug OCPBUGS-44103: oc-mirror failed error :manifest type *ocischema.DeserializedImageIndex is not supported

View the Description View the linked PRs

Description of problem:

oc-mirror failed with error: the manifest type *ocischema.DeserializedImageIndex is not supported

Version-Release number of selected component (if applicable):

./oc-mirror.rhel8 version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410251041.p0.g95f0611.assembly.stream.el9-95f0611", GitCommit:"95f0611c1dc9584a4a9e857912b9eaa539234bbc", GitTreeState:"clean", BuildDate:"2024-10-25T11:28:19Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1. imagesetconfig as follow : 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.16
    packages:
    - name: nginx-ingress-operator 
2. run the mirror2mirror command :
 ./oc-mirror.rhel8 -c config-66870.yaml docker://localhost:5000/18 --dest-skip-tls --dest-use-http

Actual results:

2. hit error :  ./oc-mirror.rhel8 -c config-66870.yaml docker://localhost:5000/18 --dest-skip-tls --dest-use-http
Checking push permissions for localhost:5000
Creating directory: oc-mirror-workspace/src/publish
Creating directory: oc-mirror-workspace/src/v2
Creating directory: oc-mirror-workspace/src/charts
Creating directory: oc-mirror-workspace/src/release-signatures
backend is not configured in config-66870.yaml, using stateless mode
backend is not configured in config-66870.yaml, using stateless mode
No metadata detected, creating new workspace
....
    manifests:
      sha256:dea36b1dde70a17369d775cbabe292e7173abcef426dfc21b8a44896544a30da -> ae3ddd14
  stats: shared=0 unique=27 size=139.1MiB ratio=1.00error: the manifest type *ocischema.DeserializedImageIndex is not supported
error: an error occurred during planning

Expected results:

3. no error

Additional info:

compared with 4.17 oc-mirror, no such issue

https://github.com/openshift/oc-mirror/pull/1080

Bug OCPBUGS-46584: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1258

Vulnerability OCPBUGS-47367: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/138

Task OU-630: Add Virtualization Perspective prerequisites to the monitoring-plugin

View the Description View the linked PRs

Background

The virtualization perspective wants to have the observe section so that they can be a fully independent perspective.

Outcomes

The prerequisite functionality is added to the monitoring-plugin without showing regressions in the admin and developer perspectives.

Previous changes to the reducers caused issues when refreshing the page/opening in a new tab. When testing please refresh the tabs to ensure proper behavior.

https://github.com/openshift/monitoring-plugin/pull/317

Bug OCPBUGS-41903: Image registry cluster operator degraded with error "The daemon set node-ca is deployed\nAzurePathFixProgressing: Migration failed"

View the Description View the linked PRs

Description of problem:

OCP cluster upgrade is stuck with image registry pod in degraded state.


The image registry co shows the below error message.

- lastTransitionTime: "2024-09-13T03:15:05Z"
    message: "Progressing: All registry resources are removed\nNodeCADaemonProgressing:
      The daemon set node-ca is deployed\nAzurePathFixProgressing: Migration failed:
      I0912 18:18:02.117077       1 main.go:233] Azure Stack Hub environment variables
      not present in current environment, skipping setup...\nAzurePathFixProgressing:
      panic: Get \"https://xxxxximageregistry.blob.core.windows.net/xxxxcontainer?comp=list&prefix=docker&restype=container\":
      dial tcp: lookup xxxximageregistry.blob.core.windows.net on 192.168.xx.xx.
      no such host\nAzurePathFixProgressing: \nAzurePathFixProgressing: goroutine
      1 [running]:\nAzurePathFixProgressing: main.main()\nAzurePathFixProgressing:
      \t/go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:53
      +0x12a\nAzurePathFixProgressing: "
    reason: AzurePathFixFailed::Removed
    status: "False"
    type: Progressing

Version-Release number of selected component (if applicable):

4.14.33

How reproducible:

Steps to Reproduce:

    1. configure azure storage in configs.imageregistry.operator.openshift.io/cluster     
    2. then mark the managementState as Removed 
    3. check the operator status

Actual results:

CO image-registry remain is degraded state

Expected results:

Operator should not be in degraded state

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/1142

Bug OCPBUGS-45073: CSI Operator doesn't propagate HCP labels to 2nd level operands

View the Description View the linked PRs

Description of problem:

    CSI Operator doesn't propagate HCP labels to 2nd level operands

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

100%

Steps to Reproduce:

    1. Create hostedCluster with .spec.Labels

Actual results:

   aws-ebs-csi-driver-controller, aws-ebs-csi-driver-operator, csi-snapshot-controller, csi-snapshot-webhook pods don't have the specified labels.

Expected results:

       aws-ebs-csi-driver-controller, aws-ebs-csi-driver-operator, csi-snapshot-controller, csi-snapshot-webhook pods have the specified labels.

Additional info:

Bug OCPBUGS-46575: [4.19] azure-disk-csi-driver: ARO HCP could not provision volume

View the Description View the linked PRs

Description of problem:

[Azure disk csi driver]on ARO HCP could not provision volume succeed

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-13-083421

How reproducible:

Always

Steps to Reproduce:

    1.Install AKS cluster on azure.
    2.Install hypershift operator on the AKS cluster.
    3.Use hypershift CLI create hosted cluster with the Client Certificate mode.
    4.Check the azure disk/file csi dirver work well on hosted cluster.

Actual results:

    In step 4: the the azure disk csi dirver provision volume failed on hosted cluster

# azure disk pvc provision failed
$ oc describe pvc mypvc
...
  Normal   WaitForFirstConsumer  74m                    persistentvolume-controller                                                                                waiting for first consumer to be created before binding
  Normal   Provisioning          74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  External provisioner is provisioning volume for claim "default/mypvc"
  Warning  ProvisioningFailed    74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Warning  ProvisioningFailed    71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   Provisioning          71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  External provisioner is provisioning volume for claim "default/mypvc"
...

$ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver
W1216 08:07:04.282922       1 main.go:89] nodeid is empty
I1216 08:07:04.290689       1 main.go:165] set up prometheus server on 127.0.0.1:8201
I1216 08:07:04.291073       1 azuredisk.go:213]
DRIVER INFORMATION:
-------------------
Build Date: "2024-12-13T02:45:35Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.29.11
Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c
Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime
Platform: linux/amd64
Topology Key: topology.disk.csi.azure.com/zone

I1216 08:09:36.814776       1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume
I1216 08:09:36.814803       1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]}
I1216 08:09:36.815338       1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d]

goroutine 153 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?})
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8
google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125

Expected results:

    In step 4: the the azure disk csi dirver should provision volume succeed on hosted cluster

Additional info:

https://github.com/openshift/azure-disk-csi-driver/pull/94

Bug OCPBUGS-47527: OWNERS update

View the Description View the linked PRs

Description of problem:

  OWNERS file updated to include prabhakar and Moe as owners and reviewers

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    This is to fecilitate easy backport via automation

https://github.com/openshift/openshift-controller-manager/pull/355

Bug OCPBUGS-51357: [OLMv1] ClusterOperator OLM Degraded: Deployment was progressing too long

View the Description View the linked PRs

Description of problem:

CO olm Degraded.

    jiazha-mac:~ jiazha$ omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version           False      True         1h7m   Unable to apply 4.19.0-0.nightly-multi-2025-02-26-050012: the cluster operator olm is not available

jiazha-mac:~ jiazha$ omg get co olm -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
...
spec: {}
status:
  conditions:
  - lastTransitionTime: '2025-02-26T16:25:34Z'
    message: 'CatalogdDeploymentCatalogdControllerManagerDegraded: Deployment was
      progressing too long


      OperatorcontrollerDeploymentOperatorControllerControllerManagerDegraded: Deployment
      was progressing too long'
    reason: CatalogdDeploymentCatalogdControllerManager_SyncError::OperatorcontrollerDeploymentOperatorControllerControllerManager_SyncError
    status: 'True'
    type: Degraded
  - lastTransitionTime: '2025-02-26T16:08:34Z'
    message: 'CatalogdDeploymentCatalogdControllerManagerProgressing: Waiting for
      Deployment to deploy pods


      OperatorcontrollerDeploymentOperatorControllerControllerManagerProgressing:
      Waiting for Deployment to deploy pods'
    reason: CatalogdDeploymentCatalogdControllerManager_Deploying::OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying
    status: 'True'
    type: Progressing
  - lastTransitionTime: '2025-02-26T16:08:34Z'
    message: 'CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment


      OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting
      for Deployment'
    reason: CatalogdDeploymentCatalogdControllerManager_Deploying::OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying
    status: 'False'
    type: Available

However, the `catalogd` and `operator-controller` deployment worked well at that time.

jiazha-mac:~ jiazha$ omg get deploy 
NAME                         READY  UP-TO-DATE  AVAILABLE  AGE
catalogd-controller-manager  1/1    1           1          1h1m
jiazha-mac:~ jiazha$ omg get deploy -n openshift-operator-controller 
NAME                                    READY  UP-TO-DATE  AVAILABLE  AGE
operator-controller-controller-manager  1/1    1           1          1h1m

jiazha-mac:~ jiazha$ omg get deploy catalogd-controller-manager -o yaml
apiVersion: apps/v1
kind: Deployment
...
status:
  availableReplicas: '1'
  conditions:
  - lastTransitionTime: '2025-02-26T16:24:35Z'
    lastUpdateTime: '2025-02-26T16:24:35Z'
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: 'True'
    type: Available
  - lastTransitionTime: '2025-02-26T16:22:42Z'
    lastUpdateTime: '2025-02-26T16:24:35Z'
    message: ReplicaSet "catalogd-controller-manager-7f855d8d48" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: 'True'
    type: Progressing
  observedGeneration: '1'
  readyReplicas: '1'
  replicas: '1'
  updatedReplicas: '1'

jiazha-mac:~ jiazha$ omg get deploy -n openshift-operator-controller  operator-controller-controller-manager -o yaml
apiVersion: apps/v1
kind: Deployment
...
status:
  availableReplicas: '1'
  conditions:
  - lastTransitionTime: '2025-02-26T16:23:49Z'
    lastUpdateTime: '2025-02-26T16:23:49Z'
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: 'True'
    type: Available
  - lastTransitionTime: '2025-02-26T16:22:54Z'
    lastUpdateTime: '2025-02-26T16:23:49Z'
    message: ReplicaSet "operator-controller-controller-manager-57f648fb64" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: 'True'
    type: Progressing
  observedGeneration: '1'
  readyReplicas: '1'
  replicas: '1'
  updatedReplicas: '1'

Version-Release number of selected component (if applicable):

How reproducible:

    Not always

Steps to Reproduce:

encountered this issues twice:

1, https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.19-multi-nightly-gcp-ipi-ovn-ipsec-arm-mixarch-f14/1894774434611335168

2, https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.19-multi-nightly-gcp-ipi-ovn-ipsec-amd-mixarch-f28-destructive/1894774064451424256

    1.
    2.
    3.

Actual results:

    CO olm Degraded.

Expected results:

    CO olm availabel.

Additional info:

    jiazha-mac:~ jiazha$ omg project openshift-cluster-olm-operator
Now using project openshift-cluster-olm-operator
jiazha-mac:~ jiazha$ omg get pods 
NAME                                   READY  STATUS   RESTARTS  AGE
cluster-olm-operator-5c6b8c4959-swxtt  0/1    Running  0         38m
jiazha-mac:~ jiazha$ omg logs cluster-olm-operator-5c6b8c4959-swxtt -c cluster-olm-operator
2025-02-26T16:31:52.648371813Z I0226 16:31:52.643085       1 cmd.go:253] Using service-serving-cert provided certificates
2025-02-26T16:31:52.648662533Z I0226 16:31:52.648619       1 leaderelection.go:121] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
...
2025-02-26T16:32:05.467351366Z E0226 16:32:05.467298       1 base_controller.go:279] "Unhandled Error" err="CatalogdDeploymentCatalogdControllerManager reconciliation failed: Deployment was progressing too long"
2025-02-26T16:32:06.059681614Z I0226 16:32:06.059629       1 builder.go:224] "ProxyHook updating environment" logger="builder" deployment="operator-controller-controller-manager"
2025-02-26T16:32:06.059769494Z I0226 16:32:06.059758       1 featuregates_hook.go:33] "updating environment" logger="feature_gates_hook" deployment="operator-controller-controller-manager"
2025-02-26T16:32:06.066149493Z E0226 16:32:06.066095       1 base_controller.go:279] "Unhandled Error" err="OperatorcontrollerDeploymentOperatorControllerControllerManager reconciliation failed: Deployment was progressing too long"

https://github.com/openshift/cluster-olm-operator/pull/106

Bug OCPBUGS-51379: console-operator re-introduced regression

View the Description View the linked PRs

Payload 4.19.0-0.nightly-2025-02-27-081354 sees console operator regression reintroduced

The original revert included steps to validate the fix

`/payload-aggregate periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance 10`

Unfortunately this did not appear to happen and payloads are blocked again.

This also impacts metal-ipi-ovn-bm

and could be tested via
`/payload-job periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm`

New revert is up.

https://github.com/openshift/console-operator/pull/964

Bug OCPBUGS-45508: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/340

Bug OCPBUGS-44602: In OCL MOSBs are leaked

View the Description View the linked PRs

Description of problem:

    Intermittently, when the MOSC resource is deleted, MOSBS resources aren't removed.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-11-14-090045

How reproducible:

    Intermittent

Steps to Reproduce:

    1.Enable techpreview
    2.Create a MOSC resource to enable OCL in a pool
    3.Wait until the build pod finishes and the first node starts updating
    4.Remove the MOSC resource

Actual results:

    The MOSB resource is not removed when the the MOSC is deleted. It is leaked.

Expected results:

    When a MOSC resource is removed, all its MOSBs should be removed too.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4861

Bug OCPBUGS-45644: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/319

Bug OCPBUGS-42849: The release signature configmap file is invalid with no name

View the Description View the linked PRs

Description of problem:

The release signature configmap file is invalid with no name defined

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) with isc :
cat /test/yinzhou/config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.16
2) do mirror2disk + disk2mirror 
3) use the signature configmap  to create resource

Actual results:

3) failed to create resource with error: 
oc create -f signature-configmap.json 
The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required

oc create -f signature-configmap.yaml 
The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required

Expected results:

No error

https://github.com/openshift/oc-mirror/pull/979

Bug OCPBUGS-17079: Machine scale failed for GCP Marketplace cluster after upgrade from 4.12 to 4.13

View the Description View the linked PRs

Description of problem:

Machine scale failed for GCP Marketplace cluster after upgrade from 4.12 to 4.13

Version-Release number of selected component (if applicable):

Upgrade from 4.12.26 to 4.13.0-0.nightly-2023-07-27-013427

How reproducible:

Always

Steps to Reproduce:

1.Install a 4.12 GCP Marketplace cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion    
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.26   True        False         24m     Cluster version is 4.12.26
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE            REGION        ZONE            AGE
huliu-41142-4cd9z-master-0         Running   n2-standard-4   us-central1   us-central1-a   48m
huliu-41142-4cd9z-master-1         Running   n2-standard-4   us-central1   us-central1-b   48m
huliu-41142-4cd9z-master-2         Running   n2-standard-4   us-central1   us-central1-c   48m
huliu-41142-4cd9z-worker-a-z772h   Running   n2-standard-4   us-central1   us-central1-a   46m
huliu-41142-4cd9z-worker-b-7vb9n   Running   n2-standard-4   us-central1   us-central1-b   46m 

2.Upgrade to 4.13
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-07-27-013427   True        False         15m     Cluster version is 4.13.0-0.nightly-2023-07-27-013427
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE            REGION        ZONE            AGE
huliu-41142-4cd9z-master-0         Running   n2-standard-4   us-central1   us-central1-a   175m
huliu-41142-4cd9z-master-1         Running   n2-standard-4   us-central1   us-central1-b   175m
huliu-41142-4cd9z-master-2         Running   n2-standard-4   us-central1   us-central1-c   175m
huliu-41142-4cd9z-worker-a-z772h   Running   n2-standard-4   us-central1   us-central1-a   172m
huliu-41142-4cd9z-worker-b-7vb9n   Running   n2-standard-4   us-central1   us-central1-b   172m 

3.Scale a machineset
liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset huliu-41142-4cd9z-worker-a --replicas=2
machineset.machine.openshift.io/huliu-41142-4cd9z-worker-a scaled
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE            REGION        ZONE            AGE
huliu-41142-4cd9z-master-0         Running   n2-standard-4   us-central1   us-central1-a   5h35m
huliu-41142-4cd9z-master-1         Running   n2-standard-4   us-central1   us-central1-b   5h35m
huliu-41142-4cd9z-master-2         Running   n2-standard-4   us-central1   us-central1-c   5h35m
huliu-41142-4cd9z-worker-a-pdzg2   Failed                                                  113s
huliu-41142-4cd9z-worker-a-z772h   Running   n2-standard-4   us-central1   us-central1-a   5h33m
huliu-41142-4cd9z-worker-b-7vb9n   Running   n2-standard-4   us-central1   us-central1-b   5h33m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-41142-4cd9z-worker-a-pdzg2  -oyaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: Unknown
  creationTimestamp: "2023-07-31T07:42:44Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: huliu-41142-4cd9z-worker-a-
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-41142-4cd9z
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: huliu-41142-4cd9z-worker-a
  name: huliu-41142-4cd9z-worker-a-pdzg2
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: huliu-41142-4cd9z-worker-a
    uid: 43046eac-5ff5-4810-8e20-f0332128410f
  resourceVersion: "163107"
  uid: 1cd7d4d2-f231-457c-b21b-4ebc2d27363e
spec:
  lifecycleHooks: {}
  metadata: {}
  providerSpec:
    value:
      apiVersion: machine.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
        name: gcp-cloud-credentials
      deletionProtection: false
      disks:
      - autoDelete: true
        boot: true
        image: projects/redhat-marketplace-public/global/images/redhat-coreos-ocp-48-x86-64-202210040145
        labels: null
        sizeGb: 128
        type: pd-ssd
      kind: GCPMachineProviderSpec
      machineType: n2-standard-4
      metadata:
        creationTimestamp: null
      networkInterfaces:
      - network: huliu-41142-4cd9z-network
        subnetwork: huliu-41142-4cd9z-worker-subnet
      projectID: openshift-qe
      region: us-central1
      serviceAccounts:
      - email: huliu-41142-4cd9z-w@openshift-qe.iam.gserviceaccount.com
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      shieldedInstanceConfig: {}
      tags:
      - huliu-41142-4cd9z-worker
      userDataSecret:
        name: worker-user-data
      zone: us-central1-a
status:
  conditions:
  - lastTransitionTime: "2023-07-31T07:42:44Z"
    status: "True"
    type: Drainable
  - lastTransitionTime: "2023-07-31T07:42:44Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  - lastTransitionTime: "2023-07-31T07:42:44Z"
    status: "True"
    type: Terminable
  errorMessage: 'error launching instance: googleapi: Error 400: Invalid value for
    field ''resource.shieldedInstanceConfig'': ''{  "enableVtpm": true,  "enableIntegrityMonitoring":
    true}''. Shielded VM Config can only be set when using a UEFI-compatible disk.,
    invalid'
  errorReason: InvalidConfiguration
  lastUpdated: "2023-07-31T07:42:50Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastTransitionTime: "2023-07-31T07:42:50Z"
      message: 'googleapi: Error 400: Invalid value for field ''resource.shieldedInstanceConfig'':
        ''{  "enableVtpm": true,  "enableIntegrityMonitoring": true}''. Shielded VM
        Config can only be set when using a UEFI-compatible disk., invalid'
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreated
    metadata: {}

liuhuali@Lius-MacBook-Pro huali-test % oc get machineset huliu-41142-4cd9z-worker-a -oyaml
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-07-31T02:09:14Z"
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-41142-4cd9z
  name: huliu-41142-4cd9z-worker-a
  namespace: openshift-machine-api
  resourceVersion: "163067"
  uid: 43046eac-5ff5-4810-8e20-f0332128410f
spec:
  replicas: 2
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-41142-4cd9z
      machine.openshift.io/cluster-api-machineset: huliu-41142-4cd9z-worker-a
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: huliu-41142-4cd9z
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: huliu-41142-4cd9z-worker-a
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          apiVersion: machine.openshift.io/v1beta1
          canIPForward: false
          credentialsSecret:
            name: gcp-cloud-credentials
          deletionProtection: false
          disks:
          - autoDelete: true
            boot: true
            image: projects/redhat-marketplace-public/global/images/redhat-coreos-ocp-48-x86-64-202210040145
            labels: null
            sizeGb: 128
            type: pd-ssd
          kind: GCPMachineProviderSpec
          machineType: n2-standard-4
          metadata:
            creationTimestamp: null
          networkInterfaces:
          - network: huliu-41142-4cd9z-network
            subnetwork: huliu-41142-4cd9z-worker-subnet
          projectID: openshift-qe
          region: us-central1
          serviceAccounts:
          - email: huliu-41142-4cd9z-w@openshift-qe.iam.gserviceaccount.com
            scopes:
            - https://www.googleapis.com/auth/cloud-platform
          tags:
          - huliu-41142-4cd9z-worker
          userDataSecret:
            name: worker-user-data
          zone: us-central1-a
status:
  availableReplicas: 1
  fullyLabeledReplicas: 2
  observedGeneration: 2
  readyReplicas: 1
  replicas: 2

Actual results:

Machine scale Failed

Expected results:

Machine should get Running, it shouldn’t validation when Shielded VM Config is not set.

Additional info:

Although we found this bug https://issues.redhat.com/browse/OCPBUGS-7367, but for the upgrade, the users didn’t set the parameter (shieldedInstanceConfig), didn’t want to use the feature either, but they cannot scale up the old machineset. That’s not convenient.

https://github.com/openshift/machine-api-provider-gcp/pull/108

Bug OCPBUGS-45591: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/342

Bug OCPBUGS-50587: Component Readiness: [Etcd] [operator-conditions] test regressed

View the Description View the linked PRs

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

operator conditions etcd

Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 99.72% to 93.55%.

Sample (being evaluated) Release: 4.18
Start Time: 2025-02-04T00:00:00Z
End Time: 2025-02-11T16:00:00Z
Success Rate: 93.55%
Successes: 87
Failures: 6
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.72%
Successes: 355
Failures: 1
Flakes: 0

View the test details report for additional context.

The issue is currently being discussed via https://redhat-internal.slack.com/archives/C01CQA76KMX/p1739211100312459. It seems to specifically impact periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-azure jobs which in part change instances / instance types during the job and appears to be impacting static pods.

4.19 Test Analysis

4.18 Test Analysis

Bug OCPBUGS-45363: Failing test: oc adm must-gather runs successfully for audit logs [apigroup:config.openshift.io][apigroup:oauth.openshift.io] [Suite:openshift/conformance/parallel]

View the Description View the linked PRs

Test is failing due to oddness with oc adm logs.

We think it is related to PodLogsQuery feature that went into 1.32.

Bug OCPBUGS-45267: Mismatch on the controller and resources that selects it

View the Description View the linked PRs

Description of problem:

The manila controller[1] defines labels that are not based on the asset prefix defined in the manila config[2], consequently when assets that selects this resource are generated they use the asset prefix as a base to define the label, resulting in them not being selected. For example in the pod antifinity[3] and controller pbd[4]. We need to change the labels used in the selectors to match the actual labels of the controller.

[1]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/standalone/controller.yaml#L45-L47

[2]https://github.com/openshift/csi-operator/blob/master/pkg/driver/openstack-manila/openstack_manila.go#L51

[3]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/standalone/controller.yaml#L55

[4]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/hypershift/controller_pdb.yaml#L16

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/csi-operator/pull/335

Bug OCPBUGS-45754: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1946

Vulnerability OCPBUGS-52808: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/140

Bug OCPBUGS-36404: Too many pending CSRs lead to scaleup failures when scaling to 500 nodes

View the Description View the linked PRs

Description of problem:
machine-approver logs

E0221 20:29:52.377443       1 controller.go:182] csr-dm7zr: Pending CSRs: 1871; Max pending allowed: 604. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSRs seen

oc get csr |wc -l
3818
oc get csr |grep "node-bootstrapper" |wc -l
2152

By approving the pending CSR manually I can get the cluster to scaleup.

We can increase the maxPending to a higher number https://github.com/openshift/cluster-machine-approver/blob/2d68698410d7e6239dafa6749cc454272508db19/pkg/controller/controller.go#L330

https://github.com/openshift/cluster-machine-approver/pull/243

Bug OCPBUGS-45342: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/images/pull/200

Bug OCPBUGS-45728: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-powervs/pull/81

Bug OCPBUGS-46544: @types/node and nodejs version mismatch

View the Description View the linked PRs

Description of problem:

    We are currently using node 18, but our types are for node 10

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

    Always

Steps to Reproduce:

    1. Open frontend/package.json
    2. Observe @types/node and engine version
    3.

Actual results:

    They are different

Expected results:

    They are the same

Additional info:

https://github.com/openshift/console/pull/14634

Bug OCPBUGS-45458: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-workload-identity-federation-webhook/pull/7

Bug OCPBUGS-45777: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-version-operator/pull/1123

Bug OCPBUGS-48676: Edit route form displays incorrect label for "save"

View the Description View the linked PRs

The edit route action shows an "Edit" button in order to save the changes, instead of a "Save" button.

The button label is "Save" on other forms e.g. Deployment.

https://github.com/openshift/networking-console-plugin/pull/185

Bug OCPBUGS-45290: [4.19] Routes with SHA1 CA certificate break HAProxy reloading

View the Description View the linked PRs

Description of problem:

    Routes with SHA1 CA certificates (spec.tls.caCertificate) break HAProxy preventing reload

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. create Route with SHA1 CA certificates
    2.
    3.

Actual results:

    HAProxy router fails to reload

Expected results:

    HAProxy router should either reject Routes with SHA1 CA certificates, or reload successfully

Additional info:

    [ALERT]    (312) : config : parsing [/var/lib/haproxy/conf/haproxy.config:131] : 'bind unix@/var/lib/haproxy/run/haproxy-sni.sock' in section 'frontend' : 'crt-list' : error processing line 1 in file '/var/lib/haproxy/conf/cert_config.map' : unable to load chain certificate into SSL Context '/var/lib/haproxy/router/certs/test:test.pem': ca md too weak.

[ALERT]    (312) : config : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config

[ALERT]    (312) : config : Fatal errors found in configuration.

This is a continuation/variance of https://issues.redhat.com/browse/OCPBUGS-26498

https://github.com/openshift/router/pull/642

Bug OCPBUGS-44507: Azure destroy idempotence: BaseDomainResourceGroup

View the Description View the linked PRs

Description of problem:

When destroying an azure cluster, if the main resource group has already been destroyed, DNS entries are not scrubbed, despite BaseDomainResourceGroup being provided via metadata.

Version-Release number of selected component (if applicable):

Probably many. I think I have been reproducing in 4.16 and/or 4.17.

How reproducible:

Easy, 100%

Steps to Reproduce:

    1. Install azure cluster with distinct BaseDomainResourceGroup . Save metadata.json. Confirm it contains BaseDomainResourceGroup.
    2. Manually (via cloud API/CLI) destroy main resource group.
    3. Destroy cluster, providing metadata.json from step 1.

Actual results:

DNS records still exist.

Expected results:

DNS records scrubbed.

Additional info:
thread

https://github.com/openshift/installer/pull/9365

Bug OCPBUGS-45153: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29333

Bug OCPBUGS-50849: IPSec cluster failed to install in GCP and IPSec pods were in pending status on worker nodes

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Issue found in prow ci
periodic-ci-openshift-openshift-tests-private-release-4.19-multi-nightly-gcp-ipi-ovn-ipsec-arm-mixarch-f14 #1890061783440297984
periodic-ci-openshift-openshift-tests-private-release-4.19-multi-nightly-gcp-ipi-ovn-ipsec-amd-mixarch-f28-destructive #1890035862469611520
periodic-ci-openshift-openshift-tests-private-release-4.19-multi-nightly-gcp-ipi-ovn-ipsec-arm-mixarch-f14 #1890279505117843456

must-gather logs for second one https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-o[…]r-must-gather/artifacts/must-gather.tar

% omg get nodes
NAME                                       STATUS  ROLES                 AGE    VERSION
ci-op-9pmd0iim-3eaf1-dcw66-master-0        Ready   control-plane,master  1h12m  v1.32.1
ci-op-9pmd0iim-3eaf1-dcw66-master-1        Ready   control-plane,master  1h13m  v1.32.1
ci-op-9pmd0iim-3eaf1-dcw66-master-2        Ready   control-plane,master  1h11m  v1.32.1
ci-op-9pmd0iim-3eaf1-dcw66-worker-a-d6sw7  Ready   worker                1h0m   v1.32.1
ci-op-9pmd0iim-3eaf1-dcw66-worker-b-97qfp  Ready   worker                58m    v1.32.1
% omg get pods -n openshift-ovn-kubernetes -o wide
NAME                                    READY  STATUS   RESTARTS  AGE   IP          NODE
ovn-ipsec-host-2qfqh                    2/2    Running  0         33m   10.0.0.4    ci-op-9pmd0iim-3eaf1-dcw66-master-2
ovn-ipsec-host-bqh5n                    0/2    Pending  0         33m   10.0.128.3  ci-op-9pmd0iim-3eaf1-dcw66-worker-b-97qfp
ovn-ipsec-host-hdjtx                    2/2    Running  0         33m   10.0.0.3    ci-op-9pmd0iim-3eaf1-dcw66-master-1
ovn-ipsec-host-jwn8s                    2/2    Running  0         33m   10.0.0.6    ci-op-9pmd0iim-3eaf1-dcw66-master-0
ovn-ipsec-host-n4cpv                    0/2    Pending  0         33m   10.0.128.2  ci-op-9pmd0iim-3eaf1-dcw66-worker-a-d6sw7
ovnkube-control-plane-85cbb47f9d-n6rps  2/2    Running  1         55m   10.0.0.6    ci-op-9pmd0iim-3eaf1-dcw66-master-0
ovnkube-control-plane-85cbb47f9d-slb94  2/2    Running  0         47m   10.0.0.3    ci-op-9pmd0iim-3eaf1-dcw66-master-1
ovnkube-node-2hwb6                      8/8    Running  0         1h0m  10.0.128.2  ci-op-9pmd0iim-3eaf1-dcw66-worker-a-d6sw7
ovnkube-node-9nhj6                      8/8    Running  1         53m   10.0.0.4    ci-op-9pmd0iim-3eaf1-dcw66-master-2
ovnkube-node-h2fd2                      8/8    Running  2         53m   10.0.0.3    ci-op-9pmd0iim-3eaf1-dcw66-master-1
ovnkube-node-hwng4                      8/8    Running  0         56m   10.0.0.6    ci-op-9pmd0iim-3eaf1-dcw66-master-0
ovnkube-node-k6rfl                      8/8    Running  0         58m   10.0.128.3  ci-op-9pmd0iim-3eaf1-dcw66-worker-b-97qfp

     % omg get pod ovn-ipsec-host-n4cpv    -n openshift-ovn-kubernetes -o yaml
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/enable-ds-eviction: 'false'
      creationTimestamp: '2025-02-13T14:54:05Z'
      generateName: ovn-ipsec-host-
      labels:
        app: ovn-ipsec
        component: network
        controller-revision-hash: 8b4dd5dc7
        kubernetes.io/os: linux
        openshift.io/component: network
        pod-template-generation: '1'
        type: infra
      managedFields:
      - apiVersion: v1
        fieldsType: FieldsV1
        fieldsV1:
          f:metadata:
            f:annotations:
              .: {}
              f:cluster-autoscaler.kubernetes.io/enable-ds-eviction: {}
              f:target.workload.openshift.io/management: {}
            f:generateName: {}
            f:labels:
              .: {}
              f:app: {}
              f:component: {}
              f:controller-revision-hash: {}
              f:kubernetes.io/os: {}
              f:openshift.io/component: {}
              f:pod-template-generation: {}
              f:type: {}
            f:ownerReferences:
              .: {}
              k:{"uid":"61870386-d205-465b-832c-061c3bf7366e"}: {}
          f:spec:
            f:affinity:
              .: {}
              f:nodeAffinity:
                .: {}
                f:requiredDuringSchedulingIgnoredDuringExecution: {}
            f:containers:
              k:{"name":"ovn-ipsec"}:
                .: {}
                f:command: {}
                f:env:
                  .: {}
                  k:{"name":"K8S_NODE"}:
                    .: {}
                    f:name: {}
                    f:valueFrom:
                      .: {}
                      f:fieldRef: {}
                f:image: {}
                f:imagePullPolicy: {}
                f:lifecycle:
                  .: {}
                  f:preStop:
                    .: {}
                    f:exec:
                      .: {}
                      f:command: {}
                f:livenessProbe:
                  .: {}
                  f:exec:
                    .: {}
                    f:command: {}
                  f:failureThreshold: {}
                  f:initialDelaySeconds: {}
                  f:periodSeconds: {}
                  f:successThreshold: {}
                  f:timeoutSeconds: {}
                f:name: {}
                f:resources:
                  .: {}
                  f:requests:
                    .: {}
                    f:cpu: {}
                    f:memory: {}
                f:securityContext:
                  .: {}
                  f:privileged: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/etc"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/etc/cni/net.d"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/etc/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/usr/libexec/ipsec"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/usr/sbin/ipsec"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/var/lib"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/var/log/openvswitch/"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/var/run"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
              k:{"name":"ovn-ipsec-cleanup"}:
                .: {}
                f:command: {}
                f:image: {}
                f:imagePullPolicy: {}
                f:name: {}
                f:resources:
                  .: {}
                  f:requests:
                    .: {}
                    f:cpu: {}
                    f:memory: {}
                f:securityContext:
                  .: {}
                  f:privileged: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/etc"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/etc/ovn/"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/var/run"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
            f:dnsPolicy: {}
            f:enableServiceLinks: {}
            f:hostNetwork: {}
            f:hostPID: {}
            f:initContainers:
              .: {}
              k:{"name":"ovn-keys"}:
                .: {}
                f:command: {}
                f:env:
                  .: {}
                  k:{"name":"K8S_NODE"}:
                    .: {}
                    f:name: {}
                    f:valueFrom:
                      .: {}
                      f:fieldRef: {}
                f:image: {}
                f:imagePullPolicy: {}
                f:name: {}
                f:resources:
                  .: {}
                  f:requests:
                    .: {}
                    f:cpu: {}
                    f:memory: {}
                f:securityContext:
                  .: {}
                  f:privileged: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/etc"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/etc/openvswitch"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/etc/ovn/"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/signer-ca"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                  k:{"mountPath":"/var/run"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
            f:nodeSelector: {}
            f:priorityClassName: {}
            f:restartPolicy: {}
            f:schedulerName: {}
            f:securityContext: {}
            f:serviceAccount: {}
            f:serviceAccountName: {}
            f:terminationGracePeriodSeconds: {}
            f:tolerations: {}
            f:volumes:
              .: {}
              k:{"name":"etc-openvswitch"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"etc-ovn"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-cni-netd"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-etc"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-var-lib"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-var-log-ovs"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"host-var-run"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"ipsec-bin"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"ipsec-lib"}:
                .: {}
                f:hostPath:
                  .: {}
                  f:path: {}
                  f:type: {}
                f:name: {}
              k:{"name":"signer-ca"}:
                .: {}
                f:configMap:
                  .: {}
                  f:defaultMode: {}
                  f:name: {}
                f:name: {}
        manager: kube-controller-manager
        operation: Update
        time: '2025-02-13T14:54:04Z'
      - apiVersion: v1
        fieldsType: FieldsV1
        fieldsV1:
          f:status:
            f:conditions:
              k:{"type":"ContainersReady"}:
                .: {}
                f:lastProbeTime: {}
                f:lastTransitionTime: {}
                f:message: {}
                f:reason: {}
                f:status: {}
                f:type: {}
              k:{"type":"Initialized"}:
                .: {}
                f:lastProbeTime: {}
                f:lastTransitionTime: {}
                f:message: {}
                f:reason: {}
                f:status: {}
                f:type: {}
              k:{"type":"PodReadyToStartContainers"}:
                .: {}
                f:lastProbeTime: {}
                f:lastTransitionTime: {}
                f:status: {}
                f:type: {}
              k:{"type":"Ready"}:
                .: {}
                f:lastProbeTime: {}
                f:lastTransitionTime: {}
                f:message: {}
                f:reason: {}
                f:status: {}
                f:type: {}
            f:containerStatuses: {}
            f:hostIP: {}
            f:hostIPs: {}
            f:initContainerStatuses: {}
            f:podIP: {}
            f:podIPs:
              .: {}
              k:{"ip":"10.0.128.2"}:
                .: {}
                f:ip: {}
            f:startTime: {}
        manager: kubelet
        operation: Update
        subresource: status
        time: '2025-02-13T14:54:05Z'
      name: ovn-ipsec-host-n4cpv
      namespace: openshift-ovn-kubernetes
      ownerReferences:
      - apiVersion: apps/v1
        blockOwnerDeletion: true
        controller: true
        kind: DaemonSet
        name: ovn-ipsec-host
        uid: 61870386-d205-465b-832c-061c3bf7366e
      resourceVersion: '38812'
      uid: ce7f6619-3015-414d-9de4-5991d74258fd
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchFields:
              - key: metadata.name
                operator: In
                values:
                - ci-op-9pmd0iim-3eaf1-dcw66-worker-a-d6sw7
      containers:
      - command:
        - /bin/bash
        - -c
        - "#!/bin/bash\nset -exuo pipefail\n\n# Don't start IPsec until ovnkube-node has\
          \ finished setting up the node\ncounter=0\nuntil [ -f /etc/cni/net.d/10-ovn-kubernetes.conf\
          \ ]\ndo\n  counter=$((counter+1))\n  sleep 1\n  if [ $counter -gt 300 ];\n \
          \ then\n          echo \"ovnkube-node pod has not started after $counter seconds\"\
          \n          exit 1\n  fi\ndone\necho \"ovnkube-node has configured node.\"\n\
          \nif ! pgrep pluto; then\n  echo \"pluto is not running, enable the service\
          \ and/or check system logs\"\n  exit 2\nfi\n\n# The ovs-monitor-ipsec doesn't\
          \ set authby, so when it calls ipsec auto --start\n# the default ones defined\
          \ at Libreswan's compile time will be used. On restart,\n# Libreswan will use\
          \ authby from libreswan.config. If libreswan.config is\n# incompatible with\
          \ the Libreswan's compiled-in defaults, then we'll have an\n# authentication\
          \ problem. But OTOH, ovs-monitor-ipsec does set ike and esp algorithms,\n# so\
          \ those may be incompatible with libreswan.config as well. Hence commenting\
          \ out the\n# \"include\" from libreswan.conf to avoid such conflicts.\ndefaultcpinclude=\"\
          include \\/etc\\/crypto-policies\\/back-ends\\/libreswan.config\"\nif ! grep\
          \ -q \"# ${defaultcpinclude}\" /etc/ipsec.conf; then\n  sed -i \"/${defaultcpinclude}/s/^/#\
          \ /\" /etc/ipsec.conf\n  # since pluto is on the host, we need to restart it\
          \ after changing connection\n  # parameters.\n  chroot /proc/1/root ipsec restart\n\
          \n  counter=0\n  until [ -r /run/pluto/pluto.ctl ]; do\n    counter=$((counter+1))\n\
          \    sleep 1\n    if [ $counter -gt 300 ];\n    then\n      echo \"ipsec has\
          \ not started after $counter seconds\"\n      exit 1\n    fi\n  done\n  echo\
          \ \"ipsec service is restarted\"\nfi\n\n# Workaround for https://github.com/libreswan/libreswan/issues/373\n\
          ulimit -n 1024\n\n/usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig\n\
          # Check kernel modules\n/usr/libexec/ipsec/_stackmanager start\n# Check nss\
          \ database status\n/usr/sbin/ipsec --checknss\n\n# Start ovs-monitor-ipsec which\
          \ will monitor for changes in the ovs\n# tunnelling configuration (for example\
          \ addition of a node) and configures\n# libreswan appropriately.\n# We are running\
          \ this in the foreground so that the container will be restarted when ovs-monitor-ipsec\
          \ fails.\n/usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec\
          \ \\\n  --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan\
          \ --no-restart-ike-daemon \\\n  --ipsec-conf /etc/ipsec.d/openshift.conf --ipsec-d\
          \ /var/lib/ipsec/nss \\\n  --log-file --monitor unix:/var/run/openvswitch/db.sock\n"
        env:
        - name: K8S_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e262b9ed22e74a3a8d7a345b775645267acfbcd571b510e1ace519cc2f658bf
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - '#!/bin/bash
     
                set -exuo pipefail
     
                # In order to maintain traffic flows during container restart, we
     
                # need to ensure that xfrm state and policies are not flushed.
     
     
                # Don''t allow ovs monitor to cleanup persistent state
     
                kill "$(cat /var/run/openvswitch/ovs-monitor-ipsec.pid 2>/dev/null)" 2>/dev/null
                || true
     
                '
        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - "#!/bin/bash\nif [[ $(ipsec whack --trafficstatus | wc -l) -eq 0 ]]; then\n\
              \  echo \"no ipsec traffic configured\"\n  exit 10\nfi\n"
          failureThreshold: 3
          initialDelaySeconds: 15
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 1
        name: ovn-ipsec
        resources:
          requests:
            cpu: 10m
            memory: 100Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/cni/net.d
          name: host-cni-netd
        - mountPath: /var/run
          name: host-var-run
        - mountPath: /var/log/openvswitch/
          name: host-var-log-ovs
        - mountPath: /etc/openvswitch
          name: etc-openvswitch
        - mountPath: /var/lib
          name: host-var-lib
        - mountPath: /etc
          name: host-etc
        - mountPath: /usr/sbin/ipsec
          name: ipsec-bin
        - mountPath: /usr/libexec/ipsec
          name: ipsec-lib
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-7rvbc
          readOnly: true
      - command:
        - /bin/bash
        - -c
        - "#!/bin/bash\n\n# When NETWORK_NODE_IDENTITY_ENABLE is true, use the per-node\
          \ certificate to create a kubeconfig\n# that will be used to talk to the API\n\
          \n\n# Wait for cert file\nretries=0\ntries=20\nkey_cert=\"/etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem\"\
          \nwhile [ ! -f \"${key_cert}\" ]; do\n  (( retries += 1 ))\n  if [[ \"${retries}\"\
          \ -gt ${tries} ]]; then\n    echo \"$(date -Iseconds) - ERROR - ${key_cert}\
          \ not found\"\n    return 1\n  fi\n  sleep 1\ndone\n\ncat << EOF > /var/run/ovnkube-kubeconfig\n\
          apiVersion: v1\nclusters:\n  - cluster:\n      certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n\
          \      server: https://api-int.ci-op-9pmd0iim-3eaf1.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443\n\
          \    name: default-cluster\ncontexts:\n  - context:\n      cluster: default-cluster\n\
          \      namespace: default\n      user: default-auth\n    name: default-context\n\
          current-context: default-context\nkind: Config\npreferences: {}\nusers:\n  -\
          \ name: default-auth\n    user:\n      client-certificate: /etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem\n\
          \      client-key: /etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem\n\
          EOF\nexport KUBECONFIG=/var/run/ovnkube-kubeconfig\n\n\n# It is safe to flush\
          \ xfrm states and policies and delete openshift.conf\n# file when east-west\
          \ ipsec is disabled. This fixes a race condition when\n# ovs-monitor-ipsec is\
          \ not fast enough to notice ipsec config change and\n# delete entries before\
          \ it's being killed.\n# Since it's cleaning up all xfrm states and policies,\
          \ it may cause slight\n# interruption until ipsec is restarted in case of external\
          \ ipsec config.\n# We must do this before killing ovs-monitor-ipsec script,\
          \ otherwise\n# preStop hook doesn't get a chance to run it because ovn-ipsec\
          \ container\n# is abruptly terminated.\n# When east-west ipsec is not disabled,\
          \ then do not flush xfrm states and\n# policies in order to maintain traffic\
          \ flows during container restart.\nipsecflush() {\n  if [ \"$(kubectl get networks.operator.openshift.io\
          \ cluster -ojsonpath='{.spec.defaultNetwork.ovnKubernetesConfig.ipsecConfig.mode}')\"\
          \ != \"Full\" ] && \\\n     [ \"$(kubectl get networks.operator.openshift.io\
          \ cluster -ojsonpath='{.spec.defaultNetwork.ovnKubernetesConfig.ipsecConfig}')\"\
          \ != \"{}\" ]; then\n    ip x s flush\n    ip x p flush\n    rm -f /etc/ipsec.d/openshift.conf\n\
          \    # since pluto is on the host, we need to restart it after the flush\n \
          \   chroot /proc/1/root ipsec restart\n  fi\n}\n\n# Function to handle SIGTERM\n\
          cleanup() {\n  echo \"received SIGTERM, flushing ipsec config\"\n  # Wait upto\
          \ 15 seconds for ovs-monitor-ipsec process to terminate before\n  # cleaning\
          \ up ipsec entries.\n  counter=0\n  while kill -0 \"$(cat /var/run/openvswitch/ovs-monitor-ipsec.pid\
          \ 2>/dev/null)\"; do\n    counter=$((counter+1))\n    sleep 1\n    if [ $counter\
          \ -gt 15 ];\n    then\n      echo \"ovs-monitor-ipsec has not terminated after\
          \ $counter seconds\"\n      break\n    fi\n  done\n  ipsecflush\n  exit 0\n\
          }\n\n# Trap SIGTERM and call cleanup function\ntrap cleanup SIGTERM\n\ncounter=0\n\
          until [ -r /var/run/openvswitch/ovs-monitor-ipsec.pid ]; do\n  counter=$((counter+1))\n\
          \  sleep 1\n  if [ $counter -gt 300 ];\n  then\n    echo \"ovs-monitor-ipsec\
          \ has not started after $counter seconds\"\n    exit 1\n  fi\ndone\necho \"\
          ovs-monitor-ipsec is started\"\n\n# Monitor the ovs-monitor-ipsec process.\n\
          while kill -0 \"$(cat /var/run/openvswitch/ovs-monitor-ipsec.pid 2>/dev/null)\"\
          ; do\n  sleep 1\ndone\n\n# Once the ovs-monitor-ipsec process terminates, execute\
          \ the cleanup command.\necho \"ovs-monitor-ipsec is terminated, flushing ipsec\
          \ config\"\nipsecflush\n\n# Continue running until SIGTERM is received (or exit\
          \ naturally)\nwhile true; do\n  sleep 1\ndone\n"
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e262b9ed22e74a3a8d7a345b775645267acfbcd571b510e1ace519cc2f658bf
        imagePullPolicy: IfNotPresent
        name: ovn-ipsec-cleanup
        resources:
          requests:
            cpu: 10m
            memory: 50Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/ovn/
          name: etc-ovn
        - mountPath: /var/run
          name: host-var-run
        - mountPath: /etc
          name: host-etc
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-7rvbc
          readOnly: true
      dnsPolicy: Default
      enableServiceLinks: true
      hostNetwork: true
      hostPID: true
      imagePullSecrets:
      - name: ovn-kubernetes-node-dockercfg-sds8g
      initContainers:
      - command:
        - /bin/bash
        - -c
        - "#!/bin/bash\nset -exuo pipefail\n\n# When NETWORK_NODE_IDENTITY_ENABLE is true,\
          \ use the per-node certificate to create a kubeconfig\n# that will be used to\
          \ talk to the API\n\n\n# Wait for cert file\nretries=0\ntries=20\nkey_cert=\"\
          /etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem\"\nwhile [ ! -f \"${key_cert}\"\
          \ ]; do\n  (( retries += 1 ))\n  if [[ \"${retries}\" -gt ${tries} ]]; then\n\
          \    echo \"$(date -Iseconds) - ERROR - ${key_cert} not found\"\n    return\
          \ 1\n  fi\n  sleep 1\ndone\n\ncat << EOF > /var/run/ovnkube-kubeconfig\napiVersion:\
          \ v1\nclusters:\n  - cluster:\n      certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n\
          \      server: https://api-int.ci-op-9pmd0iim-3eaf1.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443\n\
          \    name: default-cluster\ncontexts:\n  - context:\n      cluster: default-cluster\n\
          \      namespace: default\n      user: default-auth\n    name: default-context\n\
          current-context: default-context\nkind: Config\npreferences: {}\nusers:\n  -\
          \ name: default-auth\n    user:\n      client-certificate: /etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem\n\
          \      client-key: /etc/ovn/ovnkube-node-certs/ovnkube-client-current.pem\n\
          EOF\nexport KUBECONFIG=/var/run/ovnkube-kubeconfig\n\n\n# Every time we restart\
          \ this container, we will create a new key pair if\n# we are close to key expiration\
          \ or if we do not already have a signed key pair.\n#\n# Each node has a key\
          \ pair which is used by OVS to encrypt/decrypt/authenticate traffic\n# between\
          \ each node. The CA cert is used as the root of trust for all certs so we need\n\
          # the CA to sign our certificate signing requests with the CA private key. In\
          \ this way,\n# we can validate that any signed certificates that we receive\
          \ from other nodes are\n# authentic.\necho \"Configuring IPsec keys\"\n\ncert_pem=/etc/openvswitch/keys/ipsec-cert.pem\n\
          \n# If the certificate does not exist or it will expire in the next 6 months\n\
          # (15770000 seconds), we will generate a new one.\nif ! openssl x509 -noout\
          \ -dates -checkend 15770000 -in $cert_pem; then\n  # We use the system-id as\
          \ the CN for our certificate signing request. This\n  # is a requirement by\
          \ OVN.\n  cn=$(ovs-vsctl --retry -t 60 get Open_vSwitch . external-ids:system-id\
          \ | tr -d \"\\\"\")\n\n  mkdir -p /etc/openvswitch/keys\n\n  # Generate an SSL\
          \ private key and use the key to create a certitificate signing request\n  umask\
          \ 077 && openssl genrsa -out /etc/openvswitch/keys/ipsec-privkey.pem 2048\n\
          \  openssl req -new -text \\\n              -extensions v3_req \\\n        \
          \      -addext \"subjectAltName = DNS:${cn}\" \\\n              -subj \"/C=US/O=ovnkubernetes/OU=kind/CN=${cn}\"\
          \ \\\n              -key /etc/openvswitch/keys/ipsec-privkey.pem \\\n      \
          \        -out /etc/openvswitch/keys/ipsec-req.pem\n\n  csr_64=$(base64 -w0 /etc/openvswitch/keys/ipsec-req.pem)\
          \ # -w0 to avoid line-wrap\n\n  # Request that our generated certificate signing\
          \ request is\n  # signed by the \"network.openshift.io/signer\" signer that\
          \ is\n  # implemented by the CNO signer controller. This will sign the\n  #\
          \ certificate signing request using the signer-ca which has been\n  # set up\
          \ by the OperatorPKI. In this way, we have a signed certificate\n  # and our\
          \ private key has remained private on this host.\n  cat <<EOF | kubectl create\
          \ -f -\n  apiVersion: certificates.k8s.io/v1\n  kind: CertificateSigningRequest\n\
          \  metadata:\n    generateName: ipsec-csr-$(hostname)-\n    labels:\n      k8s.ovn.org/ipsec-csr:\
          \ $(hostname)\n  spec:\n    request: ${csr_64}\n    signerName: network.openshift.io/signer\n\
          \    usages:\n    - ipsec tunnel\nEOF\n  # Wait until the certificate signing\
          \ request has been signed.\n  counter=0\n  until [ -n \"$(kubectl get csr -lk8s.ovn.org/ipsec-csr=\"\
          $(hostname)\" --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1:].status.certificate}'\
          \ 2>/dev/null)\" ]\n  do\n    counter=$((counter+1))\n    sleep 1\n    if [\
          \ $counter -gt 60 ];\n    then\n            echo \"Unable to sign certificate\
          \ after $counter seconds\"\n            exit 1\n    fi\n  done\n\n  # Decode\
          \ the signed certificate.\n  kubectl get csr -lk8s.ovn.org/ipsec-csr=\"$(hostname)\"\
          \ --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1:].status.certificate}'\
          \ | base64 -d | openssl x509 -outform pem -text -out $cert_pem\n\n  # kubectl\
          \ delete csr/$(hostname)\n\n  # Get the CA certificate so we can authenticate\
          \ peer nodes.\n  openssl x509 -in /signer-ca/ca-bundle.crt -outform pem -text\
          \ -out /etc/openvswitch/keys/ipsec-cacert.pem\nfi\n\n# Configure OVS with the\
          \ relevant keys for this node. This is required by ovs-monitor-ipsec.\n#\n#\
          \ Updating the certificates does not need to be an atomic operation as\n# the\
          \ will get read and loaded into NSS by the ovs-monitor-ipsec process\n# which\
          \ has not started yet.\novs-vsctl --retry -t 60 set Open_vSwitch . other_config:certificate=$cert_pem\
          \ \\\n                                           other_config:private_key=/etc/openvswitch/keys/ipsec-privkey.pem\
          \ \\\n                                           other_config:ca_cert=/etc/openvswitch/keys/ipsec-cacert.pem\n"
        env:
        - name: K8S_NODE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e262b9ed22e74a3a8d7a345b775645267acfbcd571b510e1ace519cc2f658bf
        imagePullPolicy: IfNotPresent
        name: ovn-keys
        resources:
          requests:
            cpu: 10m
            memory: 100Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/ovn/
          name: etc-ovn
        - mountPath: /var/run
          name: host-var-run
        - mountPath: /signer-ca
          name: signer-ca
        - mountPath: /etc/openvswitch
          name: etc-openvswitch
        - mountPath: /etc
          name: host-etc
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-7rvbc
          readOnly: true
      nodeName: ci-op-9pmd0iim-3eaf1-dcw66-worker-a-d6sw7
      nodeSelector:
        kubernetes.io/os: linux
      preemptionPolicy: PreemptLowerPriority
      priority: 2000001000
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ovn-kubernetes-node
      serviceAccountName: ovn-kubernetes-node
      terminationGracePeriodSeconds: 10
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/ovn-ic/etc
          type: ''
        name: etc-ovn
      - hostPath:
          path: /var/log/openvswitch
          type: DirectoryOrCreate
        name: host-var-log-ovs
      - configMap:
          defaultMode: 420
          name: signer-ca
        name: signer-ca
      - hostPath:
          path: /var/lib/openvswitch/etc
          type: DirectoryOrCreate
        name: etc-openvswitch
      - hostPath:
          path: /var/run/multus/cni/net.d
          type: ''
        name: host-cni-netd
      - hostPath:
          path: /var/run
          type: DirectoryOrCreate
        name: host-var-run
      - hostPath:
          path: /var/lib
          type: DirectoryOrCreate
        name: host-var-lib
      - hostPath:
          path: /etc
          type: Directory
        name: host-etc
      - hostPath:
          path: /usr/sbin/ipsec
          type: File
        name: ipsec-bin
      - hostPath:
          path: /usr/libexec/ipsec
          type: Directory
        name: ipsec-lib
      - name: kube-api-access-7rvbc
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              items:
              - key: ca.crt
                path: ca.crt
              name: kube-root-ca.crt
          - downwardAPI:
              items:
              - fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
                path: namespace
          - configMap:
              items:
              - key: service-ca.crt
                path: service-ca.crt
              name: openshift-service-ca.crt
    status:
      conditions:
      - lastProbeTime: null
        lastTransitionTime: '2025-02-13T14:54:05Z'
        status: 'False'
        type: PodReadyToStartContainers
      - lastProbeTime: null
        lastTransitionTime: '2025-02-13T14:54:05Z'
        message: 'containers with incomplete status: [ovn-keys]'
        reason: ContainersNotInitialized
        status: 'False'
        type: Initialized
      - lastProbeTime: null
        lastTransitionTime: '2025-02-13T14:54:05Z'
        message: 'containers with unready status: [ovn-ipsec ovn-ipsec-cleanup]'
        reason: ContainersNotReady
        status: 'False'
        type: Ready
      - lastProbeTime: null
        lastTransitionTime: '2025-02-13T14:54:05Z'
        message: 'containers with unready status: [ovn-ipsec ovn-ipsec-cleanup]'
        reason: ContainersNotReady
        status: 'False'
        type: ContainersReady
      - lastProbeTime: null
        lastTransitionTime: '2025-02-13T14:54:05Z'
        status: 'True'
        type: PodScheduled
      containerStatuses:
      - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e262b9ed22e74a3a8d7a345b775645267acfbcd571b510e1ace519cc2f658bf
        imageID: ''
        lastState: {}
        name: ovn-ipsec
        ready: false
        restartCount: 0
        started: false
        state:
          waiting:
            reason: PodInitializing
        volumeMounts:
        - mountPath: /etc/cni/net.d
          name: host-cni-netd
        - mountPath: /var/run
          name: host-var-run
        - mountPath: /var/log/openvswitch/
          name: host-var-log-ovs
        - mountPath: /etc/openvswitch
          name: etc-openvswitch
        - mountPath: /var/lib
          name: host-var-lib
        - mountPath: /etc
          name: host-etc
        - mountPath: /usr/sbin/ipsec
          name: ipsec-bin
        - mountPath: /usr/libexec/ipsec
          name: ipsec-lib
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-7rvbc
          readOnly: true
          recursiveReadOnly: Disabled
      - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e262b9ed22e74a3a8d7a345b775645267acfbcd571b510e1ace519cc2f658bf
        imageID: ''
        lastState: {}
        name: ovn-ipsec-cleanup
        ready: false
        restartCount: 0
        started: false
        state:
          waiting:
            reason: PodInitializing
        volumeMounts:
        - mountPath: /etc/ovn/
          name: etc-ovn
        - mountPath: /var/run
          name: host-var-run
        - mountPath: /etc
          name: host-etc
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-7rvbc
          readOnly: true
          recursiveReadOnly: Disabled
      hostIP: 10.0.128.2
      hostIPs:
      - ip: 10.0.128.2
      initContainerStatuses:
      - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7e262b9ed22e74a3a8d7a345b775645267acfbcd571b510e1ace519cc2f658bf
        imageID: ''
        lastState: {}
        name: ovn-keys
        ready: false
        restartCount: 0
        started: false
        state:
          waiting:
            reason: PodInitializing
        volumeMounts:
        - mountPath: /etc/ovn/
          name: etc-ovn
        - mountPath: /var/run
          name: host-var-run
        - mountPath: /signer-ca
          name: signer-ca
        - mountPath: /etc/openvswitch
          name: etc-openvswitch
        - mountPath: /etc
          name: host-etc
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-7rvbc
          readOnly: true
          recursiveReadOnly: Disabled
      phase: Pending
      podIP: 10.0.128.2
      podIPs:
      - ip: 10.0.128.2
      qosClass: Burstable
      startTime: '2025-02-13T14:54:05Z'



1.7.1

PrivateBin is a minimalist, open source online pastebin where the server has zero knowledge of pasted data. Data is encrypted/decrypted in the browser using 256 bits AES. More information on the project page. Red Hat Employee Privacy Statement

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/machine-config-operator/pull/4856

Bug OCPBUGS-40772: ART requests updates to 4.18 image ose-service-ca-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/service-ca-operator/pull/246

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/service-ca-operator/pull/248

Bug OCPBUGS-44438: HCP applies featureset-guarded manifests when bootstrapping CVO

View the Description View the linked PRs

Description of problem:

CVO manifests contain some feature-gated ones:

since at least 4.16, there are feature-gated ClusterVersion CRDs
UpdateStatus API feature is delivered through DevPreview (now) and TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests to CVO, which was unexpected. Investigating further, we discovered that HyperShift applies them:

error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}

But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:

$ ls -1 manifests/*clusterversions*crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml

In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

1. inspect the cluster-version-operator-*-bootstrap.log of a HyperShift CI job

Actual results:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

= all four ClusterVersion CRD manifests are applied

Expected results:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created

= ClusterVersion CRD manifest is applied just once

Additional info

I'm filing this card so that I can link it to the "easy" fix https://github.com/openshift/hypershift/pull/5093 which is not the perfect fix, but allows us to add featureset-gated manifests to CVO without breaking HyperShift. It is desirable to improve this even further and actually correctly select the manifests to be applied for CVO bootstrap, but that involves non-trivial logic similar to one used by CVO and it seems to be better approached as a feature to be properly assessed and implemented, rather than a bugfix, so I'll file a separate HOSTEDCP card for that.

https://github.com/openshift/hypershift/pull/5093

Bug OCPBUGS-45429: Azure CAPI: inboundNatRule and ssh nsg rule are leftover after bootstrap server is deleted when installing cluster in existing resource group

View the Description View the linked PRs

Description of problem:

Install cluster in existing resource group, 

After bootstrap server is destroyed, inboundNatRule ssh_in in external load balancer is not deleted. ssh nsg rule is also leftover in nsg.

$ az network lb list -g ci-op-vq47c2zq-11f79-rg -otable
Location    Name                                 ProvisioningState    ResourceGroup            ResourceGuid
----------  -----------------------------------  -------------------  -----------------------  ------------------------------------
centralus   ci-op-vq47c2zq-11f79-xhl4q           Succeeded            ci-op-vq47c2zq-11f79-rg  282960e6-014e-4abe-8f61-2782cd82ca82
centralus   ci-op-vq47c2zq-11f79-xhl4q-internal  Succeeded            ci-op-vq47c2zq-11f79-rg  0e3afbf2-f2b2-4f59-8771-ccef9457fd90

$ az network lb inbound-nat-rule list --lb-name ci-op-vq47c2zq-11f79-xhl4q -g ci-op-vq47c2zq-11f79-rg -otable
BackendPort    EnableFloatingIP    EnableTcpReset    FrontendPort    IdleTimeoutInMinutes    Name                               Protocol    ProvisioningState    ResourceGroup
-------------  ------------------  ----------------  --------------  ----------------------  ---------------------------------  ----------  -------------------  -----------------------
22             False               False             22              4                       ci-op-vq47c2zq-11f79-xhl4q_ssh_in  Tcp         Succeeded            ci-op-vq47c2zq-11f79-rg
    

$ az network nsg rule list --nsg-name ci-op-vq47c2zq-11f79-xhl4q-nsg -g ci-op-vq47c2zq-11f79-rg -otable
Name                                                      ResourceGroup            Priority    SourcePortRanges    SourceAddressPrefixes    SourceASG    Access    Protocol    Direction    DestinationPortRanges    DestinationAddressPrefixes    DestinationASG
--------------------------------------------------------  -----------------------  ----------  ------------------  -----------------------  -----------  --------  ----------  -----------  -----------------------  ----------------------------  ----------------
apiserver_in                                              ci-op-vq47c2zq-11f79-rg  101         *                   *                        None         Allow     Tcp         Inbound      6443                     *                             None
ci-op-vq47c2zq-11f79-xhl4q_ssh_in                         ci-op-vq47c2zq-11f79-rg  220         *                   *                        None         Allow     Tcp         Inbound      22                       *                             None
k8s-azure-lb_allow_IPv4_556f7044ec033071ec0dfcf7cd85bc93  ci-op-vq47c2zq-11f79-rg  500         *                   Internet                 None         Allow     Tcp         Inbound      443 80                   48.214.241.65                 None

Version-Release number of selected component (if applicable):

    4.18 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Specify platform.azure.resourceGroupName to pre-created resource group name in install-config
    2. Install cluster
    3.

Actual results:

    InboundNatRule in external load balancer and ssh nsg rule in nsg are leftover after bootstrap server is deleted.

Expected results:

    All resources associated with bootstrap should be removed after bootstrap server is destroyed.

Additional info:

   Look like that resource group name is hard-coded as "<infrad-id>-rg" in post destroy, see code: https://github.com/openshift/installer/blob/master/pkg/infrastructure/azure/azure.go#L717

https://github.com/openshift/installer/pull/9487

Bug OCPBUGS-46605: [GCP] "destroy cluster" stucks, where additional compute nodes added and without infra_id as their name prefix

View the Description View the linked PRs

Description of problem:

    It's the testing scenario of QE test case OCP-24405, i.e. after a successful IPI installation, add an additional compute/worker node without infra_id as name prefix. The expectation is, "destroy cluster" could delete the additional compute/worker machine smoothly. But the testing results is, "destroy cluster" seems unaware of the machine.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-12-17-192034

How reproducible:

    Always

Steps to Reproduce:

1. install an IPI cluster on GCP and make sure it succeeds (see [1])
2. add the additional compute/worker node, and ensure the node's name doesn't have the cluster infra ID (see [2])
3. wait for the node ready and all cluster operators available
4. (optional) scale ingress operator replica into 3 (see [3]), and wait for ingress operator finishing progressing
5. check the new machine on GCP (see [4])
6. "destroy cluster" (see [5])

Actual results:

    The additional compute/worker node is not deleted, which seems also leading to k8s firewall-rules / forwarding-rule / target-pool / http-health-check not deleted.

Expected results:

    "destroy cluster" should be able to detect the additional compute/worker node by the label "kubernetes-io-cluster-<infra id>: owned" and delete it along with all resources of the cluster.

Additional info:

    Alternatively, we also tested with creating the additional compute/worker machine by a machineset YAML (rather than a machine YAML), and we got the same issue in such case.

https://github.com/openshift/installer/pull/9336

Bug OCPBUGS-49600: Display etcd bootstrap event on timeline

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29490

Bug OCPBUGS-41974: ImagePullSecret getting duplicated when editing DeploymentConfig in Form View

View the Description View the linked PRs

Description of problem:

If a user updates a deployment config using Form view instead of yaml, the image pull secret is getting duplicated.
~~~
$ oc get pods ubi9-2-deploy 
        message: |
          error: couldn't assign source annotation to deployment ubi9-2: failed to create manager for existing fields: failed to convert new object (testdc-dup-sec/ubi9-2; /v1, Kind=ReplicationController) to smd typed: .spec.template.spec.imagePullSecrets: duplicate entries for key [name="test-pull-secret"]
        reason: Error
~~~

Version-Release number of selected component (if applicable):

    4.13.z,4.14.z,4.15.z

How reproducible:

Steps to Reproduce:

    1. Edit DeploymentConfig in Form view
    2. Update image version
    3. Save

Actual results:

Expected results:

Additional info:

    Issue is not reproducible on OCP 4.16.7+ version.

https://github.com/openshift/console/pull/14531

Bug OCPBUGS-50670: Status card styling issues

View the Description View the linked PRs

Description of problem:

Status card has some styling issues on some browsers

Version-Release number of selected component (if applicable):

For example
Firefox ESR 128.7.0esr (64-bit)
Firefox 135.0 (aarch64)
Safari 18.1.1 (20619.2.8.11.12)

How reproducible:

Always

Steps to Reproduce:

    1. navigate to Home -> Overview page, observe Status card format

Actual results:

issue on Safari https://drive.google.com/file/d/1lfHgS-B_bWGN4YnerDPrawN13ujTGKCt/view?usp=drive_link 
issue on Firefox and Firefox ESR   https://drive.google.com/file/d/1dP-ZZ-11EIdcquoZ3_XSKaYLo__O5_ge/view?usp=drive_link

Expected results:

 consistent format across all browsers

Additional info:

https://github.com/openshift/console/pull/14766

Bug OCPBUGS-45478: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-vsphere/pull/52

Bug OCPBUGS-45511: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5420

Bug OCPBUGS-45994: allow KubeDaemonSetMisScheduled alert on ROSA

View the Description View the linked PRs

~~OSD-26887~~: managed services taints several nodes as infrastructure. This taint appears to be applied after some of the platform DS are scheduled there, causing this alert to fire. Managed services rebalances the DS after the taint is added, and the alert clears, but origin fails this test. Allowing this alert to fire while we investigate why the taint is not added at node birth.

https://github.com/openshift/origin/pull/29357

Bug OCPBUGS-45991: node-image create --report and --pxe flags should be marked as experimental

View the Description View the linked PRs

Description of problem:

    The --report and --pxe flags were introduced in 4.18. It should be marked as experimental until 4.19.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc/pull/1951

Bug OCPBUGS-48156: Assert matching rt rpm at build time

View the Description View the linked PRs

Description of problem:


It would be good to fail the build if the rt rpm does not match the kernel. 
Since 9.4+ based releases, rt comes from the same package as kernel. With this change, ARTs consistency check lost an ability. 

This bug is to "shift-left" that test, and have the build fail at build time.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/driver-toolkit/pull/161

Bug OCPBUGS-48182: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-samples-operator/pull/593

Bug OCPBUGS-51084: PowerVS: NTP is not available when SNAT is disabled

View the Description View the linked PRs

Description of problem:

    When deploying the disconnected scenario in PowerVS (SNAT disabled network), components that rely on NTP fail. IBM Cloud has an NTP server that we can use internally so we need to point to that through chrony.conf

Version-Release number of selected component (if applicable):

How reproducible:

    As easy as deploying a disconnected cluster

Steps to Reproduce:

    1. Deploy a disconnected cluster
    2. image-registry will fail due to NTP mismatch
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9500

Bug OCPBUGS-44033: Failed to mirror ocp payload when use digest not the tag

View the Description View the linked PRs

Description of problem:

when mirror ocp payload with digest will failed with error : 
invalid destination name docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release/openshift/release-images:: invalid reference format

Version-Release number of selected component (if applicable):

./oc-mirror.rhel8  version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410251041.p0.g95f0611.assembly.stream.el9-95f0611", GitCommit:"95f0611c1dc9584a4a9e857912b9eaa539234bbc", GitTreeState:"clean", BuildDate:"2024-10-25T11:28:19Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

1. imagesetconfig with digest for ocp payload :
cat config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    release: registry.ci.openshift.org/ocp/release@sha256:e87cdacdf5c575ff99d4cca7ec38758512a408ac1653fef8dd7b2c4b85e295f4

2. run the mirror2mirror command :
./oc-mirror.rhel8.18 -c config.yaml docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release --dest-tls-verify=false --v2 --workspace file://out1  --authfile auth.json

Actual results:

2. hit error :

 ✗   188/188 : (2s) registry.ci.openshift.org/ocp/release@sha256:e87cdacdf5c575ff99d4cca7ec38758512a408ac1653fef8dd7b2c4b85e295f4 
2024/10/31 06:20:03  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/10/31 06:20:03  [ERROR]  : invalid destination name docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release/openshift/release-images:: invalid reference format

Expected results:

3. no error

Additional info:

compared with 4.17 oc-mirror, no such issue :   ./oc-mirror -c config.yaml docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release --dest-tls-verify=false --v2 --workspace file://out1  --authfile auth.json 2024/10/31 06:23:04  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/10/31 06:23:04  [INFO]   : 👋 Hello, welcome to oc-mirror
...
 ✓   188/188 : (1s) registry.ci.openshift.org/ocp/release@sha256:e87cdacdf5c575ff99d4cca7ec38758512a408ac1653fef8dd7b2c4b85e295f4 
2024/10/31 06:27:58  [INFO]   : === Results ===
2024/10/31 06:27:58  [INFO]   : ✅ 188 / 188 release images mirrored successfully
2024/10/31 06:27:58  [INFO]   : 📄 Generating IDMS file...
2024/10/31 06:27:58  [INFO]   : out1/working-dir/cluster-resources/idms-oc-mirror.yaml file created
2024/10/31 06:27:58  [INFO]   : 📄 No images by tag were mirrored. Skipping ITMS generation.
2024/10/31 06:27:58  [INFO]   : 📄 No catalogs mirrored. Skipping CatalogSource file generation.
2024/10/31 06:27:58  [INFO]   : mirror time     : 4m54.452548695s
2024/10/31 06:27:58  [INFO]   : 👋 Goodbye, thank you for using oc-mirror

Bug OCPBUGS-45324: Bump golang.org/x/net to 0.31.0

View the Description View the linked PRs

golang.org/x/net is a CVE-prone dependency, and even if we are not actually exposed to some issues, carrying an old dep exposes us to version-based vulnerability scanners.

https://github.com/openshift/cluster-version-operator/pull/1115

Bug OCPBUGS-48292: PatternFly fonts are missing from images built with Dockerfile.dev

View the Description View the linked PRs

When building a container image using Dockerfile.dev, the resulting image does not include the necessary font files provided by PatternFly (e.g., RedHatText). As a result, the console renders with a system fallback. The root cause of this issue is an overly broad ignore introduced with https://github.com/openshift/console/pull/12538.

https://github.com/openshift/console/pull/14677

Bug OCPBUGS-13612: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-installer/pull/997

Bug OCPBUGS-44857: oc adm inspect --all-namespaces command line is broken in must-gather

View the Description View the linked PRs

Description of problem:


The current `oc adm inspect --all-namespaces` command line results in something like this:

oc adm inspect --dest-dir must-gather --rotated-pod-logs csistoragecapacities ns/assisted-installer leases --all-namespaces

Which is wrong because of 2 reasons:
- The `ns/assisted-installer` is there although a namespace is not namespaced, so it should go to the `named_resources` variable (this happens only in 4.16+)
- The rest of the items on `all_ns_resources` variable are group resources, but they are not separated by `,` like we do with `group_resources` (this happens on 4.14+)

As a result, we never collect what is intended with this command.

Version-Release number of selected component (if applicable):


Any 4.14+ version

How reproducible:

Always

Steps to Reproduce:

    1. Get a must-gather
    2.
    3.

Actual results:

Data from "oc adm inspect --all-namespaces" missing

Expected results:

No data missing

Additional info:

https://github.com/openshift/must-gather/pull/465

Bug OCPBUGS-45803: Layout incorrect for Service weight on Create Route page

View the Description View the linked PRs

Description of problem:

    Layout incorrect for Service weight on Create Route page,

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-12-05-103644

How reproducible:

Always

Steps to Reproduce:

    1. Navigate to ‘Create Route’ page, eg: /k8s/ns/default/route.openshift.io~v1~Route/~new/form
    2. Check the field od 'Service weight'
    3.

Actual results:

    the input field for 'Service weight' is too long

Expected results:

    Compared to a similar component in OpenShift, the input field should be shorter

Additional info:

https://github.com/openshift/networking-console-plugin/pull/203

Bug OCPBUGS-46363: Bootstrapping times out prematurely while waiting for etcd bootstrap member removal

View the Description View the linked PRs

Description of problem:

Observed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn/1866088107347021824/artifacts/e2e-gcp-ovn/ipi-install-install/artifacts/.openshift_install-1733747884.log

Distinct issues occurring in this job caused the "etcd bootstrap member to be removed from cluster" gate to take longer than its 5 minute timeout, but there was plenty of time left to complete bootstrapping successfully. It doesn't make sense to have a narrow timeout here because progress toward removal of the etcd bootstrap member begins the moment the etcd cluster starts for the first time, not when the installer starts waiting to observe it.

Version-Release number of selected component (if applicable):

4.19.0

How reproducible:

Sometimes

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9295

Bug OCPBUGS-47453: oc-mirror V2 fails on FIPS enabled and STIG compliant RHEL 9 system

View the Description View the linked PRs

Description of problem:

    When running oc-mirror V2 (either 4.16 or 4.17 has been tested) on a RHEL 9 FIPS enabled and STIG Security profile enforced system, oc-mirror fails due to a hard coded PGP key in oc-mirror V2.

Version-Release number of selected component (if applicable):

    At least 4.16-4.17

How reproducible:

    Very reproducible

Steps to Reproduce:

Install latest oc-mirror
Create a cluster-images.yml file

ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.16
      minVersion: 4.16.18
      maxVersion: 4.16.24
      shortestPath: true

3. run oc-mirror with the following flags:

[cnovak@localhost ocp4-disconnected-config]$ /pods/content/bin/oc-mirror --config /pods/content/images/cluster-images.yml file:///pods/content/images/cluster-images --v2

2024/12/18 14:40:01  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/12/18 14:40:01  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/12/18 14:40:01  [INFO]   : ⚙️  setting up the environment for you...
2024/12/18 14:40:01  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/12/18 14:40:01  [INFO]   : 🕵️  going to discover the necessary images...
2024/12/18 14:40:01  [INFO]   : 🔍 collecting release images...
2024/12/18 14:40:02  [ERROR]  : openpgp: invalid data: user ID self-signature invalid: openpgp: invalid signature: RSA verification failure
2024/12/18 14:40:02  [ERROR]  : generate release signatures: error list invalid signature for 3f14e29f5b42e1fee7d7e49482cfff4df0e63363bb4a5e782b65c66aba4944e7 image quay.io/openshift-release-dev/ocp-release@sha256:3f14e29f5b42e1fee7d7e49482cfff4df0e63363bb4a5e782b65c66aba4944e7 
2024/12/18 14:40:02  [INFO]   : 🔍 collecting operator images...
2024/12/18 14:40:02  [INFO]   : 🔍 collecting additional images...
2024/12/18 14:40:02  [INFO]   : 🚀 Start copying the images...
2024/12/18 14:40:02  [INFO]   : images to copy 0 
2024/12/18 14:40:02  [INFO]   : === Results ===
2024/12/18 14:40:02  [INFO]   : 📦 Preparing the tarball archive...
2024/12/18 14:40:02  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/12/18 14:40:02  [ERROR]  : unable to add cache repositories to the archive : lstat /home/cnovak/.oc-mirror/.cache/docker/registry/v2/repositories: no such file or directory

Expected results/immediate workaround:

[cnovak@localhost ~]$ curl -s https://raw.githubusercontent.com/openshift/cluster-update-keys/d44fca585d081a72cb2c67734556a27bbfc9470e/manifests.rhel/0000_90_cluster-update-keys_configmap.yaml | sed -n '/openshift[.]io/d;s/Comment:.*//;s/^    //p' > /tmp/pgpkey
[cnovak@localhost ~]$ export OCP_SIGNATURE_VERIFICATION_PK=/tmp/pgpkey
[cnovak@localhost ~]$ /pods/content/bin/oc-mirror --config /pods/content/images/cluster-images.yml file:///pods/content/images/cluster-images --v22024/12/19 08:54:42  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/12/19 08:54:42  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/12/19 08:54:42  [INFO]   : ⚙️  setting up the environment for you...
2024/12/19 08:54:42  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/12/19 08:54:42  [INFO]   : 🕵️  going to discover the necessary images...
2024/12/19 08:54:42  [INFO]   : 🔍 collecting release images...
2024/12/19 08:54:42  [INFO]   : 🔍 collecting operator images...
2024/12/19 08:54:42  [INFO]   : 🔍 collecting additional images...
2024/12/19 08:54:42  [INFO]   : 🚀 Start copying the images...
2024/12/19 08:54:42  [INFO]   : images to copy 382 
 ⠸   1/382 : (7s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32f80a2ee0f52e0c07a6790171be70a1b92010d8d395e9e14b4ee5f268e384bb 
 ✓   2/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a61b758c659f93e64d4c13a7bbc6151fe8191c2421036d23aa937c44cd478ace 
 ✓   3/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29ba4e3ff278741addfa3c670ea9cc0de61f7e6265ebc1872391f5b3d58427d0 
 ✓   4/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2809165826b9094873f2bc299a28980f92d7654adb857b73463255eac9265fd8 
 ⠋   1/382 : (19s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32f80a2ee0f52e0c07a6790171be70a1b92010d8d395e9e14b4ee5f268e384bb 
 ✓   2/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a61b758c659f93e64d4c13a7bbc6151fe8191c2421036d23aa937c44cd478ace 
 ✓   3/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29ba4e3ff278741addfa3c670ea9cc0de61f7e6265ebc1872391f5b3d58427d0 
 ✓   4/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2809165826b9094873f2bc299a28980f92d7654adb857b73463255eac9265fd8 
 ✓   5/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e54fc21197c341fe257d2f2f2ad14b578483c4450474dc2cf876a885f11e745 
 ✓   6/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c934b4d95545e29f9cb7586964fd43cdb7b8533619961aaa932fe2923ab40db 
 ✓   7/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:851ba9ac5219a9f11e927200715e666ae515590cd9cc6dde9631070afb66b5d7 
 ✓   8/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f614ef855220f2381217c31b8cb94c05ef20edf3ca23b5efa0be1b957cdde3a4

Additional info:

The reason this is a critical issue, is Red Hat has a relatively large footprint within the DoD/U.S Government space, and anyone who is working in a disconnected environment, with a STIG Policy enforced on a RHEL 9 machine, will run into this problem.


Additionally, below is output from oc-mirror version



[cnovak@localhost ~]$ oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202411251634.p0.g07714b7.assembly.stream.el9-07714b7", GitCommit:"07714b7c836ec3ad1b776f25b44c3b2c2f083aa2", GitTreeState:"clean", BuildDate:"2024-11-26T08:28:42Z", GoVersion:"go1.22.9 (Red Hat 1.22.9-2.el9_5) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

https://github.com/openshift/oc-mirror/pull/997

Bug OCPBUGS-49839: Oh no! Something went wrong error occurs when cluster settings is accessed.

View the Description View the linked PRs

Description of problem:

When Cluster Settings tab is opened on the console the below error is displayed:

Oh no! Something went wrong
Type Error
Description: Cannot read properties of null (reading "major')

Version-Release number of selected component (if applicable):

    OCP Version 4.14.34

How reproducible:

Steps to Reproduce:

    1. Go on console.
    2. Go to Cluster Settings.

Actual results:

Oh no! Something went wrong

Expected results:

Cluster settings should be visible.

Additional info:

Bug OCPBUGS-27477: Pausing Master MCP results in Alerts

View the Description View the linked PRs

Description of problem:

When the master MCP is paused below alert are triggered
Failed to resync 4.12.35 because: Required MachineConfigPool 'master' is paused

The node have been rebooted to make sure there is no pending MC rollout

Affects version

  4.12

How reproducible:

Steps to Reproduce:

    1. Create a MC and apply it to master
    2. use below mc
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-cgroupsv2
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=1
    3.Wait until the nodes are rebooted and running
    4. pause the mcp

    Actual results:{code:none}
MCP pausing causing the alert

Expected results:


Alerts should not be fired

    Additional info:{code:none}

https://github.com/openshift/machine-config-operator/pull/4707

Bug OCPBUGS-32033: Function Import: An error occurred Cannot read properties of undefined (reading 'filter')

View the Description View the linked PRs

Description of problem: I'm trying to import this function https://github.com/pierDipi/node-func-logger using the import function UI flow

Version-Release number of selected component (if applicable): 4.14 OCP and Serverless (current development branch)

How reproducible: Always

Steps to Reproduce:
1. Import this function https://github.com/pierDipi/node-func-logger using the import function UI flow
2. Click create

Actual results: An error occurred Cannot read properties of undefined (reading 'filter')

UI error image: https://drive.google.com/file/d/1GrhX2LUNSzvVuhUmeFYEeZwZ1X58LBAB/view?usp=drive_link

Expected results: No errors

Additional info: As noted above I'm using Serverless development branch, I'm not sure if it's reproducible with a released Serverless release, however, either way we would need to fix it

https://github.com/openshift/console/pull/14134

Bug OCPBUGS-45558: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-dns-operator/pull/425

Bug OCPBUGS-45601: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/multus-admission-controller/pull/92

Bug OCPBUGS-50919: [GCP] with user defined tags, sometimes "create cluster" got panic

View the Description View the linked PRs

Description of problem:

    with user defined tags, sometimes "create cluster" got panic

Version-Release number of selected component (if applicable):

    4.18.0-rc.9 for example

How reproducible:

    Sometimes (Easy to reproduce in PROW CI, at least today)

Steps to Reproduce:

1. "create install-config", and then insert interested settings (see [1])
2. activate the IAM service account which has the required permissions
3. (optional)"create manifests"
4. "create cluster"

Actual results:

    Sometimes "create manifests" or "create cluster" got panic (see [2]).

Expected results:

    The installation should either succeed, or tell clear error messages. In any case, there should be no panic.

Additional info:

    The panic looks being caused by either PROW System flake or GCP flake, because of below reasons: 
(1) We tried manually installation locally, of 4.18.0-0.nightly-multi-2025-02-17-042334 and 4.17.0-0.nightly-multi-2025-02-15-095503, both succeeded. 
(2) As for PROW CI tests, both with 4.18.0-rc.9, the Feb. 14's installation succeeded, but today's installation got the panic issue (see [3]). 

FYI the PROW CI debug PR: https://github.com/openshift/release/pull/61698

https://github.com/openshift/installer/pull/9495

Bug OCPBUGS-45585: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/1173

Bug OCPBUGS-45104: View release notes link is not correct

View the Description View the linked PRs

Description of problem:

console is showing view release notes on several places, but the current link only point to Y release main release note

Version-Release number of selected component (if applicable):

4.17.2

How reproducible:

Always

Steps to Reproduce:

1. set up 4.17.2 cluster
2. navigate to Cluster Settings page, check 'View release note' link in 'Update history' table

Actual results:

the link only point user to Y release main release note

Expected results:

the link should point to release note of a specific version
the correct link should be 
https://access.redhat.com/documentation/en-us/openshift_container_platform/${major}.${minor}/html/release_notes/ocp-${major}-${minor}-release-notes#ocp-${major}-${minor}-${patch}_release_notes

Additional info:

https://github.com/openshift/console/pull/14543

Bug OCPBUGS-47764: Power VS: ResourceController endpoint URL is not honored in all references to endpoint

View the Description View the linked PRs

Description of problem:

    The resource-controller endpoint override is not honored in all parts of the machine API provider for power vs.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-api-provider-powervs/pull/98

Bug OCPBUGS-39583: openshift-install is not removing agent base temp folder

View the Description View the linked PRs

Description of problem:

    The temp folder for extracting the ISO is not cleared in some cases.

Version-Release number of selected component (if applicable):

How reproducible:

    Run the installer

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    ls -ld /tmp/agent\*
06    drwx------. 6 root root 4096 Sep  3 18:25 /tmp/agent1486745391
06    drwx------. 6 root root 4096 Sep  3 19:11 /tmp/agent2382395525
06    drwx------. 6 root root 4096 Sep  3 16:21 /tmp/agent2495758451
06    drwx------. 6 root root 4096 Sep  3 18:44 /tmp/agent2810534235
06    drwx------. 6 root root 4096 Sep  3 16:41 /tmp/agent2862979295
06    drwx------. 6 root root 4096 Sep  3 17:31 /tmp/agent2935357941
06    drwx------. 6 root root 4096 Sep  3 17:00 /tmp/agent2952470601
06    drwx------. 6 root root 4096 Sep  3 17:12 /tmp/agent4019363474
06    drwx------. 6 root root 4096 Sep  3 19:03 /tmp/agent589005812

Expected results:

Temp folder to be removed once the iso is created.

Additional info:

https://github.com/openshift/installer/pull/9527

Bug OCPBUGS-44354: Multi-NetworkPolicy not working for default port field with protocol field defined

View the Description View the linked PRs

Description of problem:

    Multi-NetworkPolicy when defined with protocol defined and port undefined, we expect it to match all ports with defined protocol (as per documentation). But policy is not being applied and it allows all traffic.

Error Message in multus-networkpolicy logs:
E1127 12:12:22.098844       1 server.go:661] sync rules failed for pod [policy-ns1/pod1]: exit status 2: iptables-restore v1.8.10 (nf_tables): invalid port/service `<nil>' specified
Error occurred at line: 30

https://docs.openshift.com/container-platform/4.17/rest_api/network_apis/multinetworkpolicy-k8s-cni-cncf-io-v1beta1.html#spec-egress-ports-2

Version-Release number of selected component (if applicable):

    4.18.ec2

How reproducible:

    --> Apply below policy. ports array should have only protocol defined but not port.

apiVersion: k8s.cni.cncf.io/v1beta1
kind: MultiNetworkPolicy
metadata:
  annotations:
    k8s.v1.cni.cncf.io/policy-for: policy-test-ns1/bond-nad,policy-test-ns2/bond-nad
  name: egress-port
  namespace: policy-test-ns1
spec:
  podSelector:
    matchLabels:
      app: pod1
  policyTypes:
  - Egress
  egress:
  - ports:
     - protocol: TCP

Steps to Reproduce:

    1. Create SRIOV VFs, bond NAD and create pods that attach to bond NAD
    2. Apply MultiNetworkPolicy as mentioned above.
    3. Test egress traffic.

Actual results:

    Egress works as if no policy is applied.

Expected results:

    Egress should work only for TCP protocol to all ports

Additional info:

    Must gather : https://drive.google.com/drive/folders/1Le1PdIGiOt965Hqr2xTUXyeDAUGhYYiN?usp=sharing

https://github.com/openshift/multus-networkpolicy/pull/66

Bug OCPBUGS-44694: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-runtimecfg/pull/334

Bug OCPBUGS-45387: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-openstack/pull/345

Bug OCPBUGS-45516: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-aws/pull/99

Bug OCPBUGS-48810: In OCB. When enabling OCL in worker and master pool at the same time it may happen that the osImage is not updated in a MOSB

View the Description View the linked PRs

Description of problem:

When we enable OCL in the master and the worker pool at the same time, it may happen that one of the MOSB resources are not updated with the osImage value.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2025-01-21-015441

How reproducible:

Intermittent

Steps to Reproduce:

    1. Enable OCL at the same time in worker and master pool (or with a couple of seconds between creating the MOSC resources)

Actual results:

One of the MOSB resources may not be updated with the osImage and the image will never be applied.

Expected results:

The image should be built and applied without problems

Additional info:

More information in this slack conversation: https://redhat-internal.slack.com/archives/GH7G2MANS/p1737652042188709

This scenario may look like unlikely but we need to take into account that, when we upgrade clusters with OCL enabled in worker and master pools, both pools will always start updating at the same time.

https://github.com/openshift/machine-config-operator/pull/4811

Bug CNV-56436: UI: the created primary UDNs do not report the ipam lifecycle attribute

View the Description View the linked PRs

Description of problem:

When I use the UI to check the status of a primary UDN, it does not correctly report the value of the lifecycle attribute.

Version-Release number of selected component (if applicable):

4.99

How reproducible:

Always

Steps to Reproduce:

1. create a project
2. create a UDN in said project using the CLI
3. in the UI, access the newly created UDN

Actual results:

The UDN ipam.lifecycle attribute will *not* be properly presented to the user - it'll say "Not available".

Expected results:

The UDN ipam.lifecycle attribute is presented to the user with the correct value.

Additional info:

The API changed recently, and while before the attribute was present in UDN.spec.layer2.ipamLifecycle, it now is present on UDN.spec.layer2.ipam.lifecycle

https://github.com/openshift/networking-console-plugin/pull/218

Bug OCPBUGS-44193: GCP ListRequestsFilterCostOverheadPerMinutePerProject RATE_LIMIT_EXCEEDED

View the Description View the linked PRs

In payloads 4.18.0-0.ci-2024-11-01-110334 and 4.18.0-0.nightly-2024-11-01-101707 we observed GCP install failures

 Container test exited with code 3, reason Error
---
ails:
level=error msg=[
level=error msg=  {
level=error msg=    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
level=error msg=    "domain": "googleapis.com",
level=error msg=    "metadatas": {
level=error msg=      "consumer": "projects/711936183532",
level=error msg=      "quota_limit": "ListRequestsFilterCostOverheadPerMinutePerProject",
level=error msg=      "quota_limit_value": "75",
level=error msg=      "quota_location": "global",
level=error msg=      "quota_metric": "compute.googleapis.com/filtered_list_cost_overhead",
level=error msg=      "service": "compute.googleapis.com"
level=error msg=    },
level=error msg=    "reason": "RATE_LIMIT_EXCEEDED"
level=error msg=  },
level=error msg=  {
level=error msg=    "@type": "type.googleapis.com/google.rpc.Help",
level=error msg=    "links": [
level=error msg=      {
level=error msg=        "description": "The request exceeds API Quota limit, please see help link for suggestions.",
level=error msg=        "url": "https://cloud.google.com/compute/docs/api/best-practices#client-side-filter"
level=error msg=      }
level=error msg=    ]
level=error msg=  }
level=error msg=]
level=error msg=, rateLimitExceeded

Patrick Dillon Noted ListRequestsFilterCostOverheadPerMinutePerProject can not have it's quota limit increased.

The problem subsided over the weekend presumably with fewer jobs run but has started to appear again. opening to track ongoing issue and potential work arounds.

This contributes to the following test failures for GCP

install should succeed: configuration
install should succeed: overall

https://github.com/openshift/installer/pull/9169

Bug OCPBUGS-45214: Failing VolumeGroupSnapshot test

View the Description View the linked PRs

The following test is failing with the updated 1.32 Kubernetes in OCP 4.19:

[It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod

Since the VolumeGroupSnapshot feature is disabled by default but will become GA in 4.19, the test has been disabled temporarily to unblock the Kubernetes 1.32 rebase.

This ticket tracks the re-enabling of the test once the feature is GA and enabled by default.

https://github.com/openshift/kubernetes/pull/2155

Bug OCPBUGS-46083: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/300

Bug OCPBUGS-49366: Home - Search : 'Label' is in English

View the Description View the linked PRs

Description of problem:

    Home - Search : 'Label' is in English

Version-Release number of selected component (if applicable):

    4.18.0-rc.6

How reproducible:

    always

Steps to Reproduce:

    1. Change ocp web console UI into non en_US locale
    2. Navigate to Home - Search
    3. 'Label' drop down menu name is in English.

Actual results:

    Content is in English.

Expected results:

    Content should be in selected language.

Additional info:

    Reference screenshot added

https://github.com/openshift/console/pull/14706

Bug OCPBUGS-51007: EnsureLimitedEgressTrafficToManagementKAS failing on TestUpgradeControlPlane

View the Description View the linked PRs

EnsureLimitedEgressTrafficToManagementKAS frequently flakes on TestUpgradeControlPlane ever since https://github.com/openshift/hypershift/pull/5168

Example job:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1891666149674848256

https://github.com/openshift/hypershift/pull/5679

Bug OCPBUGS-45322: Consolidate updatingConfig/Version conditions control from CAPI controller with nodepool controller

View the Description View the linked PRs

Description of problem:

    Currently both the nodepool controller and capi controller set the updatingConfig condition on nodepool upgrades. We should only use one to set the condition to avoid constant switching between conditions and to ensure the logic used for setting this condition is the same.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    CAPI and Nodepool controller set a different status because their logic is not consistent.

Expected results:

    CAPI and Nodepool controller set  the same status because their logic is not cosolidated.

Additional info:

https://github.com/openshift/hypershift/pull/5222

Bug OCPBUGS-49347: Degraded machine-config CO due to ManagedBootImages update when upgrading an OCL cluster

View the Description View the linked PRs

Description of problem:

When we upgrade an OCL cluster from 4.18 -> 4.18, and we configure a machineset so that its base cloud Image is automatically updated in the upgrade, the machine-config CO  may become degraded with this message:

  - lastTransitionTime: "2025-01-24T19:58:23Z"
    message: 'Unable to apply 4.18.0-0.nightly-2025-01-24-014549: bootimage update
      failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      | Error(s): error syncing MAPI MachineSet cloned-tc-70813-label: timed out waiting
      for coreos-bootimages config map: mismatch between MCO hash version stored in
      configmap and current MCO version; sync will exit to wait for the MCO upgrade
      to complete'
    reason: MachineConfigurationFailed
    status: "True"
    type: Degraded

Link to the prow execution: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-4.18-upgrade-from-stable-4.18-aws-ipi-ocl-fips-f60/1882788481910968320

Version-Release number of selected component (if applicable):

Upgrading from 4.18.0-rc.6 to 4.18.0-0.nightly-2025-01-24-014549

How reproducible:

Rarely

Steps to Reproduce:

    1. install 4.18.0-rc.6 
    2. Clone an existing machineset 
    3. Configure the new machineset so that its base cloud image is updated automatically in the upgraded. Use a label configuration, so that only this machineset is updated.
    4. Upgrade to 4.18.0-0.nightly-2025-01-24-014549

Actual results:


The machine-config CO is degraded with this message:

  - lastTransitionTime: "2025-01-24T19:58:23Z"
    message: 'Unable to apply 4.18.0-0.nightly-2025-01-24-014549: bootimage update
      failed: 1 Degraded MAPI MachineSets | 0 Degraded CAPI MachineSets | 0 CAPI MachineDeployments
      | Error(s): error syncing MAPI MachineSet cloned-tc-70813-label: timed out waiting
      for coreos-bootimages config map: mismatch between MCO hash version stored in
      configmap and current MCO version; sync will exit to wait for the MCO upgrade
      to complete'
    reason: MachineConfigurationFailed
    status: "True"
    type: Degraded

Expected results:

Additional info:

It looks like the coreos-bootimages configma was never updated with the new MCOVersionHash

It may not be related to OCL at all.

https://github.com/openshift/machine-config-operator/pull/4810

Bug OCPBUGS-38570: [Azure] No zone for master machines

View the Description View the linked PRs

Description of problem:

No zone for master machines

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-15-153405

How reproducible:

Always

Steps to Reproduce:

    1. Install an azure cluster
    2. Run "oc get machine"
    3.

Actual results:

No zone info for master machine
$ oc get machine     
NAME                                       PHASE     TYPE              REGION   ZONE   AGE
yingwang-0816-tvqdc-master-0               Running   Standard_D8s_v3   eastus          104m
yingwang-0816-tvqdc-master-1               Running   Standard_D8s_v3   eastus          104m
yingwang-0816-tvqdc-master-2               Running   Standard_D8s_v3   eastus          104m
yingwang-0816-tvqdc-worker-eastus1-54ckq   Running   Standard_D4s_v3   eastus   1      96m
yingwang-0816-tvqdc-worker-eastus2-dwr2j   Running   Standard_D4s_v3   eastus   2      96m
yingwang-0816-tvqdc-worker-eastus3-7wchl   Running   Standard_D4s_v3   eastus   3      96m
$ oc get machine --show-labels  
NAME                                       PHASE     TYPE              REGION   ZONE   AGE    LABELS
yingwang-0816-tvqdc-master-0               Running   Standard_D8s_v3   eastus          104m   machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=master,machine.openshift.io/cluster-api-machine-type=master,machine.openshift.io/instance-type=Standard_D8s_v3,machine.openshift.io/region=eastus
yingwang-0816-tvqdc-master-1               Running   Standard_D8s_v3   eastus          104m   machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=master,machine.openshift.io/cluster-api-machine-type=master,machine.openshift.io/instance-type=Standard_D8s_v3,machine.openshift.io/region=eastus
yingwang-0816-tvqdc-master-2               Running   Standard_D8s_v3   eastus          104m   machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=master,machine.openshift.io/cluster-api-machine-type=master,machine.openshift.io/instance-type=Standard_D8s_v3,machine.openshift.io/region=eastus
yingwang-0816-tvqdc-worker-eastus1-54ckq   Running   Standard_D4s_v3   eastus   1      96m    machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=yingwang-0816-tvqdc-worker-eastus1,machine.openshift.io/instance-type=Standard_D4s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=eastus,machine.openshift.io/zone=1
yingwang-0816-tvqdc-worker-eastus2-dwr2j   Running   Standard_D4s_v3   eastus   2      96m    machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=yingwang-0816-tvqdc-worker-eastus2,machine.openshift.io/instance-type=Standard_D4s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=eastus,machine.openshift.io/zone=2
yingwang-0816-tvqdc-worker-eastus3-7wchl   Running   Standard_D4s_v3   eastus   3      96m    machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=yingwang-0816-tvqdc-worker-eastus3,machine.openshift.io/instance-type=Standard_D4s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=eastus,machine.openshift.io/zone=3

Expected results:

Zone info can be shown when run "oc get machine"

Additional info:

https://github.com/openshift/machine-api-provider-azure/pull/126

Bug OCPBUGS-44925: [aws] missing ec2:AllocateAddress permission when Ipv4Pool is enabled

View the Description View the linked PRs

Description of problem:

    When using PublicIPv4Pool, CAPA will try to allocate IP address in the supplied pool which requires the `ec2:AllocateAddress` permission

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1. Minimal permissions and publicIpv4Pool set
    2.
    3.

Actual results:

    time="2024-11-21T05:39:49Z" level=debug msg="E1121 05:39:49.352606     327 awscluster_controller.go:279] \"failed to reconcile load balancer\" err=<"
time="2024-11-21T05:39:49Z" level=debug msg="\tfailed to allocate addresses to load balancer: failed to allocate address from Public IPv4 Pool \"ipv4pool-ec2-0768267342e327ea9\" to role lb-apiserver: failed to allocate Elastic IP for \"lb-apiserver\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-2cr41ill-663fd-minimal-perm is not authorized to perform: ec2:AllocateAddress on resource: arn:aws:ec2:us-east-1:460538899914:ipv4pool-ec2/ipv4pool-ec2-0768267342e327ea9 because no identity-based policy allows the ec2:AllocateAddress action. Encoded authorization failure message: Iy1gCtvfPxZ2uqo-SHei1yJQvNwaOBl5F_8BnfeEYCLMczeDJDdS4fZ_AesPLdEQgK7ahuOffqIr--PWphjOUbL2BXKZSBFhn3iN9tZrDCnQQPKZxf9WaQmSkoGNWKNUGn6rvEZS5KvlHV5vf5mCz5Bk2lk3w-O6bfHK0q_dphLpJjU-sTGvB6bWAinukxSYZ3xbirOzxfkRfCFdr7nDfX8G4uD4ncA7_D-XriDvaIyvevWSnus5AI5RIlrCuFGsr1_3yEvrC_AsLENZHyE13fA83F5-Abpm6-jwKQ5vvK1WuD3sqpT5gfTxccEqkqqZycQl6nsxSDP2vDqFyFGKLAmPne8RBRbEV-TOdDJphaJtesf6mMPtyMquBKI769GW9zTYE7nQzSYUoiBOafxz6K1FiYFoc1y6v6YoosxT8bcSFT3gWZWNh2upRJtagRI_9IRyj7MpyiXJfcqQXZzXkAfqV4nsJP8wRXS2vWvtjOm0i7C82P0ys3RVkQVcSByTW6yFyxh8Scoy0HA4hTYKFrCAWA1N0SROJsS1sbfctpykdCntmp9M_gd7YkSN882Fy5FanA"
time="2024-11-21T05:39:49Z" level=debug msg="\t\tstatus code: 403, request id: 27752e3c-596e-43f7-8044-72246dbca486"

Expected results:

Additional info:

Seems to happen consistently with shared-vpc-edge-zones CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-edge-zones/1860015198224519168

https://github.com/openshift/installer/pull/9234

Bug METAL-1299: [4.19] do not use openstack packages in ironic containers

View the Description View the linked PRs

they conflict with our owns and they produce different builds between CI and production actually making the tests completely unreliable

Bug MGMT-19621: [staging] [4.17 ODF] getting an error Deployment does not have minimum availability - cluster degraded

View the Description View the linked PRs

Description of the problem:

From our CI, reproduced several time lately - trying to install 4.17 +ODF + CNV.
Getting those messages after 40 minutes :

Operator odf status: progressing message: installing: waiting for deployment odf-operator-controller-manager to become ready: deployment "odf-operator-controller-manager" not available: Deployment does not have minimum availability.


"Operator odf status: progressing message: installing: waiting for deployment odf-console to become ready: deployment "odf-console" not available: Deployment does not have minimum availability."

CI job waits for cluster to complete for 2.5h.
Cluster link - https://console.dev.redhat.com/openshift/assisted-installer/clusters/74b62ea4-61ce-4fde-acbe-cc1cf41f1fb8

Attached the installation logs and a video of installation.

How reproducible:

Still checking

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-installer/pull/1014

Vulnerability OCPBUGS-46317: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/138

Bug OCPBUGS-48056: unit test race conditon in cluster operator status controller

View the Description View the linked PRs

Description of problem:

    The status controller creates a ClusterOperator when one does not exist. In the test case verifying behavior with an already present ClusterOperator, it is requried to wait until the ClusterOperator created by the test is ready. Failing to do so can result in the controller attempting to create a duplicate ClusterOperator, causing the test to fail with an "already exists" error.

Version-Release number of selected component (if applicable):

How reproducible:

    Sometimes, race condition

Steps to Reproduce:

    1. Run ci/prow/unit presubmit job

Actual results:

    Test fails with:
clusteroperators.config.openshift.io \"machine-approver\" already exists",

Expected results:

    Test passes

Additional info:

Unit test only issue. No custommer impact.

https://github.com/openshift/cluster-machine-approver/pull/264

Bug OCPBUGS-51275: Add missing relatedObjects to CBO

View the Description View the linked PRs

After adding new CRs we should update https://github.com/openshift/cluster-baremetal-operator/blob/main/manifests/0000_31_cluster-baremetal-operator_07_clusteroperator.cr.yaml

Bug OCPBUGS-52821: Stop using DevMode for loggers

View the Description View the linked PRs

Description of problem:

As reported in https://issues.redhat.com/browse/OCPBUGS-52256, there are memory issues at scale and the use of a development logger setting is contributing to the issue. Production code should not be using the development logger settings.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/5801

Task OPRUN-3075: Downstream Sync for operator-controller v0.7.0

View the Description View the linked PRs

Bring the downstream operator-controller repo up-to-date with the v0.7.0 upstream release.

https://github.com/openshift/operator-framework-operator-controller/pull/31

Bug OCPBUGS-51981: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-authentication-operator/pull/760

Bug OCPBUGS-52340: The operator-marketplace still use old pod-security.kubernetes.io/*-version v1.24 and v1.25 respectively

View the Description View the linked PRs

Summary: OLMv0 should avoid setting Pod Security Admission (PSA) labels with a fixed value. Instead, it should pin the PSA version to match the Kubernetes API version specified in the go.mod file as agreed and discussed at: https://redhat-internal.slack.com/archives/C06KP34REFJ/p1739880491760029

Thit Jira is orginated from: https://issues.redhat.com/browse/OCPBUGS-42526

https://github.com/operator-framework/operator-marketplace/pull/600

Story OCPBUILD-174: Bring openshift/builder Contributor Docs Up to Date

View the Description View the linked PRs

User Story

As a developer looking to contribute to OCP BuildConfig I want contribution guidelines that make it easy for me to build and test all the components.

Background

Much of the contributor documentation for openshift/builder is either extremely out of date or buggy. This hinders the ability for newcomers to contribute.

Approach

Document dependencies needed to build openshift/builder from source.
Update "dev" container image for openshift/builder so teams can experiment locally.
Provide instructions on how to test
1. "WIP Pull Request" process
2. "Disable operators" mode.
3. Red Hatter instructions: using cluster-bot

Acceptance Criteria

New contributors can compile openshift/builder from GitHub instructions
New contributors can test their code changes on an OpenShift instance
Red Hatters can test their code changes with cluster-bot.

Bug OCPBUGS-45534: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-baremetal/pull/222

Task HOSTEDCP-2215: Expand pre-commit hooks

View the Description View the linked PRs

We should expand upon our current pre-commit hooks:

all hooks will either run in either the pre-commit stage or pre-push stage
adds pre-push hooks to run make verify
add pre-push hook to run make test

This will help prevent errors before code makes it on GitHub and CI.

https://github.com/openshift/hypershift/pull/5245

Bug OCPBUGS-45198: Pending plugins will block loading of Console plugins tab

View the Description View the linked PRs

Description of problem:

dynamic plugin in Pending status will block console plugins tab page loading

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-27-162407

How reproducible:

Always

Steps to Reproduce:

1. Create a dynamic plugin which will be in Pending status, we can create from file https://github.com/openshift/openshift-tests-private/blob/master/frontend/fixtures/plugin/pending-console-demo-plugin-1.yaml 

2. Enable the 'console-demo-plugin-1' plugin and navigate to Console plugins tab at /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins

Actual results:

2. page will always be loading

Expected results:

2. console plugins list table should be displayed

Additional info:

https://github.com/openshift/console/pull/14583

Bug OCPBUGS-45494: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-ibm/pull/77

Bug OCPBUGS-48701: [cluster-kube-controller-manager-operator] Inconsistent static pod operator statuses after apply migration

View the Description View the linked PRs

Description of problem:

Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers:

https://issues.redhat.com/browse/OCPBUGS-45924
https://issues.redhat.com/browse/OCPBUGS-46372
https://issues.redhat.com/browse/OCPBUGS-48276

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-45420: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/909

Bug OCPBUGS-49393: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2459

Bug OCPBUGS-44978: bump openshift/api in remaining csi driver operators

View the Description View the linked PRs

Description of problem:

We had bugs like https://issues.redhat.com/browse/OCPBUGS-44324 from the payload tests in vsphere and gcp, and this was fixed by https://github.com/openshift/api/commit/ec9bf3faa1aa2f52805c44b7b13cd7ab4b984241

There are a few operators which are missing that openshift/api bump. These operators do not have blocking payload jobs but we still need this fix before 4.18 is released. It affects the following operators:

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/go.mod#L11
https://github.com/openshift/ibm-powervs-block-csi-driver-operator/blob/main/go.mod#L6
https://github.com/openshift/gcp-filestore-csi-driver-operator/blob/main/go.mod#L8
https://github.com/openshift/secrets-store-csi-driver-operator/blob/main/go.mod#L8

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    All but 4 csi driver operators have the fix

Expected results:

    All csi driver operators have this fix vendored: https://github.com/openshift/api/commit/ec9bf3faa1aa2f52805c44b7b13cd7ab4b984241

Additional info:

Bug OCPBUGS-45477: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-config-operator/pull/430

Bug OCPBUGS-50961: PowerVS: Confusing error when VPC is not specified in Internal case

View the Description View the linked PRs

Description of problem:

   When deploying with an Internal publishing strategy, it is required that you specify a pre existing VPC via 'platform.powervs.vpcName'. Currently, if you don't, the installer will fail and say "VPC not found", when a more accurate error would be "A pre-existing VPC is required when deploying with an Internal publishing strategy".

Version-Release number of selected component (if applicable):

How reproducible:

    Easily

Steps to Reproduce:

    1. Set strategy to Internal, do not specify vpcName
    2. Deploy
    3. Observe error

Actual results:

    Confusing error

Expected results:

    Accurate error

Additional info:

https://github.com/openshift/installer/pull/9497

Bug OCPBUGS-41533: incorrect warning from kernel version mismatch

View the Description View the linked PRs

Description of problem:

    once https://github.com/openshift/kubernetes/pull/2055 merges, the kubelet will excessively print the warning "WARNING: the kernel version is incompatible with the feature gate, which needs as a minimum kernel version" kernelVersion="5.14.0" feature="UserNamespacesSupport" minKernelVersion="6.3" when the cluster is in tech preview

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

    100%

Steps to Reproduce:

    1. launch openshift
    2. look at node logs
    3.

Actual results:

"WARNING: the kernel version is incompatible with the feature gate, which needs as a minimum kernel version" kernelVersion="5.14.0" feature="UserNamespacesSupport" minKernelVersion="6.3" warning appears

Expected results:

    "WARNING: the kernel version is incompatible with the feature gate, which needs as a minimum kernel version" warning only appears when the node actually doesn't support idmapped mounts (RHEL 8 or RHEL 9.2)

Additional info:

https://github.com/openshift/kubernetes/pull/2168

Bug OCPBUGS-46483: [Azure disk/file csi driver]on ARO HCP the CPO reconcile CSO CSI Secrets incorrect

View the Description View the linked PRs

Description of problem:

[Azure disk/file csi driver]on ARO HCP could not provision volume succeed

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-13-083421

How reproducible:

Always

Steps to Reproduce:

    1.Install AKS cluster on azure.
    2.Install hypershift operator on the AKS cluster.
    3.Use hypershift CLI create hosted cluster with the Client Certificate mode.
    4.Check the azure disk/file csi dirver work well on hosted cluster.

Actual results:

    In step 4: the the azure disk/file csi dirver provision volume failed on hosted cluster

# azure disk pvc provision failed
$ oc describe pvc mypvc
...
  Normal   WaitForFirstConsumer  74m                    persistentvolume-controller                                                                                waiting for first consumer to be created before binding
  Normal   Provisioning          74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  External provisioner is provisioning volume for claim "default/mypvc"
  Warning  ProvisioningFailed    74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Warning  ProvisioningFailed    71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   Provisioning          71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  External provisioner is provisioning volume for claim "default/mypvc"
...

$ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver
W1216 08:07:04.282922       1 main.go:89] nodeid is empty
I1216 08:07:04.290689       1 main.go:165] set up prometheus server on 127.0.0.1:8201
I1216 08:07:04.291073       1 azuredisk.go:213]
DRIVER INFORMATION:
-------------------
Build Date: "2024-12-13T02:45:35Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.29.11
Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c
Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime
Platform: linux/amd64
Topology Key: topology.disk.csi.azure.com/zone

I1216 08:09:36.814776       1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume
I1216 08:09:36.814803       1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]}
I1216 08:09:36.815338       1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d]

goroutine 153 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?})
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8
google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125

# azure file pvc provision failed
$ oc describe pvc mypvc
Name:          mypvc
Namespace:     openshift-cluster-csi-drivers
StorageClass:  azurefile-csi
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com
               volume.kubernetes.io/storage-provisioner: file.csi.azure.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type     Reason                Age                From                                                                                                      Message
  ----     ------                ----               ----                                                                                                      -------
  Normal   ExternalProvisioning  14s (x2 over 14s)  persistentvolume-controller                                                                               Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          7s (x4 over 14s)   file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0  External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/mypvc"
  Warning  ProvisioningFailed    7s (x4 over 14s)   file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0  failed to provision volume with StorageClass "azurefile-csi": rpc error: code = Internal desc = failed to ensure storage account: could not list storage accounts for account type Standard_LRS: StorageAccountClient is nil

Expected results:

    In step 4: the the azure disk/file csi dirver should provision volume succeed on hosted cluster

Additional info:

https://github.com/openshift/hypershift/pull/5311

Bug OCPBUGS-49980: 2 "sum:apiserver_request:burnrate5m" recording rule for 4.19

View the Description View the linked PRs

Description of problem:

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-0.nightly-2025-02-07-024732   True        False         133m    Cluster version is 4.19.0-0.nightly-2025-02-07-024732

$ oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0 | grep "Error on ingesting results from rule evaluation with different value but same timestamp" 
ts=2025-02-07T06:50:32.371Z caller=group.go:599 level=warn name=sum:apiserver_request:burnrate5m index=11 component="rule manager" file=/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-kube-apiserver-kube-apiserver-slos-basic-4c40cd93-505e-4e93-a53c-fdbb47f77d9d.yaml group=kube-apiserver.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" num_dropped=1
ts=2025-02-07T06:51:02.376Z caller=group.go:599 level=warn name=sum:apiserver_request:burnrate5m index=11 component="rule manager" file=/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-kube-apiserver-kube-apiserver-slos-basic-4c40cd93-505e-4e93-a53c-fdbb47f77d9d.yaml group=kube-apiserver.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" num_dropped=1
....

checked, there are 2 "sum:apiserver_request:burnrate5m" recording rule for 4.19, the second one should be "sum:apiserver_request:burnrate6h", not "sum:apiserver_request:burnrate5m"

$ oc -n openshift-kube-apiserver get prometheusrules kube-apiserver-slos-basic -oyaml
...
    - expr: |
        sum(apiserver_request:burn5m)
        /
        sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE"}[5m]))
      record: sum:apiserver_request:burnrate5m
...
    - expr: |
        sum(apiserver_request:burn6h)
        /
        sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE"}[6h]))
      record: sum:apiserver_request:burnrate5m

issue is in https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.19/bindata/assets/alerts/kube-apiserver-slos-basic.yaml#L213-L217

Version-Release number of selected component (if applicable):

4.19+

How reproducible:

always

Steps to Reproduce:

1. see the descriptions

Actual results:

2 "sum:apiserver_request:burnrate5m" recording rule for 4.19

Expected results:

only one "sum:apiserver_request:burnrate5m" recording rule for 4.19

Additional info:

issue is only with 4.19

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1804

Bug CNV-56236: "Match Labels" is on the same level of 'Projects" while creating CUDN

View the Description View the linked PRs

Description of problem:

"Match Labels" is on the same level of 'Projects" while creating CUDN

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/217

Bug OCPBUGS-43825: Delete feature does not delete the graph image after mirrorToMirror

View the Description View the linked PRs

Description of problem:

When running the delete command on oc-mirror after a mirrorToMirror, the graph-image is not being deleted.

Version-Release number of selected component (if applicable):

How reproducible:
With the following ImageSetConfiguration (use the same for the DeleteImageSetConfiguration only changing the kind and the mirror to delete)

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.13
      minVersion: 4.13.10
      maxVersion: 4.13.10
    graph: true

Steps to Reproduce:

    1. Run mirror to mirror
./bin/oc-mirror -c ./alex-tests/alex-isc/isc.yaml --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 docker://localhost:6000 --v2 --dest-tls-verify=false

    2. Run the delete --generate
./bin/oc-mirror delete -c ./alex-tests/alex-isc/isc-delete.yaml --generate --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 --delete-id clid-230-delete-test docker://localhost:6000 --v2 --dest-tls-verify=false

    3. Run the delete
./bin/oc-mirror delete --delete-yaml-file /home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230/working-dir/delete/delete-images-clid-230-delete-test.yaml docker://localhost:6000 --v2 --dest-tls-verify=false

Actual results:

During the delete --generate the graph-image is not being included in the delete file 

2024/10/25 09:44:21  [WARN]   : unable to find graph image in local cache: SKIPPING. %!v(MISSING)
2024/10/25 09:44:21  [WARN]   : reading manifest latest in localhost:55000/openshift/graph-image: manifest unknown

Because of that the graph-image is not being deleted from the target registry

[aguidi@fedora oc-mirror]$ curl http://localhost:6000/v2/openshift/graph-image/tags/list | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    51  100    51    0     0  15577      0 --:--:-- --:--:-- --:--:-- 17000
{
  "name": "openshift/graph-image",
  "tags": [
    "latest"
  ]
}

Expected results:

graph-image should be deleted even after mirrorToMirror

Additional info:

https://github.com/openshift/oc-mirror/pull/982

Bug OCPBUGS-44326: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5177

Bug OCPBUGS-46511: Clicking "Don't show again" won't spot "Hide Lightspeed" if current page is on Language/Notifications/Applications tab of "user-preferences"

View the Description View the linked PRs

Description of problem:

Clicking "Don't show again" won't spot "Hide Lightspeed" if current page is on Language/Notifications/Applications tab of "user-preferences"

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-16-065305

How reproducible:

Always

Steps to Reproduce:

    1.User goes to one of Language/Notifications/Applications tabs on "user-preferences" page.
    2.Open "Lightspeed" modal at the right bottom and click "Don't show again".
    3.

Actual results:

2. The url changes to "/user-preferences/general?spotlight=[data-test="console.hideLightspeedButton%20field"]", but still stays at the original tab.

Expected results:

2. Should jump to "Hide Lightspeed" part on General tab of "user-preferences" page.

Additional info:

https://github.com/openshift/console/pull/14647

Bug OCPBUGS-48673: The bootstrap node is removed too early which can cause API disruption

View the Description View the linked PRs

During the cluster bootstrap, disruption can occur when a kube-apiserver instance doesn't have access to any live etcd endpoints. This happens in one very specific scenario:

kube-apiserver is running on a node and is at revision 1. Its etcd-servers list contains the bootstrap node IP and localhost
when bootstrap node is deleted, the etcd instance that was running on it will become unavailable
when the etcd instance running the same node as the kube-apiserver instance from above is rolled-out to a new revision it will also become unavailable

When both of these scenarios happens whilst a kube-apiserver instance is still on revision 1, its readyz probe will fail

The suggested solution to fix this issue is to add a check in cluster-bootstrap that makes sure that we have at least 2 etcd-servers that are not bootstrap and localhost for each kube-apiserver pods before getting rid of the bootstrap node.

Job where this is happening: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1387/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial/1880358740390055936

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1792

Bug OCPBUGS-21755: Able to view the community operator provided by Grafana being installed while installing the Loki operator provided by RedHat

View the Description View the linked PRs

Description of problem:

While installing Loki operator provided by Red Hat through operator hub in openshift console after pressing the install button, it turns to installing community loki operator provided by grafana loki SIG operator.
In the end Loki operator provided by Red Hat is installed.

How reproducible:

It is reproducible.

Steps to Reproduce:

Navigate to operators > OperatorHub in openshift console.
Search for loki, below three operators will appear:
- Community Loki operator provided by Grafana Loki SIG operator.
- Loki Helm operator
- Loki operator provided by Red Hat.

Install Loki operator provided by Red Hat.
The issue is visible when the operator starts getting installed.

Actual results:

While installing Loki operator provided by Red Hat through operator hub in openshift console after pressing the install button, it turns to installing community loki operator provided by grafana loki SIG operator.

Expected results:

While installing Loki operator provided by Red Hat through operator hub in openshift console after pressing the install button, Loki operator provided by Red Hat should be visible.

https://github.com/openshift/console/pull/14632

Bug OCPBUGS-44970: Loki on SNO throws excessive restarts while waiting for DNS deployment

View the Description View the linked PRs

Context Thread

As a maintainer of the SNO CI lane, I would like to ensure that the following test doesn't failure regularly as part of SNO CI.

[sig-architecture] platform pods in ns/openshift-e2e-loki should not exit an excessive amount of times

This issue is a symptom of a greater problem with SNO where there is downtime in resolving DNS after the upgrade reboot where the DNS operator has an outage while its deploying the new DNS pods. During that time, loki exists after hitting the following error:

2024/10/23 07:21:32 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": dial tcp: lookup sso.redhat.com on 172.30.0.10:53: read udp 10.128.0.4:53104->172.30.0.10:53: read: connection refused

This issue is important because it can contribute to payload rejection in our blocking CI jobs.

Acceptance Criteria:

Problem is discussed with the networking team to understand the best path to resolution and decision is documented
Either the DNS operator or test are adjusted to address or mitigate the issue.
CI is free from the issue in test results for an extended period. (Need to confirm how often we're seeing it first before this period can be defined with confidence).

https://github.com/openshift/origin/pull/29329

Bug OCPBUGS-45407: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-gcp/pull/70

Bug OCPBUGS-51037: Update the monitoring topic used by the console team

View the Description View the linked PRs

Here: https://github.com/openshift/console/blob/main/frontend/public/components/utils/documentation.tsx#L32-L36

The monitoring documentation is going through the restructuring. For this reason, the console links need to be updated to reflect that change.

Link for more information: https://issues.redhat.com/browse/OBSDOCS-1673

https://github.com/openshift/console/pull/14781

Bug OCPBUGS-35294: [es, fr] Login screen is in English

View the Description View the linked PRs

Description of problem:

SetUp:
      Set browser's(e.g. google-chrome) default language as French or Spanish
Issue:
      Goto console login page, content is in English, not in French or Spanish.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-01-063526

How reproducible:

Always

Steps to Reproduce:

1. Set Browser's(e.g. Google-chrome) default language as French or Spanish
2. Goto console login page, content is in English, not in French or Spanish.
3. Same setup is working fine with rest of the supported locales.

Actual results:

Login page is in English

Expected results:

Login page should be available in browser's default language for supported OCP languages.

Additional info:

Reference screencast attached

https://github.com/openshift/oauth-server/pull/154

Bug OCPBUGS-45471: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/220

Bug OCPBUGS-45615: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-nutanix/pull/41

Bug OCPBUGS-44898: i18n upload/download routine task - sprint 262

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

https://github.com/openshift/console/pull/14502

Task CLID-306: Upgrade oc-mirror dependencies

View the Description View the linked PRs

This is a recurrence task that needs to be performed from time to time to keep the dependencies updated.

Ticket TRT-1898: periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview failing the payload

View the Description View the linked PRs

Beginning with 4.19.0-0.nightly-2024-11-27-025041 this job failed with a pattern I don't recognize.

I'll note some other aws jobs failed on the same payload which looked like infra issues; however this test re-ran in full and so its timing was very different.

Then it failed with much the same pattern on the next payload too.

The failures are mainly on tests like these:

[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set initially, in a homogeneous default environment, should expose default metrics [Suite:openshift/conformance/parallel] expand_more
[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a heterogeneous environment, should revert to default collection profile when an empty collection profile value is specified [Suite:openshift/conformance/parallel] expand_more
[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a heterogeneous environment, should expose information about the applied collection profile using meta-metrics [Suite:openshift/conformance/parallel] expand_more
[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a heterogeneous environment, should have at least one implementation for each collection profile [Suite:openshift/conformance/parallel] expand_more
[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a homogeneous minimal environment, should hide default metrics [Suite:openshift/conformance/parallel] expand_more

Each has a run where it looks like something timed out:

fail [github.com/openshift/origin/test/extended/prometheus/collection_profiles.go:99]: Interrupted by User
Ginkgo exit error 1: exit with code 1

and a second run failing to update configmap cluster-monitoring-config

{  fail [github.com/openshift/origin/test/extended/prometheus/collection_profiles.go:197]: Expected
    <*errors.StatusError | 0xc006738280>: 
    Operation cannot be fulfilled on configmaps "cluster-monitoring-config": the object has been modified; please apply your changes to the latest version and try again
    {
        ErrStatus: 
            code: 409
            details:
              kind: configmaps
              name: cluster-monitoring-config
            message: 'Operation cannot be fulfilled on configmaps "cluster-monitoring-config":
              the object has been modified; please apply your changes to the latest version and
              try again'
            metadata: {}
            reason: Conflict
            status: Failure,
    }
to be nil
Ginkgo exit error 1: exit with code 1}

https://github.com/openshift/origin/pull/29330

Bug OCPBUGS-39403: parseIPList Function fails to process all valid IPs When a invalid entries is present in router whitelist ip list

View the Description View the linked PRs

Description

The parseIPList function currently fails to handle IP lists that contain both valid and invalid IPs or CIDRs. When the function encounters an invalid entry, it immediately returns an empty string, which prevents any valid IPs from being processed or returned.

Expected Behavior

The function should process the entire list of IPs or CIDRs.
It should return a string of all valid IPs and CIDRs, even if there are some invalid entries.
Invalid entries should be logged for debugging purposes, but they should not cause the function to exit prematurely.

Current Behavior

The function returns an empty string as soon as it encounters an invalid IP or CIDR.
No valid IPs are returned if any invalid entries are found.

Steps to Reproduce

Provide a list of IPs or CIDRs that includes both valid and invalid entries to the parseIPList function.
Observe that the function returns an empty string, regardless of the valid entries present.

Additional Information

A recent PR addresses this issue by enhancing the function to handle mixed validity lists more gracefully.
This change improves the robustness of IP list processing and provides better insights into invalid entries.

https://github.com/openshift/router/pull/621

Bug OCPBUGS-51373: Component Readiness: [Cloud Compute / Unknown] [Other] test regressed

View the Description View the linked PRs

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

[sig-arch] events should not repeat pathologically for ns/openshift-machine-api

Extreme regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 20.00%.

Sample (being evaluated) Release: 4.19
Start Time: 2025-02-20T00:00:00Z
End Time: 2025-02-27T08:00:00Z
Success Rate: 20.00%
Successes: 3
Failures: 12
Flakes: 0

Base (historical) Release: 4.18
Start Time: 2025-01-28T00:00:00Z
End Time: 2025-02-27T08:00:00Z
Success Rate: 100.00%
Successes: 60
Failures: 0
Flakes: 0

View the test details report for additional context.

This has been regressed for 5 days with a permafail on gcp serial jobs, techpreview and default both.

[sig-arch] events should not repeat pathologically for ns/openshift-machine-api expand_less 	0s
{  3 events happened too frequently

event happened 24 times, something is wrong: namespace/openshift-machine-api hmsg/30080f8454 machineset/ci-op-qinhsdt8-e68fb-4xk6j-worker-c - reason/ReconcileError error fetching disk information: unable to retrieve image "rhcos-9-6-20250121-0-gcp-x86-64-fake-update" in project "rhcos-cloud": googleapi: Error 403: Required 'compute.images.get' permission for 'projects/rhcos-cloud/global/images/rhcos-9-6-20250121-0-gcp-x86-64-fake-update', forbidden (08:35:32Z) result=reject 
event happened 25 times, something is wrong: namespace/openshift-machine-api hmsg/30080f8454 machineset/ci-op-qinhsdt8-e68fb-4xk6j-worker-f - reason/ReconcileError error fetching disk information: unable to retrieve image "rhcos-9-6-20250121-0-gcp-x86-64-fake-update" in project "rhcos-cloud": googleapi: Error 403: Required 'compute.images.get' permission for 'projects/rhcos-cloud/global/images/rhcos-9-6-20250121-0-gcp-x86-64-fake-update', forbidden (08:35:43Z) result=reject 
event happened 30 times, something is wrong: namespace/openshift-machine-api hmsg/30080f8454 machineset/ci-op-qinhsdt8-e68fb-4xk6j-worker-a - reason/ReconcileError error fetching disk information: unable to retrieve image "rhcos-9-6-20250121-0-gcp-x86-64-fake-update" in project "rhcos-cloud": googleapi: Error 403: Required 'compute.images.get' permission for 'projects/rhcos-cloud/global/images/rhcos-9-6-20250121-0-gcp-x86-64-fake-update', forbidden (08:35:08Z) result=reject }

Bug OU-645: Virtualization Perspective Issues

View the Description View the linked PRs

There is a number of various inconsistencies while using the observe section in the virtualization perspective.

Virtualization Alerts Page doesn't display navigation bar
Virtualization Alerts Detail and Alert Rule Detail Page doesn't load
Virtualization Alert Rules Page doesn't load
Virtualization Dashboards Page doesn't load

https://github.com/openshift/monitoring-plugin/pull/325

Bug OCPBUGS-33370: masters.json not written when control plane provisioning fails

View the Description View the linked PRs

Description of problem:

2024-05-07 17:21:59 level=debug msg=baremetal: getting master addresses
2024-05-07 17:21:59 level=warning msg=Failed to extract host addresses: open ocp/ostest/.masters.json: no such file or directory

https://github.com/openshift/installer/pull/9120

Bug OCPBUGS-45699: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/796

Bug OCPBUGS-48510: HyperShift CEL validation blocks ARM64 NodePool creation for None platform

View the Description View the linked PRs

Description of problem:

HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms
Can't add a Bare Metal worker node to the hosted cluster. 
This was discussed on #project-hypershift Slack channel.

Version-Release number of selected component (if applicable):

MultiClusterEngine v2.7.2 
HyperShift Operator image: 
registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a

How reproducible:

Always

Steps to Reproduce:

1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform
2. Try to create a NodePool with ARM64 architecture specification

Actual results:

- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms
- Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure"
- Additional validation in NodePool spec also blocks arm64 architecture

Expected results:

- Allow ARM64 architecture specification for NodePools on BareMetal platform 
- Remove or update the CEL validation to support this use case

Additional info:

NodePool YAML:
apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
metadata:
  name: nodepool-doca5-1
  namespace: doca5
spec:
  arch: arm64
  clusterName: doca5
  management:
    autoRepair: false
    replace:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      strategy: RollingUpdate
    upgradeType: InPlace
  platform:
    agent:
      agentLabelSelector: {}
    type: Agent
  release:
    image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi
  replicas: 1

https://github.com/openshift/hypershift/pull/5403

Bug OCPBUGS-49746: Creating CUDN with mismatch topology should fail

View the Description View the linked PRs

Description of problem:

Creating CUDN with mismatch spec.topology and the topology config succeed, but it should failed because its invalid.

See below examples.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

100%

Steps to Reproduce:

1. Create CUDN CR with spec.topology mismatch topology configuration:

Example 1:

apiVersion: k8s.ovn.org/v1
kind: ClusterUserDefinedNetwork
metadata:
 name: mynet
spec:
 namespaceSelector:
  matchLabels:
   "kubernetes.io/metadata.name": "red"
 network:
  topology: Layer2 # <--- spec.topology should match
  layer3: # <------------ topology configuration type
   role: Primary
   subnets: [{cidr: 192.168.112.12/24}]

Example 2:

apiVersion: k8s.ovn.org/v1
kind: ClusterUserDefinedNetwork
metadata:
 name: mynet
spec:
 namespaceSelector:
  matchLabels:
   "kubernetes.io/metadata.name": "red"
 network:
  topology: Layer3 # <--- spec.topology should match
  layer2: # <------------ topology configuration type
   role: Secondary
   subnets: [192.168.112.12/24]

Actual results:

The CUDN is created successfully.

ovn-kubernetes control-plane pod get into crush looping due to the following panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1dd8415]

goroutine 12154 [running]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/util/udn.IsPrimaryNetwork({0x2d8dfd0, 0xc0059746a8})
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/util/udn/udn.go:17 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/clustermanager/userdefinednetwork.(*Controller).updateNAD(0xc0007340f0, {0x2dcad90, 0xc005974580}, {0xc000012480, 0x3})
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/clustermanager/userdefinednetwork/controller_helper.go:24 +0x94
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/clustermanager/userdefinednetwork.(*Controller).syncClusterUDN(0xc0007340f0, 0xc005974420)
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/clustermanager/userdefinednetwork/controller.go:604 +0xa10
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/clustermanager/userdefinednetwork.(*Controller).reconcileCUDN(0xc0007340f0, {0xc00651c2d6, 0x5})
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/clustermanager/userdefinednetwork/controller.go:519 +0xff
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/controller.(*controller[...]).processNextQueueItem(0x19a93e0)
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/controller/controller.go:253 +0xd7
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/controller.(*controller[...]).startWorkers.func1()
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/controller/controller.go:163 +0x6f
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/controller.(*controller[...]).startWorkers in goroutine 7794
    /home/omergi/workspace/github.com/ovn-kubernetes/go-controller/pkg/controller/controller.go:160 +0x185

Expected results:

Creating CUDN with mismatch spec.topology and topology configuration should fail - at the API level.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/cluster-network-operator/pull/2638

Bug OCPBUGS-45400: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/136

Bug OCPBUGS-47715: Failed to create a disconnected cluster using HCP/HyperShift CLI

View the Description View the linked PRs

Description of problem:

 Failed to create a disconnected cluster using HCP/HyperShift CLI

Version-Release number of selected component (if applicable):

    4.19 4.18

How reproducible:

    100%

Steps to Reproduce:

    1. create disconnected hostedcluster with hcp cli
    2. The environment where the command is executed cannot access the payload.

Actual results:

    /tmp/hcp create cluster agent --cluster-cidr fd03::/48 --service-cidr fd04::/112 --additional-trust-bundle=/tmp/secret/registry.2.crt --network-type=OVNKubernetes --olm-disable-default-sources --name=b2ce1d5218a2c7b561d6 --pull-secret=/tmp/.dockerconfigjson --agent-namespace=hypershift-agents --namespace local-cluster --base-domain=ostest.test.metalkube.org --api-server-address=api.b2ce1d5218a2c7b561d6.ostest.test.metalkube.org --image-content-sources /tmp/secret/mgmt_icsp.yaml --ssh-key=/tmp/secret/id_rsa.pub --release-image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d1422024-12-31T08:01:05Z	ERROR	Failed to create cluster	{"error": "failed to retrieve manifest virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d: failed to create repository client for https://virthost.ostest.test.metalkube.org:5000: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": Internal Server Error"}143github.com/openshift/hypershift/product-cli/cmd/cluster/agent.NewCreateCommand.func1144	/remote-source/app/product-cli/cmd/cluster/agent/create.go:32145github.com/spf13/cobra.(*Command).execute146	/remote-source/app/vendor/github.com/spf13/cobra/command.go:985147github.com/spf13/cobra.(*Command).ExecuteC148	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1117149github.com/spf13/cobra.(*Command).Execute150	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1041151github.com/spf13/cobra.(*Command).ExecuteContext152	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1034153main.main154	/remote-source/app/product-cli/main.go:59155runtime.main156	/usr/lib/golang/src/runtime/proc.go:272157Error: failed to retrieve manifest virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d: failed to create repository client for https://virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": Internal Server Error158failed to retrieve manifest virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d: failed to create repository client for https://virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": Internal Server Error

Expected results:

    can be created successful

Additional info:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/60159/rehearse-60159-periodic-ci-openshift-hypershift-release-4.18-periodics-mce-e2e-agent-disconnected-ovn-ipv6-metal3-conformance/1873981158618828800

https://github.com/openshift/hypershift/pull/5341

Bug OCPBUGS-48530: [4.19] E2E: Add test cases to verify PerPodPowerManagement with PPC

View the Description View the linked PRs

Description of problem:

    Add E2E test cases for PPC related to PerPodPowermanagement

Version-Release number of selected component (if applicable):

4.19.0

How reproducible:

Steps to Reproduce:

  E2E Test cases were missing in PPC related to PerPodPowerManagment workload hint

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/1284

Bug OCPBUGS-41826: Various accessibility violations in console

View the Description View the linked PRs

Description of problem:

When updating cypress-axe, new changes and bugfixes in the axe-core accessibility auditing package have surfaced various accessibility violations that have to be addressed

Version-Release number of selected component (if applicable):

    openshift4.18.0

How reproducible:

    always

Steps to Reproduce:

    1. Update axe-core and cypress-axe to the latest versions
    2. Run test-cypress-console and run a cypress test, I used other-routes.cy.ts

Actual results:

    The tests fail with various accessibility violations

Expected results:

    The tests pass without accessibility violations

Additional info:

https://github.com/openshift/console/pull/14311

Bug OCPBUGS-45431: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-runtimecfg/pull/335

Bug OCPBUGS-48519: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5493

Vulnerability OCPBUGS-48667: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/builder/pull/432

Bug OCPBUGS-49623: [IBMCloud] Missing ca-mon region due to static region list

View the Description View the linked PRs

Description of problem:

IBM Cloud relies on a static hard coded list of support Regions for IPI. However, whenever a new Region becomes available, it cannot be used until it is added to this list of available Regions.
https://github.com/openshift/installer/blob/13932601852174e4294e16ff9cfca7df082f1ce0/pkg/types/ibmcloud/validation/platform.go#L15-L30

Version-Release number of selected component (if applicable):

4.19.0

How reproducible:

100%

Steps to Reproduce:

    1. Create install-config for IBM Cloud
    2. Specify a newly available Region (e.g., ca-mon)
    3. Attempt to create manifests

Actual results:

ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.ibmcloud.region: Unsupported value: "ca-mon": supported values: "eu-gb", "jp-tok", "au-syd", "ca-tor", "jp-osa", "br-sao", "us-south", "us-east", "eu-de", "eu-es"

Expected results:

Successful cluster creation

Additional info:

Rather than keep updating this static list, it would be better if the Regions could be looked up dynamically, to provide to the install-config creation path. Thus, validation for the Region would not be necessary in that case.

For instance current regions:

Listing regions...


Name            Show name
au-syd          Sydney (au-syd)
in-che          Chennai (in-che)
jp-osa          Osaka (jp-osa)
jp-tok          Tokyo (jp-tok)
eu-de           Frankfurt (eu-de)
eu-es           Madrid (eu-es)
eu-gb           London (eu-gb)
ca-mon          Montreal (ca-mon)
ca-tor          Toronto (ca-tor)
us-south        Dallas (us-south)
...

https://github.com/openshift/installer/pull/9421

Task SPLAT-2008: Update api vSphere disk size limitation documentation

View the Description View the linked PRs

Need to update the openshift/api project to contain correct disk size limit

https://github.com/openshift/api/pull/2161

Bug OCPBUGS-50920: machine-os-images build fails

View the Description View the linked PRs

when building machine-os-images container https://github.com/openshift/machine-os-images we see an error stating that the coreos-installer package can't be installed

No match for argument: coreos-installer
Error: Unable to find a match: coreos-installer
error: build error: building at STEP "RUN dnf install -y jq wget coreos-installer": while running runtime: exit status 1

for example see https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-os-images/51/pull-ci-openshift-machine-os-images-main-images/1891424609257918464

after a quick troubleshooting with the help of ART team it seems that the issue is related to the installer image used as base image that does not have the correct repositories

https://github.com/openshift/installer/pull/9491

Bug OCPBUGS-52352: Tuned profile degraded in ARM on Vendor Id not matching Ampere (APM)

View the Description View the linked PRs

Description of problem:

    Applying a performance profile on an ARM cluster, results with the tuned profile to turn degraded

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Label a worker node with a worker-cnf label
2. Create an mcp referring to that label 
3. Apply the below performance profile

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "1-3,4-6"
    reserved: "0,7"
  hugepages:
    defaultHugepagesSize: 512M
    pages:
    - count: 1
      node: 0
      size: 512M
    - count: 128
      node: 1
      size: 2M
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker-cnf
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ''
  kernelPageSize: 64k
  numa:
    topologyPolicy: single-numa-node
  realTimeKernel:
    enabled: false
  workloadHints:
    highPowerConsumption: true
    perPodPowerManagement: false
    realTime: true

Actual results:

Expected results:

Additional info:

[root@ampere-one-x-04 ~]# oc get profiles -A
NAMESPACE                                NAME                                             TUNED                                    APPLIED   DEGRADED   MESSAGE                                                            AGE
openshift-cluster-node-tuning-operator   ocp-ctlplane-0.libvirt.lab.eng.tlv2.redhat.com   openshift-control-plane                  True      False      TuneD profile applied.                                             22h
openshift-cluster-node-tuning-operator   ocp-ctlplane-1.libvirt.lab.eng.tlv2.redhat.com   openshift-control-plane                  True      False      TuneD profile applied.                                             22h
openshift-cluster-node-tuning-operator   ocp-ctlplane-2.libvirt.lab.eng.tlv2.redhat.com   openshift-control-plane                  True      False      TuneD profile applied.                                             22h
openshift-cluster-node-tuning-operator   ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com     openshift-node-performance-performance   False     True       The TuneD daemon profile not yet applied, or application failed.   22h
openshift-cluster-node-tuning-operator   ocp-worker-1.libvirt.lab.eng.tlv2.redhat.com     openshift-node                           True      False      TuneD profile applied.                                             22h
openshift-cluster-node-tuning-operator   ocp-worker-2.libvirt.lab.eng.tlv2.redhat.com     openshift-node                           True      False      TuneD profile applied.                                             22h

[root@ampere-one-x-04 ~]# oc describe performanceprofile
Name:         performance
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  performance.openshift.io/v2
Kind:         PerformanceProfile
Metadata:
  Creation Timestamp:  2025-03-04T15:28:44Z
  Finalizers:
    foreground-deletion
  Generation:        1
  Resource Version:  74234
  UID:               0d9c1817-c12f-4ea8-9c4b-b37badc232e9
Spec:
  Cpu:
    Isolated:  1-3,4-6
    Reserved:  0,7
  Hugepages:
    Default Hugepages Size:  512M
    Pages:
      Count:         1
      Node:          0
      Size:          512M
      Count:         128
      Node:          1
      Size:          2M
  Kernel Page Size:  64k
  Machine Config Pool Selector:
    machineconfiguration.openshift.io/role:  worker-cnf
  Net:
    User Level Networking:  true
  Node Selector:
    node-role.kubernetes.io/worker-cnf:
  Numa:
    Topology Policy:  single-numa-node
  Real Time Kernel:
    Enabled:  false
  Workload Hints:
    High Power Consumption:    true
    Per Pod Power Management:  false
    Real Time:                 true
Status:
  Conditions:
    Last Heartbeat Time:   2025-03-04T15:28:45Z
    Last Transition Time:  2025-03-04T15:28:45Z
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2025-03-04T15:28:45Z
    Last Transition Time:  2025-03-04T15:28:45Z
    Status:                False
    Type:                  Upgradeable
    Last Heartbeat Time:   2025-03-04T15:28:45Z
    Last Transition Time:  2025-03-04T15:28:45Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2025-03-04T15:28:45Z
    Last Transition Time:  2025-03-04T15:28:45Z
    Message:               Tuned ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com Degraded Reason: TunedError.
Tuned ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com Degraded Message: TuneD daemon issued one or more error message(s) during profile application. TuneD stderr: .

    Reason:       TunedProfileDegraded
    Status:       True
    Type:         Degraded
  Runtime Class:  performance-performance
  Tuned:          openshift-cluster-node-tuning-operator/openshift-node-performance-performance
Events:
  Type    Reason              Age                 From                            Message
  ----    ------              ----                ----                            -------                                                                                                       Normal  Creation succeeded  112m (x9 over 17h)  performance-profile-controller  Succeeded to create all components

[root@ampere-one-x-04 ~]# oc logs pod/tuned-kjc8j
I0304 15:35:50.346412    3259 controller.go:1666] starting in-cluster ocp-tuned v4.19.0-202502262344.p0.gf166846.assembly.stream.el9-0-g0d9dd16-dirty
I0304 15:35:50.401840    3259 controller.go:671] writing /var/lib/ocp-tuned/image.env
I0304 15:35:50.418669    3259 controller.go:702] tunedRecommendFileRead(): read "openshift-node-performance-performance" from "/etc/tuned/recommend.d/50-openshift.conf"
I0304 15:35:50.419585    3259 controller.go:1728] starting: profile unpacked is "openshift-node-performance-performance" fingerprint "ab0d99d8009d6539b91ed1aeff3e4fa1c629c1cd4e9a32bdc132dcc9737e4fc9"
I0304 15:35:50.419646    3259 controller.go:1424] recover: no pending deferred change
I0304 15:35:50.419666    3259 controller.go:1734] starting: no pending deferred update
I0304 15:36:06.074575    3259 controller.go:382] disabling system tuned...
I0304 15:36:06.121045    3259 controller.go:1546] started events processors
I0304 15:36:06.121492    3259 controller.go:359] set log level 0
I0304 15:36:06.121850    3259 controller.go:1567] monitoring filesystem events on "/etc/tuned/bootcmdline"
I0304 15:36:06.121886    3259 controller.go:1570] started controller
I0304 15:36:06.122603    3259 controller.go:692] tunedRecommendFileWrite(): written "/etc/tuned/recommend.d/50-openshift.conf" to set TuneD profile openshift-node-performance-performance
I0304 15:36:06.122634    3259 controller.go:417] profilesExtract(): extracting 6 TuneD profiles (recommended=openshift-node-performance-performance)
I0304 15:36:06.210862    3259 controller.go:462] profilesExtract(): recommended TuneD profile openshift-node-performance-performance content unchanged [openshift]
I0304 15:36:06.211950    3259 controller.go:462] profilesExtract(): recommended TuneD profile openshift-node-performance-performance content unchanged [openshift-node-performance-performance]
I0304 15:36:06.212311    3259 controller.go:478] profilesExtract(): fingerprint of extracted profiles: "ab0d99d8009d6539b91ed1aeff3e4fa1c629c1cd4e9a32bdc132dcc9737e4fc9"
I0304 15:36:06.212389    3259 controller.go:818] tunedReload()
I0304 15:36:06.212493    3259 controller.go:745] starting tuned...
I0304 15:36:06.212547    3259 run.go:121] running cmd...
2025-03-04 15:36:06,335 INFO     tuned.daemon.application: TuneD: 2.25.1, kernel: 5.14.0-570.el9.aarch64+64k
2025-03-04 15:36:06,335 INFO     tuned.daemon.application: dynamic tuning is globally disabled
2025-03-04 15:36:06,340 INFO     tuned.daemon.daemon: using sleep interval of 1 second(s)
2025-03-04 15:36:06,340 INFO     tuned.daemon.daemon: Running in automatic mode, checking what profile is recommended for your configuration.
2025-03-04 15:36:06,341 INFO     tuned.daemon.daemon: Using 'openshift-node-performance-performance' profile
2025-03-04 15:36:06,342 INFO     tuned.profiles.loader: loading profile: openshift-node-performance-performance
2025-03-04 15:36:06,460 ERROR    tuned.daemon.daemon: Cannot set initial profile. No tunings will be enabled: Cannot load profile(s) 'openshift-node-performance-performance': Cannot find profile 'openshift-node-performance--aarch64-performance' in '['/var/lib/ocp-tuned/profiles', '/usr/lib/tuned', '/usr/lib/tuned/profiles']'.
2025-03-04 15:36:06,461 INFO     tuned.daemon.controller: starting controller

sh-5.1# systemctl --no-pager | grep hugepages
  dev-hugepages.mount                                                                                                                                             loaded active mounted   Huge Pages File System
● hugepages-allocation-2048kB-NUMA1.service                                                                                                                       loaded failed failed    Hugepages-2048kB allocation on the node 1
  hugepages-allocation-524288kB-NUMA0.service                                                                                                                     loaded active exited    Hugepages-524288kB allocation on the node 0

sh-5.1# systemctl status hugepages-allocation-2048kB-NUMA1.service
× hugepages-allocation-2048kB-NUMA1.service - Hugepages-2048kB allocation on the node 1
     Loaded: loaded (/etc/systemd/system/hugepages-allocation-2048kB-NUMA1.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Tue 2025-03-04 15:32:33 UTC; 17h ago
   Main PID: 1002 (code=exited, status=1/FAILURE)
        CPU: 6ms

Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: Starting Hugepages-2048kB allocation on the node 1...
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com hugepages-allocation.sh[1002]: ERROR: /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages does not exist
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: hugepages-allocation-2048kB-NUMA1.service: Main process exited, code=exited, status=1/FAILURE
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: hugepages-allocation-2048kB-NUMA1.service: Failed with result 'exit-code'.
Mar 04 15:32:33 ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Hugepages-2048kB allocation on the node 1.

sh-5.1# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/boot/ostree/rhcos-e032e3de5cffeccaf88bc5dc1945da35b4273c5f5b758a6ca1d0d78344b55e7f/vmlinuz-5.14.0-570.el9.aarch64+64k rw ostree=/ostree/boot.0/rhcos/e032e3de5cffeccaf88bc5dc1945da35b4273c5f5b758a6ca1d0d78344b55e7f/0 ignition.platform.id=openstack console=ttyAMA0,115200n8 console=tty0 root=UUID=96763b3b-e217-4879-a03e-56568ca84bf9 rw rootflags=prjquota boot=UUID=d98055a6-2355-40d3-8e87-98eedd0e8c91 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0

 bash-5.1# ls /var/lib/ocp-tuned/profiles/
openshift                                           openshift-node-performance-intel-x86-performance
openshift-node-performance-amd-x86-performance      openshift-node-performance-performance
openshift-node-performance-arm-aarch64-performance  openshift-node-performance-rt-performance
bash-5.1# cat /var/lib/ocp-tuned/profiles/openshift-node-performance-performance/tuned.conf
[main]
summary=Openshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)
# The final result of the include depends on cpu vendor, cpu architecture, and whether the real time kernel is enabled
# The first line will be evaluated based on the CPU vendor and architecture
# This has three possible results:
#   include=openshift-node-performance-amd-x86;
#   include=openshift-node-performance-arm-aarch64;
#   include=openshift-node-performance-intel-x86;
# The second line will be evaluated based on whether the real time kernel is enabled
# This has two possible results:
#     openshift-node,cpu-partitioning
#     openshift-node,cpu-partitioning,openshift-node-performance-rt-<PerformanceProfile name>
include=openshift-node,cpu-partitioning${f:regex_search_ternary:${f:exec:uname:-r}:rt:,openshift-node-performance-rt-performance:};
    openshift-node-performance-${f:lscpu_check:Vendor ID\:\s*GenuineIntel:intel:Vendor ID\:\s*AuthenticAMD:amd:Vendor ID\:\s*ARM:arm}-${f:lscpu_check:Architecture\:\s*x86_64:x86:Architecture\:\s*aarch64:aarch64}-performance
# Inheritance of base profiles legend:
# cpu-partitioning -> network-latency -> latency-performance
# https://github.com/redhat-performance/tuned/blob/master/profiles/latency-performance/tuned.conf
# https://github.com/redhat-performance/tuned/blob/master/profiles/network-latency/tuned.conf
# https://github.com/redhat-performance/tuned/blob/master/profiles/cpu-partitioning/tuned.conf
# All values are mapped with a comment where a parent profile contains them.
# Different values will override the original values in parent profiles.
[variables]
#> isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7
isolated_cores=1-6

not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}}

[cpu]
#> latency-performance
#> (override)
force_latency=cstate.id:1|3
governor=performance
energy_perf_bias=performance
min_perf_pct=100
 
[service]
service.stalld=start,enable

[vm]
#> network-latency
transparent_hugepages=never

[irqbalance]
# Disable the plugin entirely, which was enabled by the parent profile `cpu-partitioning`.
# It can be racy if TuneD restarts for whatever reason.
#> cpu-partitioning
enabled=false

[scheduler]
runtime=0
group.ksoftirqd=0:f:11:*:ksoftirqd.*
group.rcuc=0:f:11:*:rcuc.*
group.ktimers=0:f:11:*:ktimers.*
default_irq_smp_affinity = ignore
irq_process=false

[sysctl]
#> cpu-partitioning #RealTimeHint
kernel.hung_task_timeout_secs=600
#> cpu-partitioning #RealTimeHint
kernel.nmi_watchdog=0
#> RealTimeHint
kernel.sched_rt_runtime_us=-1
#> cpu-partitioning  #RealTimeHint
vm.stat_interval=10
# cpu-partitioning and RealTimeHint for RHEL disable it (= 0)
# OCP is too dynamic when partitioning and needs to evacuate
#> scheduled timers when starting a guaranteed workload (= 1)
kernel.timer_migration=1
#> network-latency
net.ipv4.tcp_fastopen=3
# If a workload mostly uses anonymous memory and it hits this limit, the entire
# working set is buffered for I/O, and any more write buffering would require
# swapping, so it's time to throttle writes until I/O can catch up.  Workloads
# that mostly use file mappings may be able to use even higher values.
#
# The generator of dirty data starts writeback at this percentage (system default
# is 20%)
#> latency-performance
vm.dirty_ratio=10
# Start background writeback (via writeback threads) at this percentage (system
# default is 10%)
#> latency-performance
vm.dirty_background_ratio=3
# The swappiness parameter controls the tendency of the kernel to move
# processes out of physical memory and onto the swap disk.
# 0 tells the kernel to avoid swapping processes out of physical memory
# for as long as possible
# 100 tells the kernel to aggressively swap processes out of physical memory
# and move them to swap cache
#> latency-performance
vm.swappiness=10
# also configured via a sysctl.d file
# placed here for documentation purposes and commented out due
# to a tuned logging bug complaining about duplicate sysctl:
#   https://issues.redhat.com/browse/RHEL-18972
#> rps configuration
# net.core.rps_default_mask=${not_isolated_cpumask}

[selinux]
#> Custom (atomic host)
avc_cache_threshold=8192

[net]
channels=combined 2
nf_conntrack_hashsize=131072

[bootloader]
# !! The names are important for Intel and are referenced in openshift-node-performance-intel-x86
# set empty values to disable RHEL initrd setting in cpu-partitioning
initrd_remove_dir=
initrd_dst_img=
initrd_add_dir=
# overrides cpu-partitioning cmdline
cmdline_cpu_part=+nohz=on rcu_nocbs=${isolated_cores} tuned.non_isolcpus=${not_isolated_cpumask} systemd.cpu_affinity=${not_isolated_cores_expanded}
# No default value but will be composed conditionally based on platform
cmdline_iommu=

cmdline_isolation=+isolcpus=managed_irq,${isolated_cores}
 
cmdline_realtime_nohzfull=+nohz_full=${isolated_cores}
cmdline_realtime_nosoftlookup=+nosoftlockup
cmdline_realtime_common=+skew_tick=1 rcutree.kthread_prio=11
 
# No default value but will be composed conditionally based on platform
cmdline_power_performance=
 
# No default value but will be composed conditionally based on platform
cmdline_idle_poll=
 
 

[rtentsk]

https://github.com/openshift/cluster-node-tuning-operator/pull/1303

Bug OCPBUGS-48250: MCO CO degrades are stuck on until master pool updates complete

View the Description View the linked PRs

Trying to solve the root issue from this bug: https://issues.redhat.com/browse/OCPBUGS-39199?focusedId=26104570&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26104570

To fix this, we need each of the sync functions to be able to individually clear a CO degrade that they have set earlier. Our current flow only clears a CO degrade when the all of sync functions are successful and that tends to be problematic if they happen to get stuck in one of the sync functions. We typically see this for syncRequiredMachineConfigPools, which waits until the master nodes have finished updating during an upgrade.

https://github.com/openshift/machine-config-operator/pull/4791

Bug OCPBUGS-45713: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prom-label-proxy/pull/377

Bug OCPBUGS-47503: Power VS: Operator endpoint validation doesn't match API validation

View the Description View the linked PRs

Description of problem:

   After https://github.com/openshift/api/pull/2076, the validation in the image registry operator for the Power VS platform does not match the API's expectations.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/1168

Bug OCPBUGS-44655: HO issue determining the cluster payload arch does not checks the ICSP/IDMS

View the Description View the linked PRs

Description of problem:

This function https://github.com/openshift/hypershift/blame/c34a1f6cef0cb41c8a1f83acd4ddf10a4b9e8532/support/util/util.go#L391 does not checks the IDMS/ICSP overrides during the reconciliation, so it breaks the disconnected deployments.

https://github.com/openshift/hypershift/pull/5168

Bug OCPBUGS-45488: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/etcd/pull/306

Bug OCPBUGS-45606: 'Channel' and 'Version' dropdowns do not collapse if the user does not select an option

View the Description View the linked PRs

Description of problem:

'Channel' and 'Version' dropdowns do not collapse if the user does not select an option

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-04-113014

How reproducible:

    Always

Steps to Reproduce:

    1. Naviage to Operator Insatallation page OR Operator Install details page
       eg: /operatorhub/ns/openshift-console?source=["Red+Hat"]&details-item=datagrid-redhat-operators-openshift-marketplace&channel=stable&version=8.5.4
       /operatorhub/subscribe?pkg=datagrid&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=openshift-console&channel=stable&version=8.5.4&tokenizedAuth=     
    2. Click the Channel/Update channel OR 'Version' dropdown list
    3. Click the dropdow again

Actual results:

The dropdown list cannot collapse, only if user selected an option OR click other area

Expected results:

 the dropdown can collapse after click

Additional info:

https://github.com/openshift/console/pull/14590

Bug OCPBUGS-45656: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9279

Bug OCPBUGS-46477: Failing Azure File tests

View the Description View the linked PRs

Azure File tests are perma-failing with the updated 1.32 Kubernetes in OCP 4.19:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-kubernetes-2147-ci-4.19-e2e-azure-ovn-upgrade/1867506505763262464

Slack thread: https://redhat-internal.slack.com/archives/C07V32J0YKF/p1734112832694339

https://github.com/openshift/kubernetes/pull/2202

Task MON-4107: Ensure CMO provides reasons for not deploying telemeter client

View the Description View the linked PRs

At "-v=2" Says nothing about telemetry being disabled on the cluster https://docs.openshift.com/container-platform/4.17/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html

maybe it does in v>2, check that.

See https://issues.redhat.com/browse/OCPBUGS-45683

Maybe we need to add a debug log.
Maybe we should do this for all operands that can be disabled.

https://github.com/openshift/cluster-monitoring-operator/pull/2546

Bug OCPBUGS-44257: [openshift-4.16] CI Failure: [sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod

View the Description View the linked PRs

Description of problem:

s2i conformance test appears to fail permanently on OCP 4.16.z

Version-Release number of selected component (if applicable):

4.16.z

How reproducible:

Since 2024-11-04 at least

Steps to Reproduce:

    Run OpenShift build test suite in PR

Actual results:

Test fails - root cause appears to be that a built/deployed pod crashloops

Expected results:

Test succeeds

Additional info:

Job history https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-openshift-controller-manager-release-4.16-e2e-gcp-ovn-builds

https://github.com/openshift/origin/pull/29355

Bug OCPBUGS-47700: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29392

Bug OCPBUGS-50574: vsphere 4.19 capv will not delete bootstrap node

View the Description View the linked PRs

time="2025-02-11T04:45:39Z" level=debug msg="I0211 04:45:39.801815     670 machine_controller.go:359] \"Skipping deletion of Kubernetes Node associated with Machine as it is not allowed\" controller=\"machine\" controllerGroup=\"cluster.x-k8s.io\" controllerKind=\"Machine\" Machine=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-il9yv8px-da9f5-526v6-bootstrap\" reconcileID=\"1cc19254-df46-4194-a020-c87f9b6eebb5\" Cluster=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-0\" Cluster=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-0\" Node=\"\" cause=\"noderef is nil\""
time="2025-02-11T04:45:39Z" level=debug msg="I0211 04:45:39.809203     712 vimmachine.go:384] \"Updated VSphereVM\" controller=\"vspheremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"VSphereMachine\" VSphereMachine=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-il9yv8px-da9f5-526v6-bootstrap\" reconcileID=\"57d3cf64-4277-4b55-9f98-bfdf6030f4b7\" Machine=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-bootstrap\" Cluster=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-0\" Cluster=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-0\" VSphereCluster=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-0\" VSphereVM=\"openshift-cluster-api-guests/ci-op-il9yv8px-da9f5-526v6-bootstrap\""

time="2025-02-11T04:50:39Z" level=warning msg="Timeout deleting bootstrap machine: context deadline exceeded"
time="2025-02-11T04:50:39Z" level=info msg="Shutting down local Cluster API controllers..."

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview/1889163885148114944/artifacts/e2e-vsphere-ovn-techpreview/ipi-install-install/artifacts/.openshift_install-1739249869.log

https://github.com/openshift/installer/pull/9453

Vulnerability OCPBUGS-51768: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-disk-csi-driver/pull/101

Bug OCPBUGS-30973: Can't import react-redux useSelector or useDispatch hooks because of typescript errors

View the Description View the linked PRs

Description of problem:

Every import of react-redux useSelector and useDispatch hooks must have ts and eslint override comments because @types/react-redux is out of sync with the react-redux and there are no type definitions for these hooks.

Version-Release number of selected component (if applicable):

4.16.0

How reproducible:

Steps to Reproduce:

    1. Import useSelector or useDispatch hook

Actual results:

A typescript error is shown: "react-redux"' has no exported member named 'useSelector'

Expected results:

These modules can be imported without typescript errors

Task MON-3960: Enable CMO's TestTLSSecurityProfileConfiguration

View the Description View the linked PRs

Was disabled here https://issues.redhat.com/browse/MON-3959

It'll need a rewrite/some mocking. (it's good to test that MCO detects changes and sync, but maybe we could do that in a dry run way)

https://github.com/openshift/cluster-monitoring-operator/pull/2545

Bug OCPBUGS-38874: OpenShift internal registry panic when deploying OpenShift on AWS ap-southeast-5 region

View the Description View the linked PRs

Description of problem:

OpenShift internal registry panics when deploying OpenShift on AWS ap-southeast-5 region

Version-Release number of selected component (if applicable):

OpenShift 4.15.29

How reproducible:

Always

Steps to Reproduce:

    1. Deploy OpenShift 4.15.29 on AWS ap-southeast-5 region
    2. The cluster gets deployed but the image-registry Operator is not available and image-registry pods get in CrashLoopBackOff state

Actual results:

panic: invalid region provided: ap-southeast-5goroutine 1 [running]:
github.com/distribution/distribution/v3/registry/handlers.NewApp({0x2983cd0?, 0xc00005c088?}, 0xc000640c00)
    /go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:130 +0x2bf1
github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp({0x2983cd0, 0xc00005c088}, 0x0?, {0x2986620?, 0xc000377560})
    /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:96 +0xb9
github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp({0x2983cd0?, 0xc00005c088}, {0x296fa38?, 0xc0008e4148}, 0xc000640c00, 0xc000aa6140, {0x0?, 0x0})
    /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:138 +0x485
github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer({0x2983cd0, 0xc00005c088}, 0xc000640c00, 0xc000aa6140)
    /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:212 +0x38a
github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute({0x2968300, 0xc000666408})
    /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:166 +0x86b
main.main()
    /go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:93 +0x496

Expected results:

image-resgistry Operator and pods are available

Additional info:

We can asume the results will be the same while deploying on 4.16 and 4.17 but can't be tested yet as only 4.15 is working in this region. Will open another bug for the Installer to solve the issues while deploying on this region

https://github.com/openshift/image-registry/pull/419

Bug OCPBUGS-45661: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/284

Bug OCPBUGS-50685: oc-mirror --parallel-images processes one more image than specified

View the Description View the linked PRs

Description of problem:

From the ARO team evaluating oc-mirror. Setting --parallel-images processes one more image than specified

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

Use --parallel-images=1

Actual results:

    See attached image

Expected results:

Should use only 1 layer

Additional info:

https://github.com/openshift/oc-mirror/pull/1077

Vulnerability OCPBUGS-52227: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5750

Bug OCPBUGS-50703: Add incidents metric to telemetry

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2569

Task MULTIARCH-5281: Rebase ibm-powervs-block-csi-driver with upstream

View the linked PRs

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/96

Bug OCPBUGS-48364: [4.19] Bump to kubernetes 1.31.4

View the Description View the linked PRs

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.31.md#changelog-since-v1313

https://github.com/openshift/kubernetes/pull/2179

Bug OCPBUGS-48758: Hypershift e2e failing payloads on Karpenter tests

View the Description View the linked PRs

4.19 CI payloads have now failed mutiple times in a row on hypershift-e2e, same two Karpenter tests.

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1881972453123559424

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn/1882074832774893568

: TestKarpenter/Main expand_less 	0s
{Failed  === RUN   TestKarpenter/Main
    util.go:153: Successfully waited for kubeconfig to be published for HostedCluster e2e-clusters-t8sw8/example-vr6sz in 25ms
    util.go:170: Successfully waited for kubeconfig secret to have data in 25ms
    util.go:213: Successfully waited for a successful connection to the guest API server in 25ms
    karpenter_test.go:52: 
        Expected success, but got an error:
            <*meta.NoKindMatchError | 0xc002d931c0>: 
            no matches for kind "NodePool" in version "karpenter.sh/v1"
            {
                GroupKind: {
                    Group: "karpenter.sh",
                    Kind: "NodePool",
                },
                SearchedVersions: ["v1"],
            }
    --- FAIL: TestKarpenter/Main (0.10s)
}
: TestKarpenter expand_less 	27m15s
{Failed  === RUN   TestKarpenter
=== PAUSE TestKarpenter
=== CONT  TestKarpenter
    hypershift_framework.go:316: Successfully created hostedcluster e2e-clusters-t8sw8/example-vr6sz in 24s
    hypershift_framework.go:115: Summarizing unexpected conditions for HostedCluster example-vr6sz 
    util.go:1699: Successfully waited for HostedCluster e2e-clusters-t8sw8/example-vr6sz to have valid conditions in 25ms
    hypershift_framework.go:194: skipping postTeardown()
    hypershift_framework.go:175: skipping teardown, already called
--- FAIL: TestKarpenter (1635.11s)
}

https://github.com/openshift/hypershift/pull/5404 is in the first payload and looks extremely related.

https://github.com/openshift/hypershift/pull/5461

Bug OCPBUGS-49940: PowerVS: datacenter supported systems

View the Description View the linked PRs

Description of problem:

The PowerVS installer uses a hard coded list of supported machine types.  However, this does not keep up as new types are added.  Therefore switch to querying the datacenter for the current supported types.

https://github.com/openshift/installer/pull/9442

Bug OCPBUGS-50552: Power VS: Remove support for platform.powervs.clusterOSImage

View the Description View the linked PRs

Description of problem:

platform.powervs.clusterOSImage is no longer supported but is still supported in the install configuration YAML file.

Version-Release number of selected component (if applicable):

4.19.0

Expected results:

In case the platform.powervs.clusterOSImage key is specified, a warning "The value of platform.powervs.clusterOSImage will be ignored." should be shown and the value of the key should be ignored.

https://github.com/openshift/installer/pull/9417

Bug OCPBUGS-48074: Accelerator telemetry rules does not support all the vendors

View the Description View the linked PRs

Description of the problem:

How reproducible:

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/cluster-monitoring-operator/pull/2551

Bug OCPBUGS-50969: Azure MAG azure-cloud-controller-manager pod stuck in CrashLoopBackOff state

View the Description View the linked PRs

Description of problem:

Azure MAG cluster install failed. azure-cloud-controller-manager pod stuck in CrashLoopBackOff state, seems endpoint is not correct, should be https://management.usgovcloudapi.net/

The code is here https://github.com/openshift/cloud-provider-azure/blob/main/pkg/provider/azure.go#L490 

2025-02-17T23:49:41.894495557Z E0217 23:49:41.894412       1 azure.go:490] InitializeCloudFromConfig: failed to sync regional zones map for the first time: list zones: GET https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/providers/Microsoft.Compute
2025-02-17T23:49:41.894495557Z --------------------------------------------------------------------------------
2025-02-17T23:49:41.894495557Z RESPONSE 404: 404 Not Found
2025-02-17T23:49:41.894495557Z ERROR CODE: SubscriptionNotFound
2025-02-17T23:49:41.894495557Z --------------------------------------------------------------------------------
2025-02-17T23:49:41.894495557Z {
2025-02-17T23:49:41.894495557Z   "error": {
2025-02-17T23:49:41.894495557Z     "code": "SubscriptionNotFound",
2025-02-17T23:49:41.894495557Z     "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found."
2025-02-17T23:49:41.894495557Z   }
2025-02-17T23:49:41.894495557Z }
2025-02-17T23:49:41.894495557Z --------------------------------------------------------------------------------
2025-02-17T23:49:41.894495557Z F0217 23:49:41.894467       1 controllermanager.go:353] Cloud provider azure could not be initialized: could not init cloud provider azure: list zones: GET https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/providers/Microsoft.Compute
2025-02-17T23:49:41.894495557Z --------------------------------------------------------------------------------
2025-02-17T23:49:41.894495557Z RESPONSE 404: 404 Not Found
2025-02-17T23:49:41.894495557Z ERROR CODE: SubscriptionNotFound
2025-02-17T23:49:41.894495557Z --------------------------------------------------------------------------------
2025-02-17T23:49:41.894495557Z {
2025-02-17T23:49:41.894495557Z   "error": {
2025-02-17T23:49:41.894495557Z     "code": "SubscriptionNotFound",
2025-02-17T23:49:41.894495557Z     "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found."
2025-02-17T23:49:41.894495557Z   }
2025-02-17T23:49:41.894495557Z }
2025-02-17T23:49:41.894495557Z --------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Install cluster on azure mag
    2.
    3.

Actual results:

Cluster install failed

Expected results:

Cluster install succeed

Additional info:

This only found on 4.19 azure mag clusters

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-azure-mag-ipi-fips-f7/1890522382758580224
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-azure-mag-ipi-fullyprivate-f7/1891606065993224192

https://github.com/openshift/cloud-provider-azure/pull/139

Bug OCPBUGS-53057: cluster-capi-operator: wrong infraclusterkind for IBMPowerVSCluster

View the Description View the linked PRs

Description of problem:

In tech preview, the cluster-capi-operator is unable to automatically generate the core Cluster on powerVS clusters

Failing with the following error: `failed to get infra cluster`

This is due to a wrong infraclusterkind string for the IBMPowerVSCluster InfraCluster Kind.

Version-Release number of selected component (if applicable):

4.19 4.18

How reproducible:

    always

Steps to Reproduce:

    1. stand up a TechPreview powervs cluster
    2. Run e2e tests from the cluster-capi-operator repo
    3. check the cluster-capi-operator logs

Actual results:

    failed to get infra cluster

Expected results:

should get the infra cluster and create a core cluster

Additional info:

https://github.com/openshift/cluster-capi-operator/pull/266

Bug OCPBUGS-43779: [GCP] destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator

View the Description View the linked PRs

Description of problem:

    Destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-23-202329

How reproducible:

    Always

Steps to Reproduce:

1. pre-create vpc network/subnets/router and a bastion host
2. "create install-config", and then insert the network settings under platform.gcp, along with "publish: Internal" (see [1])
3. "create cluster" (use the above bastion host as http proxy)
4. "destroy cluster" (see [2])

Actual results:

    Although "destroy cluster" completes successfully, the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator are not deleted (see [3]), which leads to deleting the vpc network/subnets failure.

Expected results:

    The forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator should also be deleted during "destroy cluster".

Additional info:

FYI one history bug https://issues.redhat.com/browse/OCPBUGS-37683

https://github.com/openshift/installer/pull/9270

4.19.0-0.nightly-2025-03-15-150336

Changes from 4.18.5

Complete Features

Background

Outcomes

Background

Outcomes

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Feature Overview

Goals

Requirements

Epic Goal

Why is this important?

Acceptance Criteria

Dependencies (internal and external)

Done Checklist

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Documentation Considerations

Interoperability Considerations

Epic Goal

Why is this important?

Scenarios

Dependencies

Contributing Teams

Acceptance Criteria

Drawbacks or Risk

Done - Checklist

Feature Overview

Requirements

Questions to answer…

Out of Scope

Background, and strategic fit

Documentation Considerations

Feature Overview (aka. Goal Summary)

Background, and strategic fit

Acceptance Criteria

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Background

Steps

Stakeholders

Definition of Done

Problem

Goal

Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

Epic Goal

Why is this important?

Acceptance Criteria