Back to index

4.17.8

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.16.28

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

This outcome tracks the overall CoreOS Layering story as well as the technical items needed to converge CoreOS with RHEL image mode. This will provide operational consistency across the platforms.

ROADMAP for this Outcome: https://docs.google.com/document/d/1K5uwO1NWX_iS_la_fLAFJs_UtyERG32tdt-hLQM8Ow8/edit?usp=sharing
 

 

 

 

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

Description of problem:

When we activate the on-cluster-build functionality in a pool with yum based RHEL nodes, the pool is degraded reporting this error:

  - lastTransitionTime: "2023-09-20T15:14:44Z"
    message: 'Node ip-10-0-57-169.us-east-2.compute.internal is reporting: "error
      running rpm-ostree --version: exec: \"rpm-ostree\": executable file not found
      in $PATH"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded


Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-09-15-233408

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster and add a yum based RHEL node to the worker pool

(we used RHEL8)

2. Create the necessary resources to enable the OCB functionality. Pull and push secrets and the on-cluster-build-config configmap.

For example we can use this if we want to use the internal registry:

cat << EOF | oc create -f -
apiVersion: v1
data:
  baseImagePullSecretName: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
  finalImagePushSecretName: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
  finalImagePullspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image"
  imageBuilderType: ""
kind: ConfigMap
metadata:
  name: on-cluster-build-config
  namespace: openshift-machine-config-operator
EOF

The configuration doesn't matter as long as the OCB functionality can work.

3. Label the worker pool so that the OCB functionality is enabled

$ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=

Actual results:

The RHEL node shows this log:


I0920 15:14:42.852742    1979 daemon.go:760] Preflight config drift check successful (took 17.527225ms)
I0920 15:14:42.852763    1979 daemon.go:2150] Performing layered OS update
I0920 15:14:42.868723    1979 update.go:1970] Starting transition to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/tc-67566@sha256:24ea4b12acf93095732ba457fc3e8c7f1287b669f2aceec65a33a41f7e8ceb01"
I0920 15:14:42.871625    1979 update.go:1970] drain is already completed on this node
I0920 15:14:42.874305    1979 rpm-ostree.go:307] Running captured: rpm-ostree --version
E0920 15:14:42.874388    1979 writer.go:226] Marking Degraded due to: error running rpm-ostree --version: exec: "rpm-ostree": executable file not found in $PATH
I0920 15:15:37.570503    1979 daemon.go:670] Transitioned from state: Working -> Degraded
I0920 15:15:37.570529    1979 daemon.go:673] Transitioned from degraded/unreconcilable reason  -> error running rpm-ostree --version: exec: "rpm-ostree": executable file not found in $PATH
I0920 15:15:37.574942    1979 daemon.go:2300] Not booted into a CoreOS variant, ignoring target OSImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3128a8e42fb70ab6fc276f7005e3c0839795e4455823c8ff3eca9b1050798b9
I0920 15:15:37.591529    1979 daemon.go:760] Preflight config drift check successful (took 16.588912ms)
I0920 15:15:37.591549    1979 daemon.go:2150] Performing layered OS update
I0920 15:15:37.591562    1979 update.go:1970] Starting transition to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/tc-67566@sha256:24ea4b12acf93095732ba457fc3e8c7f1287b669f2aceec65a33a41f7e8ceb01"
I0920 15:15:37.594534    1979 update.go:1970] drain is already completed on this node
I0920 15:15:37.597261    1979 rpm-ostree.go:307] Running captured: rpm-ostree --version
E0920 15:15:37.597315    1979 writer.go:226] Marking Degraded due to: error running rpm-ostree --version: exec: "rpm-ostree": executable file not found in $PATH
qI0920 15:16:37.613270    1979 daemon.go:2300] Not booted into a CoreOS variant, ignoring target OSImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e3128a8e42fb70ab6fc276f7005e3c0839795e4455823c8ff3eca9b1050798b9



And the worker pool is degraded with this error:

  - lastTransitionTime: "2023-09-20T15:14:44Z"
    message: 'Node ip-10-0-57-169.us-east-2.compute.internal is reporting: "error
      running rpm-ostree --version: exec: \"rpm-ostree\": executable file not found
      in $PATH"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded


Expected results:


The pool should not be degraded.

Additional info:


Note: phase 2 target is tech preview.

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

The goal of this effort is to leverage OVN Kubernetes SDN to satisfy networking requirements of both traditional and modern virtualization. This Feature describes the envisioned outcome and tracks its implementation.

Current state

In its current state, OpenShift Virtualization provides a flexible toolset allowing customers to connect VMs to the physical network. It also has limited secondary overlay network capabilities and Pod network support.

It suffers from several gaps: Topology of the default pod network is not suitable for typical VM workload - due to that we are missing out on many of the advanced capabilities of OpenShift networking, and we also don't have a good solution for public cloud. Another problem is that while we provide plenty of tools to build a network solution, we are not very good in guiding cluster administrators configuring their network, making them rely on their account team.

Desired outcome

Provide:

  • Networking solution for public cloud
  • Advanced SDN networking functionality such as IPAM, routed ingress, DNS and cloud-native integration
  • Ability to host traditional VM workload imported from other virtualization platforms

... while maintaining networking expectations of a typical VM workload:

  • Sticky IPs allowing seamless live migration
  • External IP reflected inside the guest, i.e. no NAT for east-west traffic

Additionally, make our networking configuration more accessible to newcomers by providing a finite list of user stories mapped to recommended solutions.

User stories

You can find more info about this effort in https://docs.google.com/document/d/1jNr0E0YMIHsHu-aJ4uB2YjNY00L9TpzZJNWf3LxRsKY/edit

Goal

Provide IPAM to customers connecting VMs to OVN Kubernetes secondary networks.

User Stories

  • As a developer running VMs,
    I want to offload IPAM to somebody else,
    so I don't need to manage my own IP pools, DHCP server, or static IP configuration.

Non-Requirements

  • IPv6 support is not required.

Notes

  • KubeVirt cannot support CNI IPAM. For that reason we cannot utilize the current implementation of IP management in OVN Kubernetes
  • OVN supports IPAM, where an IP range is defined per port, and the port then offers assigned IP to the client using DHCP. We can use this

Done Checklist

Who What Reference
DEV Upstream roadmap issue <link to GitHub Issue>
DEV Upstream code and tests merged <link to meaningful PR>
DEV Upstream documentation merged <link to meaningful PR>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion https://polarion.engineering.redhat.com/polarion/#/project/CNV/workitem?id=CNV-10864
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

 

Add a knob to CNO to control the installation of the IPAMClaim CRD.

Requires a new OpenShift feature gate only allowing the feature to be installed in Dev / Tech preview.

Placeholder feature for ccx-ocp-core maintenance tasks.

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Description of problem:

InsightsRecommendationActive firing, description link results in "Invalid parameter: redirect_uri" on sso.redhat.com.

Insights recommendation "OpenShift cluster with more or less than 3 control plane node replicas is not supported by Red Hat" with total risk "Moderate" was detected on the cluster. More information is available at https://console.redhat.com/openshift/insights/advisor/clusters/<UID>?first=ccx_rules_ocp.external.rules.control_plane_replicas|CONTROL_PLANE_NODE_REPLICAS.


Version-Release number of selected component (if applicable):

4.15.14

How reproducible:

unknown

Steps to Reproduce:

1. Install 4.15.14 on a cluster that triggers this alert
2. Log out of Red Hat SSO
3. Clink link in alert description

Actual results:

"Invalid parameter: redirect_uri" on sso.redhat.com

Expected results:

Link successfully navigates through SSO

Additional info:

 

Description of problem:

We have a test test_cluster_base_domain_obfuscation that checks that when we set the insights-config configmap with the obfuscation parameter set to "networking", we expect the archive to not have instance of the api_url or the base_hostname of the cluster.
This is currently not happening in hypershift hosted clusters.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

    1. Run test_cluster_base_domain_obfuscation
    Or
    1. Create insights_config configmap in the openshift-insights namespace, with 
dataReporting: 
   obfuscation: Networking
    2. wait until the obfuscation creation table exists
    3. Download the archive
    4. Check every path in the archive and search for instances of the api_url or the base_hostname (easier to do with automation than manually)
    

Actual results:

Instances are found. 

Expected results:

No instances are found since they've all been obfuscated.

Additional info:

    

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

 Description of problem:

Insights operator should replaces %s in https://console.redhat.com/api/gathering/v2/%s/gathering_rules error messages like the failed-to-bootstrap:

$ jq -r .content osd-ccs-gcp-ad-install.log | sed 's/\\n/\n/g' | grep 'Cluster operator insights'
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED"
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules"
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: "
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED
level=info msg=Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%27REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules
level=info msg=Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: 
level=info msg=Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED
level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: {\"errors\":[{\"meta\":{\"response_by\":\"gateway\"},\"detail\":\"UHC services authentication failed\",\"status\":401}]}

Version-Release number of selected component

Seen in 4.17 RCs. Also in this comment.

How reproducible

Unknown

Steps to Reproduce:

Unknown.

Actual results:

ClusterOperator conditions talking about https://console.redhat.com/api/gathering/v2/%s/gathering_rules

Expected results

URIs we expose in customer-oriented messaging to not have %s placeholders.

Additional detail

Seems like the template is coming in as conditionalGathererEndpoint here. Seems like insights-operator#964 introduced the %s, but I'm not finding the logic that's supposed to populate that placeholder.

Rapid recommendations enhancement defines this built-in configuration when the operator cannot reach the remote endpoint.

The issue is that the built-in configuration (though currently empty) is no taken into account - i.e the data requested in the built-configuration is not gathered.

Goal:
Track Insights Operator Data Enhancements epic in 2024

 

 

 

 

INSIGHTOCP-1557 is a rule to check for any custom Prometheus instances that may impact the management of corresponding resources.

Resource to gather:  Prometheus and Alertmanager in all namespaces

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager

Backport:  OCP 4.12.z; 4.13.z; 4.14.z; 4.15.z

Additional info:
1) Get the Prometheus and Alertmanager in all namespaces

$ oc get prometheus -A 
NAMESPACE              NAME                             VERSION   DESIRED   READY   RECONCILED   AVAILABLE   AGE
openshift-monitoring   k8s                              2.39.1    2         1       True         Degraded    712d
test                   custom-prometheus                          1         0       True         False       25d
$ oc get alertmanager -A 
NAMESPACE              NAME                             VERSION   DESIRED   READY   RECONCILED   AVAILABLE   AGE
openshift-monitoring   main                             2.39.1    2         1       True         Degraded    712d
test                   custom-alertmanager                        1         0       True         False       25d
 

 

Business required:

We had a recommendation to check the certificate of the default ingress controller expiration after it has expired. From the referenced KCS, it seems that many customers(hundreds) hit this issue. So, Oscar Arribas Arribas suggests that if we can have a recommendation to alert customers before certificate expiration. 

Gathering method:

1. Gather all the ingresscontroller objects(we already gathered the default ingresscontroller) with commands: 
oc get ingresscontrollers -n openshift-ingress-operator
2. Gather operator auto-generated certificate's validate dates with commands:

$ oc get ingresscontrollers -n openshift-ingress-operator -o yaml | grep -A1 defaultCertificate
#### empty output here when certificate created by the operator
$ oc get secret router-ca -n openshift-ingress-operator -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Dec 28 00:00:00 2022 GMT
notAfter=Jan 22 23:59:59 2024 GMT
$ oc get secret router-certs-default -n openshift-ingress -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Dec 28 00:00:00 2022 GMT
notAfter=Jan 22 23:59:59 2024 GMT

3. Gather custom certificates' validate dates with commands:

$ oc get ingresscontrollers -n openshift-ingress-operator -o yaml | grep -A1 defaultCertificate
    defaultCertificate:
      name: [custom-cert-secret-1]
#### for each [custom-cert-secret] above
$ oc get secret [custom-cert-secret-1] -n openshift-ingress -o yaml | grep crt | awk '{print $2}' | base64 -d | openssl x509 -noout -dates
notBefore=Dec 28 00:00:00 2022 GMT
notAfter=Jan 22 23:59:59 2024 GMT
 

Other Information:

An RFE to create a cluster alert is under reveiwing: https://issues.redhat.com/browse/RFE-4269

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The PowerVS IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing PowerVS Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Phase 2 Deliverable:

GA support for a generic interface for administrators to define custom reboot/drain suppression rules. 

Epic Goal

  • Allow administrators to define which machineconfigs won't cause a drain and/or reboot.
  • Allow administrators to define which ImageContentSourcePolicy/ImageTagMirrorSet/ImageDigestMirrorSet won't cause a drain and/or reboot
  • Allow administrators to define alternate actions (typically restarting a system daemon) to take instead.
  • Possibly (pending discussion) add switch that allows the administrator to choose to kexec "restart" instead of a full hw reset via reboot.

Why is this important?

  • There is a demonstrated need from customer cluster administrators to push configuration settings and restart system services without restarting each node in the cluster. 
  • Customers are modifying ICSP/ITMS/IDMS outside post day 1/adding them+
  • (kexec - we are not committed on this point yet) Server class hardware with various add-in cards can take 10 minutes or longer in BIOS/POST. Skipping this step would dramatically speed-up bare metal rollouts to the point that upgrades would proceed about as fast as cloud deployments. The downside is potential problems with hardware and driver support, in-flight DMA operations, and other unexpected behavior. OEMs and ODMs may or may not support their customers with this path.

Scenarios

  1. As a cluster admin, I want to reconfigure sudo without disrupting workloads.
  2. As a cluster admin, I want to update or reconfigure sshd and reload the service without disrupting workloads.
  3. As a cluster admin, I want to remove mirroring rules from an ICSP, ITMS, IDMS object without disrupting workloads because the scenario in which this might lead to non-pullable images at a undefined later point in time doesn't apply to me.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Follow up epic to https://issues.redhat.com/browse/MCO-507, aiming to graduate the feature from tech preview and GA'ing the functionality.

For tech preview we only allow files/units/etc. There are two potential use cases for directories:

  1. namespaced image policy objects
  2. hostname based networking policies

Which would allow the MCO to generally allow anything under a path to apply the policy. We should adapt the API and MCO logic to also allow paths.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.

This feature will be used to track all the CAPI preparation work that is common for all the supported providers

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic Goal

  • Day 0 Cluster Provisioning
  • Compatibility with existing workflows that do not require a container runtime on the host

Why is this important?

  • This epic would maintain compatibility with existing customer workflows that do not have access to a management cluster and do not have the dependency of a container runtime

Scenarios

  1. openshift-install running in customer automation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

This feature is about providing workloads within an HCP KubeVirt cluster access to gpu devices. This is an important use case that expands usage of HCP KubeVirt to AL and ML workloads.

Goals (aka. expected user outcomes)

  • Users can assign GPUs to HCP KubeVirt worker nodes using NodePool API

Requirements (aka. Acceptance Criteria):

  • Expose the ability to assign GPUS to kubevirt NodePools
  • Ensure nvidia supports the nvidia gpu operator on hcp kubevirt
  • document usage and support of nvidia gpus with hcp kubevirt
  • CI environment and tests to verify gpu assignment to kubevirt nodepools functions 

 

 

GOAL:

Support running workloads within HCP KubeVirt clusters which need access to GPUs.

Accomplishing this involves multiple efforts

  • The NodePool API must be expanded to allow assignment of GPUs to the KubeVirt worker node VMs.
  • ensure nvidia operator works within the HCP cluster for gpus passed through to KubeVirt VMs
  • Develop a CI environment which allows us to exercise gpu passthrough. 

Diagram of multiple nvidia operator layers

https://docs.google.com/document/d/1HwXVL_r9tUUwqDct8pl7Zz4bhSRBidwvWX54xqXaBwk/edit 

1. Design and implement an API at the NodePool (platform.kubevirt) that will allow exposing GPU passthrough or vGPU slicing from the infra cluster to the guest cluster.

2. Implement logic that sets up the GPU resources to be available to the guest cluster's workloads (by using nvidia-gpu-operator?)

Feature Overview (aka. Goal Summary)

 

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

Goals (aka. expected user outcomes)

Customers can override the default (three) value and set it to a custom value.

Make sure we document (or link) the VMWare recommendations in terms of performances.

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

https://kb.vmware.com/s/article/1025279

Requirements (aka. Acceptance Criteria):

The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.

No change in the default

Use Cases (Optional):

As an OCP admin I would like to change the maximum number of snapshots per volumes.

Out of Scope

Anything outside of 

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

Background

The default value can't be overwritten, reconciliation prevents it.

Customer Considerations

Make sure the customers understand the impact of increasing the number of snapshots per volume.

https://kb.vmware.com/s/article/1025279

Documentation Considerations

Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.

Interoperability Considerations

N/A

Epic Goal*

The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.

Possible future candidates:

  • configure EFS volume size monitioring (via driver cmdline arg.) - STOR-1422
  • configure OpenStack topology - RFE-11

 
Why is this important? (mandatory)

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

https://kb.vmware.com/s/article/1025279

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to configure the maximum number of snapshots per volume.
  2. As a user I would like to create more than 3 snapshots per volume

 
Dependencies (internal and external) (mandatory)

1) Write OpenShift enhancement (STOR-1759)

2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)

3) Update vSphere operator to use the new snapshot options (STOR-1804)

4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)

  • prerequisite: add e2e test and demonstrate stability in CI (STOR-1838)

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Enablement
  • Others -

Acceptance Criteria (optional)

Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.

Drawbacks or Risk (optional)

Setting this config setting with a high value can introduce performances issues. This needs to be documented.

https://kb.vmware.com/s/article/1025279

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

The etc-ca must be rotatable both on-demand and automatically when expiry approaches.

Goals (aka. expected user outcomes)

 

  • Have a tested path for customers to rotate certs manually
  • We must have a tested path for auto rotation of certificates when certs need rotation due to age

 

Requirements (aka. Acceptance Criteria):

Deliver rotation and recovery requirements from OCPSTRAT-714 

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Feature Overview (aka. Goal Summary)  

As cluster admin I would like to configure machinesets to allocate instances from pre-existing Capacity Reservation in Azure.
I want to create a pool of reserved resources that can be shared between clusters of different teams based on their priorities. I want this pool of resources to remain available for my company and not get allocated to another Azure customer.

https://docs.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-associate-vm?tabs=api1%2Capi2%2Capi3

Additional background on the feature for considering additional use cases

https://techcommunity.microsoft.com/t5/azure-compute-blog/guarantee-capacity-access-with-on-demand-capacity-reservations/ba-p/3269202

 

 

  1. Proposed title of this feature request

Machine API support for Azure Capacity Reservation Groups

  1. What is the nature and description of the request?

The customer would like to configure machinesets to allocate instances from pre-existing Capacity Reservation Groups, see Azure docs below

  1. Why does the customer need this? (List the business requirements here)

This would allow the customer to create a pool of reserved resources which can be shared between clusters of different priorities. Imagine a test and prod cluster where the demands of the prod cluster suddenly grow. The test cluster is scaled down freeing resources and the prod cluster is scaled up with assurances that those resources remain available, not allocated to another Azure customer.

  1. List any affected packages or components.

MAPI/CAPI Azure

In this use case, there's no immediate need for install time support to designate reserved capacity group for control plane resources, however we should consider whether that's desirable from a completeness standpoint. We should also consider whether or not this should be added as an attribute for the installconfig compute machinepool or whether altering generated MachineSet manifests is sufficient, this appears to be a relatively new Azure feature which may or may not see wider customer demand. This customer's primary use case is centered around scaling up and down existing clusters, however others may have different uses for this feature.

https://docs.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-associate-vm?tabs=api1%2Capi2%2Capi3

Additional background on the feature for considering additional use cases

https://techcommunity.microsoft.com/t5/azure-compute-blog/guarantee-capacity-access-with-on-demand-capacity-reservations/ba-p/3269202

User Story

As a developer I want to add the field "CapacityReservationGroupID" to "AzureMachineProviderSpec" in openshift/api so that Azure capacity reservation can be supported.

Background

CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.

Slack discussion regarding the same: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1713249202780119?thread_ts=1712582367.529309&cid=CBZHF4DHC_

Steps

  • Add the field "CapacityReservationGroupID" to "AzureMachineProviderSpec"
  • The new field should be immutable. Add validations for the same.
  • Add tests to validate the immutability.

Stakeholders

  • Cluster Infra
  • CFE

Definition of Done

  • The PR should be reviewed and approved.
  • Docs
  • Add appropriate godoc for the field explaining its purpose
  • Testing
  • Add tests to validate the immutability.

User Story

As a developer I want to add support of capacity reservation group in openshift/machine-api-provider-azure so that azure VMs can be associated to a capacity reservation group during the VM creation.

Background

CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.

Steps

  • Import the latest API changes from openshift/api to get the new field "CapacityReservationGroupID" into openshift/machine-api-provider-azure.
  • If a value is assigned to the field then use for associated a VM to the capacity reservation group during VM creation.

Stakeholders

  • Cluster Infra
  • CFE

Definition of Done

  • The PR should be reviewed and approved.
  • Testing
  • Add unit tests to validate the implementation.

As a developer I want to add the webhook validation for the "CapacityReservationGroupID" field of "AzureMachineProviderSpec" in openshift/machine-api-operator so that Azure capacity reservation can be supported.

Background

CFE-1036 adds the support of Capacity Reservation in upstream CAPZ (PR). The same support is needed to be added downstream also. Please refer the upstream PR for adding support downstream.

Slack discussion regarding the same: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1713249202780119?thread_ts=1712582367.529309&cid=CBZHF4DHC_

Steps

  • Add the validation for "CapacityReservationGroupID" to "AzureMachineProviderSpec"
  • Add tests to validate.

Stakeholders

  • Cluster Infra
  • CFE

Feature Overview (aka. Goal Summary)  

Add support for standalone secondary networks for HCP kubevirt.

Advanced multus integration involves the following scenarios

1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM

Goals (aka. expected user outcomes)

Users of HCP KubeVirt should be able to create a guest cluster that is completely isolated on a secondary network outside of the default pod network. 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both self-managed
Classic (standalone cluster) na
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all na
Connected / Restricted Network yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86
Operator compatibility na
Backport needed (list applicable versions) na
UI need (e.g. OpenShift Console, dynamic plugin, OCM) na
Other (please specify) na

Documentation Considerations

ACM documentation should include how to configure secondary standalone networks.

 

This is a continuation of CNV-33392.

Multus Integration for HCP KubeVirt has three scenarios.

1. Secondary network as single interface for VM
2. Multiple Secondary Networks as multiple interfaces for VM
3. Secondary network + pod network (default for kubelet) as multiple interfaces for VM

Item 3 is the simplest use case because it does not require any addition considerations for ingress and load balancing. This scenario [item 3] is covered by CNV-33392.

Items [1,2] are what this epic is tracking, which we are considering advanced use cases.

Feature Overview (aka. Goal Summary)  

The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.

Goals (aka. expected user outcomes)

Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.

Requirements (aka. Acceptance Criteria):

There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.
 

Background

Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel

Documentation Considerations

Usual documentation will be required in case there are any new user-facing options available as a result of this feature.

Epic Goal

  • Implement support in openshift-install to install OpenShift clusters using CAPI with user-provided Public IPv4 Pool ID and create resources* which consumes Public IP when publish strategy is "External".

*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways

Why is this important?

  • The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. 

Scenarios

  1. As a customer with BYO Public IPv4 pools in AWS, I would like to install OpenShift cluster on AWS consuming public IPs for my own CIDR blocks, so I can have control of IPs used by my the services provided by me and will not be impacted by AWS Public IPv4 charges
  2.  

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Is there a method to use the pool by default for all Public IPv4 claims from a given VPC/workload? So the implementation doesn't need to create EIP and associations for each resource and subnet/zone.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

In phase 1 provided tech preview for GCP.

In phase 2, GCP support goes to GA. Support for other IPI footprints is new and tech preview.

Requirements

This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589

Currently errors are propagated via a prometheus alert. Before GA, we will need to make sure that we are placing a condition on the configuration object in addition to the current Prometheus mechanism. This will be done by the MSBIC, but it should be mindful as to not stomp on the operator, which updates the MachineConfiguration Status as well.  

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.29 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • Uncover any possible issues with the openshift/kubernetes rebase before it merges.
  • MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
  • MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

  • All stories in this epic must be completed.
  • Go version is upgraded for MCO components.
  • CI is running successfully with the upgraded components against the 4.16/master branch.

Dependencies (internal and external)

  1. ART team creating the go 1.29 image for upgrade to go 1.29.
  2. OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

  1. Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

  • Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
  • Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
  • Step 3 - Upgrade OpenShift client and API dependencies
  • Step 4 - Update kubelet and kube-proxy submodules in MCO repository
  • Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

We currently have a kube version string used in the call to setup envtest. We should either git rid of this reference and grab it from elsewhere or update it with every kube bump we do.

In addition, the setup call now requires an additional argument to factor for openshift/api's kubebuilder divergence. So the value being used here may not be valid for every kube bump as the archive is not generated for every kube version. (Doing a bootstrap test run should be able to sus this out, if it doesn't error with the new version you should be ok)

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4380

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Epic Goal*

Drive the technical part of the Kubernetes 1.29 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

 
Why is this important? (mandatory)

OpenShift 4.17 cannot be released without Kubernetes 1.30

 
Scenarios (mandatory) 

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

PRs:

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

make sure we deliver a 1.30 kube-proxy standalone image

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Bump kube to 1.30 in CNCC

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.29
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that cloud-credential-operator uses to v1.30 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

 

 upgrade all OpenShift and Kubernetes components that cloud-credential-operator uses to v1.30 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.29
  • target is 4.17 since CAPI is always a release behind upstream

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.17 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature Overview (aka. Goal Summary)  

While installing OpenShift on AWS add support to use of existing IAM instance profiles

Goals (aka. expected user outcomes)

Allow a user to use existing an IAM instance profile while deploying OpenShift on AWS.

Requirements (aka. Acceptance Criteria):

When using existing IAM role, the Installer tries to create a new IAM instance profile. As of today, the installation will fail if the user does not have permission to create instance profiles.

The Installer will provide an option to the user to use an existing IAM instance profile instead trying to create a new one if this is provided.

Background

This work is important not only for self-manage customers who want to reduce the required permissions needed for the IAM accounts but also for the IC regions and ROSA customers.

Previous work

https://github.com/dmc5179/installer/commit/8699caa952d4a9ce5012cca3f86aeca70c499db4

Epic Goal

  • Allow a user to use existing an IAM instance profile while deploying OpenShift on AWS.

Why is this important?

  • This work is important not only for self-managed customers who want to reduce the required permissions needed for the IAM accounts but also for the IC regions and ROSA customers.

Scenarios

  1. When using an existing IAM role, the Installer tries to create a new IAM instance profile. As of today, the installation will fail if the user does not have permission to create instance profiles.

The Installer will provide an option to the user to use an existing IAM instance profile instead of trying to create a new one if this is provided.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work (Optional):

  1. https://github.com/dmc5179/installer/commit/8699caa952d4a9ce5012cca3f86aeca70c499db4

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Use a pre-existing IAM profile in the install-config.yaml
  • Use a user/role which doesn't have the permissions needed for instance profile creation.

so that I can achieve

  • A cluster created with an existing IAM profile.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA

Goals (aka. expected user outcomes)

Remove the feature gate flag and ,ake the feature accessible to all customers 

Requirements (aka. Acceptance Criteria):

Requires fixes to apiserver to handle etcd client retries correctly

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both yes
Classic (standalone cluster) yes
Hosted control planes no
Multi node, Compact (three node), or Single node (SNO), or all Multi node and compact clusters
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Yes
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify) N/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal*

Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA

https://github.com/openshift/api/pull/1538
https://github.com/openshift/enhancements/pull/1447

 
Why is this important? (mandatory)

Graduating the feature to GA makes it accessible to all customers and not hidden behind a feature gate.

As further outlined in the linked stories the major roadblock for this feature to GA is to ensure that the API server has the necessary capability to configure its etcd client for longer retries on platforms with slower latency profiles. See: https://issues.redhat.com/browse/OCPBUGS-18149

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an openshift admin I can change the latency profile of the etcd cluster without causing any downtime to the control-plane availability

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd
  • Documentation - etcd docs team
  • QE - etcd qe
  • PX - 
  • Others -

Acceptance Criteria (optional)

Once the cluster is installed, we should be able to change the default latency profile on the API to a slower one and verify that etcd is rolled out with the updated leader election and heartbeat timeouts. During this rollout there should be no disruption or unavailability to the control-plane.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Once https://issues.redhat.com/browse/ETCD-473 is done this story will track the work required to move the "operator/v1 etcd spec.hardwareSpeed" field from behind the feature gate to GA.

Feature Overview (aka. Goal Summary)  

Graduate the etcd tuning profiles feature delivered in https://issues.redhat.com/browse/ETCD-456 to GA

Goals (aka. expected user outcomes)

Remove the feature gate flag and ,ake the feature accessible to all customers 

Requirements (aka. Acceptance Criteria):

Requires fixes to apiserver to handle etcd client retries correctly

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both yes
Classic (standalone cluster) yes
Hosted control planes no
Multi node, Compact (three node), or Single node (SNO), or all Multi node and compact clusters
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) Yes
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify) N/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Note: There is no work pending from OTA team. The Jira tracks the work pending from other teams.
We started the feature with the assumption that CVO has to implement sigstore key verification like we do with gpg keys.
After investigation we found that sigstore key verification is done at node level and there is no CVO work. From that point this feature became a tracking feature for us to help other teams to do "sigstore key verification" tasks . Specifically Node team. The "sigstore key verification" roadmap is here https://docs.google.com/presentation/d/16dDwALKxT4IJm7kbEU4ALlQ4GBJi14OXDNP6_O2F-No/edit#slide=id.g547716335e_0_2075

 

Feature Overview (aka. Goal Summary)  

Add sigstore signatures to core OCP payload and enable verification. Verification is now done via CRIO.
There is no CVO work in this feature and this is a Tech Preview change.
OpenShift Release Engineering can leverage a mature signing and signature verification stack instead of relying on simple signing

enhancement - https://github.com/openshift/enhancements/blob/49e25242f5105259d539a6c586c6b49096e5f201/enhancements/api-review/add-ClusterImagePolicy-and-ImagePolicy-for-Signature-Verification.md

Goals (aka. expected user outcomes)

Customers can leverage OpenShift to create trust relationships for running OCP core container images
Specifically, customers can trust signed images from a Red Hat registry and OCP can verify those signatures

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>
– Kubelet/CRIO to verify RH images & release payload sigstore signatures
– ART will add sigstore signatures to core OCP images

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

These acceptance criteria are for all deployment flavors of OpenShift.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)   Not Applicable
UI need (e.g. OpenShift Console, dynamic plugin, OCM) none, 
Other (please specify)  

 

 

Documentation Considerations

Add documentation for sigstore verification and gpg verification

Interoperability Considerations

For folks mirroring release images (e.g. disconnected/restricted-network):

  • oc-mirror need to support sigstore mirroring (OCPSTRAT-1417).
  • Customers using BYO image registries need to support hosting sigstore signatures.

OCP clusters need to add the ability to validate Sigstore signatures for OpenShift release images.

This is part of Red Hat's overall Sigstore strategy.

Today, Red Hat uses "simple signing" which uses an OpenPGP/GPG key and a separate file server to host signatures for container images. 

Cosign is on track to be an industry standard container signing technique. The main difference is that, instead of signatures being stored in a separate file server, the signature is stored in the same registry that hosts the image.

Design document / discussion from software production: https://docs.google.com/document/d/1EPCHL0cLFunBYBzjBPcaYd-zuox1ftXM04aO6dZJvIE/edit

Demo video: https://drive.google.com/file/d/1bpccVLcVg5YgoWnolQxPu8gXSxoNpUuQ/view 
 
Software production will be migrating to the cosign over the course of 2024.

ART will continue to sign using simple signing in combination with sigstore signatures until SP stops using it and product documentation exists to help customers migrate from the simple signing signature verification.

Acceptance criteria

  • Help kubelet/CRI-O verify the new Sigstore signatures for OCP release images (TechPreview)

Currently this epic is primarily supporting the Node implementation work in OCPNODE-2231. There's a minor CVO UX tweak planned in OTA-1307 that's definitely OTA work. There's also the enhancement proposal in OTA-1294 and the cluster-update-keys in OTA-1304, which Trevor happens to be doing for intertial reasons, but which he's happy to hand off to OCPNODE and/or shift under OCPNODE-2231.

As described in the OTA-1294 enhancement. The cluster-update-keys repository isn't actually managed by the OTA team, but I expect it will be me opening the pull, and there isn't a dedicated Jira project covering cluster-update-keys, so I'm creating this ticket under the OTA Epic just because I can't think of a better place to put it.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

The goal of this EPIC is to either ship a cluster wide policy (not enabled by default) to verify OpenShift release/payload images or document how end users can create their own policy to verify them.

Why is this important?

We shipped cluster wide policy support in OCPNODE-1628 which should be used for internal components as well.

Scenarios

  1. Validate the sigstore signatures of OpenShift internal images to security harden the cluster deployment.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open Questions

  • How can we ensure no race condition between the CVO policy and CRI-O doing the verification?
  • Do we need to ensure to have old and new policies in place during an upgrade?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem

As Miloslav Trmač reported upstream, when a ClusterImagePolicy is set on a scope to accept sigstore signatures, the underlying registry needs to be configured with use-sigstore-attachments: true. The current code:

 func generateSigstoreRegistriesdConfig(clusterScopePolicies map[string]signature.PolicyRequirements) ([]byte, error) { 

does do that for the configured scope; but the use-sigstore-attachments option applies not to the "logical name", but to each underlying mirror individually.

I.e. the option needs to be on every mirror of the scope. Without that, if the image is found on one of such mirrors, the c/image code will not be looking for signatures on the mirror, and policy enforcement is likely to fail.

Version-Release number of selected component

Seen in 4.17.0-0.nightly-2024-06-25-162526, but likely all releases which implement ClusterImagePolicy so far, because this is unlikely to be a regression.

How reproducible

Every time.

Steps to Reproduce

Apply the ClusterImagePolicy suggested in OTA-1294's enhancements#1633:

$ cat <<EOF >policy.yaml
apiVersion: config.openshift.io/v1alpha1
kind: ClusterImagePolicy
metadata:
  name: openshift
  annotations:
    kubernetes.io/description: Require Red Hat signatures for quay.io/openshift-release-dev/ocp-release container images.
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/feature-set: TechPreviewNoUpgrade
spec:
  scopes:
  - quay.io/openshift-release-dev/ocp-release
  policy:
    rootOfTrust:
      policyType: PublicKey
      publicKey:
        keyData: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUlJQ0lqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FnOEFNSUlDQ2dLQ0FnRUEzQzJlVGdJQUo3aGxveDdDSCtIcE1qdDEvbW5lYXcyejlHdE9NUmlSaEgya09ZalRadGVLSEtnWUJHcGViajRBcUpWYnVRaWJYZTZKYVFHQUFER0VOZXozTldsVXpCby9FUUEwaXJDRnN6dlhVbTE2cWFZMG8zOUZpbWpsVVovaG1VNVljSHhxMzR2OTh4bGtRbUVxekowR0VJMzNtWTFMbWFEM3ZhYmd3WWcwb3lzSTk1Z1V1Tk81TmdZUHA4WDREaFNoSmtyVEl5dDJLTEhYWW5BMExzOEJlbG9PWVJlTnJhZmxKRHNzaE5VRFh4MDJhQVZSd2RjMXhJUDArRTlZaTY1ZE4zKzlReVhEOUZ6K3MrTDNjZzh3bDdZd3ZZb1Z2NDhndklmTHlJbjJUaHY2Uzk2R0V6bXBoazRjWDBIeitnUkdocWpyajU4U2hSZzlteitrcnVhR0VuVGcyS3BWR0gzd3I4Z09UdUFZMmtqMnY1YWhnZWt4V1pFN05vazNiNTBKNEpnYXlpSnVSL2R0cmFQMWVMMjlFMG52akdsMXptUXlGNlZnNGdIVXYwaktrcnJ2QUQ4c1dNY2NBS00zbXNXU01uRVpOTnljTTRITlNobGNReG5xU1lFSXR6MGZjajdYamtKbnAxME51Z2lVWlNLeVNXOHc0R3hTaFNraGRGbzByRDlkVElRZkJoeS91ZHRQWUkrK2VoK243QTV2UVV4Wk5BTmZqOUhRbC81Z3lFbFV6TTJOekJ2RHpHellSNVdVZEVEaDlJQ1I4ZlFpMVIxNUtZU0h2Tlc3RW5ucDdZT2d5dmtoSkdwRU5PQkF3c1pLMUhhMkJZYXZMMk05NDJzSkhxOUQ1eEsrZyszQU81eXp6V2NqaUFDMWU4RURPcUVpY01Ud05LOENBd0VBQVE9PQotLS0tLUVORCBQVUJMSUMgS0VZLS0tLS0K
EOF
$ oc apply -f policy.yaml

Set up an ImageContentSourcePolicy such as the ones Cluster Bot jobs have by default:

cat <<EOF >mirror.yaml
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: pull-through-mirror
spec:
  repositoryDigestMirrors:
  - mirrors:
    - quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com
    source: quay.io
EOF
$ oc apply -f mirror.yaml

Set CRI-O debug logs, following these docs:

$ cat <<EOF >custom-loglevel.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: custom-loglevel
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ''
  containerRuntimeConfig:
    logLevel: debug
EOF
$ oc create -f custom-loglevel.yaml

Wait for that to roll out, as described in docs:

$ oc get machineconfigpool master

Launch a Sigstore-signed quay.io/openshift-release-dev/ocp-release image, by asking the cluster to update to 4.16.1:

$ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a

Check the debug CRI-O logs:

$ oc adm node-logs --role=master -u crio | grep -i1 sigstore | tail -n5

Actual results

Not looking for sigstore attachments: disabled by configuration entries like:

$ oc adm node-logs --role=master -u crio' | grep -i1 sigstore | tail -n5
--
Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317169116Z" level=debug msg=" Using transport \"docker\" specific policy section quay.io/openshift-release-dev/ocp-release" file="signature/policy_eval.go:150"
Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317207897Z" level=debug msg="Reading /var/lib/containers/sigstore/openshift-release-dev/ocp-release@sha256=c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a/signature-1" file="docker/docker_image_src.go:479"
Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317240227Z" level=debug msg="Not looking for sigstore attachments: disabled by configuration" file="docker/docker_image_src.go:556"
Jun 28 19:06:34.317335 ip-10-0-43-59 crio[2154]: time="2024-06-28 19:06:34.317277208Z" level=debug msg="Requirement 0: denied, done" file="signature/policy_eval.go:285"

Expected results

Something about "we're going to look for Sigstore signatures on quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com, since that's where we found the quay.io/openshift-release-dev/ocp-release@sha256:c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a image". At this point, it doesn't matter whether the retrieved signature is accepted or not, just that a signature lookup is attempted.

Remove the openshift/api ClusterImagePolicy document about the restriction of scope field on OpenShift release image repository and install the updated manifest in MCO.

enhancements#1633 is still in flight, but there seams to be some consensus around its API Extensions proposal to drop the following Godocs from ClusterImagePolicy and ImagePolicy:

// Please be aware that the scopes should not be nested under the repositories of OpenShift Container Platform images.
// If configured, the policies for OpenShift Container Platform repositories will not be in effect.

The backing implementation will also be removed. This guard was initially intended to protect cluster adminstrators from breaking their clusters by configuring policies that blocked critical images. And before Red Hat was publishing signatures for quay.io/openshift-release-dev/ocp-release releases, that made sense. But now that Red Hat is almost (OTA-1267) publishing Sigstore signatures for those release images, it makes sense to allow policies covering those images. And even if a cluster administrator creates a policy that blocks critical image pulls, PodDisriptionBudgets should keep the Kubernetes API server and related core workloads running for long enough for the cluster administrator to use the Kube API to remove or adjust the problematic policy.

There's a possibility that we replace the guard with some kind of pre-rollout validation, but that doesn't have to be part of the initial work.

We want this guard in place to unblock testing of enhancements#1633's proposed ClusterImagePolicy, so we can decide if it works as expected, or if it needs tweaks before being committed as a cluster-update-keys manifest. And we want that testing to establish confidence in the approach before we start in on the installer's internalTestingImagePolicy and installer-caller work.

These seem like dup's, and we should remove ImagePolicy and consolidate around SigstoreImageVerification for clarity.

Feature Overview (aka. Goal Summary)  

Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.

Goals (aka. expected user outcomes)

Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.

Requirements (aka. Acceptance Criteria):

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Managed
Classic (standalone cluster) N/A
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all N/A
Connected / Restricted Network Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_64 ARM
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify)  

Questions to Answer (Optional):

Check with OCM and CAPI requirements to expose larger worker node count.

 

Documentation:

  • Design document detailing the autoscaling mechanism and configuration options
  • User documentation explaining how to configure and use the autoscaling feature.

Acceptance Criteria

  • Configure max-node size from CAPI
  • Management cluster nodes automatically scale up and down based on the hosted cluster's size.
  • Scaling occurs without manual intervention.
  • A set of "warm" nodes are maintained for immediate hosted cluster creation.
  • Resizing nodes should not cause significant downtime for the control plane.
  • Scaling operations should be efficient and have minimal impact on cluster performance.

 

Goal

  • Dynamically scale the serving components of control planes

Why is this important?

  • To be able to have clusters with large amount of worker nodes

Scenarios

  1. A hosted cluster amount of worker nodes increases past X amount, the serving components are moved to larger cloud instances
  2. A hosted cluster amount of workers falls below a threshold, the serving components are moved to smaller cloud instances.

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a service provider, I want to be able to:

  • Configure priority and fairness settings per HostedCluster size and force these settings to be applied on the resulting hosted cluster.

so that I can achieve

  • Prevent user of hosted cluster from bringing down the HostedCluster kube apiserver with their workload.

Acceptance Criteria:

Description of criteria:

  • HostedCluster priority and fairness settings should be configurable per cluster size in the ClusterSizingConfiguration CR
  • Any changes in priority and fairness inside the HostedCluster should be prevented and overridden by whatever is configured on the provider side.
  • With the proper settings, heavy use of the API from user workloads should not result in the KAS pod getting OOMKilled due to lack of resources.

This does not require a design proposal.
This does not require a feature gate.

Description of problem:

    In some instances the request serving node autoscaler helper fails to kick off a reconcile when there are pending pods.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    sometimes

Steps to Reproduce:

    1. Setup a mgmt cluster with size tagging
    2. Create a hosted cluster and get it scheduled
    3. Scale down the machinesets of the request serving nodes for the hosted cluster.
    4. Wait for the hosted cluster to recover

Actual results:

    A placeholder pod is created for the missing node of the hosted cluster, but does not cause a scale up of the corresponding machineset.

Expected results:

    The hosted cluster recovers by scaling up the corresponding machinesets.

Additional info:

    

Description of problem:

    On Running PerfScale test on staging sectors, the script creates 1 HC per minute to load up a Management Cluster to its maximum capacity(64 HC). There were 2 clusters trying to use same serving node pair and got in to a deadlock
# oc get nodes -l osd-fleet-manager.openshift.io/paired-nodes=serving-12 
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-4-127.us-east-2.compute.internal    Ready    worker   34m   v1.27.11+d8e449a
ip-10-0-84-196.us-east-2.compute.internal   Ready    worker   34m   v1.27.11+d8e449a

Each node got assigned to 2 different cluster
# oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcimf68iudmq2pctkj11os571ahutr1-mukri-dysn-0017 
NAME                                       STATUS   ROLES    AGE   VERSION
ip-10-0-4-127.us-east-2.compute.internal   Ready    worker   33m   v1.27.11+d8e449a

# oc get nodes -l hypershift.openshift.io/cluster=ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-84-196.us-east-2.compute.internal   Ready    worker   36m   v1.27.11+d8e449a

Taints were missing on those nodes, so metric-forwarder pod from other hostedclusters got scheduled on serving nodes.

# oc get pods -A -o wide | grep ip-10-0-84-196.us-east-2.compute.internal 
ocm-staging-2bcind28698qgrugl87laqerhhb0u2c2-mukri-dysn-0019   kube-apiserver-86d4866654-brfkb                                           5/5     Running                  0                40m     10.128.48.6      ip-10-0-84-196.us-east-2.compute.internal    <none>           <none>
ocm-staging-2bcins06s2acm59sp85g4qd43g9hq42g-mukri-dysn-0020   metrics-forwarder-6d787d5874-69bv7                                        1/1     Running                  0                40m     10.128.48.7      ip-10-0-84-196.us-east-2.compute.internal    <none>           <none>

and few more

Version-Release number of selected component (if applicable):

MC Version 4.14.17
HC version 4.15.10
HO Version quay.io/acm-d/rhtap-hypershift-operator:c698d1da049c86c2cfb4c0f61ca052a0654e2fb9

How reproducible:

Not Always

Steps to Reproduce:

    1. Create an MC with prod config (non-dynamic serving node)
    2. Create HCs on them at 1 HCP per minutes
    3. Cluster stuck at installing for more than 30 minutes
    

Actual results:

Only one replica of Kube-apiserver pods were up and the second stuck at pending state, upon checking the machine API has scaled both nodes in that machineset(serving-12) but only one got assigned(labelled). Further checking that node from one zone(serving-12a) was assigned to a specific hosted cluster(0017), and the other one(serving-12b) was assigned to a different hosted cluster(0019)

Expected results:

Kube-apiserver replica should be on the same machinesets and those node should be tainted.

Additional info: Slack

    

Description of problem:

    When creating a cluster with OCP < 4.16 and nodepools with a number of workers larger than smallest size, the placeholder pods for the hosted cluster get continually recycled and the hosted cluster is never scheduled.

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    Always

Steps to Reproduce:

0. Install hypershift with size tagging enabled    
1. Create a hosted cluster with nodepools large enough to not be able to use the default placeholder pods (or simply configure no placeholder pods)
    

Actual results:

    The cluster never schedules

Expected results:

    The cluster is scheduled and the control plane can come up

Additional info:

    

Feature Overview (aka. Goal Summary)  

CRIO wipe is existing feature in Openshift . When node reboots CRIO wipe goes and clear the node of all images so that node boots clean . When node come back up it need access to image registry to get all images and it takes time to get all images . For telco and edge situation node might not have access to image registry and takes time to come up .

Goal of this feature is to adjust CRIO wipe to wipe only images that has been corrupted because of sudden reboot not all images 

Feature Overview

Phase 2 of the enclave support for oc-mirror with the following goals

  • Incorporate feedback from the field from 4.16 TP
  • Performance improvements

Goals

  • Update the batch processing using `containers/image` to do the copying for setting the number of blobs (layers) to download
  • Introduce a worker for concurrency that can also update the number of images to download to improve overall performance (these values can be tweaked via CLI flags). 
  • Collaborate with the UX team to improve the console output while pulling or pushing images. 

For 4.17 timeframe

  • Incorporate feedback from the field from 4.16 TP
  • Performance improvements

Currently the operator catalog image is always being deleted in the delete feature. It can leads to catalogs broken in the clusters.

It is necessary to change the implementation to skip the deletion of the operator catalog image according with the following conditions:

  • If in the DeleteImageSetConfiguration were specified packages (operators) under the operator catalog, so it means the customer wants to delete only the specified operators and the operator catalog image should not be deleted.
  • If only the operator catalog was specified, it means all the operators under the specified operator catalog should be deleted AND also the operator catalog image.

It is necessary to create a data structure that contains in which operator each related image is encountered, it is possible to get this information from the current loop already present in the collection phase.

Having this data structure will allow to tell customers which operators failed based on a image that failed during the mirroring.

For example: related image X failed during the mirroring, this related image is present in the operators a, b and c, so in the mirroring errors file already being generated is going to include the name of the operators instead of the name of the related image only.

 

  • for a fail safe on a related image, stop mirroring the whole group of related images for that bundle
  • AND especially, defer mirroring bundle images till all related images are mirrored. Otherwise, we'll be in the situation you described yesterday: the operator starts upgrading when its related images are missing
  •  

Feature Overview

oc-mirror to notify users when when minVersion and maxVersion have a difference of more than one major version (e.g 4.14 to 4.16), to advice users to include the interim version in the channels to be mirrored.

This is required when planning to allow upgrades between Extended Upgrade Support (EUS) releases, which require the interim version between the two (e.g. 4.15 is required in the mirrored content to upgrade 4.14 to 4.16).

Goals

oc-mirror will inform clearly users via the command line about this requirement so that users can select the appropriate versions for their upgrade plans.

When doing OCP upgrade on EUS versions, sometimes it is required to add a middle version between the current and target version.

For example:

current OCP version 4.14
target OCP version 4.16

Sometimes in order to upgrade from 4.14 to 4.16 it is required a version in the middle like 4.15.8 and this version needs to be included in the ImageSetConfiguration when using oc-mirror.

The current algorithm in oc-mirror is not accurate enough to give this information, so the proposal is to add a warning in the command line and in the docs about using the cincinnati graph web page to check if there are versions in the middle when upgrading OCP EUS versions and adding it to the ImageSetConfiguration.

oc-mirror needs to identify when an OCP on EUS version is trying to upgrade and it is skiping one version (For example going from 4.12.14 to 4.14.18)

When this condition is identified, oc-mirror needs to show a warning in the log saying to customer to use the cincinnati web tool (upgrade tool) to identify versions required in the middle.

Feature Overview

Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.

This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.

Goals and requirements

  • Users can install a host on day 2 using a bootable image to an OpenShift cluster.
  • At least platforms baremetal, vSphere, none and Nutanix are supported
  • Clusters installed with any installation method can be expanded with the image
  • Clusters don't need to run any special agent to allow the new nodes to join.

How this workflow could look like

1. Create image:

$ export KUBECONFIG=kubeconfig-of-target-cluster
$ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker

2. Boot image

3. Check progress

$ oc adm add-node 

Consolidate options

An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:

  • UPI: Adding RHCOS worker nodes to a user-provisioned infrastructure cluster
    • This feature will replace the need to use this method for the majority of UPI clusters. The current UPI method consists on many many manual steps. The new method would replace it by a couple of commands and apply to probably more than 90% of UPI clusters.
  • Field-documented methods and asks
  • IPI:
    • There are instances were adding a node to an bare metal IPI-deployed cluster can't be done via its BMC. This new feature, while not replacing the day-2 IPI workflow, solves the problem for this use case.
  • MCE: Scaling hosts to an infrastructure environment
    • This method is the most time-consuming and in many cases overkilling, but currently, along with the UPI method, is one of the two options we can give to users.
    • We shouldn't need to ask users to install and configure the MCE operator and its infrastructure for single clusters as it becomes a project even larger than UPI's method and save this for when there's more than one cluster to manage.

With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.

In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.

This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.

Why is this important

This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).

Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.

Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.

Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.

Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters. 

Oracle Cloud Infrastructure

This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.

Existing work

We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.

Day 2 node addition with agent image.

Yet Another Day 2 Node Addition Commands Proposal

Enable day2 add node using agent-install: AGENT-682

 

Epic Goal

  • Cleanup/carryover work from AGENT-682 for the GA release

Why is this important?

  • Address all the required elements for the GA, such as the FIPS compliancy. This will allow a smoother integration of the node-joiner into the oc tool, as planned in   OCPSTRAT-784

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. https://issues.redhat.com/browse/AGENT-682

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

By default (and if available) the same pub ssh key used for the installation is reused to add a new node.
It could be useful to allow the user to (optionally) specify a different key in the nodes-config.yaml configuration file

Allow monitoring simultaneously more than one node.

It may be necessary to coalesce the different assisted-service pre-flight validations output accordingly

User Story:

As a cluster admin, I want to be able to:

  • add a worker node to the existing cluster consuming the multi-arch payload having a different architecture from the controlplane one

Open questions:

  • What are the effective use cases for an hybrid cluster (nodes with different archs)?

Currently Assisted Service, when adding a new node, configure the bootstrap ignition to fetch the node ignition from the insecure port (22624), even though it would be possible to use the secure one (22623). This could be an issue for the existing users that didn't want to use the insecure port for the add node operation.

Implementation notes
Extend the ClusterInfo asset to retrieve the initial ignition details (url and ca certificate) from the openshift-machine-api/worker-user-data-managed secret, if available in the target cluster. These information will then be used by the agent-installer-client when importing a new cluster, to configure the cluster ignition_endpoint

(see more context in comment https://github.com/openshift/installer/pull/8242#discussion_r1571664023)

To allow running the node-joiner in a pod into a cluster where FIPS was enabled:

  • set CGO_ENABLED=1 in hack/build-node-joiner.sh
  • set the fips=1 karg in the ISO (if not already set, to be verified)

Performs all the required preliminary validations for adding new nodes:

  • Ensure that the (static) IPs/hostnames, if specified, do not conflict with the existing ones
    -If nmstate config is provided, it should be validated as usual- (already done)
  • Validate that the target platform is a supported one

Epic Goal*

Provide a simple commands for almost all users to add a node to a cluster where scaling up a MachineSet isn't an option - whether they have installed using UPI, Assisted or the agent-based installer, or can't use MachineSets for some other reason.

 
Why is this important? (mandatory)

  • Enable easy day2 installation without requiring additional knowledge from the user
  • Unified experience for day1 and day2 installation for the agent based installer
  • Unified experience for day1 and day2 installation for appliance workflow
  • Eliminate the requirement of installing MCE that have high requirements (requires 4 cores and 16GB RAM for a multi-node cluster, and if the infrastructure operator is included then it will require storage as well)
  • Eliminate the requirement of nodes having a BMC available to expand bare metal clusters (see docs).
  • Simplify adding compute nodes based on the the UPI method or other method implemented in the field such as WKLD-433 or other automations that try to solve this problem

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. User installed day1 cluster with agent based install and want to add workers or replace failed nodes, currently alternative is to install MCE or, if connected, use SAAS.

 
Dependencies (internal and external) (mandatory)

AGENT-682

Contributing Teams(and contacts) (mandatory) 

The installer team is developing the main body of the feature, which will run in the cluster to be expanded, as well as a prototype client-side script in AGENT-682. They will then be able to translate the client-side into native oc-adm subcommands.

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

This is the command responsible for monitoring the activity when adding new node(s) to the target cluster.
Similarly to the add-nodes-image command, also this one will be a simpler wrapper around the node-joiner's monitor command.

A list of the expected operations can be found in https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/node-joiner-monitor.sh

Feature Overview

Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift

prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Phase 1 & 2 covers implementing base functionality for CAPI.

Background, and strategic fit

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic Goal

  • As we prepare to move over to using Cluster API (CAPI) we need to make sure that we have the providers in place to work with this. This Epic is to track the tech preview of the provider for Azure

Why is this important?

  • What are the benefits to the customer, or to us, that make this worth
    doing? Fulfills a critical need for a customer? Improves
    supportability/debuggability? Improves efficiency/performance? This
    section is used to help justify the priority of this item vs other things
    we can do.

Drawbacks

  • Reasons we should consider NOT doing this such as: limited audience for
    the feature, feature will be superceded by other work that is planned,
    resulting feature will introduce substantial administrative complexity or
    user confusion, etc.

Scenarios

  • Detailed user scenarios that describe who will interact with this
    feature, what they will do with it, and why they want/need to do that thing.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

On clusters where we use Cluster API instead of Machine API we need to create an empty file to report that the bootstrapping was successful. The file should be place in "/run/cluster-api/bootstrap-success.complete"

Normally there is a special controller for it, but in OpenShift we use MCO to bootstrap machines, so we have to create this file directly.

ToDo:

  • Ensure that "/run/cluster-api" folder exists on the machine
  • Create CAPI bootstrapping sentinel file "bootstrap-success.complete" in this folder

Links: 

https://cluster-api.sigs.k8s.io/developer/providers/bootstrap.html 
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/defaults.go#L81-L83 

 

User Story

As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps

Background

Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.

Steps

  • Install new CAPI manifest generator as a go `tool` to all the CAPI provider repositories
  • Setup a make target under the `/openshift/Makefile` to invoke the generator. Make it output the manifests under `/openshift/manifests`
  • Make sure `/openshift/manifests` is mapped to `/manifests` in the openshift/Dockerfile, so that the files are later picked up by CVO
  • Make sure the manifest generation works by triggering a manual generation
  • Check in the newly generated transport ConfigMap + Credential Requests (to let them be applied by CVO)

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • CAPI manifest generator tool is installed 
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Acceptance criteria:

  1. For RFE-2461: Automatically prefill the exposed port from a Dockerfile into the Import from Git flow (if defined and possible)
  2. For RFE-2473: Automatically select new Deployments or Knative Services in Topology after creating them with the Import from Git or Import image container flow, similar to the "Create Helm Release" flow.

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description of problem:

Port exposed in Dockerfile not observed in the Ports Dropdown in Git Import Form   

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

Always

Steps to Reproduce:

 Use the https://github.com/Lucifergene/knative-do-demo/ to create application with the Dockerfile strateg     

Actual results:

    Ports not displayed

Expected results:

    Should display

Additional info:

    

Description

As a user, I don't want to see jumping UI when the Import from Git flow (or other forms) waits that network call is handled correctly.

The PatternFly Button components have a progress indicator for that: https://www.patternfly.org/components/button/#progress-indicators

This should be preferred over the manually displayed indicator that is shown sometimes (esp on the Import from Git page) in the next line.

Acceptance Criteria

  1. Use the button loading indicator instead of a loading indicator next to the button ("on all forms")
    1. Import from Git
    2. Import Container image
    3. Pipelines Builder (Pipelines create page)

Additional Details:

Description

As a user, I want to quickly select a Git Type if Openshift cannot identify it. Currently, the UI involves opening a Dropdown and selecting the desired Git type.

But it would be great if I could select the Git Type directly without opening any dropdown. This will reduce the number of clicks required to complete the action.

Acceptance Criteria

  1. Remove the Git Type dropdown
  2. Use the Tile PF component to design the new Git Types
  3. Make sure it's smaller than the Build Strategy Tiles
  4. Update the E2E tests

Additional Details:

Goal:

The OpenShift Developer Console supports an easy way to import source code from a Git repository and automatically creates a BuildConfig or a Pipeline for the user.

Why is it important?

GitEA is an open-source alternative to GitHub, similar to GitLab. Customers who use or try GitEA will see warnings while importing their Git repository. We got already the first bug around missing GitEA support OCPBUGS-31093

Use cases:

  1. Import from Git should support GitEA
  2. Should also work with Serverless function > Import from Git
  3. And when import a Devfile

Acceptance criteria:

  1. Import from Git should support GitEA
  2. If might switch to GitEA provider if the domainname contains gitea or git-ea? (not a hard requirement)
  3. The user should have the option to switch to GitEA if the git provider auto-detection doesn't work
  4. It should work with public and private repositories

Dependencies (External/Internal):

None

Design Artifacts:

Not required

Exploration:

  1. We should explore if GitEA provides an API so that our frontend can fetch file lists and file content via REST.
  2. We should check if there might be CORS issues or if we can use our internet proxy if needed.

Note:

None

Description

As a developer, I want to create a new Gitea service to be able to perform all kinds of import operations on repositories hosted in Gitea.

Acceptance Criteria

  1. Analyse the APIs and filter the ones that are required for performing the import functions
  2. Create the new gitea-service file.
  3. Update the UI

Additional Details:

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature Overview (aka. Goal Summary)

OLM users can easily see in the console if an installed operator package is deprecated and learn how to stay within the support boundary by viewing the alerts/notifications that OLM emits, or by reviewing the operator status rendered by the console with visual representation.

Goals (aka. expected user outcomes)

  • Pre-installation: OLM users can see the deprecation visual representation in the console UI and be warned/discouraged from installing a deprecated package, from deprecated channels, or in a deprecated version, and learn the recommended alternatives to stay within the supported path (with a short description).
  • Post-installation: OLM users can see the deprecation visual representation in the console UI to tell if an installed operator is deprecated entirely, currently subscribed to a deprecated channel, or in a deprecated version, and know the alternatives as in package(s), update channel(s), or version(s) to stay within the support boundary.

Related Information

    • kiali-operator:
      • deprecated package: kiali-operator
      • deprecated channel: alpha
      • deprecated version: kiali-operator.v1.68.0

Acceptance Criteria

  • Pre-installation
    • Operator Hub page - display the deprecation warning if the PackageManifest is deprecated
    • Install Operator details page
      • display the deprecation badge if the PackageManifest is deprecated
      • display the deprecation warning with deprecation message if PackageManifest, Channel or Version is deprecated
      • In both Channel and Version dropdown, show an warning icon next to the deprecated entry.
    • Operator install page
      • display the deprecation badge if the PackageManifest is deprecated
      • display the deprecation warning with deprecation message if PackageManifest, Channel or Version is deprecated
      • In both Channel and Version dropdown, show an warning icon next to the deprecated entry.
  • Add integration and unit tests

As a user, I would like to customize the modal displayed when a 'Create Project' button is clicked in the UI.

Acceptance criteria

  • add an extension point so plugin creators can provide their own "Create Project" modal
  • the console should implement this extension point using the "useModal" hook from the dynamic plugin SDK
  • update docs with this extension point.
  • add a a custom modal to the demo plugin for testing. The modal will create a project and quota for that project.

Provide a simplified view of config files belonging to MachineConfig objects, to provide more convenient user experience. simpler management.

Current state:

  • When a user/partner needs to retrieve contents from a MachineConfig they need to manually decode the file into a readable format from the URL-encoded one. For example: $ oc get mc $machineConfigName -o jsonpath=' {.spec.config.storage.files[1].contents.source}

    ' | sed "s@+@ @g;s@%@\\\\x@g" | xargs -0 printf "%b\n

Desired state:

  • Instead of the command below the partner would like to get the content of the config files by default in a human-readable format via the OpenShift web dashboard. Please see the attached image. On this screen, a new tab can be introduced to be able to check the content of the configuration files.

 

AC:

  • Add the Configuration Files section into the MachineConfig details page, which renders all the configuration that the resource contains.
  • Add integration tests

 

RFE - https://issues.redhat.com/browse/RFE-5198

Feature Overview (aka. Goal Summary)

OLM users can easily see in the console if an installed operator package is deprecated and learn how to stay within the support boundary by viewing the alerts/notifications that OLM emits, or by reviewing the operator status rendered by the console with visual representation.

Goals (aka. expected user outcomes)

  • Pre-installation: OLM users can see the deprecation visual representation in the console UI and be warned/discouraged from installing a deprecated package, from deprecated channels, or in a deprecated version, and learn the recommended alternatives to stay within the supported path (with a short description).
  • Post-installation: OLM users can see the deprecation visual representation in the console UI to tell if an installed operator is deprecated entirely, currently subscribed to a deprecated channel, or in a deprecated version, and know the alternatives as in package(s), update channel(s), or version(s) to stay within the support boundary.

Related Information

    • kiali-operator:
      • deprecated package: kiali-operator
      • deprecated channel: alpha
      • deprecated version: kiali-operator.v1.68.0

Acceptance Criteria

    • Installed operators page
      • Add the deprecation badge to the operator's Status field, if PackageManifest, Channel or Version is deprecated
    • Operator Details page 
      • Add deprecation badge next to the operator's name if PackageManifest, Channel or Version is deprecated
      • Add deprecation warning to the details page, if PackageManifest, Channel or Version is deprecated
        • Link the the operator's Subscription tab if Channel or Version is deprecated
        • In Subscription tab, the warning will contain "Update channel" link. In case of deprecated Channel, the "Change Subscription update channel" modal will need to get updated to contain warning icons next to deprecated channels.
  • Add integration and unit tests

Problem:

Goal:

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

If there is any PDB with SufficientPods have allowed disruptions equal to 0, then when cluster admin tries to drain a node it will through error as Pod disruption budget is violated. So in order to avoid this, add a warning message in Topology page similar to Resource quota warning message to let the user know about this violation. 

Acceptance Criteria

  1. Create a util where it will fetch PDB for namespace and return how many PDB with disruptionsAllowed is 0 and with SufficientPods, if the count is 1, return the name also so that we can redirect to it's details page
  2. Use the util in Topology page and add Warning message similar to resource quota warning message.
  3. On click of warning message, if there is one PDB which is violated then redirect to it's details page or else to PDB list page
  4. Add YellowExclamationTriangleIcon to PDB list page to Allowed disruption column where rows having allowed disruption equal to 0. Add beside the count.
  5. Create unit test for the util
  6. Add e2e test (Automate this or manual????)

Additional Details:

Internal doc for reference - https://docs.google.com/document/d/1pa1jaYXPPMc-XhHt_syKecbrBaozhFg2_gKOz7E2JWw/edit

Feature Overview (aka. Goal Summary)  

This about GAing the work we started with OCPSTRAT-1040

Goal is to remove experiment tag in command and document this 

This is about GA-ing the work we started in OCPSTRAT-1040.

Goal is to remove the experimental keyword from the new command flag and document this.

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

Current state:

An update is in progress for 28m42s: Working towards 4.14.1: 700 of 859 done (81% complete), waiting on network

= Control Plane =
...
Completion:      91%

Improvement opportunities

1. Inconsistent info: CVO message says "700 of 859 done (81% complete)" but control plane section says "Completion: 91%"
2. Unclear measure of completion: CVO message counts manifest applied and control plane section says "Completion: 91%" which counts upgraded COs. Both messages do not state what they count. Manifest count is an internal implementation detail which users likely do not understand. COs are less so, but we should be more clear in what the completion means.
3. We could take advantage of this line and communicate progress with more details

Definition of Done

We'll only remove CVO message once the rest of the output functionally covers it, so the inconsistency stays until OTA-1154. Otherwise:

= Control Plane =
...
Completion:      91% (30 operators upgraded, 1 upgrading, 2 waiting)

Upgraded operators are COs that updated its version, no matter its conditions
Upgrading operators are COs that havent updated its version and are Progressing=True
Waiting operators are COs that havent updated its version and are Progressing=False

=Control Plane Upgrade=
...
Completion: 45% (Est Time Remaining: 35m)
                ^^^^^^^^^^^^^^^^^^^^^^^^^

Do not worry too much about the precision, we can make this more precise in the future. I am thinking of
1. Assigning a fixed amount of time per CO remaining for COs that do not have daemonsets
2. Assign an amount of time proportional to # of workers to each remaining CO that has daemonsets (network, dns)
3. Assign a special amount of time proportional to # of workers to MCO

We can probably take into account the "how long are we upgrading this operator right now" exposed by CVO in OTA-1160

Discovered by Evgeni Vakhonin during OTA-1245, the 4.16 code did not take nodes that are members of multiple pools into account. This surfaced in several ways:

Duplicate insights (=we iterate over nodes over pools, so we see problematic edges in each pool it is a member of):

= Update Health =
SINCE   LEVEL     IMPACT           MESSAGE
-	Error     Update Stalled   Node ip-10-0-26-198.us-east-2.compute.internal is degraded
-	Error     Update Stalled   Node ip-10-0-26-198.us-east-2.compute.internal is degraded

Such node is present in all pool listings, and in some cases such as paused pools the output is confusing (paused-ness is a property of a pool, so we list a node as paused in one pool but outdated pending in another):

= Worker Pool =
Worker Pool:     mcpfoo
Assessment:      Excluded
...

Worker Pool Node
NAME                                        ASSESSMENT   PHASE    VERSION   EST   MESSAGE
ip-10-0-26-198.us-east-2.compute.internal   Excluded     Paused   4.15.12   -

= Worker Pool =
Worker Pool:     worker
...
Worker Pool Nodes
NAME                                        ASSESSMENT   PHASE     VERSION                              EST   MESSAGE 
ip-10-0-26-198.us-east-2.compute.internal   Outdated     Pending   4.15.12                              ? 

It is not clear to me what would be the correct presentation of this case. Because this is an update status (and not node or cluster status) command, and only a single pool drives an update of a node, I'm thinking that maybe the best course of action would be to show only nodes whose version is driven by a given pool, or maybe come up with a "externally driven"-like assessment or whatever.

As an OTA engineer,
I would like to make sure the node in a single-node cluster is handled correctly in the upgrade-status command.

Context:
According to the discussion with the MCO team,
the node is in MCP/master but not worker.
This card is to make sure that the node are displayed that way too. My feeling is that the current code probably does the job already. In that case, we should add test coverage for the case to avoid regression in the future.

AC:

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Enable GCP Workload Identity Webhook

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Provide GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.{}

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Both, the scope of this is for self-managed
Classic (standalone cluster) Classic
Hosted control planes N/A
Multi node, Compact (three node), or Single node (SNO), or all All
Connected / Restricted Network All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64
Operator compatibility TBD
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) TBD
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Just like AWS STS and ARO Entra Workload ID, we want to provide the GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.

  • For AWS, we deploy the AWS STS pod identity webhook as a customer convenience for configuring their applications to utilize service account tokens minted by a cluster that supports STS. When you create a pod that references a service account, the webhook looks for annotations on that service account and if found, the webhook mutates the deployment in order to set environment variables + mounts the service account token on that deployment so that the pod has everything it needs to make an API client.
  • Our temporary access token (using TAT in place of STS because STS is AWS specific) enablement for (select) third party operators does not rely on the webhook and is instead using CCO to create a secret containing the variables based on the credentials requests. The service account token is also explicitly mounted for those operators. Pod identity webhooks were considered as an alternative to this approach but weren't chosen.
  • Basically, if we deploy this webhook it will be for customer convenience and will enable us to potentially use the Azure pod identity webhook in the future if we so chose. Note that AKS provides this webhook and other clouds like Google offer a webhook solution for configuring customer applications.
  • This is about providing parity with other solutions but not required for anything directly related to the product.
    If we don't provide this Azure pod identity webhook method, customer would need to get the details via some other way like a secret or set explicitly as environment variables. With the webhook, you just annotate your service account.
  • For Azure pod identity webhook, see CCO-363 and https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Will require following

Background

  • For AWS, we deploy the AWS STS pod identity webhook as a customer convenience for configuring their applications to utilize service account tokens minted by a cluster that supports STS. When you create a pod that references a service account, the webhook looks for annotations on that service account and if found, the webhook mutates the deployment in order to set environment variables + mounts the service account token on that deployment so that the pod has everything it needs to make an API client.
  • Our temporary access token (using TAT in place of STS because STS is AWS specific) enablement for (select) third party operators does not rely on the webhook and is instead using CCO to create a secret containing the variables based on the credentials requests. The service account token is also explicitly mounted for those operators. Pod identity webhooks were considered as an alternative to this approach but weren't chosen.
  • Basically, if we deploy this webhook it will be for customer convenience and will enable us to potentially use the Azure pod identity webhook in the future if we so chose. Note that AKS provides this webhook and other clouds like Google offer a webhook solution for configuring customer applications.
  • This is about providing parity with other solutions but not required for anything directly related to the product.
    If we don't provide this Azure pod identity webhook method, customer would need to get the details via some other way like a secret or set explicitly as environment variables. With the webhook, you just annotate your service account.
  • For Azure pod identity webhook, see CCO-363 and https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html.

 

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The GCP IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing GCP Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision GCP infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing GCP

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

failed to create control-plane machines using GCP marketplace image

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-multi-2024-06-11-205940 / 4.16.0-0.nightly-2024-06-10-211334

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", and then edit it to insert osImage settings (see [1])
2. "create cluster" (see [2])

Actual results:

1. The bootstrap machine and the control-plane machines are not created.
2. Although it says "Waiting up to 15m0s (until 10:07AM CST)" for control-plane machines being provisioned, it did not time out until around 10:35AM CST.

Expected results:

The installation should succeed.

Additional info:

FYI a PROW CI test also has the issue: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/52816/rehearse-52816-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.16-installer-rehearse-debug/1800431930391400448

Once support for private/internal clusters is added to CAPG in CORS-3252, we will need to integrate those changes into the installer:

  • vendor updated capg
  • update cluster manifest so internal cluster is default
  • update load balancer creation (so external LB is created by installer and MCS configuration is added to CAPI-created LB)
  • update DNS record creation (if needed) to ensure we are associating records with proper LB

Description of problem:

installing into Shared VPC stuck in waiting for network infrastructure ready

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-10-225505

How reproducible:

Always

Steps to Reproduce:

1. "create install-config" and then insert Shared VPC settings (see [1])
2. activate the service account which has the minimum permissions in the host project (see [2])
3. "create cluster"

FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project. 

Actual results:

1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed.
2. 2 firewall-rules are created in the service project unexpectedly (see [3]).

Expected results:

The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.

Additional info:

 

When installconfig.platform.gcp.userTags is specified, all taggable resources should have the specified user tags.

This requires setting TechPreviewNoUpgrade featureSet to configure tags.

When creating a Private Cluster with CAPG the cloud-controller-manager generates an error when the instance-group is created:

I0611 00:04:34.998546       1 event.go:376] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-dev-installer/zones/us-east1-b/instances/bfournie-capg-test-6vn69-worker-b-rghf7' is expected to be in the subnetwork 'projects/openshift-dev-installer/regions/us-east1/subnetworks/bfournie-capg-test-6vn69-master-subnet' but is in the subnetwork 'projects/openshift-dev-installer/regions/us-east1/subnetworks/bfournie-capg-test-6vn69-worker-subnet'., wrongSubnetwork"

Three "k8s-ig" instance-groups were created for the Internal LoadBlancer. Of the 3, the first one is using the master subnet

subnetwork: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/regions/us-east1/subnetworks/bfournie-capg-test-6vn69-master-subnet

while the other two are using the worker-subnet. Since this IG uses the master-subnet and the instance is using the worker-subnet it results in this mismatch.

This looks similar to issue tracked (and closed) with cloud-provider-gcp
https://github.com/kubernetes/cloud-provider-gcp/pull/605

We are occasionally seeing this error when using GCP with TechPreview, i.e. using CAPG.

waiting for api to be available
level=warning msg=FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
level=info msg=Creating infrastructure resources...
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add worker roles: failed to set project IAM policy: googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff. The request's ETag '\007\006\033\255\347+\335\210' did not match the current policy's ETag '\007\006\033\255\347>%\332'., aborted
Installer exit with code 4
Install attempt 3 of 3

Here is an example:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hive-master-periodic-e2e-gcp-weekly/1813424715671277568/artifacts/e2e-gcp-weekly/test/artifacts/cluster-test-hive-310440d4-8bfb-40f1-a489-ec0a44a7852e-0-m2gqq-provisio29dxg-installer.log

This is a clone of issue OCPBUGS-38152. The following is the description of the original issue:

Description of problem:

    Shared VPC installation using service account having all required permissions failed due to cluster operator ingress degraded, by telling error "error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc'"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-multi-2024-08-07-221959

How reproducible:

    Always

Steps to Reproduce:

1. "create install-config", then insert the interested settings (see [1])
2. "create cluster" (see [2])

Actual results:

    Installation failed, because cluster operator ingress degraded (see [2] and [3]). 

$ oc get co ingress
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress             False       True          True       113m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc', forbidden...
$ 

In fact the mentioned k8s firewall-rule doesn't exist in the host project (see [4]), and, the given service account does have enough permissions (see [6]).

Expected results:

    Installation succeeds, and all cluster operators are healthy. 

Additional info:

    

Feature Overview (aka. Goal Summary)  

Customers have requested the ability to have the ability to apply tolerations to the HCP control plane pods. This provides the flexibility to have the HCP pods scheduled to nodes with taints applied to them that are not currently tolerated by default.

API 

Add new field to HostedCluster. hc.Spec.Tolerations

Tolerations   []corev1.Toleration `json:"tolerations,omitempty"` 

Implementation

In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.

CLI

Add new cli argument called –tolerations to the hcp cli tool during cluster creation. This argument should be able to be set multiple times. The syntax of the field should follow the convention set by the kubectl client tool when setting a taint on a node.

For example, the kubectl client tool can be used to set the following taint on a node.

kubectl taint nodes node1 key1=value1:NoSchedule

And then the hcp cli tool should be able to add a toleration for this taint during creation with the following cli arg.

hcp cluster create kubevirt –toleration “key1=value1:noSchedule” …

Goals (aka. expected user outcomes)

  • Support for customer defined tolerations for HCP pods

    Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Customers have requested the ability to have the ability to apply tolerations to the HCP control plane pods. This provides the flexibility to have the HCP pods scheduled to nodes with taints applied to them that are not currently tolerated by default.

API 

Add new field to HostedCluster. hc.Spec.Tolerations

Tolerations   []corev1.Toleration `json:"tolerations,omitempty"` 

Implementation

In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.

CLI

Add new cli argument called –tolerations to the hcp cli tool during cluster creation. This argument should be able to be set multiple times. The syntax of the field should follow the convention set by the kubectl client tool when setting a taint on a node.

For example, the kubectl client tool can be used to set the following taint on a node.

kubectl taint nodes node1 key1=value1:NoSchedule

And then the hcp cli tool should be able to add a toleration for this taint during creation with the following cli arg.

hcp cluster create kubevirt –toleration “key1=value1:noSchedule” …

The cluster-network-operator needs to be HCP tolerations aware, otherwise controllers (like multus and ovn) won't be deployed by the CNO with the correct tolerations.

The code that looks at the HostedControlPlane within the CNO can be found in pkg/hypershift/hypershift.go. https://github.com/openshift/cluster-network-operator/blob/33070b57aac78118eea34060adef7f2fb7b7b4bf/pkg/hypershift/hypershift.go#L134

API 

Add new field to HostedCluster. hc.Spec.Tolerations

Tolerations   []corev1.Toleration `json:"tolerations,omitempty"` 

Implementation

In support/config/deployment.go, add hc.spec.tolerations from hc when generating the default config. This will cause the toleration to naturally get spread to the deployments and statefulsets.

Goal

Users want to create OpenShift clusters on Nutanix Cloud Platforms with multiple disks

Requirements (aka. Acceptance Criteria):

  • Can create machine sets with multiple disks.
  • Define amount of storage per disk
  • Define the storage container per disk
  • additional disks are optional

Feature Overview (aka. Goal Summary)

The objective is to create a comprehensive backup and restore mechanism for HCP OpenShift Virtualization Provider. This feature ensures both the HCP state and the worker node state are backed up and can be restored efficiently, addressing the unique requirements of KubeVirt environments.

Goals (aka. Expected User Outcomes)

  • Users will be able to backup and restore the KubeVirt HCP cluster, including both HCP state and worker node state.
  • Ensures continuity and reliability of operations after a restore, minimizing downtime and data loss.
  • Supports seamless re-connection of HCP to worker nodes post-restore.

Requirements (aka. Acceptance Criteria)

  • Backup of KubeVirt CSI infra PVCs
  • Backup of KubeVirt VMs + VM state + (possibly even network attachment definitions)
  • Backup of Cloud Provider KubeVirt Infra Load Balancer services (having IP addresses change here on the service could be problematic)
  • Backup of Any custom network policies associated with VM pods
  • Backup of VMs and state placed on External Infra

Use Cases (Optional)

  1. Disaster Recovery: In case of a disaster, the system can restore the HCP and worker nodes to the previous state, ensuring minimal disruption.
  2. Cluster Migration: Allows migration of hosted clusters across different management clusters/
  3. System Upgrades: Facilitates safe upgrades by providing a reliable restore point.

Out of Scope

  • Real-time synchronization of backup data.
  • Non-disruptive Backup and restore (ideal but not required)

Documentation Considerations

Interoperability Considerations

  • Impact on other projects like ACM/MCE vol-sync.
  • Test scenarios to validate interoperability with existing backup solutions.

The HCP team has delivered OADP backup and restore steps for the Agent and AWS provider here. We need to add the steps necessary to make these steps work for HCP KubeVirt clusters.

Requirements

  • Deliver backup/restore steps that reach feature parity with the documented agent and aws platforms
  • Ensure that kubevirt-csi and cloud-provider-kubevirt LBs can be backup and restored successfully
  • Ensure this works with external infra

 

Non Requirements

  • VMs do not need to be backed up to reach feature parity because the current aws/agent steps require the cluster to scale down to zero before backing up.

Feature Overview (aka. Goal Summary)  

The etcd-operator should automatically rotate the etcd-signer and etcd-metrics-signer certs as they approach expiry.

Goals (aka. expected user outcomes)

 

  • We must have a tested path for auto rotation of certificates when certs need rotation due to age

 

Requirements (aka. Acceptance Criteria):

Deliver rotation and recovery requirements from OCPSTRAT-714 

 

Epic Goal*

The etcd cert rotation controller should automatically rotate the etcd-signer and etcd-metrics-signer certs (and re-sign leaf certs) as they approach expiry.

 
Why is this important? (mandatory)

Automatic rotation of the signer certs will reduce the operational burden of having to manually rotate the signer certs.

 
Scenarios (mandatory) 

etcd-signer and etcd-metrics-signer certs are rotated as they approach the end of their validity period. For the signer certs this is 4.5 years.
https://github.com/openshift/cluster-etcd-operator/blob/d8f87ecf9b3af3cde87206762a8ca88d12bc37f5/pkg/tlshelpers/tlshelpers.go#L32
 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation - etcd docs team
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

We shall never allow new leaf certificates to be generated when a revision rollout is in progress AND when the bundle was just changed.

From ETCD-606 we know when a bundle has changed, so we can save the current revision in the operator status and only allow leaf updates on the next higher revision.

NOTE: this assumes etcd rolls out slower than apiserver in practice. We should also think about how we can in-cooperate the revision rollout on the apiserver static pods.

 

in ETCD-565 we have added tests to manually rotate certificate.

In the recovery test suite, depending on the order of execution we have the following failures:

 1. : [sig-etcd][Feature:CertRotation][Suite:openshift/etcd/recovery] etcd can recreate trust bundle [Timeout:15m]

Here the tests usually time out waiting for a revision rollout - couldn't find a deeper cause, maybe the timeout is not large enough.

2. : [sig-etcd][Feature:CertRotation][Suite:openshift/etcd/recovery] etcd can recreate dynamic certificates [Timeout:15m]

The recovery test suite creates several new nodes. When choosing a peer secret, we sometimes choose one that has no member/node anymore and thus it will never be recreated.

3. after https://github.com/openshift/cluster-etcd-operator/pull/1269

After the leaf gating has merged, some certificates are not in their original place anymore, which invalidates the manual rotation procedure

For backward compatibility we tried to keep the previous named certificates the way they were:

https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/starter.go#L614-L639

Many of those are currently merely copied with the ResourceSyncController and could be replaced with their source configmap/secret. 

This should help with easier understanding and mental load of the codebase. 

Some replacement suggestions:

  • etcd-serving-ca -> etcd-ca-bundle
  • etcd-peer-client-ca -> etcd-ca-bundle
  • etcd-metrics-proxy-serving-ca -> etcd-metrics-ca-bundle
  • etcd-metrics-proxy-client-ca -> etcd-metrics-ca-bundle

 

AC:

  • replaced the above suggestions 
  • updated static pod manifests and references in backups
  • updated docs/etcd-tls-assets.md

All openshift TLS artifacts (secrets and configmaps) now have a requirement to have an annotation for user facing descriptions per the metadata registry for TLS artifacts.
https://github.com/openshift/origin/tree/master/tls

There is a guideline for how these descriptions must be written:
https://github.com/openshift/origin/blob/master/tls/descriptions/descriptions.md#how-to-meet-the-requirement

The descriptions for the etcd's TLS artifacts don't meet that requirement and should be updated to point out the required details e.g hostnames, subjects and what kind of certificates the signer is signing.
https://github.com/openshift/origin/blob/8ffdb0e38af1319da4a67e391ee9c973d865f727/tls/descriptions/descriptions.md#certificates-22-1

https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/tlshelpers/tlshelpers.go#L74

See also:
https://github.com/openshift/origin/blob/master/tls/descriptions/descriptions.md#Certificates-85

Currently a new revision is created when the ca bundle configmaps (etcd-signer / metrics-signer) have changed. 

As of today, this change is not transactional across invocations of EnsureConfigMapCABundle, meaning that four revisions (at most, one for each function call) could be created. 

For gating the leaf cert generation on a fixed revision number, it's important to ensure that any bundle change will only ever result in exactly one revision change.

We currently ensure this for leaf certificates by a single update to "etcd-all-certs", we can use the exact same trick again.

AC: 

  • create a single revisioned configmap that contains all relevant CA bundles
  • update all static pod manifests to read from that configmap instead of the two existing ones

This feature request proposes adding the Partition Number within a Placement Group for OpenShift MachineSets & in CAPI. Currently, OCP 4.14 supports pre-created Placement Groups (RFE-2194). But the feature to define the Partition Number within those groups is missing.

Partition placement groups offer a more granular approach to instance allocation within an Availability Zone on AWS, particularly beneficial for deployments on AWS Outpost (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups-outpost.html). It also allows users to further enhance high availability by distributing instances across isolated hardware units within the chosen Placement Group. This improves fault tolerance by minimizing the impact of hardware failures to only instances within the same partition.

Some Benefits are listed below.

  • By leveraging Partition Numbers, users can achieve a higher level of availability for their OpenShift clusters on AWS Outpost, as failures within a partition won't affect instances in other partitions.
  • Distributing instances across isolated hardware units minimizes the impact of hardware failures, ensuring service continuity.
  • It provides optimized resource utilization.

 

Update MAPI to Support AWS Placement Group Partition Number

Based on RFE-2194, support for pre-created Placement Groups was added in OCP. Following that, it is requested in RFE-4965 to have the ability to specify the Partition Number of the Placement Group as this allows more precise allocation.

NOTE: Placement Group (and Partition) will be pre-created by the user. User should be able to specify Partition Number along with PlacementGroupName on EC2 level to improve availability.

References

Upstream changes: CFE-1041

Description of problem:

    Create machineset with invalid placementGroupPartition 0, it will be cleared in machineset, machine created successfully and placed inside an auto-chosen partition number, but should create failed

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-07-01-124741

How reproducible:

    Always

Steps to Reproduce:

    1.Create a machineset with a pre-created partition placement group and placementGroupPartition = 0     
          placementGroupName: pgpartition
          placementGroupPartition: 0

    2.The placementGroupPartition was cleared in the machineset, and the machine created successfully in pgpartition placement group and placed inside an auto-chosen partition number   
    

Actual results:

    The machine created and placed inside an auto-chosen partition number.

Expected results:

    The machine should fail and give error message like other invalid values.

errorMessage: 'error launching instance: Value ''8'' is not a valid value for PartitionNumber.     Use a value between 1 and 7.' 

Additional info:

    It's a new feature test for https://issues.redhat.com/browse/CFE-1066

Implement changes in machine-api-provider-aws to support partition number while creating instance. 

 

Acceptance criteria:

  • Add implementation to support  PlacementGroupPartition during AWS instance creation
  • Add unit tests 
  • Vendor CFE-1063 into machine-api-operator to support PlacementGroupPartition in AWSMachineProviderConfig to allow users specify the Partition Number of placement group.
  • Add webhook validation to check non-empty PlacementGroupName while using PlacementGroupPartition
  • Add necessary unit tests

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

We have multiple customers that are asking us to disable the vsphere CSI driver/operator as a day 2 operation. The goal of this epic is to provide a safe API that will remove the vSphere CSI/Operator. This will also silent the VPD alerts as we received several complaints about VPD raising to many.

 

IMPORTANT: As disabling a storage driver from a running environment can be risky, the use of this API will only be allowed through a RH customer support case. Support will ensure that it is safe to proceed and guide the customer through the process.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Customers want to disable vSphere CSI because this requires several vsphere permissions that they don't want to allow at the OCP level, not setting these permissions ends up with constant and recurring alerting. These customers don't want to use vsphere CSI usually because they use another storage solution.

 

The goal is to provide an API that disables vSphere storage intergration as well as the VPD alerts which will still be present in the logs but not raise (no alerts, lower frequency of checks, lower severity).

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

vsphere CSI/Operator is disabled and no VDP alerts are raised. Logs are still present.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both self managed
Classic (standalone cluster) yes vsphere only
Hosted control planes N/A
Multi node, Compact (three node), or Single node (SNO), or all vsphere only usually not SNO
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all applicable to vsphere
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM) none
Other (please specify) Available only through RH support case

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an admin I want to disable the vsphere CSI driver because I am using another storage solution. 

As an admin I want to disable the vsphere CSI driver because it requires too many vsphere permissions and keep raising OCP alerts.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

What to do if there is existing PVs?

How do we manage general alerts if VPD alert are silenced?

What do we do if customer tries to install the upstream vsphere CSI?

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Replace the Red Hat vSphere CSI with the vmware upstream driver. We can consider this use case in a second phase if there is an actual demand.

Public availability. To begin with this will be only possible through RH support.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Several customers requests asking to disable vsphere CSI drivers.

see https://issues.redhat.com/browse/RFE-3821

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Understand why the customer wants to disable it in the first. Be extra careful with pre-flight checks in order to make sure that it is safe to proceed with.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

No public doc, we need a detailed documentation for our support organisation that includes, pre-flight checks, the differents steps, confirmation that everything works as expected and basic troubleshooting guide. Likely an internal KB or whatever works best for support.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Applies to vsphere only

Epic Goal*

Provide a nice and user-friendly API to disable integration between OCP storage and vSphere.

 
Why is this important? (mandatory)

This is continuation of STOR-1766. In the old releases we want to provide a dirty way to disable vSphere CSI driver in 4.12 - 4.16.

This epic provides a nice and explicit API to disable the CSI driver in 4.17 (or where is this epic implemented), and ensures the cluster can be upgraded to any future OCP version.
Scenarios (mandatory) 

  • As  OCP cluster admin, I want to disable the CSI driver as day 2 operation (i.e. I can't disable Storage capability), so the vSphere CSI driver + its operator does not mess up with my vSphere.
  • As  OCP cluster admin, I want to update my 4.12 cluster with the vSphere CSI driver removed in a dirty way (STOR-1766) all the way to 4.17 and then use a nice API to disable the CSI driver forever.

Dependencies (internal and external) (mandatory)

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Allow customer to enabled EFS CSI usage metrics.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

OCP already supports exposing CSI usage metrics however the EFS metrics are not enabled by default. The goal of this feature is to allows customers to optionally turn on EFS CSI usage metrics in order to see them in the OCP console.

The EFS metrics are not enabled by default for a good reason as it can potentially impact performances.  It's disabled in OCP, because the CSI driver would walk through the whole volume, and that can be very slow on large volumes. For this reason, the default will remain the same (no metrics), customers would need explicitly opt-in.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Clear procedure on how to enable it as a day 2 operation. Default remains no metrics. Once enabled the metrics should be available for visualisation.

 

We should also have a way to disable metrics.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all AWS only
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all AWS/EFS supported
Operator compatibility EFS CSI operator
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Should appear in OCP UI automatically
Other (please specify) OCP on AWS only

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP user i want to be able to visualise the EFS CSI metrics

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Additional metrics

Enabling metrics by default.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Customer request as per 

https://issues.redhat.com/browse/RFE-3290

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

We need to be extra clear on the potential performance impact

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Document how to enable CSI metrics + warning about the potential performance impact.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

It can benefit any cluster on AWS using EFS CSI including ROSA

Epic Goal*

This goal of this epic is to provide a way to admin to turn on EFS CSI usage metrics. Since this could lead to performance because the CSI driver would walk through the whole volume this option will not be enabled by default; admin will need to explicitly opt-in.

 
Why is this important? (mandatory)

Turning on EFS metrics allows users to monitor how much EFS space is being used by OCP.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to turn on EFS CSI metrics 
  2. As an admin I would like to visualise how much EFS space is used by OCP.

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Yes, knowledge transfer
  • Others -

Acceptance Criteria (optional)

Enable CSI metrics via the operator - ensure the driver is started with the proper cmdline options. Verify that the metrics are sent and exposed to the users.

Drawbacks or Risk (optional)

Metrics are calculated by walking through the whole volume which can impact performances. For this reason enabling CSI metrics will need an explicit opt-in from the admin. This risk needs to be explicitly documented.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:

The original PR had all the labels, but it didn't merge on time for code freeze duo to CI flakes.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Feature Overview (aka. Goal Summary)  

Some customers have expressed the need to have a control plane with nodes of a different architecture from the compute nodes in a cluster. This may be to realise cost or power savings or maybe just to benefit from cloud providers in-house hardware.

While this config can be achieved with Hosted Control Planes the customers also want to use Multi-architecture compute to achieve these configs, ideally at install time. 

This feature is to track the implementation of this offering with Arm nodes running in the AWS cloud.

Goals (aka. expected user outcomes)

Customers will be able to install OpenShift clusters that contain control plane and compute nodes of different architectures

Requirements (aka. Acceptance Criteria):

  • Install a cluster with an x86 control plane arm workers
  • Install a cluster with an arm control plane x86 workers
  • Validate that the installer can support the above configurations (ie fails gracefully with single arch payload)
  • Provide a way for the user to know that their installer can support multi-arch installs.(ie openshift-install version)

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Yes
Classic (standalone cluster) Yes
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all Yes
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86 and Arm
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) n/a
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Install a cluster in AWS with different cpu architectures for the control plane and workers.

Acceptance Criteria

  • Install a cluster with an x86 control plane arm workers
  • Install a cluster with an arm control plane x86 workers
  • Validate that the installer can support the above configurations (ie fails gracefully with single arch payload)
  • Provide a way for the user to know that their installer can support multi-arch installs.(ie openshift-install version)

Allow mixing Control Plane and Compute CPU archs, bypass with warnings if the user overrides the release image. Put behind a feature gate.

Add validation in the Installer to not allow install with multi-arch nodes using a single-arch release payload.

The validation needs to be skipped (or just a warning) when the release payload architecture cannot be determined (e.g. when using OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE).

Currently the installer doesn't expose if the release payload is multi or single arch:

./openshift-install version
./openshift-install 4.y.z
built from commit xx
release image quay.io/openshift-release-dev/ocp-release@sha256:xxx
release architecture amd64 

Feature Overview (aka. Goal Summary)  

Some customers have expressed the need to have a control plane with nodes of a different architecture from the compute nodes in a cluster. This may be to realise cost or power savings or maybe just to benefit from cloud providers in-house hardware.

While this config can be achieved with Hosted Control Planes the customers also want to use Multi-architecture compute to achieve these configs, ideally at install time. 

This feature is to track the implementation of this offering with Arm nodes running in the AWS cloud.

Goals (aka. expected user outcomes)

Customers will be able to install OpenShift clusters that contain control plane and compute nodes of different architectures

Requirements (aka. Acceptance Criteria):

  • Install a cluster with an x86 control plane arm workers
  • Install a cluster with an arm control plane x86 workers
  • Validate that the installer can support the above configurations (ie fails gracefully with single arch payload)
  • Provide a way for the user to know that their installer can support multi-arch installs.(ie openshift-install version)

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Yes
Classic (standalone cluster) Yes
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all Yes
Connected / Restricted Network Yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86 and Arm
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) n/a
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Install a cluster in GCP with different cpu architectures for the control plane and workers.

Why is this important?

  •  

Scenarios
1. …

Acceptance Criteria

  • Install a cluster with an x86 control plane arm workers
  • Install a cluster with an arm control plane x86 workers
  • Validate that the installer can support the above configurations (ie fails gracefully with single arch payload)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1.

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Feature Overview (aka. Goal Summary)  

Enable CPU manager on s390x.

Why is this important?

CPU manager is an important component to manage performance of OpenShift and utilize the respective platforms.

Goals (aka. expected user outcomes)

Enable CPU manager on s390x.

Requirements (aka. Acceptance Criteria):

CPU manager works on s390x.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Y
Classic (standalone cluster) Y
Hosted control planes Y
Multi node, Compact (three node), or Single node (SNO), or all Y
Connected / Restricted Network Y
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) IBM Z
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) n/a
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Feature Overview

oc-mirror to include the RHCOS image for HyperShift KubeVirt provider when mirroring the OCP release payload

Goals

When using the KubeVirt (OpenShift Virtualization) provider for HyperShift, the KubeVirt VMs that are going to serve as nodes for the hosted clusters will consume an RHOCS image shipped as a container-disk image for KubeVirt.

In order to have it working on disconnected/air-gapped environments, its image must be part of the mirroring process.

Overview

Refer to RFE-5468

The coreos image is needed to ensure seamless deployment of HyperShift KubeVirt  functionality in disconnected/air-gapped environments.

Solution

This story will address this issue, in that oc-mirror will include the cores kubevirt container image in the release payload

The image is found in the file release-manifests/0000_50_installer_coreos-bootimages.yaml

A field kubeVirtContainer (default false) will be added to the current v2 imagesetconfig and if set to true, the release collector will have logic to read and parse the yaml file correctly to extract and add the "DigestRef" (digest) to the release payload

 

Feature Overview (aka. Goal Summary)  

As a product manager or business owner of OpenShift Lightspeed. I want to track who is using what feature of OLS and WHY. I also want to track the product adoption rate so that I can make decision about the product ( add/remove feature , add new investment )

Requirements (aka. Acceptance Criteria):

Notes:

Enable moniotring of OLS by defult when a user install OLS operator ---> check the box by defualt 

Users will have the ability to disable the monitoring by . ----> in check the box

 

Refer to this slack conversation :https://redhat-internal.slack.com/archives/C068JAU4Y0P/p1723564267962489 

 

Description of problem:

When installing the OpenShift Lightspeed operator, cluster monitoring should be enabled by default.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

    1. Click OpenShift Lightspeed in operator catalog
    2. Click Install
    

Actual results:

"Enable Operator recommended cluster monitoring on this Namespace" checkbox is not selected by default.

Expected results:

"Enable Operator recommended cluster monitoring on this Namespace" checkbox should be selected by default.

Additional info:

    

Feature Overview (aka. Goal Summary)  

This ticket focuses on a reduced scope compared to the initial Tech Preview outlined in OCPSTRAT-1327.

Specifically, the console in the 4.17 Tech Preview release allows customers to:

  • discover collections of Kubernetes extension/operator content released in FBC format within a new ecosystem catalog UI in the 'Administrator Perspective' of the console, powered by the OLM v1 catalog API.
  • view a list of installed Kubernetes extension/operator objects (previously installed via CLI) and easily edit them using the built-in YAML editor in the console.

Goals (aka. expected user outcomes)

1) Pre-installation:

  • Both cluster-admins and non-privileged end-users can explore and discover the layered capabilities or workloads provided by Kubernetes extensions/operators in a new unified ecosystem catalog UI within the console's Administrator Perspective. 
    • (This catalog will be expanded to include content packaged as Helm charts in the future.)
  • Users can filter available offerings by the provider (Red Hat, ISV, community, etc), valid subscription, infrastructure features, and other criteria in the new unified ecosystem catalog UI.
  • In this Tech Preview release, users can view detailed descriptions and other metadata for the latest version within the default channel. 
    • (Future releases will expand this to allow users to discover all versions across all channels defined by an offering or package within a catalog and select a specific version from a desired channel.)

2) Post-installation: 

  • In this Tech Preview release, users with access to OLM v1's ClusterExtension API can view a list of installed Kubernetes extension/operator objects (previously installed via CLI) and easily create, read, update, and delete them using the console's built-in YAML editor.

Requirements (aka. Acceptance Criteria):

All the expected user outcomes and the acceptance criteria in the engineering epics are covered.
 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Our customers will experience a streamlined approach to managing layered capabilities and workloads delivered through operators, operators packaged in Helm charts, or even plain Helm charts.  The next generation OLM will power this central distribution mechanism within the OpenShift in the future. 

Customers will be able to explore and discover the layered capabilities or workloads, and then install those offerings and make them available on their OpenShift clusters.  Similar to the experience with the current OperatorHub, customers will be able to sort and filter the available offerings based on the delivery mechanism (i.e., operator-backed or plain helm charts), source type (i.e., from Red Hat or ISVs), valid subscriptions, infrastructure features, etc.  Once click on a specific offering, they see the details which include the description, usage, and requirements of the offering, the provided services in APIs, and the rest of the relevant metadata for making the decisions.  

The next-gen OLM aims to unify workload management.  This includes operators packaged for current OLM, operators packaged in Helm charts, and even plain Helm charts for workloads.  We want to leverage the current support for managing plain Helm charts within OpenShift and the console for leveraging our investment over the years. 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Refer to the “Documentation Considerations” section of the OLM v1 GA feature.

Relevant documents

 

This epic contains all the OLM related stories for OCP release-4.17. This was cloned from the original epic which contained a spike to create stories and a plan to support the OLM v1 api

Some docs that detail the OLM v1 Upgrade: https://docs.google.com/document/d/1D--lL8gnoDvs0vl72WZl675T23XcsYIfgW4yhuneCug/edit#heading=h.3l98owr87ve

Epic Goal

  • Results from planning spike in 4.16

AC: Implement a catalog view, which lists available OLM v1 packages

  • MVP - Both cluster-admins or non-privileged end-users can explore and discover the layered capabilities or workloads delivered by k8s extensions/operators or plain helm charts from a unified ecosystem catalog UI in the ‘Administrator Perspective’ in the console.
  • Stretch - Users can filter the available offerings based on the delivery mechanism/source type (i.e., operator-backed or plain helm charts), providers (i.e., from Red Hat or ISVs), valid subscriptions, infrastructure features, etc.
  • Enable the Catalog view only when the OLMv1 is installed on the cluster.

Feature Overview (aka. Goal Summary)  

This ticket outlines the scope of the Tech Preview release for OCP 4.17

This Tech Preview release grants early access to upcoming features in the next-generation Operator Lifecycle Manager (OLM v1).  Customers can now test these functionalities and provide valuable feedback during development.

Goals (aka. expected user outcomes)

Highlights of OLM v1 Phase 4 Preview:

  • Safe CRD upgrades: Prevent data loss due to CRD schema changes
  • Clear compatibility reporting: Improved status reporting for supported and unsupported operator bundles
  • Clear ownership: Prevent conflicts between multiple ClusterExtensions managing the same resources
  • Least privilege principle: Adhere to security best practices by using dedicated ServiceAccounts for installing/upgrading content
  • Secure communication: Protect catalog data with HTTPS encryption for catalogd webserver responses
  • Laying the groundwork for native Helm chart support: OLM v1 embeds Helm, doing the heavy lifting to enable future native support for Helm chart-packaged content

Requirements (aka. Acceptance Criteria):

All the expected user outcomes and the acceptance criteria in the engineering epics are covered.

Background

Leveraging learnings and customer feedback since OCP 4's inception, OLM v1 is designed to be a major overhaul.

With OpenShift 4.17, we are one step closer to the highly anticipated general availability (GA) of the next-generation OLM.  

See the OCPSTRAT feature for OLM v1 GA:

Documentation Considerations

  • Safe CRD Upgrades: [TP release] Docs explain OLM v1's current approach to prevent data loss due to CRD schema changes during the ClusterExtension upgrade.
  • Clear compatibility reporting: [TP release] Docs introduce OLM v1's current approach to communicating the supported and unsupported operator bundles during installation.
  • Clear Ownership: [TP release] Docs explain OLM v1's effort to prevent conflicts between multiple ClusterExtensions managing the same resources.
  • Least Privilege Principle: [TP release] Docs explain OLM v1's design rationale behind adhering to security best practices by using dedicated ServiceAccounts for installing/upgrading content, showcasing the installation/upgrade flow with ServiceAccounts w/o and w/ enough permissions associated with it.
  • Secure Communication: [TP release] Docs explain OLM v1's security stance in protecting catalog data with HTTPS encryption for catalogd webserver responses.

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Operator controller no longer depends on the rukpak APIs and controllers, and we do not intend to support them in OLMv1 going forward. We need to remove the rukpak APIs and controllers from the payload to ensure they are not present/do not run.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Use HTTPS for catalogd webserver before we GA the Catalog API

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Upstream parent issue: https://github.com/operator-framework/catalogd/issues/242 

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

operator-controller manifests will need to be updated to create a configmap for service-ca-operator to inject the CA bundle into. In order to not break the payload, cluster-olm-operator will need to be updated to have create, update, patch permissions for the configmap we are creating. Following the principle of least privilege, the permissions should be scoped to the resource name "operator-controller-openshift-ca" (this will be the name of the created configmap)

Feature Overview (aka. Goal Summary)  

OVN Kubernetes Developer's Preview for BGP as a routing protocol for User Defined Network (Segmentation) pod and VM addressability via common data center networking removing the need to negotiate NAT at the cluster's edge.

Goals (aka. expected user outcomes)

OVN-Kubernetes currently has no native routing protocol integration, and relies on a Geneve overlay for east/west traffic, as well as third party operators to handle external network integration into the cluster. The purpose of this Developer's Preview enhancement is to introduce BGP as a supported routing protocol with OVN-Kubernetes. The extent of this support will allow OVN-Kubernetes to integrate into different BGP user environments, enabling it to dynamically expose cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN. In a follow-on release, this enhancement will provide support for EVPN, which is a common data center networking fabric that relies on BGP.

Requirements (aka. Acceptance Criteria):

  • Provide a user-facing API to allow configuration of iBGP or eBGP peers, along with typical BGP configurations to include communities, route targets, vpnv4/v6, etc
  • Support for advertising Egress IP addresses
  • Enable BFD to BGP peers
  • Support EVPN configuration and integration with a user’s DC fabric, along with MAC-VRFs and IP-VRFs
  • ECMP routing support within OVN for BGP learned routes
     
    Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
    Deployment considerations List applicable specific needs (N/A = not applicable)
    Self-managed, managed, or both  
    Classic (standalone cluster)  
    Hosted control planes  
    Multi node, Compact (three node), or Single node (SNO), or all  
    Connected / Restricted Network  
    Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
    Operator compatibility  
    Backport needed (list applicable versions)  
    UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
    Other (please specify)  

Design Document

Use Cases (Optional):

  • Integration with 3rdparty load balancers that send packets directly to OpenShift nodes with the destination IP address of a targeted pod, without needing custom operators to detect which node a pod is scheduled to and then add routes into the load balancer to send the packet to the right node.

Questions to Answer (Optional):

Out of Scope

  • EVPN integration
  • Support of any other routing protocol
  • Running separate BGP instances per VRF network
  • Support for any other type of L3VPN with BGP, including MPLS
  • Providing any type of API or operator to automatically connect two Kubernetes clusters via L3VPN
  • Replacing the support that MetalLB provides today for advertising service IPs
  • Asymmetric Integrated Routing and Bridging (IRB) with EVPN

Background

BGP

Importing Routes from the Provider Network
Today in OpenShift there is no API for a user to be able to configure routes into OVN. In order for a user to change how cluster traffic is routed egress into the cluster, the user leverages local gateway mode, which forces egress traffic to hop through the Linux host's networking stack, where a user can configure routes inside of the host via NM State. This manual configuration would need to be performed and maintained across nodes and VRFs within each node.

Additionally, if a user chooses to not manage routes within the host and use local gateway mode, then by default traffic is always sent to the default gateway. The only other way to affect egress routing is by using the Multiple External Gateways (MEG) feature. With this feature the user may choose to have multiple different egress gateways per namespace to send traffic to.

As an alternative, configuring BGP peers and which route-targets to import would eliminate the need to manually configure routes in the host, and would allow dynamic routing updates based on changes in the provider’s network.

Exporting Routes into the Provider Network
There exists a need for provider networks to learn routes directly to services and pods today in Kubernetes. Metal LB is already one solution whereby load balancer IPs are advertised by BGP to provider networks, and this feature development does not intend to duplicate or replace the function of Metal LB. Metal LB should be able to interoperate with OVN-Kubernetes, and be responsible for advertising services to a provider’s network.

However, there is an alternative need to advertise pod IPs on the provider network. One use case is integration with 3rd party load balancers, where they terminate a load balancer and then send packets directly to OCP nodes with the destination IP address being the pod IP itself. Today these load balancers rely on custom operators to detect which node a pod is scheduled to and then add routes into its load balancer to send the packet to the right node.

By integrating BGP and advertising the pod subnets/addresses directly on the provider network, load balancers and other entities on the network would be able to reach the pod IPs directly.

EVPN (to be integrated with BGP in a follow-on release targeting 4.18)

Extending OVN-Kubernetes VRFs into the Provider Network
This is the most powerful motivation for bringing support of EVPN into OVN-Kubernetes. A previous development effort enabled the ability to create a network per namespace (VRF) in OVN-Kubernetes, allowing users to create multiple isolated networks for namespaces of pods. However, the VRFs terminate at node egress, and routes are leaked from the default VRF so that traffic is able to route out of the OCP node. With EVPN, we can now extend the VRFs into the provider network using a VPN. This unlocks the ability to have L3VPNs that extend across the provider networks.

Utilizing the EVPN Fabric as the Overlay for OVN-Kubernetes
In addition to extending VRFs to the outside world for ingress and egress, we can also leverage EVPN to handle extending VRFs into the fabric for east/west traffic. This is useful in EVPN DC deployments where EVPN is already being used in the TOR network, and there is no need to use a Geneve overlay. In this use case, both layer 2 (MAC-VRFs) and layer 3 (IP-VRFs) can be advertised directly to the EVPN fabric. One advantage of doing this is that with Layer 2 networks, broadcast, unknown-unicast and multicast (BUM) traffic is suppressed across the EVPN fabric. Therefore the flooding domain in L2 networks for this type of traffic is limited to the node.

Multi-homing, Link Redundancy, Fast Convergence
Extending the EVPN fabric to OCP nodes brings other added benefits that are not present in OCP natively today. In this design there are at least 2 physical NICs and links leaving the OCP node to the EVPN leaves. This provides link redundancy, and when coupled with BFD and mass withdrawal, it can also provide fast failover. Additionally, the links can be used by the EVPN fabric to utilize ECMP routing.

Customer Considerations

  • For customers using MetalLB, it will continue to function correctly regardless of this development.

Documentation Considerations

Interoperability Considerations

  • Multiple External Gateways (MEG)
  • Egress IP
  • Services
  • Egress Service
  • Egress Firewall
  • Egress QoS

 

Epic Goal

OVN Kubernetes support for BGP as a routing protocol.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Introduce snapshots support for Azure File as Tech Preview

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all with Azure
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility Azure File CSI
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Already covered
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Is there any known issues, if so they should be documented.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

N/A

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

User experience should be the same as other CSI drivers.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Can be leveraged by ARO or OSD on Azure.

Epic Goal*

Add support for snapshots in Azure File.

 

Why is this important? (mandatory)

We should track upstream issues and ensure enablement in OpenShift. Snapshots are a standard feature of CSI and the reason we did not support it until now was lacking upstream support for snapshot restoration.

Snapshot restore feature was added recently in upstream driver 1.30.3 which we rebased to in 4.17 - https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/1904

Furthermore we already included azcopy cli which is a depencency of cloning (and snapshots). Enabling snapshots in 4.17 is therefore just a matter of adding a sidecar, volumesnapshotclass and RBAC in csi-operator which is cheap compared to the gain.

However, we've observed a few issues with cloning that might need further fixes to be able to graduate to GA and intend releasing the cloning feature as Tech Preview in 4.17 - since snapshots are implemented with azcopy too we expect similar issues and suggest releasing snapshot feature also as Tech Preview first in 4.17.

 
Scenarios (mandatory) 

Users should be able to create a snapshot and restore PVC from snapshots.

 
Dependencies (internal and external) (mandatory)

azcopy - already added in scope of cloning epic

upstream driver support for snapshot restore - already added via 4.17 rebase

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Introduce snapshots support for Azure File as Tech Preview

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all with Azure
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility Azure File CSI
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Already covered
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Is there any known issues, if so they should be documented.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

N/A

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

User experience should be the same as other CSI drivers.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Can be leveraged by ARO or OSD on Azure.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Develop tooling to support migrating CNS volumes between datastores in a safe way for Openshift users.

This tool relies on a new VMware CNS API and requires 8.0.2 or 7.0 Update 3o minimum versions

https://docs.vmware.com/en/VMware-vSphere/8.0/rn/vsphere-vcenter-server-802-release-notes/index.html

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Often our customers are looking to migrate volumes between datastores because they are running out of space in current datastore or want to move to more performant datastore. Previously this was almost impossible or required modifying PV specs by hand to accomplish this. It was also very error prone.

 

As a first version, we develop a CLI tool that is shipped as part of the vsphere CSI operator. We keep this tooling internal for now, support can guide customers on a per request basis. This is to manage current urgent customer's requests, a CLI tool is easier and faster to develop it can also easily be used in previous OCP releases.

Ultimately we want to develop an operator that would take care of migrating CNS between datastores.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Tool is able to take a list of volumes and migrate from one datastore to another. It also performs the necessary pre-flight tests to ensure that the volume is safe to migrate.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) Yes
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all Yes
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility vsphere CSI operator
Backport needed (list applicable versions) no
UI need (e.g. OpenShift Console, dynamic plugin, OCM) no
Other (please specify) OCP on vsphere only

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As a admin - want to migrate all my PVs or optional PVCs belonging to certain namespace to a different datastore within cluster without potentially requiring extended downtime.

  1. I want to move volumes to another datastore that has better performances
  2. I want to move volumes to another datastore current the current one is getting full
  3. I want to move all volumes to another datastore because the current one is being decommissioned.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

How to ship the binary?

Which versions of OCP can this tool support?

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

This feature tracks the implementation with a CLI binary. The operator implementation will be tracked by another Jira feature.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

We had a lot of requests to migrate volumes between datastore for multiple reason. Up until now it was not natively supported by VMware. In 8.0.2 they added a CNS API and a vsphere UI feature to perform volume migration.

We want to avoid customers to directly use the feature from the vsphere UI so we have to develop a wrapper for customers. It's easier to ship a CLI tool first to cover current request and then take some time to develop an official operator-led way.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Given this tool is shipped as part of the vsphere CSI operator and requires extraction and careful manipulation we are not going to document it publicly.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Will be documented as an internal KCS

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

OCP on vSphere only

Epic Goal*

Develop tooling to support migrating CNS volumes between datastores in a safe way for Openshift users.

As a first version, we develop a CLI tool that is shipped as part of the vsphere CSI operator. We keep this tooling internal for now, support can guide customers on a per request basis. This is to manage current urgent customer request, a CLI tool is easier and faster to develop it can also easily be used in previous OCP releases.

Ultimately we want to develop an operator that would take care of migrating CNS between datastores.

 

Why is this important? (mandatory)

Often our customers are looking to migrate volumes between datastores because they are running out of space in current datastore or want to move to more performant datastore. Previously this was almost impossible or required modifying PV specs by hand to accomplish this. It was also very error prone.

 
Scenarios (mandatory) 

As a admin - want to migrate all my PVs or optional PVCs belonging to certain namespace to a different datastore within cluster without potentially requiring extended downtime.

  1. I want to move volumes to another datastore that has better performances
  2. I want to move volumes to another datastore current the current one is getting full
  3. I want to move all volumes to another datastore because the current one is being decommissioned.

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Goal

This goals of this features are:

  • optimize and streamline the operations of HyperShift Operator (HO) on Azure Kubernetes Service (AKS) clusters
  • Enable auto-detectopm of the underlying environment (managed or self-managed) to optimize the HO accordingly.

Place holder epic to capture all azure tickets.

TODO: review.

User Story:

As an end user of a hypershift cluster, I want to be able to:

  • Not see internal host information when inspecting a serving certificate of the kubernetes API server

so that I can achieve

  • No knowledge of internal names for the kubernetes cluster.

From slack thread: https://redhat-external.slack.com/archives/C075PHEFZKQ/p1722615219974739 

We need 4 different certs:

  • common sans
  • internal san
  • fqdn
  • svc ip
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Add e2e tests to openshift/origin to test the improvement in integration between CoreDNS and EgressFirewall as proposed in the enhancement https://github.com/openshift/enhancements/pull/1335.

 

As the feature is currently targeted for Tech-Preview, the e2e tests should enable the feature set to test the feature.

 

The e2e test should create EgressFirewall with DNS rules after enabling Tech-Preview. The EgressFirewall rules should work correctly. E.g. https://github.com/openshift/origin/blob/master/test/extended/networking/egress_firewall.go

Goal

As an OpenShift installer I want to update the firmware of the hosts I use for OpenShift on day 1 and day 2.

As an OpenShift installer I want to integrate the firmware update in the ZTP workflow.

Description

The firmware updates are required in BIOS, GPUs, NICs, DPUs, on hosts that will often be used as DUs in Edge locations (commonly installed with ZTP).

Acceptance criteria

  • Firmware can be updated (upgrade/downgrade)
  • Existing firmware version can be checked

Out of Scope

  • Day 2 host firmware upgrade

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description of problem:

After running a firmware update the new version is not displayed in the status of the HostFirmwareComponents    

Version-Release number of selected component (if applicable):

    

How reproducible:

100%    

Steps to Reproduce:

    1. Execute a firmware update, after it succeeds check the Status to find the information about the new version installed.    

Actual results:

    Status only show the initial information about the firmware components.

Expected results:

    Status should show the newer information about the firmware components.

Additional info:

    

When executing a firmware update for BMH, there is a problem updating the Status of the HostFirmwareComponents CRD, causing the BMH to repeat the update multiple times since it stays in Preparing state.

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

Description of problem:

The cluster version is not updating (Progressing=False).

  Reason: <none>
  Message: Cluster version is 4.16.0-0.nightly-2024-05-08-222442 

When cluster is outside of update it shows Failing=True condition content which is potentially confusing. I think we can just show "The cluster version is not updating ".

Description of problem:

the newly available TP upgrade status command have formatting issue while expanding update health using --details flag, a plural s:<resource> is displayed, which according to dev supposed to be added to group.kind, but only the plural itself is displayed instead 

Resources:
  s: version

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-08-222442

How reproducible:

100%

Steps to Reproduce:

oc adm upgrade status --details=all
while there is any health issue with the cluster

Actual results:

  Resources:
    s: ip-10-0-76-83.us-east-2.compute.internal
  Description: Node is unavailable
  Resources:
    s: version
  Description: Cluster operator control-plane-machine-set is not available
  Resources:
    s: ip-10-0-58-8.us-east-2.compute.internal
  Description: failed to set annotations on node: unable to update node "&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}": Patch "https://api-int.evakhoni-1514.qe.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-58-8.us-east-2.compute.internal": read tcp 10.0.58.8:48328->10.0.27.41:6443: read: connection reset by peer

Expected results:

should mention the correct <group.kind>s:<resource> ?     

Additional info:
OTA-1246
slackl thread

Using the alerts-in-CLI PoC OTA-1080 show relevant firing alerts in the OTA-1087 section. Probably do not show all firing alerts.

I propose showing

  • Alerts that started to fire during the upgrade
  • Allow list of alerts that we know are relevant during upgrades? Insight severity can match alert severity.

Impact can be probably simple alertname -> impact type classifier. Message can be "Alert name: Alert message":

=Update Health= 
SINCE	        LEVEL 		        IMPACT 			MESSAGE
3h		Warning		        API Availability	KubeDaemonSetRolloutStuck: DaemonSet openshift-ingress-canary/ingress-canary has not finished or progressed for at least 30 minutes.

Definition of done

  • Alerts that started firing during the upgrade are shown as a upgrade health insight in upgrade health section
  • Alerts that started firing before the upgrade but are present on an allowlist (hardcoded for now) of alerts relevant for update
  • Create an allow list for alerts (structure) which will show the alerts in this section.
  • We do not plan to decide which alerts should be in the allow list as part of this card (as this is a future card)

Feature Overview

Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.

Goals

Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).

Use Cases

Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.

 

粗文本*h3. *Feature Overview

Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.

Goals

Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).

Use Cases

Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.

 

Done Done Done Criteria

This section contains all the test cases that we need to make sure work as part of the done^3 criteria.

  • Clean install of new cluster with multi vCenter configuration
  • Clean install of new cluster with single vCenter still working as previously
  • VMs / machines can be scaled across all vCenters / Failure Domains
  • PVs should be able to be created on all vCenters

Out-of-Scope

This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.

  • Migration of single vCenter OCP to a multi vCenter (stretch
  •  

User Story

As an OpenShift administrator, I would like vShere CSI Driver Operator to not become degraded due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters and create PVs.

Description

The purpose of this story is to perform the needed changes to get vShere CSI Driver Operator allowing the configuration of the new Feature Gate for vSphere Multi vCenter support.  By default, the operator will still only allow one vCenter definition and support that config; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter.  Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.

Required:

The vShere CSI Driver Operator after install must not fail due to the number of vCenters configured.  The operator will also need to allow the creation of PVs.  Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.

ACCEPTANCE CRITERIA

  • multi vcenter enabled: Operator is not degraded from having more than one vCenter defined in the infrastructure custom resource
  • multi vcenter disabled: Operator will become degraded if vCenter count is greater than 1

ENGINEERING DETAILS

  • Migrate operator to use new YAML cloud config
  • Fix csi driver controller roles to include correct permissions
  • Update openshift/api to be >= version with new VSphereMultiVCenters feature gate
  • Enhance operator to be able to monitor feature gates
  • Enhance operator to support multiple vCenters
    • apply tags
    • create storage policies
  • Update all check logic
  • Update pod creation to not use env var and put user/pass in config.  ENV vars do not allow for mulitple user/pass to allow communication w/ multiple vCenters

 

User Story

As an OpenShift administrator, I would like Machine API Operator (MAO) to not become degraded due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.

Description

The purpose of this story is to perform the needed changes to get MAO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support.  By default, the operator will still only allow one vCenter definition and support that config; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter.  Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.  Also, this operator will need to be enhanced to handle the new YAML format cloud config.

Required:

The vShere CSI Driver Operator after install must not fail due to the number of vCenters configured.  The operator will also need to allow the creation of PVs.  Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.

ACCEPTANCE CRITERIA

  • multi vcenter enabled: Operator is not degraded from having more than one vCenter defined in the infrastructure custom resource
  • multi vcenter disabled: Operator will become degraded if vCenter count is greater than 1
  • Operator is now using the new YAML cloud config for vSphere

ENGINEERING DETAILS

  • Migrate operator to use new YAML cloud config
  • Update openshift/api to be >= version with new VSphereMultiVCenters feature gate

 

USER STORY

As a cluster administrator, I would like to enhance the CAPI installer to support multiple vCenters for installation of cluster so that I can spread my cluster across several vcenters.

DESCRIPTION:

The purpose of this story is to enhance the installer to support multiple vcenters.  Today OCP only allows the use of once vcenter to install the cluster into.  With the development of this feature, cluster admins will be able to configure via the install-config multiple vCenters and allow creation of VMs in all specified vCenter instances.  Failure Domains will encapsulate the vcenter definitions. 

ACCEPTANCE CRITERIA

  • Installer can create a cluster with multiple vCenters defined in the install-config when using the new CAPI installer.
  • Installer (Terraform) should still fail when feature gate is enabled and multiple vcenter definitions are detected.
  • Installer uses new feature gate to allow use of multiple vcenters
    • If feature gate enabled and vCenter count > 1, allow install
    • If feature gate disabled and vCenter count > 1, error out with message that more than one is not allowed
  • Installer can destroy a cluster that is spread across multiple vcenters
  • Create unit tests to cover all scenarios

 

ENGINEERING DETAILS

This will required changed in the API to provide the new feature gate.  Once the feature gate is created, the installer can be enhanced to leverage this new feature gate to allow the user to install the VMs of the cluster across multiple vCenters.

We will need to verify how we are handling unit testing CAPI in the installer.  The unit tests should cover the cases of checking for the new FeatureGate.

User Story

As an OpenShift administrator, I would like MCO to not become degraded due to vSphere Multi vCenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.

Description

The purpose of this story is to perform the needed changes to get MCO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support.  There will be other stories created to track the functional improvements of MCO.  By default, the operator will still only allow one vCenter definition; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter.  Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.

Required:

The MCO after install must not fail due to the number of vCenters configured.  Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.

ACCEPTANCE CRITERIA

  • multi vcenter enabled: MCO is not degraded from having more than one vCenter defined in the infrastructure custom resource
  • multi vcenter disabled: MCO will become degraded if vCenter count is greater than 1

ENGINEERING DETAILS

We will need to enhance all logic that has hard coded vCenter size to now look to see if vSphere Multi vCenter feature gate is enabled.  If it is enabled, the vCenter count may be larger than 1, else it will still need to fail with the error message of vCenter count may not be greater than 1.

User Story

As an OpenShift administrator, I would like CSO to not become degraded due to multi vcenter feature gate being enabled so that I can begin to install my cluster across multiple vcenters.

Description

The purpose of this story is to perform the needed changes to get CSO allowing the configuration of the new Feature Gate for vSphere Multi vCenter support.  There will be other stories created to track the functional improvements of CSO.  By default, the operator will still only allow one vcenter definition; however, once the feature gate for vSphere Multi vCenter is enabled, we will allow more than one vCenter.  Initially, the plan is to only allow a max of 3 vCenter definitions which will be controlled via the CRD for the vSphere infrastructure definitions.

Required:

The CSO after install must not fail due to the number of vCenters configured.  Any other failure reported based on issues performing operator tasks is valid and should be addressed via a new story.

ACCEPTANCE CRITERIA

  • multi vcenter enabled: CSO is not degraded from having more than one vCenter defined in the infrastructure custom resource
  • multi vcenter disabled: CSO will become degraded if vcenter count is greater than 1

ENGINEERING DETAILS

We will need to enhance all logic that has hard coded vCenter size to now look to see if multi vcenter feature gate is enabled.  If it is enabled, the vcenter count may be larger than 1, else it will still need to fail with the error message of vcenter count may not be greater than 1.

User Story

As an OpenShift administrator, I would like the vSphere Problem Detector (VPD) to not log error messages related to new YAML config so that I can begin to install my cluster across multiple vcenters and create PVs and have VPD verify all vCenters and their configs.

Description

The purpose of this story is to perform the needed changes to get vSphere Problem Detector (VPD) allowing the configuration of the new Feature Gate for vSphere Multi vCenter support.  This involves a new YAML config that needs to be supported.  Also, we need to make sure the VPD checks all vCenters / failure domains for any and all checks that it performs.

Required:

The VPD after install must not fail due to the number of vCenters configured.  The VPD may be logging error messages that are not causing the storage operator to become degraded.  We should verify the logs and make sure all vCenters / FD are check as we expect.

ACCEPTANCE CRITERIA

  • multi vcenter enabled (YAML): VPD logs no error message.
  • multi vcenter disabled (INI): VPD logs no error messages.
  • all existing unit tests pass
  • new unit tests added for multi vcenter

ENGINEERING DETAILS

  • Migrate operator to use new YAML cloud config

Feature Overview

Add authentication to the internal components of the Agent Installer so that the cluster install is secure.

Goals

  • Day1: Only allow agents booted from the same agent ISO to register with the assisted-service and use the agent endpoints
  • Day2: Only allow agents booted from the same node ISO to register with the assisted-service and use the agent endpoints
  •  
  • Only allow access to write endpoints to the internal services
  • Use authentication to read endpoints

 

Epic Goal

  • This epic scope was originally to encompass both authentication and authorization but we have split the expanding scope into a separate epic.
  • We want to add authorization to the internal components of Agent Installer so that the cluster install is secure. 

Why is this important?

  • The Agent Installer API server (assisted-service) has several methods for Authorization but none of the existing methods are applicable tothe Agent Installer use case. 
  • During the MVP of Agent Installer we attempted to turn on the existing authorization schemes but found we didn't have access to the correct API calls.
  • Without proper authorization it is possible for an unauthorized node to be added to the cluster during install. Currently we expect this to be done as a mistake rather than maliciously.

Brainstorming Notes:

Requirements

  • Allow only agents booted from the same ISO to register with the assisted-service and use the agent endpoints
  • Agents already know the InfraEnv ID, so if read access requires authentication then that is sufficient in some existing auth schemes.
  • Prevent access to write endpoints except by the internal systemd services
  • Use some kind of authentication for read endpoints
  • Ideally use existing credentials - admin-kubeconfig client cert and/or kubeadmin-password
  • (Future) Allow UI access in interactive mode only

 

Are there any requirements specific to the auth token?

  • Ephemeral
  • Limited to one cluster: Reuse the existing admin-kubeconfig client cert

 

Actors:

  • Agent Installer: example wait-for
  • Internal systemd: configurations, create cluster infraenv, etc
  • UI: interactive user
  • User: advanced automation user (not supported yet)

 

Do we need more than one auth scheme?

Agent-admin - agent-read-write

Agent-user - agent-read

Options for Implementation:

  1. New auth scheme in assisted-service
  2. Reverse proxy in front of assisted-service API
  3. Use an existing auth scheme in assisted-service

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. AGENT-60 Originally we wanted to just turn on local authorization for Agent Installer workflows. It was discovered this was not sufficient for our use case.

Open questions::

  1. Which API endpoints do we need for the interactive flow?
  2. What auth scheme does the Assisted UI use if any?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Once the new auth type is implemented, update assisted-service-env.template from AUTH_TYPE:none to AUTH_TYPE: agent-installer-local

Read the generated local JWT token and ECDSA public key from the asset store and pass it to newly implemented auth type in assisted service to perform authentication in assisted service. Use seperate the auth headers for the API requests  in the places where we make curl requests w.r.t. systemd services.

User Story:

As a user using agent installer on day2 to add a new node to the cluster, I want to be able to:

  • verify if the token is unexpired

so that I can achieve

  • successful authentication from assisted service and be able to add a node to the cluster
  • an error in the case if token is expired

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user, I want to be able to:

so that I can achieve

  • API authentication

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

When running `./openshift-install agent wait-for bootstrap-complete` and `./openshift-install agent wait-for install-complete`, read the generated local JWT token and ECDSA public key from the asset store and pass it to newly implemented auth type in assisted service to perform authentication in assisted service.

User Story:

As an ABI user responsible for day-2 operations, I want to be able to:

  • Verify the Status of the Authentication Token:
    • Quickly and easily check whether the authentication token used for booting up nodes with node.iso is currently valid or has expired. 
  • Receive Guidance on Expired Tokens:
    • If the authentication token has expired, receive clear and actionable instructions on the necessary steps to renew or replace the token. This includes understanding how to generate a new token by running the add-nodes command to create a new node ISO.
    • Display a status message on the boot-up screen where other status messages are shown. The message could be:
      • The auth token is expired. Re-run the add-nodes command to generate a new node ISO and boot it up to continue.
      • The auth token is valid up to AGENT_AUTH_TOKEN_EXPIRY

so that I can

  • effectively manage the authentication aspect of booting up nodes using node.iso, ensuring that all operations run smoothly and securely. This will provide a clear path for corrective actions in the event of authentication issues.

Additional Details:

A new systemd service will be introduced to check and display the status of the authentication token—whether it is valid or expired. This service will run immediately after the agent-interactive-console systemd service. If the authentication token is expired, cluster installation or adding new nodes will be halted until a new node ISO is generated.

Acceptance Criteria:

Description of criteria:

  • A new systemd service 
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Read the generated local JWT token and ECDSA public key from the asset store and pass it to newly implemented auth type in assisted service to perform authentication in assisted service. Use seperate the auth headers for agent API requests, similar to wait-for commands and internal systemd services.

User Story:

As a user, I want to be able to

  • create an ISO to add nodes to an existing cluster 
  • make authenticated API requests to add a new node

so that I can achieve

  • cluster expansion ( adding new nodes)
  •  

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Note: phase 2 target is tech preview.

Feature Overview

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP GA
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This is continuation of CORS-2455 / CFE-719 work, where support for GCP tags & labels delivered as TechPreview in 4.14 and to make it GA in 4.15. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

Dependent on https://issues.redhat.com/browse/CFE-918. Once driver is updated to support tags, operator should be added with the functionality to pass the user-defined tags found in Infrastructure as arg to the driver.

https://issues.redhat.com/browse/CFE-918 is for enabling tag functionality in the driver and driver will have the provision to pass user-defined tags to be added to the resources managed by it as process args and operator should read the user-defined tags found in Infrastructure object and pass as CSV to the driver.

Acceptance Criteria

  • Code linting, validation and best practices adhered to
  • Unit tests should be added for the new changes.
  • Compute Disks, Images, Snapshots created by the driver should have user-defined tags attached to it.

TechPreview featureSet check added in machine-api-provider-gcp operator for userLabels and userTags.

And the new featureGate added in openshift/api should also be removed.

Acceptance Criteria

  • Should be able to define userLabel and userTags without setting featureSet.

Installer would validate the existence of tags and fail the installation if the tags defined are not present. But the tags processed by installer is removed later, operator referencing these tags through Infrastructure would fail. 

Enhance checks to identify not existent tags and insufficient permissions errors as GCP doesn't differentiate it.

Epic Goal*

GCP Filestore instances are not automatically deleted when the cluster is destroyed.

 
Why is this important? (mandatory)

The need to delete GCP Filestore instances is documented. This is however inconsistent with other Storage resources (GCP PD), which get removed automatically and also may lead to resource leaks

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. User installs a GCP cluster
  2. User creates a new Filestore instance as per the documentation: https://docs.openshift.com/container-platform/4.13/storage/container_storage_interface/persistent-storage-csi-google-cloud-file.html
  3. User does not delete the Filestore instance but destroys the cluster
  4. The Filestore is not removed and may impose additional cloud costs

 
Dependencies (internal and external) (mandatory)

This requires changes in the GCP Filestore Operator, GCP Filestore Driver and the OpenShift Installer

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

As the GCP Filestore user I would like all the resources belonging to the cluster to be automatically deleted upon the cluster destruction. This is currently only working for GCP PD volumes but has to be done manually for GCP Filestore ones:

https://docs.openshift.com/container-platform/4.13/storage/container_storage_interface/persistent-storage-csi-google-cloud-file.html#persistent-storage-csi-gcp-cloud-file-delete-instances_persistent-storage-csi-google-cloud-file

Exit criteria:

  • All the cluster provisioned GCP Filestore volumes are labelled as belonging to a cluster
  • All the labelled GCP Filestore volumes are removed from the cloud when the cluster gets destroyed.

DoD

We need to ensure we have parity with OCP and support heterogeneous clusters

https://github.com/openshift/enhancements/pull/1014

Goal

Why is this important?

  • Necessary to enable workloads with different architectures in the same Hosted Clusters.
  • Cost savings brought by more cost effective ARM instances

Scenarios

  1. I have an x86 hosted cluster and I want to have at least one NodePool running ARM workloads
  2. I have an ARM hosted cluster and I want to have at least one NodePool running x86 workloads

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides

Dependencies (internal and external)

  1. The management cluster must use a multi architecture payload image.
  2. The target architecture is in the OCP payload
  3. MCE has builds for the architecture used by the worker nodes of the management cluster

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a HyperShift/HCP CLI user, I want:

  • the multi-arch flag to be enabled by default

so that 

  • the multi-arch validation is triggered by default for customers

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • mult-arch flag enabled by default for HyperShift and HCP CLIs by default

Out of Scope:

N/A

Engineering Details:

  • This shouldn't affect CI testing since there is a related e2e flag setting the multi-arch flag to false - Slack thread.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

  • Add a heterogenous NodePool e2e to the AWS test suite

Why is this important?

  • Ensure we don't regress on this feature

Scenarios

  1. HC cluster with both an x86 NodePool and an arm NodePool

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The Azure IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing Azure Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision Azure infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing Azure
  2. terraform provider.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As an administrator, I want to be able to:

  • Create cluster without public endpoints
  • Create cluster that isn't reachable publicly

Acceptance Criteria:

Description of criteria:

  • No public endpoints
  • No public resources
  • Single private load balancer
  • Only reachable internally or through VPN or some other layer 2 or 3 protocol

Engineering Details:

Description of problem:

Launched CAPI based installation on azure platform, the default HyperVGeneration on each master node is V1, the expected value should be V2 if instance type supports HyperVGeneration V2.
$ az vm get-instance-view --name jimadisk01-xphq8-master-0 -g jimadisk01-xphq8-rg --query 'instanceView.hyperVGeneration' -otsv
V1

Also, if setting instance type to Standard_DC4ds_v3 that only supports HyperVGeneration V2, 
install-config:
========================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      type: Standard_DC4ds_v3


continued to create cluster, installer failed and was timeout when waiting for machine provision.

INFO Waiting up to 15m0s (until 6:46AM UTC) for machines [jimadisk-nmkzj-bootstrap jimadisk-nmkzj-master-0 jimadisk-nmkzj-master-1 jimadisk-nmkzj-master-2] to provision... 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded 
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
WARNING process cluster-api-provider-azure exited with error: signal: killed 
INFO Stopped controller: azure infrastructure provider 
INFO Stopped controller: azureaso infrastructure provider 

In openshift-install.log, got below error:
time="2024-06-25T06:42:57Z" level=debug msg="I0625 06:42:57.090269 1377336 recorder.go:104] \"failed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jimadisk-nmkzj-rg/jimadisk-nmkzj-master-2 (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimadisk-nmkzj-rg/providers/Microsoft.Compute/virtualMachines/jimadisk-nmkzj-master-2\\n--------------------------------------------------------------------------------\\nRESPONSE 400: 400 Bad Request\\nERROR CODE: BadRequest\\n--------------------------------------------------------------------------------\\n{\\n  \\\"error\\\": {\\n    \\\"code\\\": \\\"BadRequest\\\",\\n    \\\"message\\\": \\\"The selected VM size 'Standard_DC4ds_v3' cannot boot Hypervisor Generation '1'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '1' VM Size. For more information, see https://aka.ms/azuregen2vm\\\"\\n  }\\n}\\n--------------------------------------------------------------------------------\\n\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AzureMachine\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"jimadisk-nmkzj-master-2\",\"uid\":\"c2cdabed-e19a-4e88-96d9-3f3026910403\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta1\",\"resourceVersion\":\"1600\"} reason=\"ReconcileError\""
time="2024-06-25T06:42:57Z" level=debug msg="E0625 06:42:57.090701 1377336 controller.go:329] \"Reconciler error\" err=<"
time="2024-06-25T06:42:57Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jimadisk-nmkzj-rg/jimadisk-nmkzj-master-2 (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimadisk-nmkzj-rg/providers/Microsoft.Compute/virtualMachines/jimadisk-nmkzj-master-2"
time="2024-06-25T06:42:57Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-06-25T06:42:57Z" level=debug msg="\tRESPONSE 400: 400 Bad Request"
time="2024-06-25T06:42:57Z" level=debug msg="\tERROR CODE: BadRequest"
time="2024-06-25T06:42:57Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-06-25T06:42:57Z" level=debug msg="\t{"
time="2024-06-25T06:42:57Z" level=debug msg="\t  \"error\": {"
time="2024-06-25T06:42:57Z" level=debug msg="\t    \"code\": \"BadRequest\","
time="2024-06-25T06:42:57Z" level=debug msg="\t    \"message\": \"The selected VM size 'Standard_DC4ds_v3' cannot boot Hypervisor Generation '1'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '1' VM Size. For more information, see https://aka.ms/azuregen2vm\""
time="2024-06-25T06:42:57Z" level=debug msg="\t  }"
time="2024-06-25T06:42:57Z" level=debug msg="\t}"
time="2024-06-25T06:42:57Z" level=debug msg="\t--------------------------------------------------------------------------------"

 

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-23-145410    

How reproducible:

 Always   

Steps to Reproduce:

    1. set instance type to Standard_DC4ds_v3 which only supports HyperVGeneration V2 or without instance type setting in install-config 
    2. launched installation
    3.
    

Actual results:

 1. without instance type setting, default HyperVGeneraton on each master instances is V1
 2. fail to create master instances with instance type to Standard_DC4ds_v3    

Expected results:

1. without instance type setting, default HyperVGeneraton on each master instances is V2.
2. succeed to create cluster with instance type Standard_DC4ds_v3     

Additional info:

    

Remove vendored terraform-provider-azure and not all the terraform code for Azure installs.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

Enable diskEncryptionSet under defaultMachinePlatform in install-config:
=============
platform:
  azure:
    defaultMachinePlatform:
      encryptionAtHost: true
      osDisk:
        diskEncryptionSet:
          resourceGroup: jimades01-rg
          name: jimades01-des
          subscriptionId: 53b8f551-f0fc-4bea-8cba-6d1fefd54c8a

Created cluster, checked diskEncryptionSet on each master instance's osDisk, all of them are empty.

$ az vm list -g jimades01-8ktkn-rg --query '[].[name, storageProfile.osDisk.managedDisk.diskEncryptionSet]' -otable
Column1                               Column2
------------------------------------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
jimades01-8ktkn-master-0
jimades01-8ktkn-master-1
jimades01-8ktkn-master-2
jimades01-8ktkn-worker-eastus1-9m8p5  {'id': '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimades01-rg/providers/Microsoft.Compute/diskEncryptionSets/jimades01-des', 'resourceGroup': 'jimades01-rg'}
jimades01-8ktkn-worker-eastus2-cmcn7  {'id': '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimades01-rg/providers/Microsoft.Compute/diskEncryptionSets/jimades01-des', 'resourceGroup': 'jimades01-rg'}
jimades01-8ktkn-worker-eastus3-nknss  {'id': '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jimades01-rg/providers/Microsoft.Compute/diskEncryptionSets/jimades01-des', 'resourceGroup': 'jimades01-rg'}

same situation when setting diskEncryptionSet under controlPlane in install-config, no des setting in cluster api manifests 10_inframachine_jima24c-2cmlf_*.yaml.

$ yq-go r 10_inframachine_jima24c-2cmlf-bootstrap.yaml 'spec.osDisk'
cachingType: ReadWrite
diskSizeGB: 1024
managedDisk:
  storageAccountType: Premium_LRS
osType: Linux

$ yq-go r 10_inframachine_jima24c-2cmlf-master-0.yaml 'spec.osDisk'
cachingType: ReadWrite
diskSizeGB: 1024
managedDisk:
  storageAccountType: Premium_LRS
osType: Linux

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-23-145410    

How reproducible:

Always

Steps to Reproduce:

    1. Configure disk encryption set under controlPlane or defaultMachinePlatform in install-config
    2. Create cluster
    3.
    

Actual results:

    DES does not take effect on master instances

Expected results:

    DES should be configured on all master instances

Additional info:

    

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

We see that the MachinePool feaure gate has become default=true in a recent version of CAPZ. See https://github.com/openshift/installer/pull/8627#issuecomment-2178061050 for more context.

 

We should probably disable this feature gate. Here's an example of disabling a feature gate using a flag for the aws controller:

https://github.com/openshift/installer/blob/master/pkg/clusterapi/system.go#L153

 

Description of problem:

created Azure IPI cluster by using CAPI, interrupted the installer when running at the stage of waiting for bootstrapping to complete, then ran command "openshift-installer gather bootstrap --dir <install_dir>" to gather bootstrap log.

$ ./openshift-install gather bootstrap --dir ipi --log-level debug
DEBUG OpenShift Installer 4.17.0-0.test-2024-07-25-014817-ci-ln-rcc2djt-latest 
DEBUG Built from commit 91618bc6507416492d685c11540efb9ae9a0ec2e 
...
DEBUG Looking for machine manifests in ipi/.clusterapi_output 
DEBUG bootstrap manifests found: [ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-bootstrap.yaml] 
DEBUG found bootstrap address: 10.0.0.7            
DEBUG master machine manifests found: [ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-master-0.yaml ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-master-1.yaml ipi/.clusterapi_output/Machine-openshift-cluster-api-guests-jima25-m-4sq6j-master-2.yaml] 
DEBUG found master address: 10.0.0.4               
DEBUG found master address: 10.0.0.5               
DEBUG found master address: 10.0.0.6               
...
DEBUG Added /home/fedora/.ssh/openshift-qe.pem to installer's internal agent 
DEBUG Added /home/fedora/.ssh/id_rsa to installer's internal agent 
DEBUG Added /home/fedora/.ssh/openshift-dev.pem to installer's internal agent 
DEBUG Added /tmp/bootstrap-ssh2769549403 to installer's internal agent 
INFO Failed to gather bootstrap logs: failed to connect to the bootstrap machine: dial tcp 10.0.0.7:22: connect: connection timed out 
...

Checked Machine-openshift-cluster-api-guests-jima25-m-4sq6j-bootstrap.yaml under capi artifact folder, only private IP is there.
$ yq-go r Machine-openshift-cluster-api-guests-jima25-m-4sq6j-bootstrap.yaml status.addresses
- type: InternalDNS
  address: jima25-m-4sq6j-bootstrap
- type: InternalIP
  address: 10.0.0.7

From https://github.com/openshift/installer/pull/8669/, it creates an inbound nat rule that forwards port 22 on the public load balancer to the bootstrap host instead of creating public IP directly for bootstrap, and I tried and it was succeeded to ssh login bootstrap server by using frontend IP of public load balancer. But as no public IP saved in bootstrap machine CAPI artifact, installer failed to connect bootstrap machine with private ip.  

Version-Release number of selected component (if applicable):

 4.17 nightly build   

How reproducible:

 Always

Steps to Reproduce:

    1. Create Azure IPI cluster by using CAPI
    2. Interrupt installer when waiting for bootstrap complete
    3. gather bootstrap logs
    

Actual results:

    Only serial console logs and local capi artifacts are collected, logs on bootstrap and control plane fails to be collected due to ssh connection to bootstrap timeout.

Expected results:

    succeed to gather bootstrap logs

Additional info:

    

Attach identities to VMs so that service principals are not placed on the VMs. CAPZ is issuing warnings about this when creating VMs.

Description of problem:

When creating cluster with service principal certificate, as known issues OCPBUGS-36360, installer exited with error.

# ./openshift-install create cluster --dir ipi6 
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 
WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
INFO Consuming Install Config from target directory 
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig 
WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: azureaso infrastructure provider with args [-v=0 -metrics-addr=0 -health-addr=127.0.0.1:45179 -webhook-port=37401 -webhook-cert-dir=/tmp/envtest-serving-certs-1364466879 -crd-pattern= -crd-management=none] 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready) 
INFO Shutting down local Cluster API control plane... 
INFO Local Cluster API system has completed operations 

From output, local cluster API system is shut down. But when checking processes, only parent process installer exit, CAPI related processes are still running.

When local control plane is running:
# ps -ef|grep cluster | grep -v grep
root       13355    6900 39 08:07 pts/1    00:00:13 ./openshift-install create cluster --dir ipi6
root       13365   13355  2 08:08 pts/1    00:00:00 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true
root       13373   13355 55 08:08 pts/1    00:00:10 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
root       13385   13355  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
root       13394   13355  6 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig

After installer exited:
# ps -ef|grep cluster | grep -v grep
root       13365       1  1 08:08 pts/1    00:00:01 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true
root       13373       1 45 08:08 pts/1    00:00:35 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
root       13385       1  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
root       13394       1  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig


Another scenario, ran capi-based installer on the small disk, and installer stuck there and didn't exit until interrupted until <Ctrl> + C. Then checked that all CAPI related processes were still running, only installer process was killed.

[root@jima09id-vm-1 jima]# ./openshift-install create cluster --dir ipi4
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 
INFO Consuming Install Config from target directory 
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] 
FATAL failed to extract "ipi4/cluster-api/cluster-api-provider-azureaso": write ipi4/cluster-api/cluster-api-provider-azureaso: no space left on device 
^CWARNING Received interrupt signal                    
^C[root@jima09id-vm-1 jima]#
[root@jima09id-vm-1 jima]# ps -ef|grep cluster | grep -v grep
root       12752       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:38889 --data-dir=ipi4/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:38889 --listen-peer-urls=http://127.0.0.1:38859 --unsafe-no-fsync=true
root       12760       1  4 07:38 pts/1    00:00:09 ipi4/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_3790461974 --client-ca-file=/tmp/k8s_test_framework_3790461974/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:38889 --secure-port=44429 --service-account-issuer=https://127.0.0.1:44429/ --service-account-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
root       12769       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig
root       12781       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig
root       12851    6900  1 07:41 pts/1    00:00:00 ./openshift-install destroy cluster --dir ipi4
 

Version-Release number of selected component (if applicable):

   4.17 nightly build 

How reproducible:

    Always

Steps to Reproduce:

    1. Run capi-based installer
    2. Installer failed to start some capi process and exited 
    3.
    

Actual results:

    Installer process exited, but capi related processes are still running

Expected results:

    Both installer and all capi related processes are exited.

Additional info:

 

 

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Currently, CAPZ only allows a single API load balancer to be specified. OpenShift requires both a public and private load balancer. It is desirable to allow multiple load balancers to be specified in the API load balancer field.

We need to modify the Azure CAPI NetworkSpec to add support for an array of load balancers. For each LB in the array, the existing behavior of a load balancer needs to be implemented (adding VM's into the backend pool).

Description of problem:

Created VM instances on Azure, and assign managed identity to it, then created cluster in this VM, installer got error as below:

# ./openshift-install create cluster --dir ipi --log-level debug
...
time="2024-07-01T00:52:43Z" level=info msg="Waiting up to 15m0s (until 1:07AM UTC) for network infrastructure to become ready..."
...
time="2024-07-01T00:52:58Z" level=debug msg="I0701 00:52:58.528931    7149 recorder.go:104] \"failed to create scope: failed to configure azure settings and credentials for Identity: failed to create credential: secret can't be empty string\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AzureCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"jima0701-hxtzd\",\"uid\":\"63aa5b17-9063-4b33-a471-1f58c146da8a\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta1\",\"resourceVersion\":\"1083\"} reason=\"CreateClusterScopeFailed\""

Version-Release number of selected component (if applicable):

  4.17 nightly build  

How reproducible:

  Always

Steps to Reproduce:

    1. Created VM and assigned managed identity to it 
    2. Create cluster in this VM
    3.
    

Actual results:

    cluster is created failed

Expected results:

    cluster is installed successfully

Additional info:

    

 

Description of problem:

Specify controlPlane.architecture as arm64 in install-config
===
controlPlane:
  architecture: arm64
  name: master
  platform:
    azure:
      type: null
compute:
- architecture: arm64
  name: worker
  replicas: 3
  platform:
    azure:
      type: Standard_D4ps_v5

Launch installer to create cluster, installer exit with below error:

time="2024-07-26T06:11:00Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource ci-op-wtm3h6km-72f4b-fdwtz-rg/ci-op-wtm3h6km-72f4b-fdwtz-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-wtm3h6km-72f4b-fdwtz-rg/providers/Microsoft.Compute/virtualMachines/ci-op-wtm3h6km-72f4b-fdwtz-bootstrap"
time="2024-07-26T06:11:00Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-26T06:11:00Z" level=debug msg="\tRESPONSE 400: 400 Bad Request"
time="2024-07-26T06:11:00Z" level=debug msg="\tERROR CODE: BadRequest"
time="2024-07-26T06:11:00Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-26T06:11:00Z" level=debug msg="\t{"
time="2024-07-26T06:11:00Z" level=debug msg="\t  \"error\": {"
time="2024-07-26T06:11:00Z" level=debug msg="\t    \"code\": \"BadRequest\","
time="2024-07-26T06:11:00Z" level=debug msg="\t    \"message\": \"Cannot create a VM of size 'Standard_D8ps_v5' because this VM size only supports a CPU Architecture of 'Arm64', but an image or disk with CPU Architecture 'x64' was given. Please check that the CPU Architecture of the image or disk is compatible with that of the VM size.\""
time="2024-07-26T06:11:00Z" level=debug msg="\t  }"
time="2024-07-26T06:11:00Z" level=debug msg="\t}"
time="2024-07-26T06:11:00Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-26T06:11:00Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/ci-op-wtm3h6km-72f4b-fdwtz-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-wtm3h6km-72f4b-fdwtz-bootstrap\" reconcileID=\"60b1d513-07e4-4b34-ac90-d2a33ce156e1\""

Checked that gallery image definitions (Gen1 & Gen2), the architecture is still x64.
$ az sig image-definition show --gallery-image-definition ci-op-wtm3h6km-72f4b-fdwtz -g ci-op-wtm3h6km-72f4b-fdwtz-rg --gallery-name gallery_ci_op_wtm3h6km_72f4b_fdwtz
{
  "architecture": "x64",
  "hyperVGeneration": "V1",
  "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-wtm3h6km-72f4b-fdwtz-rg/providers/Microsoft.Compute/galleries/gallery_ci_op_wtm3h6km_72f4b_fdwtz/images/ci-op-wtm3h6km-72f4b-fdwtz",
  "identifier": {
    "offer": "rhcos",
    "publisher": "RedHat",
    "sku": "basic"
  },
  "location": "southcentralus",
  "name": "ci-op-wtm3h6km-72f4b-fdwtz",
  "osState": "Generalized",
  "osType": "Linux",
  "provisioningState": "Succeeded",
  "resourceGroup": "ci-op-wtm3h6km-72f4b-fdwtz-rg",
  "tags": {
    "kubernetes.io_cluster.ci-op-wtm3h6km-72f4b-fdwtz": "owned"
  },
  "type": "Microsoft.Compute/galleries/images"
}

$ az sig image-definition show --gallery-image-definition ci-op-wtm3h6km-72f4b-fdwtz-gen2 -g ci-op-wtm3h6km-72f4b-fdwtz-rg --gallery-name gallery_ci_op_wtm3h6km_72f4b_fdwtz 
{
  "architecture": "x64",
  "hyperVGeneration": "V2",
  "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-wtm3h6km-72f4b-fdwtz-rg/providers/Microsoft.Compute/galleries/gallery_ci_op_wtm3h6km_72f4b_fdwtz/images/ci-op-wtm3h6km-72f4b-fdwtz-gen2",
  "identifier": {
    "offer": "rhcos-gen2",
    "publisher": "RedHat-gen2",
    "sku": "gen2"
  },
  "location": "southcentralus",
  "name": "ci-op-wtm3h6km-72f4b-fdwtz-gen2",
  "osState": "Generalized",
  "osType": "Linux",
  "provisioningState": "Succeeded",
  "resourceGroup": "ci-op-wtm3h6km-72f4b-fdwtz-rg",
  "tags": {
    "kubernetes.io_cluster.ci-op-wtm3h6km-72f4b-fdwtz": "owned"
  },
  "type": "Microsoft.Compute/galleries/images"
}   

Version-Release number of selected component (if applicable):

4.17 nightly build    

How reproducible:

Always

Steps to Reproduce:

    1. Configure controlPlane.architecture as arm64
    2. Create cluster by using multi nightly build
    

Actual results:

    Installation fails as unable to create bootstrap/master machines

Expected results:

    Installation succeeds.

Additional info:

    

Outbound Type defines how egress is provided for the cluster. Currently 3 options: Load Balancer (default), User Defined Routing and NAT Gateway (tech preview) are supported.

As part of the move away from terraform, the `UserDefinedRouting` outboundType needs to be supported.  

Description of problem:

Failed to create second cluster in shared vnet, below error is thrown out during creating network infrastructure when creating 2nd cluster, installer timed out and exited.
==============
07-23 14:09:27.315  level=info msg=Waiting up to 15m0s (until 6:24AM UTC) for network infrastructure to become ready...
...
07-23 14:16:14.900  level=debug msg=	failed to reconcile cluster services: failed to reconcile AzureCluster service loadbalancers: failed to create or update resource jima0723b-1-x6vpp-rg/jima0723b-1-x6vpp-internal (service: loadbalancers): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal
07-23 14:16:14.900  level=debug msg=	--------------------------------------------------------------------------------
07-23 14:16:14.901  level=debug msg=	RESPONSE 400: 400 Bad Request
07-23 14:16:14.901  level=debug msg=	ERROR CODE: PrivateIPAddressIsAllocated
07-23 14:16:14.901  level=debug msg=	--------------------------------------------------------------------------------
07-23 14:16:14.901  level=debug msg=	{
07-23 14:16:14.901  level=debug msg=	  "error": {
07-23 14:16:14.901  level=debug msg=	    "code": "PrivateIPAddressIsAllocated",
07-23 14:16:14.901  level=debug msg=	    "message": "IP configuration /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal/frontendIPConfigurations/jima0723b-1-x6vpp-internal-frontEnd is using the private IP address 10.0.0.100 which is already allocated to resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd.",
07-23 14:16:14.902  level=debug msg=	    "details": []
07-23 14:16:14.902  level=debug msg=	  }
07-23 14:16:14.902  level=debug msg=	}
07-23 14:16:14.902  level=debug msg=	--------------------------------------------------------------------------------

Install-config for 1st cluster:
=========
metadata:
  name: jima0723b
platform:
  azure:
    region: eastus
    baseDomainResourceGroupName: os4-common
    networkResourceGroupName: jima0723b-rg
    virtualNetwork: jima0723b-vnet
    controlPlaneSubnet: jima0723b-master-subnet
    computeSubnet: jima0723b-worker-subnet
publish: External

Install-config for 2nd cluster:
========
metadata:
  name: jima0723b-1
platform:
  azure:
    region: eastus
    baseDomainResourceGroupName: os4-common
    networkResourceGroupName: jima0723b-rg
    virtualNetwork: jima0723b-vnet
    controlPlaneSubnet: jima0723b-master-subnet
    computeSubnet: jima0723b-worker-subnet
publish: External

shared master subnet/worker subnet:
$ az network vnet subnet list -g jima0723b-rg --vnet-name jima0723b-vnet -otable
AddressPrefix    Name                     PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  -----------------------  --------------------------------  -----------------------------------  -------------------  ---------------
10.0.0.0/24      jima0723b-master-subnet  Disabled                          Enabled                              Succeeded            jima0723b-rg
10.0.1.0/24      jima0723b-worker-subnet  Disabled                          Enabled                              Succeeded            jima0723b-rg

internal lb frontedIPConfiguration on 1st cluster:
$ az network lb show -n jima0723b-49hnw-internal -g jima0723b-49hnw-rg --query 'frontendIPConfigurations'
[
  {
    "etag": "W/\"7a7531ca-fb02-48d0-b9a6-d3fb49e1a416\"",
    "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd",
    "inboundNatRules": [
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-0",
        "resourceGroup": "jima0723b-49hnw-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-1",
        "resourceGroup": "jima0723b-49hnw-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-2",
        "resourceGroup": "jima0723b-49hnw-rg"
      }
    ],
    "loadBalancingRules": [
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/LBRuleHTTPS",
        "resourceGroup": "jima0723b-49hnw-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/sint-v4",
        "resourceGroup": "jima0723b-49hnw-rg"
      }
    ],
    "name": "jima0723b-49hnw-internal-frontEnd",
    "privateIPAddress": "10.0.0.100",
    "privateIPAddressVersion": "IPv4",
    "privateIPAllocationMethod": "Static",
    "provisioningState": "Succeeded",
    "resourceGroup": "jima0723b-49hnw-rg",
    "subnet": {
      "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-rg/providers/Microsoft.Network/virtualNetworks/jima0723b-vnet/subnets/jima0723b-master-subnet",
      "resourceGroup": "jima0723b-rg"
    },
    "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations"
  }
]

From above output, privateIPAllocationMethod is static and always allocate privateIPAddress to 10.0.0.100, this might cause the 2nd cluster installation failure.

Checked the same on cluster created by using terraform, privateIPAllocationMethod is dynamic.
===============
$ az network lb show -n wxjaz723-pm99k-internal -g wxjaz723-pm99k-rg --query 'frontendIPConfigurations'
[
  {
    "etag": "W/\"e6bec037-843a-47ba-a725-3f322564be58\"",
    "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/frontendIPConfigurations/internal-lb-ip-v4",
    "loadBalancingRules": [
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/api-internal-v4",
        "resourceGroup": "wxjaz723-pm99k-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/sint-v4",
        "resourceGroup": "wxjaz723-pm99k-rg"
      }
    ],
    "name": "internal-lb-ip-v4",
    "privateIPAddress": "10.0.0.4",
    "privateIPAddressVersion": "IPv4",
    "privateIPAllocationMethod": "Dynamic",
    "provisioningState": "Succeeded",
    "resourceGroup": "wxjaz723-pm99k-rg",
    "subnet": {
      "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-rg/providers/Microsoft.Network/virtualNetworks/wxjaz723-vnet/subnets/wxjaz723-master-subnet",
      "resourceGroup": "wxjaz723-rg"
    },
    "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations"
  },
...
]

Version-Release number of selected component (if applicable):

  4.17 nightly build

How reproducible:

  Always

Steps to Reproduce:

    1. Create shared vnet / master subnet / worker subnet
    2. Create 1st cluster in shared vnet
    3. Create 2nd cluster in shared vnet
    

Actual results:

    2nd cluster installation failed

Expected results:

    Both clusters are installed successfully.

Additional info:

    

 

Description of problem:

Whatever vmNetworkingType setting under ControlPlane in install-config, "Accelerated networking" on master instances are always disabled.

In install-config.yaml, set controlPlane.platform.azure.vmNetworkingType to 'Accelerated' or without such setting on controlPlane
=======================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      vmNetworkingType: 'Accelerated'
 
create cluster, and checked "Accelerated networking" on master instances, all are disabled.
$ az network nic show --name jima24c-tp7lp-master-0-nic -g jima24c-tp7lp-rg --query 'enableAcceleratedNetworking'
false

After creating manifests, checked capi manifests, acceleratedNetworking is set as false.
$ yq-go r 10_inframachine_jima24c-qglff-master-0.yaml 'spec.networkInterfaces'
- acceleratedNetworking: false

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-23-145410

How reproducible:

Always

Steps to Reproduce:

1. Set vmNetworkingType to 'Accelerated' or without vmNetworkingType setting under controlPlane in install-config
2. Create cluster
3.

Actual results:

AcceleratedNetworking on all master instances are always disabled.

Expected results:

1. Without vmNetworkingType setting in install-config, AcceleratedNetworking on all master instances should be enabled by default, which keeps the same behavior as terraform based installation.
2. AcceleratedNetworking on all master instances should be consistent with setting in install-config.

Additional info:

 

CAPZ expects a VM extension to report back that capi bootstapping is successful but rhcos does not support extensions (because it, by design, does not support the azure linux agent).

We need https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/4792 to be able to disable the default extension in capz.

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage GCP Workload Identity Federation-based authorization when using GCP APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around GCP WIF with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile GCPP credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their GCP IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage GCP WIF
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which GCP IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens for other cloud providers like AWS. This capabilitiy is now also being implemented for GCP as part of CCO-1898 and CCO-285. The support should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with GCP APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on GCP.

 

Customer Considerations

This is particularly important for OSD on GCP customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in GCP WIF mode
    • how to become aware of operators that support GCP WIF and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with OSD on GCP
  • this needs to work with self-managed OCP on AWS

CCO needs to support the CloudCredentialRequestAPI with GCP Workload Identity (just like we did for AWS STS and Azure Entra Workload ID) to enable OCPSTRAT-922 (CloudCredentialOperator-based workflows for OLM-managed operators and GCP WIF).

Feature Overview (aka. Goal Summary)

When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.

Goals (aka. expected user outcomes)

An end user can use the openshift console without a notable difference in experience.  This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery

Requirements (aka. Acceptance Criteria):

  1. User can log in and use the console
  2. User can get a kubeconfig that functions on the CLI with matching oc
  3. Both of those work on hypershift
  4. both of those work on standalone.

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • When installed with external OIDC, the clientID and clientSecret need to be configurable to match the external (and unmanaged) OIDC server

Why is this important?

  • Without a configurable clientID and secret, I don't think the console can identify the user.
  • There must be a mechanism to do this on both hypershift and openshift, though the API may be very similar.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

    When a cluster is configured for direct OIDC configuration (authentication.config/cluster .spec.type=OIDC), console pods will be in crashloop until an OIDC client is configured for the console.

Version-Release number of selected component (if applicable):

    4.15.0

How reproducible:

100% in Hypershift; 100% in TechPreviewNoUpgrade featureset on standalone OpenShift   

Steps to Reproduce:

    1. Update authentication.config/cluster so that Type=OIDC
    

Actual results:

    The console operator tries to create a new console rollout, but the pods crashloop. This is because the operator sets the console pods to "disabled". This would normally actually mean a privilege escalation, fortunately the configuration prevents a successful deploy.

Expected results:

    Console pods are healthy, they show a page which says that no authentication is currently configured.

Additional info:

    

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.29 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • Uncover any possible issues with the openshift/kubernetes rebase before it merges.
  • MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
  • MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

  • All stories in this epic must be completed.
  • Go version is upgraded for MCO components.
  • CI is running successfully with the upgraded components against the 4.16/master branch.

Dependencies (internal and external)

  1. ART team creating the go 1.29 image for upgrade to go 1.29.
  2. OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

  1. Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

  • Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
  • Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
  • Step 3 - Upgrade OpenShift client and API dependencies
  • Step 4 - Update kubelet and kube-proxy submodules in MCO repository
  • Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

This feature is now re-opened because we want to run z-rollback CI. This feature doesn't block the release of 4.17.This is not going to be exposed as a customer-facing feature and will not be documented within OpenShift documentation.  This is strictly going to be covered as a RH Support guided solution with KCS article providing guidance. A public facing KCS will basically point to contacting Support for help on Z-stream rollback, and Y-stream rollback is not supported.

NOTE:
Previously this was closed as "won't do" because didn't have a plan to support y-stream and z-stream rollbacks is standalone openshift.
For Single node openshift please check TELCOSTRAT-160 . "won't do"  decisions was after further discussion with leadership.
The e2e tests https://docs.google.com/spreadsheets/d/1mr633YgQItJ0XhbiFkeSRhdLlk6m9vzk1YSKQPHgSvw/edit?gid=0#gid=0 We have identified a few bugs that need to be resolved before the General Availability (GA) release. Ideally, these should be addressed in the final month before GA when all features are development complete. However, asking component teams to commit to fixing critical rollback bugs during this time could potentially delay the GA date.

------

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Red Hat Support assisted z-stream rollback from 4.16+

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Red Hat Support may, at their discretion, assist customers with z-stream rollback once it’s determined to be the best option for restoring a cluster to the desired state whenever a z-stream rollback compromises cluster functionality.

Engineering will take a “no regressions, no promises” approach, ensuring there are no major regressions between z-streams, but not testing specific combinations or addressing case-specific bugs.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Public Documentation (or KCS?) that explains why we do not advise unassisted z-stream rollback and what to do when a cluster experiences loss of functionality associated with a z-stream upgrade.
  • Internal KCS article that provides a comprehensive plan for troubleshooting and resolving issues introduced after applying a z-stream update, up to and including complete z-stream rollback.
  • Should include alternatives such as limited component rollback (single operator, RHCOS, etc) and workaround options
  • Should include incident response and escalation procedures for all issues incurred during application of a z-stream update so that even if rollback is performed we’re tracking resolution of defects with highest priority
  • Foolproof command to initiate z-stream rollback with Support’s approval, aka a hidden command that ensures we don’t typo the pull spec or initiate A->B->C version changes, only A->B->A
  • Test plan and jobs to ensure that we have high confidence in ability to rollback a z-stream along happy paths
  • Need not be tested on all platforms and configurations, likely metal or vSphere and one foolproof platform like AWS
  • Test should not monitor for disruption since it’s assumed disruption is tolerable during an emergency rollback provided we achieve availability at the end of the operation
  • Engineering agrees to fix bugs which inhibit rollback completion before the current master branch release ships, aka they’ll be filed as blockers for the current master branch release. This means bugs found after 4.N branches may not be fixed until the next release without discussion.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed all
Multi node, Compact (three node) all
Connected and Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Release payload only all
Starting with 4.16, including all future releases all
   
   

While this feature applies to all deployments we will only run a single platform canary test on a high success rate platform, such as AWS. Any specific ecosystems which require more focused testing should bring their own testing.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an admin who has determined that a z-stream update has compromised cluster functionality I have clear documentation that explains that unassisted rollback is not supported and that I should consult with Red Hat Support on the best path forward.

As a support engineer I have a clear plan for responding to problems which occur during or after a z-stream upgrade, including the process for rolling back specific components, applying workarounds, or rolling the entire cluster back to the previously running z-stream version.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Should we allow rollbacks whenever an upgrade doesn’t complete? No, not without fully understanding the root cause. If it’s simply a situation where workers are in process of updating but stalled, that should never yield a rollback without credible evidence that rollback will fix that.

Similar to our “foolproof command” to initiate rollback to previous z-stream should we also craft a foolproof command to override select operators to previous z-stream versions? Part of the goal of the foolproof command is to avoid potential for moving to an unintended version, the same risk may apply at single operator level though impact would be smaller it could still be catastrophic.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Non-HA clusters, Hosted Control Planes – those may be handled via separately scoped features

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Occasionally clusters either upgrade successfully and encounter issues after the upgrade or may run into problems during the upgrade. Many customers assume that a rollback will fix their concerns but without understanding the root cause we cannot assume that’s the case. Therefore, we recommend anyone who has encountered a negative outcome associated with a z-stream upgrade contact support for guidance.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

It’s expected that customers should have adequate testing and rollout procedures to protect against most regressions, i.e. roll out a z-stream update in pre-production environments where it can be adequately tested prior to updating production environments.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

This is largely a documentation effort, i.e. we should create either a KCS article or new documentation section which describes how customers should respond to loss of functionality during or after an upgrade.
KCS Solution : https://access.redhat.com/solutions/7083335 

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Given we test as many upgrade configurations as possible and for whatever reason the upgrade still encounters problems, we should not strive to comprehensively test all configurations for rollback success. We will only test a limited set of platforms and configurations necessary to ensure that we believe the platform is generally able to roll back a z-stream update.

Epic Goal

  • Validate z-stream rollbacks in CI starting with 4.10 by ensuring that a rollback completes unassisted and e2e testsuite passes
  • Provide internal documentation (private KCS article) that explains when this is the best course of action versus working around a specific issue
  • Provide internal documentation (private KCS article) that explains the expected cluster degradation until the rollback is complete
  • Provide internal documentation (private KCS article) outlining the process and any post rollback validation

Why is this important?

  • Even if upgrade success is 100% there's some chance that we've introduced a change which is incompatible with a customer's needs and they desire to roll back to the previous z-stream
  • Previously we've relied on backup and restore here, however due to many problems with time travel, that's only appropriate for disaster recovery scenarios where the cluster is either completely shut down already or it's acceptable to do so while also accepting loss of any workload state change (PVs that were attached after the backup was taken, etc)
  • We believe that we can reasonably roll back to a previous z-stream

Scenarios

  1. Upgrade from 4.10.z to 4.10.z+n
  2. oc adm upgrade rollback-z-stream – which will initially be hidden command, this will look at clusterversion history and rollback to the previous version if and only if that version is a z-stream away
  3. Rollback from 4.10.z+n to exactly 4.10.z, during which the cluster may experience degraded service and/or periods of service unavailability but must eventually complete with no further admin action
  4. Must pass 4.10.z e2e testsuite

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Fix all bugs listed here
    project = "OpenShift Bugs" AND affectedVersion in( 4.12, 4.14, 4.15) AND labels = rollback AND status not in (Closed ) ORDER BY status DESC

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. At least today we indend to only surface this process internally and work through it with customers actively engaged with support, where do we put that?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Validate z-stream rollbacks in CI starting with 4.16 by ensuring that a rollback completes unassisted and e2e testsuite passes
  • Provide internal documentation (private KCS article) that explains when this is the best course of action versus working around a specific issue
  • Provide internal documentation (private KCS article) that explains the expected cluster degradation until the rollback is complete
  • Provide internal documentation (private KCS article) outlining the process and any post rollback validation

Why is this important?

  • Even if upgrade success is 100% there's some chance that we've introduced a change which is incompatible with a customer's needs and they desire to roll back to the previous z-stream
  • Previously we've relied on backup and restore here, however due to many problems with time travel, that's only appropriate for disaster recovery scenarios where the cluster is either completely shut down already or it's acceptable to do so while also accepting loss of any workload state change (PVs that were attached after the backup was taken, etc)
  • We believe that we can reasonably roll back to a previous z-stream

Scenarios

  1. Upgrade from 4.16.z to 4.16.z+n
  2. oc adm upgrade rollback-z-stream – which will initially be hidden command, this will look at clusterversion history and rollback to the previous version if and only if that version is a z-stream away
  3. Rollback from 4.16.z+n to exactly 4.16.z, during which the cluster may experience degraded service and/or periods of service unavailability but must eventually complete with no further admin action
  4. Must pass 4.16.z e2e testsuite

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Fix all bugs listed here
    project = "OpenShift Bugs" AND affectedVersion in( 4.16, 4.17) AND labels = rollback AND status not in (Closed ) ORDER BY status DESC

Documentation

KCS : https://access.redhat.com/solutions/7089715 

Open questions::

  1. At least today we indend to only surface this process internally and work through it with customers actively engaged with support, where do we put that?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem

OTA-941 landed a rollback guard in 4.14 that blocked all rollbacks. OCPBUGS-24535 drilled a hole in that guard to allow limited rollbacks to the previous release the cluster had been aiming at, as long as that previous release was part of the same 4.y z stream. We decided to block that hole back up in OCPBUGS-35994. And now folks want the hole re-opened in this bug. We also want to bring back the oc adm upgrade rollback ... subcommand. Hopefully this new plan sticks

Version-Release number of selected component

Folks want the guard-hole and rollback subcommand restored for 4.16 and 4.17.

How reproducible

Every time.

Steps to Reproduce

Try to perform the rollbacks that OCPBUGS-24535 allowed.

Actual results

They stop working, with reasonable ClusterVersion conditions explaining that even those rollback requests will not be accepted.

Expected results

They work, as verified in OCPBUGS-24535.

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

openshift-sdn is no longer part of OCP in 4.17, so CNO must stop referring to its image

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

This feature is to track automation in ODC, related packages, upgrades and some tech debts

Goals

  • Improve automation for Pipelines dynamic plugins
  • Improve automation for OpenShift Developer console
  • Move cypress script into frontend to make it easier to approve changes
  • Update to latest PatternFly QuickStarts

Requirements

  • TBD
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. No

 

Questions to answer…

  • Is there overlap with what other teams at RH are already planning?  No overlap

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

This won't impact documentation and this feature is to mostly enhance end to end test and job runs on CI

Assumptions

  • ...

Customer Considerations

  • No direct impact to customer

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Here is our overall tech debt backlog: ODC-6711

See included tickets, we want to clean up in 4.16.

Description

This is a follow-up on https://github.com/openshift/console/pull/13931.

We should move test-cypress.sh from the root of the console project into the frontend folder or a new frontend integration-tests folder to allow more people to approve changes in the test-cypress.sh script.

Acceptance Criteria

  1. More people can approve changes in the test-cypress.sh script.

Additional Details:

Problem:

Improving existing tests in CI to run more tests

Goal:

Why is it important?

Use cases:

  1. Improving test execution to get more tests run on CI

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

The goal is to replace the present Operator installation via UI with a command-line method to reduce flakiness and test time.

Acceptance Criteria

  1. <criteria>

Additional Details:

Description

The goal is to replace the present Operator installation via UI with a command-line method to reduce flakiness and test time.

Acceptance Criteria

  1. <criteria>

Additional Details:

*Executive Summary *

Provide mechanisms for the builder service account to be made optional in core OpenShift.

Goals

< Who benefits from this feature, and how? What is the difference between today’s current state and a world with this feature? >

  • Let cluster administrators disable the automatic creation of the "builder" service account when the Build capability is disabled on the cluster. This reduces potential attack vectors for clusters that do not run build or other CI/CD workloads. Example - fleets for mission-critical applications, edge deployments, security-sensitive environments.
  • Let cluster administrators enable/disable the generation of the "builder" service account at will. Applies to new installations with the "Build" capability enabled as well as upgraded clusters. This helps customers who are not able to easily provision new OpenShift clusters and block usage of the Build system through other means (ex: RBAC, 3rd party admission controllers (ex OPA, Kyverno)).

Requirements

Requirements Notes IS MVP
Disable service account controller related to Build/BuildConfig when Build capability is disabled When the API is marked as removed or disabled, stop creating the "builder" service account and its associated RBAC Yes
Option to disable the "builder" service account Even if the Build capability is enabled, allow admins to disable the "builder" service account generation. Admins will need to bring their own service accounts/RBAC for builds to work Yes

(Optional) Use Cases

< What are we making, for who, and why/what problem are we solving?>

  • Build as an installation capability - see WRKLDS-695
  • Disabling the Build system through RBAC or admission controllers. The "builder" service account is the only thing that RBAC and admission control cannot block without significant cluster impact.

Out of scope

<Defines what is not included in this story>

  • Disabling the Build API separately from the capabilities feature

Dependencies

< Link or at least explain any known dependencies. >

  • Build capability: WRKLDS-695
  • Separate controllers for default service accounts: API-1651

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

  • In OCP 4.14, "Build" was introduced as an optional installation capability. This means that the BuildConfig API and subsystems are not guaranteed to be present on new clusters.
  • The "builder" service account is granted permission to push images to the OpenShift internal registry via a controller. There is risk that the service account can be used as an attack vector to overwrite images in the internal registry.
  • OpenShift has an existing API to configure the build system. See OCP documentation on the current supported options. The current OCP build test suite includes checks for these global settings. Source code.
  • Customers with larger footprints typically separate "CI/CD clusters" from "application clusters" that run production workloads. This is because CI/CD workloads (and building container images in particular) can have "noisy" consumption of resources that risk destabilizing running applications.

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

  • Must work for new installations as well as upgraded clusters.

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

  • Update the "Build configurations" doc so admins can understand the new feature.
  • Potential updates to "Understanding BuildConfig" doc doc to include references to the serviceAccount option in the spec, as well as a section describing the permissions granted to the "builder" service account.

What does success look like?

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact?>

QE Contact

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Impact

< If the feature is ordered with other work, state the impact of this feature on the other work>

Related Architecture/Technical Documents

  • Disabling OCM Controllers (slides). Note that the controller names may be a bit out of date once API-1651 is done.
  • Install capabilities - OCP docs

Done Checklist

  • Acceptance criteria are met
  • Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
  • User Journey automation is delivered
  • Support and SRE teams are provided with enough skills to support the feature in production environment
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description of problem:


When a cluster is deployed with no capabilities enabled, and the Build capability is later enabled, it's related cluster configuration CRD is not installed. This prevents admins from fine-tuning builds and ocm-o from fully reconciling its state.

    

Version-Release number of selected component (if applicable):

4.16.0
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Launch a cluster with no capabilities enabled (via cluster-bot: launch 4.16.0.0-ci aws,no-capabilities 
    2. Edit the clusterversion to enable the build capability: oc patch clusterversion/version --type merge -p '{"spec":{"capabilities":{"additionalEnabledCapabilities":["Build"]}}}'
    3. Wait for the openshift-apiserver and openshift-controller-manager to roll out
    

Actual results:

APIs for BuildConfig (build.openshift.io) are enabled.
Cluster configuration API for build system is not:

$ oc api-resources | grep "build"
buildconfigs  bc  build.openshift.io/v1 true         BuildConfig
builds                   build.openshift.io/v1 true         Build
    

Expected results:

Cluster configuration API is enabled.

$ oc api-resources | grep "build"
buildconfigs  bc  build.openshift.io/v1    true  BuildConfig
builds                   build.openshift.io/v1    true  Build
builds                   config.openshift.io/v1  true  Build     
    

Additional info:

This causes list errors in openshift-controller-manager-operator, breaking the controller that reconciles state for builds and the image registry.

W0523 18:23:38.551022       1 reflector.go:539] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)
E0523 18:23:38.551334       1 reflector.go:147] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.Build: failed to list *v1.Build: the server could not find the requested resource (get builds.config.openshift.io)
    

Story (Required)

As a cluster admin trying to disable the Build, DeploymentConfig, and Image Registry capabilities I want the RBAC controllers for the builder and deployer service accounts and default image-registry rolebindings disabled when their respective capability is disabled.

<Describes high level purpose and goal for this story. Answers the questions: Who is impacted, what is it and why do we need it? How does it improve the customer's experience?>

Background (Required)

<Describes the context or background related to this story>

In WRKLDS-695, ocm-o was enhanced to disable the Build and DeploymentConfig controllers when the respective capability was disabled. This logic should be extended to include the controllers that set up the service accounts and role bindings for these respective features.

Out of scope

<Defines what is not included in this story>

Approach (Required)

<Description of the general technical path on how to achieve the goal of the story. Include details like json schema, class definitions>

    • Needs manual testing (OpenShift cluster deployed with all/some capabilities disabled). 

Dependencies

<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>

Acceptance Criteria (Mandatory)

  • Build and DeploymentConfig systems remain functional when the respective capability is enabled.
  • Build, DeploymentConfig, and Image-Puller RoleBinding controllers are not started when the respective capability is disabled.

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

  • Engineering: 5
  • QE: 2
  • Doc: 2

Legend

Unknown

Verified

Unsatisfied

Done Checklist

  • Code is completed, reviewed, documented and checked in
  • Unit and integration test automation have been delivered and running cleanly in continuous integration/staging/canary environment
  • Continuous Delivery pipeline(s) is able to proceed with new code included
  • Customer facing documentation, API docs etc. are produced/updated, reviewed and published
  • Acceptance criteria are met

In OCP 4.16.0, the default role bindings for image puller, image pusher, and deployer are created, even if the respective capabilities are disabled on the cluster.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Feature goal (what are we trying to solve here?)

Design a secure device lifecycle from provisioning and on-boarding of devices, attesting their integrity, rotating device certificates, to decommissioning with a frictionless user experience in a way that facilitates later IEC 62443 certification.

Implement the MVP parts of it, namely the secure device enrollment, certificate rotation, and decommissioning, preparing the way to use a hardware root-of-trust for device identity where possible. Certificates must be sign-able by a user-provided, production-grade CA.

User stories:

  • As OT admin provisioning virtual devices on a virtualisation or cloud platform, I want a secure, zero-touch way of onboarding these devices into Flight Control.
  • As OT admin provisioning physical devices on Red Hat Satellite or a 3rd party bare metal management platform, I want a secure, zero-touch way of onboarding these devices into Flight Control.
  • As OT admin provisioning physical devices delivered from the factory directly at a deployment site without local provisioning infrastructure, I want to a secure, zero-touch way of onboarding these devices into Flight Control.
  • As a security engineer, I want to onboard new devices efficiently using clear authentication and security protocols that allow me to trust that the device is what it claims to be.
  • As an IT admin, I want to use Flight Control with self-signed certificates.
  • As an IT admin, I want to use Flight Control with certificates signed by my own production-grade CA.

DoD (Definition of Done)

  • IT admins can configure Flight Control to use a built-in CA using self-signed certs.
  • IT admins can configure Flight Control to have agent certificates signed by their production-grade CA.
  • IT admins can generate enrollment certificates that they can use to enroll agents.
  • IT admins can generate both shared, long-validity certificates for embedding into OS images and dedicated, short-lived certificates for enrollment during provisioning.
  • Agents can use the enrollment certificate to bootstrap their agent keys and bootstrap their management certificates from that.
  • Agents can renew management certificates before they expire, or after expiration if certain security checks are met.
  • IT admins can securely decommission a device.

Market Problem

As a stakeholder aiming to adopt KubeSaw as a Namespace-as-a-Service solution, I want the project to provide streamlined tooling and a clear code-base, ensuring seamless adoption and integration into my clusters.

Why it Matters

Efficient adoption of KubeSaw, especially as a Namespace-as-a-Service solution, relies on intuitive tooling and a transparent codebase. Improving these aspects will empower stakeholders to effortlessly integrate KubeSaw into their Kubernetes clusters, ensuring a smooth transition to enhanced namespace management.

Illustrative User Stories

As a Stakeholder, I want streamlined setup of the KubeSaw project and fully automated way of upgrading this setup aling with the updates of the installation.

Expected Outcomes

  • Intuitive and user-friendly tooling for seamless configuration and management of KubeSaw instance.
  • A transparent and well-documented codebase, facilitating a quick understanding of KubeSaw internals.

Effect

The expected outcome within the market is both growth and retention. The improved tooling and codebase will attract new stakeholders (growth) and enhance the experience for existing users (retention) by providing a straightforward path to adopting KubeSaw's Namespace-as-a-Service features in their clusters.

Partner

  • Developer Sandbox
  • Konflux

Additional/tangential areas for future development

  • Integration with popular Kubernetes management platforms and tooling for enhanced interoperability.
  • Regular updates to compatibility matrices to support evolving Kubernetes technologies.
  • Collaboration with stakeholders to gather feedback and continuously improve the integration experience, including advanced namespace management features tailored to user needs.

This epic is to track all the unplanned work related to security incidents, fixing flaky e2e tests, and other urgent and unplanned efforts that may arise during the sprint.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

“In order to have a consistent metrics UI/UX, we as the Observability UI Team need to reuse the metrics code from the admin console in the dev console”

 

The metrics page is the one that receives a promQL query and is able to display a line chart with the results

Goals & Outcomes

Product Requirements:

  • Metrics from the dev console use the same components as the admin console

Open Questions

  • How to store / offer the possibility to select pre defined queries as the current dev console supports

Background

In order to keep the dev and admin console metrics consistent. Users need to be able to select from a list a predefined query. The dev perspective metrics is blocked by the current selected namespace, we should adjust the code so the current namespace is used in the soft-tenancy requests for thanos querier.

Outcomes

  • Users can select from a list predefined queries as is currently in the dev console metrics, the selected query will be added as first on the query list.

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.

Outcomes

  • The dev console metrics is loaded from monitoring-plugin and the code that is not shared with other components in the console is removed from the console codebase.
  • The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

 

Duplicate issue of https://issues.redhat.com/browse/CONSOLE-4187

To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board. 

This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin. 

openshift/console PR#4187: Removes the Metrics Page. 

openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page. 

Testing

Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.  

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Background

The admin console's alerts list page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

There are also some UX differences between the two pages, but we want to change the dev console to have the same UX as the admin console.

Outcomes

That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.

The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

The admin console's alerts list page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

There are also some UX differences between the two pages, but we want to change the dev console to have the same UX as the admin console.

Outcomes

  • The dev console page for listing alerts is loaded from monitoring-plugin and the code for the page is removed from the console codebase.

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

Outcomes

That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

Outcomes

That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.

Proposed title of this feature request

Fleet / Multicluster Alert Management User Interface

What is the nature and description of the request?

Large enterprises are drowning in cluster alerts.

side note: Just within my demo RHACM Hub environment, across 12 managed clusters (OCP, SNO, ARO, ROSA, self-managed HCP, xKS), I have 62 alerts being reported! And I have no idea what to do about them!

Customers need the ability to interact with alerts in a meaningful way, to leverage a user interface that can filter, display, multi-select, sort, etc. To multi-select and take actions, for example:

  • alert filter state is warning
  • clusters filter is label environment=development
  • multi-select this result set
  • take action to Silence the alerts!

Why does the customer need this? (List the business requirements)

Platform engineering (sys admin; SRE etc) must maintain the health of the cluster and ensure that the business applications are running stable. There might indeed be another tool and another team which focuses on the Application health itself, but for sure the platform team is interested to ensure that the platform is running optimally and all critical alerts are responded to.

As of TODAY, what the customer must do is perform alert management via CLI. This is tedious, ad-hoc, and error prone. see blog link

The requirements are:

  • filtering fleet alerts
  • multiselect for actions like silence
  • as a bonus, configuring alert forwarding will be amazing to have.

List any affected packages or components.

OCP console Observe dynamic plugin

ACM Multicluster observability (MCO operator)

Description

"In order to provide ACM with the same monitoring capabilities OCP has, we as the Observability UI Team need to allow the monitoring plugin to be installed and work in ACM environments."

Goals & Outcomes

Product Requirements:

  • Be able to install the monitoring plugin without CMO, use COO
  • Allow the monitoring plugin to use a different backend endpoint to fetch alerts, ACM has is own alert manager
  • Add a column to the alerts list to display the cluster that originated the alert
  • Include only the alerting parts which include the alerts list, alert detail and silences

UX Requirements:

  • Align UX text and patterns between ACM concepts (hub cluster, spoke cluster, core operators) and current the monitoring plugin

Open Questions

  • Do the current monitoring plugin and the ACM monitoring plugin need to coexist in a cluster?
  • Do we need to connect to a different prometheus/thanos or is just a different alert manager?

Background

In order to enable/disable features for monitoring in different OpenShift flavors, the monitoring plugin should support feature flags

Outcomes

  • The monitoring plugin has a lightweight Go backend that can be configured with feature toggles to enable dynamic plugin extensions
  • Do not include specific flags that will break FIPS

Proposed title of this feature request
Q2 - Rapid Recommendations Iterration1 - Containers/Pod logs gathering
What is the nature and description of the request?

As an Insights/Observability user I'd like the collection mechanism to be more dynamic in order to cover more scenarios and provide recommendations faster.

Why does the customer need this? (List the business requirements)

Rapid recommendations is a set of collection mechanism changes that enables Insights Rule development and Analytics functions to request data collection enhancement and trim time for actual implementation of rule/dashboard from a month+ to days.

List any affected packages or components.

Insights Operator

Goal

The main goal is to implement Rapid Recommendations Iteration 1 (Containers/Pod logs gathering) in Insights Operator as per Openshift Enhancement Proposal - PR link

In more details:

  • Enable Insights recommendations that target existing OCP versions and that require
    new container log data.
  • Reduce time to fleet-wide impact for Insights recommendations that require new container log data.
  • Reduce effort to develop Insights recommendations that require new container log data.
  • Enable one-off queries about the fleet that utilize container log data.
  • Provide a solid base for future extensions of the remote configuration feature to node logs
    and API resources.
     

    Importance

    This improvement has a huge potential in terms of additional value IO data could bring on the table.

Scenarios

Previous Work

The conditional data gathering can be considered as the previous work. This idea generally builds on it.
Scoping is done in previous epic and described in the Openshift Enhancement Proposal - PR link

Unknowns

There are several unknowns:

  • The area of Prometheus metrics and alerts remains open with many unknowns at the moment.

Risks

even if the scope was reduced to Pod logs only it remains a sizable chunk of work and might require project management involvement (not necessarily a project manager)

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

By having the OpenShift Insights Operator gathering data from OSP CRDs, this data can be ingested into Insights and plug in into the existing tools and processes under the Insights/CCX teams.

This will allow us to create dedicated Superset dashboards or query the data using SQL via the Trino API to for example:

  • Tell how many OSP installations we have
  • See trends in eg. the number of nodes per deployment
  • Adoption rate of a particular OSP (>18) release
  • Upgrade/update rates
  • Answer questions such as: how many OSP deployments are using Fibre Channel as their Cinder driver?
  • ... 

Besides, customers will benefit from any Insights rules that we'll be adding over time to for example anticipate issues or detect misconfigurations, suggest parameter tunings, etcetera.

Examples of how OCP Insights uses this data can be seen in the "Let's Do The Numbers" series of monthly presentations.

This epic is targeted only for the RHOSO (So OSP18 and newer). There are no any changes nor support for that planned in OSP-17.1 or older.

It is implementation of the solution 1 from the document https://docs.google.com/document/d/1r3sC_7ZU7qkxvafpEkAJKMTmtcWAwGOI6W_SZGkvP6s/edit#heading=h.kfjcs2uvui3g

We need to, base on the Yatin Karel's patch make proper integration of our CRs with the insights-operator. It needs to collect data from the 'OpenstackControlPlane', 'OpenstackDataPlaneDeployment' and 'OpenstackDataPlaneNodeSet' CRs with proper anonymization of the data like IP addresses etc. It also needs to set somehow "good" ID to identify Openstack cluster as we cannot rely on the Openshitf clusterID because we may have more than one Openstack cluster on the same OCP cluster.

To identify openstack cluster maybe uuid of the OpenstackControlPlane CR can be used. If no we will need to figure out something else there.

Definition of done:

  • insights-operator knows about Openstack related CRs, can collect data from it, anonymize that data and send to the insights-server,
    • it can collect data from CRs from more than one Openstack cluster and send them separately to the insights servereh

Feature Overview

As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.

 

Background:

This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.  

  1. OTA-700 Reduce False Positives (such as Degraded) 
  2. OTA-922 - Better able to show the progress made in each discrete step 
  3. [Cover by status command]Better visibility into any errors during the upgrades and documentation of what they error means and how to recover. 

Goals

  1. Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are: 
    • Control plane upgrade
    • Worker nodes upgrade
    • Workload enabling upgrade (i..e. Router, other components) or infra nodes
  2. An user experience around an end-2-end back-up and restore after a failed upgrade 
  3. MCO-530 - Support in Telemetry for the discrete steps of upgrades 

References

Epic Goal

  • Eliminate the gap between measured availability and Available=true

Why is this important?

  • Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
  • We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
  • We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
  • Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

  1. In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
  2. Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
  3. Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
  4. Address all identified issues

Acceptance Criteria

  • openshift/enhancements CONVENTIONS outlines these requirements
  • CI - Release blocking jobs include these new/updated tests
  • Release Technical Enablement - N/A if we do this we should need no docs
  • No outstanding identified issues

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
    https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Tests in place
  • DEV - No outstanding failing tests
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available expand_less
Run #0: Failed expand_less	1h58m17s
{  0 unexpected clusteroperator state transitions during e2e test run, as desired.
3 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Jan 09 12:43:04.348 E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.)
Jan 09 12:43:04.348 - 56s   E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.)
Jan 09 12:44:00.860 W clusteroperator/image-registry condition/Available reason/Ready status/True Available: The registry is ready\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created (exception: Available=True is the happy case)
}

And the job still passed.

Definition of done:

  • Same as OTA-362, except filling in here.
  • File bugs or the existing issues
  • If bug exists then add the tests to the exception list.
  • Unless tests are in exception list , they should fail if we see Available != True.

Feature Overview (aka. Goal Summary)  

In order to continue with the evolution of rpm-ostree in RHEL CoreOS, we should adopt bootc. This will keep us aligned with future development work in bootc and provide operational consistence between RHEL image mode and RHEL CoreOS.

Goals

  • bootc is integrated into the RHCOS update flow

Requirements (aka. Acceptance Criteria):

  • OpenShift itself no longer calls rpm-ostree (unless needed for local package installs)
  • CoreOS extensions still work
  • Existing local package layering still works

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Operator compatibility Does an operator use `rpm-ostree install`?

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

The whole bootc integration into the MCO can be divided into three epics, each announced in an enhancement

Epic 1: Bootc Update Path 

Summary: An additional image update path added to the backend of the MCO. Keeping synced with CoreOS, MCO currently utilizes a rpm-ostree-based system to engage in base image specification and layering. By adding a bootc update path, MCO, in the future, will embrace the Image Mode for RHEL and become ready to support image-based upgrade process.

Proposal:

Phase 1 Get bootc working in the MCO 

  • Goal:
    • Discuss the feasibility of switching and identify switching risks sooner than later - enhancement 
    • Create a bootc update path in the MCO
    • Ensure the functionality of osImageURL Update and Kernel Argument Update
    • Shoot out warnings for Extension Change & Kernel Type Switch - not supported by bootc 
  • Implementation: 
    • Mimic everything we have in rpm-ostree.go and creating a bootc.go Make bootc.go works with update paths in the daemons 
    • Create a bootc wrapper (similar to rpmostree-client-go) for bootc status reporting purposes 
    • Code merge behind FG 
  • People need to be aware of are:
  • Done when:
    • We have an enhancement (or comparable) to describe the future bootc switching plan in the MCO, identifying dependencies, challenges, plans and blockers. 
    • All goals achieved 

Phase 2 Get bootc working with OCL 

  • Goal:
    • Based on the findings in spike https://issues.redhat.com/browse/MCO-1027 , theoretically no addition work is needed for wiring up bootc switching with OCL. Prove the theory with testings.
    • Drawing attention to unexpected behaviours in the actual testing and address them 
    • Make sure support for Extension Change & Kernel Type Switch is back online with bootc TP turned on and OCL TP turned on (if not GA) 
  • Implementation: 
    • Update the image rollout command for OCL built images 
    • Code merge behind FG 
  • Done when:
    • We have confidence in adapting bootc switching in OCL or concerns are raised 
    • All goals achieved 

Phase 3 GA graduation criteria 

  • Goal: 
    • Create a bootc update path in the MCO so that the MCO will support bootable container images.
    • Ensure that MCO is able to understand both rpm-ostree and bootc mutations
    • Enable the user to install a cluster with bootc and transfer an old cluster to a bootc-enabled one 
    • Have users not notice any difference in user experiences other than that more customization and mutation options are supported 
    • Make bootc update path the default layered OS update path
  • Dependencies:
    • Openshift’s decision on whether bootc will be supported - 4.19
    • bootc GA

Epic 2: Unified Update Interface @ https://issues.redhat.com/browse/MCO-1189

Summary: Currently, there are mainly three update paths built in parallel within the MCO. They separately take care of non-image updates, image updates, and updates for pools that have opted in to On Cluster Layering. As a new bootc update path will be added in with the introduction of this enhancement, MCO is looking for a better way to better manage these four update paths who manage different types of update but also shares a lot of things in common (e.g. check reconcilability). Interest and Proposals in refactoring the MCD functions and creating a unified update interface has been raised several times in previous discussions:

Epic 3: Bootc Day-2 Configuration Tools @ https://issues.redhat.com/browse/MCO-1190 

Summary:  Bootc has opened a door for disk image customization via configmap. Switching from rpm-ostree to bootc, MCO should not only make sure all the functionality remains but also proactively extending its support to make sure all the customization power brought in by bootc are directed to the user, allowing them to fully maximize these advantages.This will involve creating a new user-side API for fetching admin-defined configuration and pairing MCO with a bootc day-2 configuration tool for applying the customizations. Interest and Proposals in which has been raised several times in previous discussions:

Several discussions have happened on the question : is Go binding for bootc a necessity for adapting a bootc update path in the MCO? 

    • Discussion: Currently, bootc does not have go binding which will make it hard to read its status and deployment. It seems that we will need to create a Go library for client side interactions with bootc (similar to: https://github.com/coreos/rpmostree-client-go) Colin has suggested a way to auto-generate it (see comment in https://issues.redhat.com/browse/MCO-1026), which would be the most ideal solution. Workarounds should also be available, making go binding a non-goal item for this enhancement. 
    • Result: Go bindings for bootc is not a must and, as a result, will be outside of the scope of this enhancement and will not be implemented. The main purpose of creating go bindings for rpm-ostree/bootc is to have an easier way to parse the created json and to read the status of rpm-ostree/bootc. This can be done via creating a simple wrapper function inside of the MCO. It will make sense to separate it out to a standalone helper/library in the future when the demand raises, but is a non-goal for now.

Action item: 

(1) Mimic the rpmostree.go we have for shelling out rpm-ostree commands and have something similar for bootc to be able to call bootc commands for os updates 

(2 This is a merging card for result in https://issues.redhat.com/browse/MCO-1191 

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal

Goals

  • Validating OpenShift on OCI baremetal to make it officially supported. 
  • Enable installation of OpenShift 4 on OCI bare metal using Assisted Installer.
  • Provide published installation instructions for how to install OpenShift on OCI baremetal
  • OpenShift 4 on OCI baremetal can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI baremetal for connected OpenShift clusters (e.g. platform=external or none + some other indicator to know it's running on OCI baremetal).

Use scenarios

  • As a customer, I want to run OpenShift Virtualization on OpenShift running on OCI baremetal.
  • As a customer, I want to run Oracle BRM on OpenShift running OCI baremetal.

Why is this important

  • Customers who want to move from on-premises to Oracle cloud baremetal
  • OpenShift Virtualization is currently only supported on baremetal

Requirements

 

Requirement Notes
OCI Bare Metal Shapes must be certified with RHEL It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot (OCPSTRAT-1246)
Certified shapes: https://catalog.redhat.com/cloud/detail/249287
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. Oracle will do these tests.
Updating Oracle Terraform files  
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. Support Oracle Cloud in Assisted-Installer CI: MGMT-14039

 

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

OCI Bare Metal Shapes to be supported

Any bare metal Shape to be supported with OCP has to be certified with RHEL.

From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.

As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes 

Assumptions

  • Pre-requisite: RHEL certification which includes RHEL and OCI baremetal shapes (instance types) has successfully completed.

 

 

 

 
 

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Feature Overview (aka. Goal Summary)  

Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.

Goals (aka. expected user outcomes)

  • Provide a configurable way to indicate that a pod should be connected to a unique network of a specific type via its primary interface.
  • Allow networks to have overlapping IP address space.
  • The primary network defined today will remain in place as the default network that pods attach to when no unique network is specified.
  • Support cluster ingress/egress traffic for unique networks, including secondary networks.
  • Support for ingress/egress features where possible, such as:
    • EgressQoS
    • EgressService
    • EgressIP
    • Load Balancer Services

Requirements (aka. Acceptance Criteria):

  • Support for 10,000 namespaces
  •  

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Design Document

Use Cases (Optional):

  • As an OpenStack or vSphere/vCenter user, who is migrating to OpenShift Kubernetes, I want to guarantee my OpenStack/vSphere tenant network isolation remains intact as I move into Kubernetes namespaces.
  • As an OpenShift Kubernetes user, I do not want to have to rely on Kubernetes Network Policy and prefer to have native network isolation per tenant using a layer 2 domain.
  • As an OpenShift Network Administrator with multiple identical application deployments across my cluster, I require a consistent IP-addressing subnet per deployment type. Multiple applications in different namespaces must always be accessible using the same, predictable IP address.

Questions to Answer (Optional):

  •  

Out of Scope

  • Multiple External Gateway (MEG) Support - support will remain for default primary network.
  • Pod Ingress support - support will remain for default primary network.
  • Cluster IP Service reachability across networks. Services and endpoints will be available only within the unique network.
  • Allowing different service CIDRs to be used in different networks.
  • Localnet will not be supported initially for primary networks.
  • Allowing multiple primary networks per namespace.
  • Allow connection of multiple networks via explicit router configuration. This may be handled in a future enhancement.
  • Hybrid overlay support on unique networks.

Background

OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.

As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.

Network Policy has its issues:

  • it can be cumbersome to configure and manage for a large cluster
  • it can be limiting as it only matches TCP, UDP, and SCTP traffic
  • large amounts of network policy can cause performance issues in CNIs

With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.

Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.

Customer Considerations

  •  

Documentation Considerations

  •  

Interoperability Considerations

Test scenarios:

  • E2E upstream and downstream jobs covering supported features across multiple networks.
  • E2E tests ensuring network isolation between OVN networked and host networked pods, services, etc.
  • E2E tests covering network subnet overlap and reachability to external networks.
  • Scale testing to determine limits and impact of multiple unique networks.

See https://github.com/ovn-org/ovn-kubernetes/pull/4276#discussion_r1628111584 for more details

  1. Currently we seem to be handling the same network from multiple threads when different NADs refer to the same network
  2. This leads to race conditions
  3. we need a level driven single threaded way of handling networks
  4. this card tracks the refactoring needed for this as the 1st step

In order for the nework API related CRDs be installed and usable out-of-the-box, the new CRDs manifests should be replicated to CNO repository in a way it will install them along with other OVN-K CRDs.

Example https://github.com/openshift/cluster-network-operator/pull/1765

Goal of this task is to simply add a feature gate both upstream to OVNK and to downstream in ocp/api to then leverage via CNO once the entire feature merges. This is going to be a huge EPIC so with the break down, this card is intentionally ONLY tracking the glue work to have a feature gate piece done in both places.

  1. Controller changes must leverage this feature gate,
  2. test changes must leverage this
  3. all topology changes that depend on "specific ifs for this feature" need this feature gate
  4. be smart about the naming cause it will be user facing in docs
  5. also expose it via KIND

This card DOES NOT HAVE TO USE THE FEATURE GATE. It is meant to allow other cards to use this.

  • Make the necessary API changes if needed in telling people its modifable
  • make any CNO changes in ensuring new range in day2 is not conflicting with other ranges
  • Make OVNK changes to allow for day2 config changes - disruption is not an issue
  • Ensure users can provide any value to this not just 169.x.x.x along with allowing expansion of current range

Feature Overview (aka. Goal Summary)  

Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default 

Benefits of Crun is covered here https://github.com/containers/crun 

 

FAQ.:  https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit

***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that  

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

Upgrading from 4.17 to 4.18 results in crun as the default runtime. If a user didn't have a ContainerRuntimeConfig set to set crun, then they should continue to use runc    

Version-Release number of selected component (if applicable):

    4.17.z

How reproducible:

    100%

Steps to Reproduce:

    1. upgrade 4.17 to 4.18
    

Actual results:

    crun is the default

Expected results:

    runc should be the default

Additional info:

    

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Support kubevirt csi volume cloning. 

As an application developer on a KubeVirt-HCP hosted cluster I want to use CSI clone workflows when provisioning PVCs when my infrastructure storage supports it.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Support kubevirt csi volume cloning. 

As an application developer on a KubeVirt-HCP hosted cluster I want to use CSI clone workflows when provisioning PVCs when my infrastructure storage supports it.

In order to implement csi-clone in the tenant cluster, we can simply pass the csi-clone request to the infra cluster (if it supports csi-clone) and have that take care of creating the csi-clone.

Goal

This goals of this features are:

  • As part of a Microsoft guideline/requirement for implementing ARO HCP, we need to design a shared-ingress to kube-apiserver because MSFT has internal restrictions on IPv4 usage.  

Background

Given Microsoft's constraints on IPv4 usage, there is a pressing need to optimize IP allocation and management within Azure-hosted environments.

 

Interoperability Considerations

  • Impact: Which versions will be impacted by the changes?
  • Test Scenarios: Must test across various network and deployment scenarios to ensure compatibility and scale (perf/scale)

There's currently multiple ingress strategies we support for hosted cluster service endpoints (kas, nodePort, router...).
In a context of uncertainty about what use cases would be more critical to support, we initially exposed this in a flexible API that enables to potentially choose any combination of ingress strategies and endpoints.
ARO has internal restrictions on IPv4 usage. Because of this, to simplify the above and to be more cost effective in terms of infra we'd want to have a common shared ingress solution for all hosted clusters fleet.

Current implementation reproduces https://docs.google.com/document/d/1o6kd61gBVvUtYAqTN6JqJGlUAsatmlq2mmoBkQThEFg/edit for simplicity.

This has a caveat: when the management KAS SVC changes it would require changes to the .sh files setting binding the IP to the lo interface and the haproxies running in the dataplane.

We should find a way to either

  • Ensure the IPs are always static for the kas SVCs so no need to worry about this.
  • Or let them be static from the data plane pov, and the shared proxy would always use that same initial IP for discriminating but refresh the real IP that is forwarding requests to.
  • Or find a way to refresh the IP on the flight dataplane side (seems unlikely since these are disk files, systemd services and static pods).
  • Or use something else to use as the discriminating criteria.

current implementation reproduces https://docs.google.com/document/d/1o6kd61gBVvUtYAqTN6JqJGlUAsatmlq2mmoBkQThEFg/edit for simplicity
We should explore if it's possible to use one single proxy data plane side. E.g:
We could change the kube endpoint IP advertised address in the dataplane to match the management kas svc ip, or relying on source ip instead of dst to discriminate.

User Story:

As a consumer (managed services) I want hypershift to support sharedingress
As a dev I want to transition towards a prod ready solution iteratively

Acceptance Criteria:

A shared ingress solution poc is merged and let Azure e2e pass https://docs.google.com/document/d/1o6kd61gBVvUtYAqTN6JqJGlUAsatmlq2mmoBkQThEFg/edit#heading=h.hsujpqw67xkr

The initial goal is to start transitioning the code structure (reconcilers, APIs, helpers...) towards a shared ingress oriented solution using haproxy as presented above for simplicity.
Then we can iteratively follow up to harden security, evaluate more sophisticated alternatives and progress towards being prod ready.

Market Problem

  • As a Managed OpenShift cluster administrator, I want to use the Priority based expander for cluster-autoscaler to select instance types based on priorities assigned by a user to scaling groups.
    The Configuration is based on the values stored in a ConfigMap
  • flags need to be set on the cluster-auto-scaler --expander=priority .
    The user can give a list of expanders in order to have the autoscaler make better choices. for example, we used this in our cost analysis testing --expander=priorty,least-waste
    this would first try to use the priority configmap, and if it finds multiple groups it then tries to find the group with the least waste.
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    10: 
      - .*t2\.large.*
      - .*t3\.large.*
    50: 
      - .*m4\.4xlarge.*
  • OCM will need to expose the option to do that.
  • Hypershift just sets --expander=priorty flag on the autosacler, and OCM will be responsible of creating the cluster-autoscaler-priority-expander configMap , based on customer configuration.
  • This feature should be available in CAPA.

Expected Outcomes

 

Acceptance Criteria

  • Add code in HCP to support priority expander
  • Add code in OCM to support optional configmap field for priority expander - XCMSTRAT-769 
  • Add code upstream CAPA to support optional configmap field for priority expander - HOSTEDCP-1728 

 

 

Documentation Considerations

Documentation will need to be updated to point out the new maximum for ROSA HCP clusters, and any expectations to set with customers.

Goal

  • As a Managed OpenShift cluster administrator, I want to use the Priority based expander for cluster-autoscaler to select instance types based on priorities assigned by a user to scaling groups.

the following flag needs to be set on the cluster-auto-scaler `--expander=priority` .

The Configuration is based on the values stored in a ConfigMap called `cluster-autoscaler-priority-expander` which will be created by the user/ocm.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal:
Graduate to GA (full support) Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.

Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

Continue work for E2E tests for the implementation of "gatewaycontroller" in https://issues.redhat.com/browse/NE-1206 

Some ideas for the test include:

  1. Test installation: Create GatewayClass with spec.controllerName: openshift.io/gateway-controller and ensure the following:
    1. Istio Operator successfully gets installed
    2. The expected Gateway API CRDs are installed (done in NE-1208)
    3. The SMCP gets created
    4. Istiod control plane gets stood up
    5. Istio-ingressgateway exists and is healthy
  2. Automatic deployments work by creating a Gateway controller - Istio creates an envoy proxy for the gateway
  3. End-to-End Ingress Check: Create HTTPRoute, and attach to Gateway, ensure connectivity
  4. Ingress op creates the catalog source and subscription automatically for a Istio image that is publicly available

Create E2E tests for the implementation of "gatewaycontroller" in https://issues.redhat.com/browse/NE-1206 

Some ideas for the test include:

  1. Test installation: Create GatewayClass with spec.controllerName: openshift.io/gateway-controller and ensure the following:
    1. Istio Operator successfully gets installed
    2. The expected Gateway API CRDs are installed
    3. The SMCP gets created
    4. Istiod control plane gets stood up
    5. Istio-ingressgateway exists and is healthy
  2. Automatic deployments work by creating a Gateway controller - need details on AC for this (Isto creates an envoy proxy for you)
  3.  [Stretch] End-to-End Ingress Check: Create HTTPRoute, and attach to Gateway, ensure connectivity
  4. [Tech Preview] Ingress op creates the catalog source and subscription automatically for a Istio image that is publicly available

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Track the stories that cannot be completed before live migration GA.

Why is this important?

These tasks shall not block the live migration GA, but we still need to get them done.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the "Discussion Needed: Service Delivery Architecture Overview" checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the "Discussion Needed: Service Delivery Architecture Overview" checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn't have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. ...

Open questions::

1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As, the live migration process may take hours for a large cluster. The workload in the cluster may trigger cluster extension by adding new nodes. We need to support adding new nodes when an SDN live migration is running in progress.

We need to backport this to 4.15.

The SDN live migration can not work properly in a cluster with specific configurations. CNO shall refuse proceeding the live migration in such a case. We need to add the pre-migration validation to CNO

The live migration shall be blocked for clusters with the following configuration

  • OpenShiftSDN multitenat mode.
  • Egress Router
  • cluster network or service network ranges conflict with the OVN-K internal subnets

SD team manage many clusters. Metrics can help them to monitor status of many cluster at the time. There is something which has been done for the cluster upgrade, we may want to follow the same recipe.

Feature Overview

This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

 

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
 
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

 
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

 

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up 
(MCO-770, MCO-578, MCO-574 )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: MCO-1097, MCO-1099

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Occasionally, transient network issues will cause image builds to fail. This can be  remedied by adding retry capabilities to the Buildah build and push operations within the build phase. This will allow these operations to be automatically retried without cluster admin intervention.

 

Done When:

  • The Buildah build and push operations have automatic retries.

Description of problem:

The CurrentImagePullSecret field on the MachineOSConfig is not being consumed by the rollout process. This is evident when the designated image registry is private and the only way to pull an image is to present a secret.

 

How reproducible:

Always

 

Steps to Reproduce:

  1. Configure a private image registry to be the designated image registry of choice for on-cluster layering.
  2. Get the pull secret for the registry and apply it to the cluster in the MCO namespace.
  3. Configure on-cluster layering to use this secret for pushing the image.
  4. Wait for the build to complete.
  5. Wait for the rollout to occur.

 

Actual results:

The node and MachineConfigPool will degrade because rpm-ostree is unable to pull the newly-built image because it does not have access to the credentials even though the MachineOSConfig has a field for them.

 

Expected results:

Rolling out the newly-built OS image should succeed.

 

Additional info:

It looks like we'll need make the getImageRegistrySecrets() function aware of all MachineOSConfigs and pull the secrets from there. Where this could be problematic is where there are two image registries with different secrets. This is because the secrets are merged based on the image registry hostname. Instead, what we may want to do is have the MCD write only the contents of the referenced secret to the nodes' filesystem before calling rpm-ostree to consume it. This could potentially also reduce or eliminate the overall complexity introduced by the getImageRegistrySecrets() while simultaneously resolving the concerns found under https://issues.redhat.com//browse/OCPBUGS-33803.

It is worth mentioning that even though we use a private image registry to test the rollout process in OpenShift CI, the reason it works is because it uses an Imagestream which the machine-os-puller service account and its image pull secret is associated with it. This secret is surfaced to all of the cluster nodes by the getImageRegistrySecrets() process. So in effect, it may appear that its working when it does not work as intended. A way to test this would be to create an ImageStream in a separate namespace along with a separate pull secret and then attempt to use that ImageStream and pull secret within a MachineOSConfig.

Finally, to add another wrinkle to this problem: If a cluster admin wants to use a different final image pull secret for each MachineConfigPool, merging those will get more difficult. Assuming the image registry has the same hostname, this would lead to the last secret being merged as the winner. And the last secret that gets merged would be the secret that gets used; which may be the incorrect secret.

For the custom pod builder, we have a hardcoded dependency upon the Buildah container image (quay.io/buildah/stable:latest). This causes two problems: 1) It breaks the guarantees that OpenShift makes about everything being built and stable together since that image can change at any time. 2) This makes on-cluster builds not work in a disconnected environment since that image cannot be pulled.

It is worth mentioning that we cannot simply delegate to the image that the OpenShift Image Builder uses and use a Buildah binary there. To remedy this, we'll need to decide on and implement an approach as defined below:

Approach #1: Include Buildah in the MCO image

As part of our container build process, we'll need to install Buildah. Overall, this seems pretty straightforward since the package registry we'd be installing from (the default package registry for a given OCP release) has the appropriate package versions for a given OCP release.

This has the downside that the MCO image size will increase as a result.

Approach #2: Use the OS base image

The OS base image technically has Buildah within it albeit embedded inside Podman. By using this base image, we can effectively lifecycle Buildah in lockstep with the OCP release without significant cognitive or process overhead. Basically, we'd have an e2e test that would perform a build using this image and if it passes, we can be reasonably confident that it will continue to work.

However, it is worth mentioning that I encountered significant difficulty while attempting to make this work in an unprivileged pod.

 

Done When:

  • An approach has been decided upon and implemented.

Feature Overview (aka. Goal Summary)  

Today VMs for a single nodepool can “clump” together on a single node after the infra cluster is updated. This is due to live migration shuffling around the VMs in ways that can result in VMs from the same nodepool being placed next to each other.

 

Through a combination of TopologySpreadConstraints and the De-Scheduler, it should be possible to continually redistributed VMs in a nodepool (via live migration) when clumping occurs. This will provide stronger HA guarantees for nodepools

 

Goals (aka. expected user outcomes)

VMs within a nodepool should re-distribute via live migration in order to best satisfy topology spread constraints.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Feature Overview (aka. Goal Summary)  

Today VMs for a single nodepool can “clump” together on a single node after the infra cluster is updated. This is due to live migration shuffling around the VMs in ways that can result in VMs from the same nodepool being placed next to each other.

 

Through a combination of TopologySpreadConstraints and the De-Scheduler, it should be possible to continually redistributed VMs in a nodepool (via live migration) when clumping occurs. This will provide stronger HA guarantees for nodepools

 

Goals (aka. expected user outcomes)

VMs within a nodepool should re-distribute via live migration in order to best satisfy topology spread constraints.

Feature Overview 

ETCD backup API was delivered behind a feature gate in 4.14. This feature is to complete the work for allowing any OCP customer to benefit from the automatic etcd backup capability.

The feature introduces automated backups of the etcd database and cluster resources in OpenShift clusters, eliminating the need for user-supplied configuration. This feature ensures that backups are taken and stored on each master node from the day of cluster installation, enhancing disaster recovery capabilities.

Why is it important?

The current method of backing up etcd and cluster resources relies on user-configured CronJobs, which can be cumbersome and prone to errors. This new feature addresses the following key issues:

  • User Experience: Automates backups without requiring any user configuration, improving the overall user experience.
  • Disaster Recovery: Ensures backups are available on all master nodes, significantly improving the chances of successful recovery in disaster scenarios where multiple control-plane nodes are lost.
  • Cluster Stability: Maintains cluster availability by avoiding any impact on etcd and API server operations during the backup process.

Requirements

Complete work to auto-provision internal PVCs when using the local PVC backup option. (right now, the user needs to create PVC before enabling the service).

Out of Scope

The feature does not include saving cluster backups to remote cloud storage (e.g., S3 Bucket), automating cluster restoration, or providing automated backups for non-self-hosted architectures like Hypershift. These could be future enhancements (see OCPSTRAT-464)

 

Epic Goal*

Provide automated backups of etcd saved locally on the cluster on Day 1 with no additional config from the user.

 
Why is this important? (mandatory)

The current etcd automated backups feature requires some configuration on the user's part to save backups to a user specified PersistentVolume.
See: https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L46

Before the feature can be shipped as GA, we would require the capability to save backups automatically by default without any configuration. This would help all customers have an improved disaster recovery experience by always having a somewhat recent backup. 

 
Scenarios (mandatory) 

  • After a cluster is installed the etcd-operator should take etcd backups and save them to local storage.
  • The backups must be pruned according to a "reasonable" default retention policy so it doesn't exhaust local storage.
  • A warning alert must be generated upon failure to take backups.

Implementation details:
One issue we need to figure out during the design of this feature is how the current API might change as it is inherently tied to the configuration of the PVC name.
See:
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L99
and 
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/operator/v1alpha1/types_etcdbackup.go#L44

Additionally we would need to figure out how the etcd-operator knows about the available space on local storage of the host so it can prune and spread backups accordingly.
 

Dependencies (internal and external) (mandatory)

Depends on changes to the etcd-operator and the tech preview APIs 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation - etcd docs team
  • QE - Sandeep Kundu
  • PX - 
  • Others -

Acceptance Criteria (optional)

Upon installing a tech-preview cluster backups must be saved locally and their status and path must be visible to the user e.g on the operator.openshift.io/v1 Etcd cluster object.

An e2e test to verify that the backups are being saved locally with some default retention policy.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

As a developer, I want to implement the logic of etcd backup so that backups are taken without configuration.

Feature description

Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals: 

  • Manage complex air-gapped scenarios, providing support for the enclaves feature
  • Faster and more robust: introduces caching, it doesn’t rebuild catalogs from scratch
  • Improves code maintainability, making it more reliable and easier to add features, and fixes, and including a feature plugin interface

 

Feature Overview (aka. Goal Summary)  

Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.

Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.

We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.

Goals (aka. expected user outcomes)

As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.

As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

TBD
 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Goal

  • ...

Why is this important?

Scenarios

\

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

cloud-provider-openstack is not the only service needing access to the cloud credentials. The list also includes:

  • image-registry (Swift and Glance access)
  • cloud-network-config-controller (Neutron and Nova access for EgressIPs support)
  • CSIs (Cinder and Manila access)
  • Ingress (I have no damn clue why it could need it, but I see it on other platforms)

Normally this is solved by cloud-credentials-operator, but in HyperShift we don't have it. hosted-control-plane-operator needs to take care of this alone. The code goes here: https://github.com/openshift/hypershift/blob/1af078fe4b9ebd63a9b6e506f03abc9ae6ed4edd/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1156

We also need to pass CA here! It might be non-trivial!

  • Run e2e hypershift conformance against a hosted cluster
  • Get it passing

Note: this is a first iteration and we might improve how we do things later.

For example, I plan to work in openshift/release for this one. In the future we might add it directly to the HCP e2e framework.

This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.

When the management cluster runs on AWS, make sure we update the DNS record for *apps, so ingress can work out of the box.

Currently, CCM is not configured with a floating IP network:

oc get cm openstack-cloud-config -n clusters-openstack -o yaml 

We need to change that because if a cloud has multiple external networks, CCM needs to know where to create the Floating IPs, especially since the user can specify which external network they want to use with --openstack-external-network cluster create CLI arguent.

HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.

Combining the different tasks in this EPIC, we use that Jira task to track the first PR that is being submitted to Hypershift.

In HyperShift it's cluster-api responsible for deployment of cluster resources on the cloud and the Machines for worker nodes. This means we need to configure and deploy CAPO.

As CAPO is responsible for deploying OpenStack resources for each Hosted Cluster we need a way to translate HostedCluster into an OpenStackCluster, so that CAPO will take care of creating everything for the cluster. This goes here: https://github.com/openshift/hypershift/pull/3243/files#diff-0de67b02a8275f7a8227b3af5101786e70e5c651b852a87f6160f6563a9071d6R28-R31

The probably tedious part - SGs, let's make sure we understand this.

Feature Overview 

Support using a proxy in oc-mirror.

Why this is important?

We have customers who want to use Docker Registries as proxies / pull-through cache's.

This means that customers would need a way to get the ICSP/IDMS/ITMS and image list which seems relevant to the "generating mapping files" for "V2 tooling". Would like to make sure this is addressed in your use cases

From our IBM sync

"We have customers who want to use Docker Registries as proxies / pull-through cache's. This means that customers would need a way to get the ICSP/IDMS/ITMS and image list which seems relevant to the "generating mapping files" for “V2 tooling”. Would like to make sure this is addressed in your use cases."

Description of problem:

When recovering signatures for releases, the http connection doesn't use the system proxy configuration 
    

Version-Release number of selected component (if applicable):WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info

{Major:"", Minor:"", GitVersion:"4.16.0-202407100906.p0.g75da281.assembly.stream.el9-75da281", GitCommit:"75da281989a147ead237e738507bbd8cec3175e5", GitTreeState:"clean", BuildDate:"2024-07-10T09:48:28Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

Always
Image set config
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    graph: true
    channels:
    - name: stable-4.16
    - name: stable-4.15
    

Steps to Reproduce:

    1. Run oc-mirror with above imagesetconfig in mirror to mirror in an env that requires proxy setup

    

Actual results:

2024/07/15 14:02:11  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/07/15 14:02:11  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/07/15 14:02:11  [INFO]   : ⚙️  setting up the environment for you...
2024/07/15 14:02:11  [INFO]   : 🔀 workflow mode: mirrorToMirror
2024/07/15 14:02:11  [INFO]   : 🕵️  going to discover the necessary images...
2024/07/15 14:02:11  [INFO]   : 🔍 collecting release images...
I0715 14:02:11.770186 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.16&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6
2024/07/15 14:02:12  [INFO]   : detected minimum version as 4.16.1
2024/07/15 14:02:12  [INFO]   : detected minimum version as 4.16.1
I0715 14:02:12.321748 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6
I0715 14:02:12.485330 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&channel=stable-4.15&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6
2024/07/15 14:02:12  [INFO]   : detected minimum version as 4.15.20
2024/07/15 14:02:12  [INFO]   : detected minimum version as 4.15.20
I0715 14:02:12.844366 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&arch=amd64&channel=stable-4.16&channel=stable-4.16&channel=stable-4.15&channel=stable-4.15&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6&id=3808de83-dfe4-42f6-8d5b-196ed1b5bbc6
I0715 14:02:13.115004 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=00000000-0000-0000-0000-000000000000
I0715 14:02:13.784795 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20
I0715 14:02:13.965936 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&channel=stable-4.16&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20
I0715 14:02:14.136625 2475426 core-cincinnati.go:477] Using proxy 10.18.7.5:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&arch=amd64&arch=amd64&arch=amd64&channel=stable-4.15&channel=stable-4.15&channel=stable-4.16&channel=stable-4.16&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&id=00000000-0000-0000-0000-000000000000&version=4.15.20&version=4.16.1
W0715 14:02:14.301982 2475426 core-cincinnati.go:282] No upgrade path for 4.15.20 in target channel stable-4.16
2024/07/15 14:02:14  [ERROR]  : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=c17d4489c1b283ee71c76dda559e66a546e16b208a57eb156ef38fb30098903a/signature-1": dial tcp: lookup mirror.openshift.com: no such host
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3d02c56]

goroutine 1 [running]:
github.com/openshift/oc-mirror/v2/internal/pkg/release.SignatureSchema.GenerateReleaseSignatures({{0x54bb930, 0xc000c80738}, {{{0x4c6edb1, 0x15}, {0xc00067ada0, 0x1c}}, {{{...}, {...}, {...}, {...}, ...}, ...}}, ...}, ...)
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/signature.go:96 +0x656
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*CincinnatiSchema).GetReleaseReferenceImages(0xc000fdc000, {0x54aef28, 0x74cf1c0})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/cincinnati.go:208 +0x70a
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*LocalStorageCollector).ReleaseImageCollector(0xc000184e00, {0x54aef28, 0x74cf1c0})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/local_stored_collector.go:62 +0x47f
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).CollectAll(0xc000ace000, {0x54aef28, 0x74cf1c0})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:942 +0x115
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).RunMirrorToMirror(0xc000ace000, 0xc0007a5800, {0xc000f3f038?, 0x17dcbb3?, 0x2000?})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:748 +0x73
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).Run(0xc000ace000, 0xc0004f9730?, {0xc0004f9730?, 0x0?, 0x0?})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:443 +0x1b6
github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc000ad0e00?, {0xc0004f9730, 0x1, 0x7})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:203 +0x32a
github.com/spf13/cobra.(*Command).execute(0xc0007a5800, {0xc000052110, 0x7, 0x7})
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc0007a5800)
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x72d7738?)
	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13
main.main()
	/go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
    

Expected results:


    

Additional info:


    

Goal

Stop using the openshift/installer-aro repo during installation of ARO cluster. installer-aro is a fork of openshift/installer with carried patches. Currently it is vendored into openshift/installer-aro-wrapper in place of the upstream installer.

Benefit Hypothesis

Maintaining this fork requires considerable resources from the ARO team, and results in delays of offering new OCP releases through ARO. Removing the fork will eliminate the work involved in keeping it up to date from this process.

Resources

https://docs.google.com/document/d/1xBdl2rrVv0EX5qwhYhEQiCLb86r5Df6q0AZT27fhlf8/edit?usp=sharing

It appears that the only work required to complete this is to move the additional assets that installer-aro adds for the purpose of adding data to the ignition files. These changes can be directly added to the ignition after it is generated by the wrapper. This is the same thing that would be accomplished by OCPSTRAT-732, but that ticket involves adding a Hive API to do this in a generic way.

Responsibilities

The OCP Installer team will contribute code changes to installer-aro-wrapper necessary to eliminate the fork. The ARO team will review and test changes.

Success Criteria

The fork repo is no longer vendored in installer-aro-wrapper.

Results

Add results here once the Initiative is started. Recommend discussions & updates once per quarter in bullets.

 

Epic Goal

  • Eliminate the need to use the openshift/installer-aro fork of openshift/installer during the installation of an ARO cluster.

Why is this important?

  • Maintaining the fork is time-consuming for the ARO and causes delays in rolling out new releases of OpenShift to ARO.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. CORS-1888
  2. CORS-2743
  3. CORS-2744
  4. https://github.com/openshift/installer/pull/7600/files

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The installer makes heavy use of it's data/data directory, which contains hundreds of files in various subdirectories that are mostly used for inserting into ignition files. From these files, autogenerated code is created that includes the contents in the installer binary.

Unfortunately, subdirectories that do not contain .go files are not regarded by as golang packages and are therefore not included when building the installer as a library: https://go.dev/wiki/Modules#some-needed-files-may-not-be-present-in-populated-vendor-directory

This is currently handled in the installer fork repo by deleting the compile-time autogeneration and instead doing a one-time autogeneration that is checked in to the repo: https://github.com/openshift/installer-aro/pull/27/commits/26a5ed5afe4df93b6dde8f0b34a1f6b8d8d3e583

Since this does not exist in the upstream installer, we will need some way to copy the data/data associated with the current installer version into the wrapper repo - we should probably encapsulate this in a make vendor target. The wiki page above links to https://github.com/goware/modvendor which unfortunately doesn't work, because it assumes you know the file extensions of all of the files (e.g. .c, .h), and it can't handle directory names matching the glob. We could probably easily fix this by forking the tool and teaching it to ignore directories in the source. Alternatively, John Hixson has a script that can do something similar.

Currently the Azure client can only be mocked in unit tests of the pkg/asset/installconfig/azure package. Using the mockable interface consistently and adding a public interface to set it up will allow other packages to write unit tests for code involving the Azure client.

Feature Overview (aka. Goal Summary)  

Add support to GCP N4 Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud

Goals (aka. expected user outcomes)

As a user, I want to deploy OpenShift on Google Cloud using N4 Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types

Requirements (aka. Acceptance Criteria):

OpenShift can be deployed in Google Cloud using the new N4 Machine Series for the Control Plane and Compute Nodes

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  both
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

Google has made N4 Machine Series available on their cloud offering. These Machine Series use "hyperdisk-balanced" disk for the boot device that are not currently supported

Documentation Considerations

The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the N4 Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift

Epic Goal

Why is this important?

  • This is a new Machine Series Google has introduced that customers will use for their OpenShift deployments

Scenarios

  1. Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on N4 GCP Machines

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/CORS-3561

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo. Having a common repo will across drivers will ease maintenance burden.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both yes
Classic (standalone cluster) yes
Hosted control planes all
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility  
Backport needed (list applicable versions) no
UI need (e.g. OpenShift Console, dynamic plugin, OCM) no
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

N/A includes all the CSI operators Red Hat manages as part of OCP

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

This effort started with CSI operators that we included for HCP, we want to align all CSI operator to use the same approach in order to limit maintenance efforts.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Not customer facing, this should not introduce any regression.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

No doc needed

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

N/A, it's purely tech debt / internal

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

 
Why is this important? (mandatory)

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.

 
Scenarios (mandatory) 

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Note: we do not plan to do any changes for HyperShift. The EFS CSI driver will still fully run in the guest cluster, including its control plane.

Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • QE - 

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Feature Overview (aka. Goal Summary)  

Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.

Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.

Goals (aka. expected user outcomes)

Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.

Requirements (aka. Acceptance Criteria):

This needs to be backported to 4.14 so we have a better sense of the fleet as it is.

4.12 might be useful as well, but is optional.

Questions to Answer (Optional):

Why not simply block upgrades if there are locally layered packages?

That is indeed an option. This card is only about gathering data.

Customer Considerations

Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.

Description copied from attached feature card: https://issues.redhat.com/browse/OCPSTRAT-1521

 

Feature Overview (aka. Goal Summary)  

Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.

Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.

Goals (aka. expected user outcomes)

Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.

Requirements (aka. Acceptance Criteria):

This needs to be backported to 4.14 so we have a better sense of the fleet as it is.

4.12 might be useful as well, but is optional.

Questions to Answer (Optional):

Why not simply block upgrades if there are locally layered packages?

That is indeed an option. This card is only about gathering data.

Customer Considerations

Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.

Feature Overview

Improve the cluster expansion with the agent workflow added in OpenShift 4.16 (TP) and OpenShift 4.17 (GA) with:

  • Caching RHCOS image for faster node addition, i.e. no extraction of image every time)
  • Add a single node with just one command, no need to write config files describing node
  • Support creating PXE artifacts 

Goals

Improve the user experience and functionality of the commands to add nodes to clusters using the image creation functionality.

Epic Goal

  • Cleanup/carryover work from AGENT-682 and WRKLDS-937 that were non-urgent for GA of the day 2 implementation

In order to cover simpler scenarios (ie adding just one node without any static networking configuration), it could be useful for the user to provide the minimum requested input via option flags on the command line rather than providing the full-fledged nodes-config.yaml file.
Internally the oc command will take care to always generate the required nodes-config.yaml to be passed to the node-joiner tool

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

A set of capabilities need to be added to the Hypershift Operator that will enable AWS Shared-VPC deployment for ROSA w/ HCP.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Build capabilities into HyperShift Operator to enable AWS Shared-VPC deployment for ROSA w/ HCP.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Antoni Segura Puimedon Please help with providing what Hypershift will need on the OCPSTRAT side.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both (perhaps) both
Classic (standalone cluster)  
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_64 and Arm
Operator compatibility  
Backport needed (list applicable versions) 4.14+
UI need (e.g. OpenShift Console, dynamic plugin, OCM) no (this is an advanced feature not being exposed via web-UI elements)
Other (please specify) ROSA w/ HCP

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

"Shared VPCs" are a unique AWS infrastructure design: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

See prior work/explanations/etc here: https://issues.redhat.com/browse/SDE-1239

 

Summary is that in a Shared VPC environment, a VPC is created in Account A and shared to Account B. The owner of Account B wants to create a ROSA cluster, however Account B does not have permissions to create a private hosted zone in the Shared VPC. So they have to ask Account A to create the private hosted zone and link it to the Shared VPC. OpenShift then needs to be able to accept the ID of that private hosted zone for usage instead of creating the private hosted zone itself.

QE should have some environments or testing scripts available to test the Shared VPC scenario

 

The AWS endpoint controller in the CPO currently uses the control plane operator role to create the private link endpoint for the hosted cluster as well as the corresponding dns records in the hypershift.local hosted zone. If a role is created to allow it to create that vpc endpoint in the vpc owner's account, the controller would have to explicitly assume the role so it can create the vpc endpoint, and potentially a separate role for populating dns records in the hypershift.local zone.

The users would need to create a custom policy to enable this

Add the necessary API fields to support a Shared VPC infrastructure, and enable development/testing of Shared VPC support by adding the Shared VPC capability to the hypershift CLI.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Currently the same SG is used for both workers and VPC endpoint. Create a separate SG for the VPC endpoint and only open the ports necessary on each.

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

  • Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
  • Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
  • Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
  • Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
  • Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

  • Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
  • Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
  • Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
  • Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
  • Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
  • Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
  • Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

  • Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
  • Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

  • Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
  • Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

  • Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
  • Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Goal

Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a HC consumer I expect the API UX to be meaningful and coherent.
i.e. immutable fields should fail at the API when possible
ServicePublishingStrategy Name and type should be made immutable via CEL.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

placeholder for linking all stories aimed to improve hypershift CI:

  • Reduce failure rate
  • Improve signal to root cause
  • Increase coverage

Description of problem:

    HyperShift e2e tests will start out with jUnit available but end up without it, making it hard to read the results.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

We can improve the robustness, maintainability, and developer experience of asynchronous assertions in end-to-end tests by adhering to a couple rules:

  • Bounded Lifecycle: every asynchronous assertion should have an explicit bound for when we expect it to have passed
  • Explain What You're Doing: every major code-path should log what it's doing - a user waiting on an assertion should never be left wondering what's going on
  • Terse Output: "don't say the same thing twice" - only output when something has changed
  • Keep Track Of Time: at a minimum, output how long the assertion took to run its course; if it's useful, output the durations between state deltas

Description of problem:

    HyperShift E2E tests have so many files in the artifacts buckets in GCS that the pages in Deck load super slowly. Using a .tar.gz for must-gather content like the OCP E2Es do will improve this significantly.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Our tests are pretty chatty today. This makes it exceedingly difficult to determine what's going on and, when something failed, what that was. We need to do a full pass through them to enact changes to the following ends:

  • don't repeat yourself - if code is making an asynchronous assertion or polling to make a change, never output the same message more than once
  • honor $EVENTUALLY_VERBOSE - when running in CI, our tests should emit the minimal number of log lines
    • any log line that's announcing that the test will do something next should be quiet by default (${EVENTUALLY_VERBOSE} != "false") and only emit when running locally and the dev wants a streaming output
    • any log line that's announcing expected or wanted results, as opposed to failures or anomalies, should be quiet by default

Goal

As a dev I want the base code to be easier to read, maintain and test

Why is this important?

If devs are don't have a healthy dev environment the project will go and the business won't make $$

Scenarios

  1. ...

Acceptance Criteria

  • 80% unit tested code
  • No file > 1000 lines of code

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:

  •  Have a hotfix process that is customer/support-exception targeted rather than fleet targeted
  • Can take weeks to be available for Managed OpenShift

This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation

Goals (aka. expected user outcomes)

  • Hosted Control Plane fixes are delivered through Konflux builds
  • No additional upgrade edges
  • Release specific
  • Adequate, fleet representative, automated testing coverage
  • Reduced human interaction

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Overriding Hosted Control Plane components can be done automatically once the PRs are ready and the affected versions have been properly identified
  • Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving
  • Fix can be promoted through integration, stage and production canary with a good degree of observability

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both managed (ROSA and ARO)
Classic (standalone cluster) No
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all All supported ROSA/HCP topologies
Connected / Restricted Network All supported ROSA/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All supported ROSA/HCP topologies
Operator compatibility CPO and Operators depending on it
Backport needed (list applicable versions) TBD
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify) No

Use Cases (Optional):

  • Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Out of Scope

  • HyperShift Operator binary bundling

Background

Discussed previously during incident calls. Design discussion document

Customer Considerations

  • Because the Managed Control Plane version does not change but it is overridden, customer visibility and impact should be limited as much as possible.

Documentation Considerations

SOP needs to be defined for:

  • Requesting and approving the fleet wide fixes described above
  • Building and delivering them
  • Identifying clusters with deployed fleet wide fixes

Goal

  • Have a Konflux build for every supported branch on every pull request / merge that modifies the Control Plane Operator

Why is this important?

  • In order to build the Control Plane Operator images to be used for management cluster wide overrides.
  • To be able to deliver managed Hosted Control Plane fixes to managed OpenShift with a similar SLO as the fixes for the HyperShift Operator.

Scenarios

  1. A PR that modifies the control plane in a supported branch is posted for a fix affecting managed OpenShift

Acceptance Criteria

  • Dev - Konflux application and component per supported release
  • Dev - SOPs for managing/troubleshooting the Konflux Application
  • Dev - Release Plan that delivers to the appropriate AppSre production registry
  • QE - HyperShift Operator versions that encode an override must be tested with the CPO Konflux builds that they make

Dependencies (internal and external)

  1. Konflux

Previous Work (Optional):

  1. HOSTEDCP-2027

Open questions:

  1. Antoni Segura Puimedon  How long or how many times should the CPO override be tested?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Konflux App link: <link to Konflux App for CPO>
  • DEV - SOP: <link to meaningful PR or GitHub Issue>
  • QE - Test plan in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Acceptance criteria:

  • Workspace: crt-redhat-acm
  • Components: One per supported branch
  • Separate Containerfile
  • Should only build for area/control-plane-operator

Feature Overview

The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments. 

Goals

The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context.  As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.  

Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.   

Requirements

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Questions to answer…

  •  

Out of Scope

  • Configuration of external-to-cluster IPsec endpoints for N-S IPsec. 

Background, and strategic fit

The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default.  This encryption must scale to the largest of deployments. 

Assumptions

  •  

Customer Considerations

  • Customers require the option to use their own certificates or CA for IPsec. 
  • Customers require observability of configuration (e.g. is the IPsec tunnel up and passing traffic)
  • If the IPsec tunnel is not up or otherwise functioning, traffic across the intended-to-be-encrypted network path should be blocked. 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Feature Overview (aka. Goal Summary)  

Have Hosted Cluster's cloud resource cleanup be directly managed by HyperShift instead of delegating to the operators that run in the Hosted Control Plane so that we can achieve better SLO performance and more control over what fails to delete.

Goals (aka. expected user outcomes)

  • Quicker cloud resource deletion
  • Information about resources that can't be deleted

Requirements (aka. Acceptance Criteria):

  • Keep with cloud resource deletion SLO
  • Metrics agreed with SRE on resources that can't be deleted

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) no
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all All supported Hosted Control Planes topologies and configurations
Connected / Restricted Network All supported Hosted Control Planes topologies and configurations
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All supported Hosted Control Planes topologies and configurations
Operator compatibility N/A
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Maybe the failure to delete resources could be shown in the console.
Other (please specify)  

Use Cases (Optional):

  • Hosted Cluster is successfully deleted within SLO time
  • Hosted Cluster fails to have some resources deleted (typically due to permissions changes) and emits metrics to make it observable.

Out of Scope

Background

The OpenShift installer and Hive manage it this way

Customer Considerations

We need to come up with the right level of granularity for the emitted metrics and the right UX to show it

Documentation Considerations

The metrics and the UX need to be documented. An SOP for tracking failures should be written.

Interoperability Considerations

ROSA/HCP and ARO/HCP

User Story:

As a Hosted Cluster admin, I want to be able to:

  • Delete hosted clusters in the minimum time

so that I can achieve

  • Minimum cloud resource consumption

Service provider achieves

  • Better UX
  • Less computation related to deleting resources

Acceptance Criteria:

Description of criteria:

  • HyperShift directly manages resource deletion
  • Resource deletion failure alert the customer (specially important for billable items)

Out of Scope:

Cloud resource deletion throttling detection

Engineering Details:

  • Currently cloud resource cleanup is delegated to operators that run in the hosted control plane (registry operator cleans up its bucket, ingress operator removes additional dns entries, cloud controller manager removes load balancers and persistent volumes, etc). The benefit with this approach is that we don't need cloud-specific code in the CPO to destroy resources. The drawback is that this cleanup can sometimes take a long time and depends on the hosted cluster's API server to be in a healthy state.
  • A different approach which could make this process faster is to directly destroy resources in a similar way to `openshift-installer destroy cluster` or even `hypershift destroy cluster infra`. Instead of waiting for controllers to do the right thing, we can directly destroy resources. This would make it more straightforward and likely much faster.
  • One consideration with this approach is that unlike the CLI tools, the CPO doesn't have a single role that can destroy all resources. We would have to access AWS with different operator roles to destroy the different types of resources. This can be done via API calls similar to what the token-minter command makes to obtain tokens for the different service accounts.

Feature Overview 

OpenShift relies on internal certificates for communication between components, with automatic rotations ensuring security. For critical components like the API server, rotations occur via a rollout process, replacing certificates one instance at a time.

In clusters with high transaction rates and SNO, this can lead to transient errors for in-flight transactions during the transition.

This feature ensures seamless TLS certificate rotations in OpenShift, eliminating downtime for the Kubernetes API server during certificate updates, even under heavy loads or in SNO deployments.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Goal

Provide a way to install NodePools with varied architectures on the Agent platform.

Why is this Important?

This feature is important to enable workloads with different architectures (IBM Power/Z in this case) within the same Hosted Cluster.

Scenarios

  • Cross-Architecture Use Case: A user has an x86 Hosted Cluster and wants to include at least one NodePool running with a ppc64le architecture.
  • Multi-Architecture Workload Support: Scenarios where workloads demand compatibility with both x86 and alternative architectures like ARM64 or PowerPC.

Acceptance Criteria

Development (DEV):

  • A valid enhancement is implemented if necessary, following community or internal guidelines.
  • All upstream code and tests related to this feature are merged successfully.
  • Upstream documentation is submitted and approved.

Continuous Integration (CI):

  • All CI pipelines must run successfully with automated tests covering new functionality.
  • Existing multi-architecture CI jobs are not broken by this enhancement.

Quality Engineering (QE):

  • Polarion test plans are created and cover all necessary test cases.
  • Automated tests are implemented and merged.
  • Verification of documentation is completed during QE testing.

Dependencies (Internal and External)

  • Management Cluster: Requires multi-architecture payload images for compatibility.
  • Target Architecture: Architecture must already be included in the OCP payload for the feature to be functional.
  • Multi-Cluster Engine (MCE): Must have builds compatible with the architecture used by worker nodes in the management cluster.

Goal

Provide a way to install NodePools with varied architectures on the Agent platform.

Why is this Important?

This feature is important to enable workloads with different architectures (IBM Power/Z in this case) within the same Hosted Cluster.

Scenarios

  • Cross-Architecture Use Case: A user has an x86 Hosted Cluster and wants to include at least one NodePool running with a ppc64le architecture.
  • Multi-Architecture Workload Support: Scenarios where workloads demand compatibility with both x86 and alternative architectures like ARM64 or PowerPC.

Acceptance Criteria

Development (DEV):

  • A valid enhancement is implemented if necessary, following community or internal guidelines.
  • All upstream code and tests related to this feature are merged successfully.
  • Upstream documentation is submitted and approved.

Continuous Integration (CI):

  • All CI pipelines must run successfully with automated tests covering new functionality.
  • Existing multi-architecture CI jobs are not broken by this enhancement.

Quality Engineering (QE):

  • Polarion test plans are created and cover all necessary test cases.
  • Automated tests are implemented and merged.
  • Verification of documentation is completed during QE testing.

Dependencies (Internal and External)

  • Management Cluster: Requires multi-architecture payload images for compatibility.
  • Target Architecture: Architecture must already be included in the OCP payload for the feature to be functional.
  • Multi-Cluster Engine (MCE): Must have builds compatible with the architecture used by worker nodes in the management cluster.

Add a doc to explain the steps on how can we have heterogenous node pools in agent platform.

Goal:
Provide a Technical Preview of Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.

Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

At its core, OpenShift's implementation of Gateway API will be based on the existing Cluster Ingress Operator and OpenShift Service Mesh (OSSM). The Ingress Operator will manage the Gateway API CRDs (gatewayclasses, gateways, httproutes), install and configure OSSM, and configure DNS records for gateways. OSSM will manage the Istio and Envoy deployments for gateways and configure them based on the associated httproutes. Although OSSM in its normal configuration does support service mesh, the Ingress Operator will configure OSSM without service mesh features enabled; for example, using Gateway API will not require the use of sidecar proxies. Istio will be configured specifically to support Gateway API for cluster ingress. See the gateway-api-with-cluster-ingress-operator enhancement proposal for more details.

Epic Goal

  • Add Gateway API via Istio Gateway implementation as Tech Preview in 4.x

Problem: ** As an administrator, I would like to securely expose cluster resources to remote clients and services while providing a self-service experience to application developers. 

Tech Preview:  A feature is implemented as Tech Preview so that developers can issue an update to the Dev Preview MVP and:

  • can still change APIs that are clearly indicated as tech preview, without following a deprecating or backwards compatibility process.
  • are not required to fix bugs customers uncover in your TP feature.
  • do not have to provide an upgrade path from a customer using your TP feature to the GA version of your feature.
  • TBD - must still support upgrading the cluster and your component, but it’s ok if the TP feature doesn’t work after the upgrade.
  • still need to provide docs (which should make it clear the feature is tech preview)
  • still need to provide education to CEE about the feature (tech enablement)
  • must also follow Red Hat's support policy for tech preview
    From https://github.com/openshift/enhancements/blob/master/guidelines/techpreview.md

Why is this important?

  • Reduces the burden on Red Hat developers to maintain IngressController and Route custom resources
  • Brings OpenShift ingress configuration more in line with standard Kubernetes APIs
  • Demonstrates Red Hat’s leadership in the Kubernetes community.

Scenarios

  1. ...

Acceptance Criteria (draft)

  • Gateway API and Istio Gateway are in an acceptable standing for Tech Preview
  • Now that we've decided on single control plane (shared between OSSM and Network Edge functionality), complete the feature in collaboration with OSSM
  • Decide on whether we can make an existing OSSM control plane work when the GWAPI feature is enabled on the cluster
  • Decide if the cluster admin can/should configure SMCP, what options are exposed, and API for configuration
  • Document limitations and collect information to plan future work needed to accommodate HyperShift architecture - in OSSM and elsewhere
  • Initial security model
  • Enhancement Proposals, Migration details, Tech Enablement, and other input for QA and Docs as needed
  • Web console
  • Must-gather updates
  • CI, E2E tests on GA OSSM
  • Metrics/Telemetry as needed
  • Installation, Upgrade details (keep OSSM and c-i-o in sync)
  • [stretch] oc updates

Dependencies (internal and external)

  1. OSSM release schedule aligned with OpenShift's cadence, or workaround designed
  2. ...tbd

Previous Work (Optional):

  1. https://issues.redhat.com/browse/NE-993
  2. https://issues.redhat.com/browse/NE-1036

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a user I want to be able to automatically recover from a GWAPI CRD being deleted,instead of manually re-adding the CRD.

Currently, if you delete one or more of the gwapi crds it does not get recreated until you restart the ingress operator. 

  • Put a watch on the crds so they get recreated withoiut restarting the ingress operator.

 

Epic Goal

  • Test GWAPI release v1.0.0-* custom resources with current integration

    Why is this important?

  • Help find bugs in the v1.0.0 upstream release
  • Determine if any updates are needed in ingress-cluster-operator based on v1.0.0

    Planning Done Checklist

    The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

The details of this Jira Card are restricted (Only Red Hat employees and contractors)

This involves grabbing the clusters global tlssecurity profile from the operator pod and then storing it locally in renderconfig so it can be applied to MCO's templates and the MCS.

Note: The kubelet config controller already fetches this object and updates the kubeletconfig with it. So this can probably refactored into a common function.

Done when:

  • new RBAC rules for the operator and controller service account to access APIServer object are setup
  • the operator/controller can grab these values and store them locally in renderconfig.

 

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

What

Add an annotation that shows what the label syncer would set.

Why

If a customer takes ownership of the audit and warn labels it is unclear what the label syncer would enforce, without evaluating all the SCCs of all users in the namespace.

This:

  • creates a blindspot in CFE / cluster-debug-tools
  • makes it hard to block upgrades confidently.

Notes

  • Must be set, when label syncer would set the value.
  • Can be set in all other cases (easier debugging on customer side).

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

  1. namespaces
4.19 4.18 4.17 4.16 4.15 4.14
monitored 82 82 82 82 82 82
fix needed 68 68 68 68 68 68
fixed 39 39 35 32 39 1
remaining 29 29 33 36 29 67
~ remaining non-runlevel 8 8 12 15 8 46
~ remaining runlevel (low-prio) 21 21 21 21 21 21
~ untested 2 2 2 2 82 82

Progress breakdown

# namespace 4.19 4.18 4.17 4.16 4.15 4.14
1 oc debug node pods #1763 #1816 #1818  
2 openshift-apiserver-operator #573 #581  
3 openshift-authentication #656 #675  
4 openshift-authentication-operator #656 #675  
5 openshift-catalogd #50 #58  
6 openshift-cloud-credential-operator #681 #736  
7 openshift-cloud-network-config-controller #2282 #2490 #2496    
8 openshift-cluster-csi-drivers #6 #118 #524 #131 #306 #265 #75   #170 #459 #484  
9 openshift-cluster-node-tuning-operator #968 #1117  
10 openshift-cluster-olm-operator #54 n/a n/a
11 openshift-cluster-samples-operator #535 #548  
12 openshift-cluster-storage-operator #516   #459 #196 #484 #211  
13 openshift-cluster-version       #1038 #1068  
14 openshift-config-operator #410 #420  
15 openshift-console #871 #908 #924  
16 openshift-console-operator #871 #908 #924  
17 openshift-controller-manager #336 #361  
18 openshift-controller-manager-operator #336 #361  
19 openshift-e2e-loki #56579 #56579 #56579 #56579  
20 openshift-image-registry       #1008 #1067  
21 openshift-ingress   #1032        
22 openshift-ingress-canary   #1031        
23 openshift-ingress-operator   #1031        
24 openshift-insights #1033 #1041 #1049 #915 #967  
25 openshift-kni-infra #4504 #4542 #4539 #4540  
26 openshift-kube-storage-version-migrator #107 #112  
27 openshift-kube-storage-version-migrator-operator #107 #112  
28 openshift-machine-api #1308
#1317 
#1311 #407 #315 #282 #1220 #73 #50 #433 #332 #326 #1288 #81 #57 #443  
29 openshift-machine-config-operator #4636 #4219 #4384 #4393  
30 openshift-manila-csi-driver #234 #235 #236  
31 openshift-marketplace #578 #561 #570
32 openshift-metallb-system #238 #240 #241    
33 openshift-monitoring #2298 #366 #2498   #2335 #2420  
34 openshift-network-console #2545        
35 openshift-network-diagnostics #2282 #2490 #2496    
36 openshift-network-node-identity #2282 #2490 #2496    
37 openshift-nutanix-infra #4504 #4539 #4540  
38 openshift-oauth-apiserver #656 #675  
39 openshift-openstack-infra #4504   #4539 #4540  
40 openshift-operator-controller #100 #120  
41 openshift-operator-lifecycle-manager #703 #828  
42 openshift-route-controller-manager #336 #361  
43 openshift-service-ca #235 #243  
44 openshift-service-ca-operator #235 #243  
45 openshift-sriov-network-operator #995 #999 #1003  
46 openshift-user-workload-monitoring #2335 #2420  
47 openshift-vsphere-infra #4504 #4542 #4539 #4540  
48 (runlevel) kube-system            
49 (runlevel) openshift-cloud-controller-manager            
50 (runlevel) openshift-cloud-controller-manager-operator            
51 (runlevel) openshift-cluster-api            
52 (runlevel) openshift-cluster-machine-approver            
53 (runlevel) openshift-dns            
54 (runlevel) openshift-dns-operator            
55 (runlevel) openshift-etcd            
56 (runlevel) openshift-etcd-operator            
57 (runlevel) openshift-kube-apiserver            
58 (runlevel) openshift-kube-apiserver-operator            
59 (runlevel) openshift-kube-controller-manager            
60 (runlevel) openshift-kube-controller-manager-operator            
61 (runlevel) openshift-kube-proxy            
62 (runlevel) openshift-kube-scheduler            
63 (runlevel) openshift-kube-scheduler-operator            
64 (runlevel) openshift-multus            
65 (runlevel) openshift-network-operator            
66 (runlevel) openshift-ovn-kubernetes            
67 (runlevel) openshift-sdn            
68 (runlevel) openshift-storage            
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We should be able to correlate flows with network policies:

  • which policy allowed that flow?
  • what's the dropped flows?
  • provide global stats on dropped / accepted traffic

 

PoC doc: https://docs.google.com/document/d/14Y3YYFxuOs3o-Lkipf-d7ZZp5gpbk6-01ZT_fTraCu8/edit

There are two possible approaches in terms of implementation:

  • Add new "netpolicy flows" on top of existing flows
  • Enrich existing flows with netpolicy info.

The PoC describes the former, however it is probably most interesting to aim the latter. (95% of the PoC is valid in both cases, ie. all the "low level" parts: OvS, OVN). The latter involves more work in FLP.

Epic Goal

Implement observability for ovn-k using OVS sampling.

Why is this important?

This feature should improve packet tracing and debuggability.

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Users/customers of OpenShift on AWS (ROSA) want to use static IPs (and therefore AWS Elastic IPs) so that they can configure appropriate firewall rules. They want the default AWS Load Balancer that they use (NLB) for their router to use these EIPs.

Kubernetes does define a service annotation for configuring EIP
allocations, which should work in OCP:

    // ServiceAnnotationLoadBalancerEIPAllocations is the annotation used on the
    // service to specify a comma separated list of EIP allocations to use as
    // static IP addresses for the NLB. Only supported on elbv2 (NLB)
    const ServiceAnnotationLoadBalancerEIPAllocations = "service.beta.kubernetes.io/aws-load-balancer-eip-allocations"

Source: https://github.com/openshift/kubernetes/blob/eab9cc98fe4c002916621ace6cdd623afa519203/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L227-L230

We do not provide an API field on the IngressController API to configure
this annotation.  

This is a feature request to enhance the IngressController API to be able to support static IPs from install time and upon reconfiguration of the router (may require destroy/recreate LB)

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

  1. User can provision EIPs and use them with an IngressController via NLB
  2. User can ensure EIPs are used with NLB on default router at install time
  3. User can reconfigure default router to use EIPs

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  1. User can use existing EIPs (one per subnet) for cluster install or router configuration
  2. Router NLB and DNS can be inspected to have those (and only those) EIPs attached to the associated ingress.
  3. EIPs will survive, be detached and available upon cluster deletion for subsequent cluster reuse

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

  1. Management of EIPs (provision/cleanup) outside of selection/association with IngressController
  2. Static IP usage with NLBs for API server

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Allowing those EIPs to be provisioned outside and survive the cluster reconfiguration or even creation/deletion, it helps support our "don't treat clusters as pets" philosophy. It also removes additional burden for them to wrap the cluster or our managed service with yet another global IP service that should be unnecessary and bring more complexity. That aligns precisely with their interest in the functionality and we should pursue making this seamless.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

See Parent Feature for details.

This enhancement allow user to set AWS EIP for the NLB default or custom ingress controller.
This is a feature request to enhance the IngressController API to be able to support static IPs during

  • Custom NLB ingress controller creation 
  • Reconfiguration of the router.

Epic Goal

R&D spike for 4.14 and  implementation in 4.15

  • Users are able to use EIP for a NLB Ingress Controller.
  • Any existing ingress controller should be able to be reconfigured to use EIPs. 

Why is this important?

  • Reasons are mentioned in the scenario section.

Scenarios

  • As an administrator, I want to provision EIPs and use them with an NLB IngressController.
  • As a user or admin, I want to ensure EIPs are used with NLB on default router at install time. This scenario will be addressed in a separate epic.
  • As a user, I want to reconfigure default router to use EIPs.
  • As a user of OpenShift on AWS (ROSA), I want to use static IPs (and therefore AWS Elastic IPs) so that 
      I can configure appropriate firewall rules.
      I want the default AWS Load Balancer that they use (NLB) for their router to use these EIPs.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The service annotation service.beta.kubernetes.io/aws-load-balancer-eip-allocations of the LoadBalancer service for the IngressController is set with the value of IngressController  CR's eipAllocations field.
  • This  EP will also be covered by end to end tests.

  The end-to-end test will cover the scenario where the user sets an eipAllocations field in the IngressController CR and verify that the LoadBalancer-type service has the service.beta.kubernetes.io/aws-load-balancer-eip-allocations annotation set with the value of the eipAllocations field from the IngressController CR that was created.

  The end to end test will also the cover the scenario where user updates the eipAllocations field in the IngressController CR and verify that the LoadBalancer-type service's
service.beta.kubernetes.io/aws-load-balancer-eip-allocations annotation is updated with the new value of eipAllocations from the IngressController CR.

Dependencies (internal and external)

  1. Code in the ingress operator will be written to use the value of the field eipAllocations from the IngressController CR and set and update it accordingly.
  2. openshift/api will be updated to add the eipAllocations API field in the IngressController CR and cluster config infra object.

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Add the logic to the ingress operator to create a load balancer service with a specific subnet as provided by the API. The Ingress Operator will need to set a platform-specific annotation on the load balancer type service to specify a subnet.

Develop the API updates and create a PR against the openshift/api repo for allowing users to select a subnet for Ingress Controllers. 

Get review from API team.

After the implementation https://github.com/openshift/cluster-ingress-operator/pull/1046 is merged, and there are is a 95+% pass rates over at least 14 CI runs (slack ref), we will need to open a PR to promote from the TechPreview to the Default feature set. 

Feature Overview (aka. Goal Summary)  

Phase 2 Goal:  

  • Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
  • attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
  • manage the lifecycle of Cluster API components within OpenShift standalone clusters
  • E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To add support for generating Cluster and Infrastructure Cluster resources on Cluster API based clusters

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-aws/openshift
  • Create a controller in the above module Go to manage the AWSCluster resource for non-capi bootstrapped clusters
  • Ensure the AWSCluster controller is only enabled for AWS platform clusters
  • Create an "externally-managed" AWSCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the AWSCluster spec using the controller
  • (Refer to openstack implementation)

Stakeholders

  • Cluster Infra

Definition of Done

  • AWSCluster resource is correctly created and populated on AWS clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-azure/openshift
  • Create a controller in the above module Go to manage the AzureCluster resource for non-capi bootstrapped clusters
  • Ensure the AzureCluster controller is only enabled for Azure platform clusters
  • Create an "externally-managed" AzureCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the AzureCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • AzureCluster resource is correctly created and populated on Azure clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-vsphere/openshift
  • Create a controller in the above module Go to manage the VSphereCluster resource for non-capi bootstrapped clusters
  • Ensure the VSphereCluster controller is only enabled for VSphere platform clusters
  • Create an "externally-managed" VSphereCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the VSphereCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • VSphereCluster resource is correctly created and populated on VSphere clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-gcp/openshift
  • Create a controller in the above module Go to manage the GCPCluster resource for non-capi bootstrapped clusters
  • Ensure the GCPCluster controller is only enabled for AWS platform clusters
  • Create an "externally-managed" GCPCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the GCPCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • GCPCluster resource is correctly created and populated on GCP clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview (aka. Goal Summary)  

Implement Migration core for MAPI to CAPI for AWS

  • This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
  • This Design investigates possible solutions for AWS
  • Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI .  Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

  • We need to build out the core so that development of the migration for individual providers can then happen in parallel
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

We would like new code going into the CAPI operator repository to be linted based on the teams linting standards.

The CPMS is a good source for an initial configuration, though may be a touch strict, we can make a judgement if we want to disable some of the linters it uses for this project.

Some linters enabled may be easy to bulk fix in the existing code, others may be harder.

We can make a judgement as to whether to fix the issues or set the CI configuration to only check new code.

Steps

Stakeholders

  • Cluster infra

Definition of Done

  • As we create PRs to the CAPI operator repo, linting feedback is given
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To implement https://github.com/openshift/enhancements/pull/1465, we need to include an `authoritativeAPI` field on Machines, MachineSets, ControlPlaneMachineSets and MachineHealthChecks.

This field will control which version of a resource is considered authoritative, and therefore, which controllers should implement the functionality.

The field details are outlined in the enhancement.

The status of each resource should also be updated to indicate the authority of the API.

APIs should be added behind a ClusterAPIMigration feature gate.

Steps

  • Add API fields to the spec of Machine API resources based on the enhancement
  • Get API review
  • Get API merged

Stakeholders

  • Cluster Infra

Definition of Done

  • API for authoritative APIs is merged as a feature gated API
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.

When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.

Behaviours

  • Should not reconcile when .status.authoritativeAPI is not MachineAPI
  • Except when it is empty (prior to defaulting migration webhook)

Steps

  • Ensure MAO has new API fields vendored
  • Add checks in Machine/MachineSet for authoritative API in status not Machine API
  • When not machine API, set paused condition == true, otherwise paused == false (same as CAPI)
    • Condition should be giving reasons for both false and true
  • This feature must be gated on the ClusterAPIMigration feature gate

Stakeholders

  • Cluster Infra

Definition of Done

  • When the status of Machine indicates that the Machine API is not authoritative, the Paused condition should be set and no action should be taken.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Problem & Overview

Currently, the existing procedure for full rotation of all cluster CAs/certs/keys is not suitable for Hypershift. Several oc helper commands added for this flow are not functional in Hypershift. Therefore, a separate and tailored procedure is required specifically for Hypershift post its General Availability (GA) stage.

 

Background

Most of the rotation procedure can be performed on the management side, given the decoupling between the control-plane and workers in the HyperShift architecture.

That said, it is important to ensure and assess the potential impacts on customers and guests during the rotation process, especially on how they affect SLOs and disruption budgets. 

 

Why care? 

  • Additional Security: Regular rotation of cluster CAs/certs/keys is essential for maintaining a secure environment. Adapting the rotation procedure for Hypershift ensures that security measures align with its specific requirements and limitations.
  • Compliance and Governance: Maintaining compliance(e.g., FIPS). Rotating certificates produced by non-compliant modules in Hypershift clusters is essential to align with FIPS requirements and mitigate future compliance risks...

User Story:

As a hypershift QE, I want to be able to:

  • Set the certificate expiration time of a HostedCluster's certificates to a custom duration via HostedCluster annotation.
  • Set the certificate rotation time of a HostedCluster's certificates to a custom duration via HostedCluster annotation.

so that I can achieve

  • Verify that a HostedCluster continues to function after its certificates have rotated and old ones have expired.

Acceptance Criteria:

  • Annotation to set custom expiration and rotation times takes effect on certificates created in a HostedCluster namespace.

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

As an engineer I would like to customize the self-signed certificates rotation used in the HCP components using an annotation over the HostedCluster object.

As an engineer I would like to customize the self-signed certificates expiration used in the HCP components using an annotation over the HostedCluster object.

 

Feature Overview

Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.

Overarching Goal

Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.

Background

  • For a single node cluster effectively cut off from all other networking, update the cluster despite the lack of access to image registries, local or remote.
  • For multi-node clusters that could have a complete power outage, recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.
  • Allow cluster node(s) to boot without any access to a registry in case all the required images are pinned

 

As described in https://github.com/openshift/enhancements/pull/1483, we would like the cluster to be able to upgrade properly without accessing an external registry in case in case all the images already exist and pinned in the relevant nodes. Same goes for boot.

The required functionality is mostly to add blocking tests that ensure that's the case, and address any issues that these test might reveal in the OCP behavior.

Details:

These are the tests that will need to be added:

 

  1. Verify that an SNO cluster can be rebooted without a registry present.
  2. Verify that all the nodes of a multi-node cluster can be rebooted without a registry present.
  3. Verify that an SNO cluster can be upgraded without a registry present.
  4. Verify that a multi-node cluster can be upgraded without a registry server.

 

All these tests will have a preparation step that will use the PinnedImageSet (see MCO-838 for more details) support to ensure that all the images required are present in all the nodes.

 

Our proposal for these set of tests is to create a set of machines inside a virtual network. This network will initially have access to the registry server. That registry will be used to install the cluster and to pull the pinned images. After that access to the registry server will be blocked using a firewall rule in the virtual network. Then the reboots or upgrades will be performed and verified.

 

So these are the things that need to be done:

 

  1. Create the mechanism to deploy the cluster inside a virtual network, and then block access to the registry.
  2. Create the SNO reboot test.
  3. Create the multi-cluster reboot test:
  4. Create the SNO upgrade test.
  5. Create the multi-node upgrade test.
  6. Configure the tests as a gating criteria for OCP releases.

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities. 

Epic Goal

  • Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

  • MSI authentication is required for any component that will run on the control plane side in ARO hosted control planes.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Problem

Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation. 

Goal

Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion. 

Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.  

Operators running management side needing to access azure customer account will use MSI.
Operands running in the guest cluster should rely on workload identity.
This ticket is to solve the latter.

 

We need to implement workload identity support in our components that run on the spoke cluster.

 

Address any TODOs in the code related to this ticket.

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

Controllers that require cloud access and run on the control plane side in ARO hosted clusters will need to use MSI to acquire tokens to interact with the hosted cluster's cloud resources.

The cluster network operator runs the following pods that require cloud credentials:

  • cloud-network-config-controller

The following components use the token-minter but do not require cloud access:

  • network-node-identity
  • ovnkube-control-plane

 

These pods will need to use MSI when running in hosted control plane mode.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

Controllers that require cloud access and run on the control plane side in ARO hosted clusters will need to use MSI to acquire tokens to interact with the hosted cluster's cloud resources.

The cluster ingress controller will need to support MSI.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (Goal Summary)

This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.

User Story:

As ARO/HCP provider, I want to be able to:

  • receive a customer's subnet id 

so that I can achieve

  • parse out the resource group name, vnet name, subnet name for use in CAPZ (creating VMs) and Azure CCM (setting up Azure cloud provider).

Acceptance Criteria:

Description of criteria:

  • Customer hosted cluster resources get created in a managed resource group.
  • Customer vnet is used in setting up the VMs.
  • Customer resource group remains unchanged.

Engineering Details:

  • Benjamin Vesel said the managed resource group and customer resource group would be under the same subscription id.
  • We are only supporting BYO VNET at the moment; we are not supporting the VNET being created with the other cloud resources in the managed resource group.

This requires a design proposal so OCM knows where to specify the resource group
This requires might require a feature gate in case we don't want it for self-managed.

Goal

  • All Azure API fields should have appropriate definitions as to their use and purpose.
  • All Azure API fields should have appropriate k8s cel added.

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions:

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

Azure stipulates that any VM that's created must have a NIC in the same location as the virtual machine. The NIC must also belong to a subnet in the same location as well. Furthermore, the network security group attached to a subnet must also be in the same location as the subnet and therefore vnet.

So all these resources, virtual network (including subnets), network security group, and resource group (and all the associated resources in there including VMs and NICs), all have to be in the same location.

Acceptance Criteria:

Description of criteria:

  • Useful reporting of the incompatibility
  • Covered in CI

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

Goal

  • Support configuring azure diagnostics for boot diagnostics on nodepools

Why is this important?

  • When a node fails to join the cluster, serial console logs are useful in troubleshooting, especially for managed services. 

Scenarios

  1. Customer scales / creates nodepool
    1. nodes created
    2. one or more nodes fail to join the cluster
    3. cannot ssh to nodes because ssh daemon did not come online
    4. Can use diagnosics + managed storage account to fetch serial console logs to troubleshoot

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. Capz already supports this, so dependency should be on hypershift team implementing this: https://github.com/openshift/cluster-api-provider-azure/blob/master/api/v1beta1/azuremachine_types.go#L117

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

Before GAing Azure let's make sure we do a final API review

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

We should move the MachineIdentityID to the NodePool API rather than the HostedCluster API. This field is specifically related to a NodePool and shouldn't be a HC wide field.

Acceptance Criteria:

  • MachineIdentityID is moved to the NodePool API
  • CLI & infra code updated appropriately for the API move

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

  • Allow Arm NodePools to be created from Arm Azure Marketplace RHCOS images

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions:

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user of Azure HostedClusters on AKS, I want to be able to:

  • create Arm NodePools

so that I can

  • run my workloads on Arm Azure VMs

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Pull request with updated code

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

  • An assistant to help developers in ODC edit configuration YML files

Goals

  • Perform an architectural spike to better assess feasibility and value of pursuing this further

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • Is there overlap with what other teams at RH are already planning? 

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • More details in the outcome parent RHDP-985

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem: 

As a developer, I want to be able to generate YAMLs using generative AI. For that I need to open the LightSpeed chat interface easily to start a conversation.

Goal:

Add support to open the LightSpeed chat interface directly from the YAML Editor Code Toolbar.

Why is it important?

Use cases:

  1. A user who wants to create some OpenShift resource can quickly open the OLS Chat interface from the new button and ask for the resources in the OLS Chat.

Acceptance criteria:

  1. Should open the OLS Chat Interface from the "Ask LightSpeed" button on the YAML Editor 
  2. The button should be shown only when the LightSpeed Configuration is properly set up.

Dependencies (External/Internal):

OpenShift Lightspeed.

Design Artifacts:

Exploration: Refinement DOC

Note:

Description

As a user, I want to quickly open the Lightspeed Chat interface directly from the YAML Editor UI so that I can quickly get some help in getting some sample YAMLs. 

Acceptance Criteria

  1. Create the button that will show on top of the YAML Editor
  2. Add the logic to make it open the Lightspeed Chat Interface once its clicked.

Additional Details:

Feature Overview

Improve onboarding experience for using Shipwright Builds in OpenShift Console

Goals

Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright 

Requirements

 

Requirements Notes IS MVP
Enable creating Shipwright Builds using a form   Yes
Allow use of Shipwright Builds for image builds during import flows   Yes
Enable access to build strategies through navigation   Yes

Use Cases

  • Use Shipwright Builds for image builds in import flows
  • Enable form-based creation of Shipwright Builds and without YAML expertise
  • Provide access to Shipwright resources through navigation

Out of scope

TBD

Dependencies

TBD

Background, and strategic fit

Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.

Assumptions

TBD

Customer Considerations

TBD

Documentation/QE Considerations

TBD

Impact

TBD

Related Architecture/Technical Documents

TBD

Definition of Ready

  • The objectives of the feature are clearly defined and aligned with the business strategy.
  • All feature requirements have been clearly defined by Product Owners.
  • The feature has been broken down into epics.
  • The feature has been stack ranked.
  • Definition of the business outcome is in the Outcome Jira (which must have a parent Jira).

 
 

Problem:

Builds for OpenShift (Shipwright) offers Shipwright builds for building images on OpenShift but it's not available as a choice in the import flows. Furthermore, BuildConfigs could be turned off through the Capabilities API in OpenShift which would prevent users from importing their applications into the cluster through the import flows, even when they have Shipwright installed on their clusters.

Goal:

As a developer, I want to use Shipwright Builds when importing applications through the import flows so that I can take advantage of Shipwright Build strategies such as buildpacks for building my application.

User should be provided the option to use Shipwright Builds when the OpenShift Builds operator is installed on the cluster.

Why is it important?

To enable developers to take advantage of new build capabilities on OpenShift.

Acceptance criteria:

  1. When importing from Git, user can choose Shipwright Build strategies such as S2I, buildpack and Dockerfile (buildah) strategy for building the image

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I need to be able to use the Shipwright Builds to build my application in the Git Import Flow.

Acceptance Criteria

  1. Add the Shipwright option in the Build Type.
  2. Create a function in the `import-submit-utils` to create SW Builds and BuildRuns.
  3. If the user wants to build their application with a Dockerfile (have selected Dockerfile Import Strategy), go forward with the "buildah" Buildstrategy, without the need to give users the option to select a build strategy
  4. If the user wants to build their application with the predefined Builder Images available on Openshift (have selected BuilderImage Import Strategy), go forward to show a dropdown in the Build Section of the form. This dropdown should allow users to select "s2i" or "buildpack" BuildStrategy.
  5. Write and fix the Unit tests

Additional Details:

Description

As a developer, I need to ensure that all the current import workflows are working properly after this update and also fix and add new unit and e2e tests to ensure this feature does not get broken in the future.

Acceptance Criteria

  1. Test all the current Import forms
  2. Write and fix broken E2E tests

Additional Details:

Description

As a user, I need to be able to use the Edit Application form to edit the resource I have created using the Git Import Form.

Acceptance Criteria

  1. The Edit Application option should be visible in the Right-click dropdown on the Topology page
  2. The user should be able to edit the application URL, etc. in the Edit Application form
  3. Write and fix the Unit tests

Additional Details:

Problem:

The build strategy list page is missing from admin console

Goal:

Add a page for Shipwright BuildStrategy in admin console with 2 tabs in the page

  • ClusterBuildStrategy
  • BuildStrategy (namespace-scoped)

Acceptance criteria:

  • A navigation item is available for Shipwright build strategies under the Build bucket in admin perspective
  • A Shipwright build strategies list page provides access to ClusterBuildStrategies and namespace-scoped BuildStrategies

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to see the Shipwright lists pages under one navigation option in the administrative perspective

Acceptance Criteria

  1. Create a multi-tab Shipwright list page an admin perspective
  2. It should have four tabs 
    1. Shipwright builds (Builds)
    2. Shipwright buildRuns (BuildRuns)
    3. BuildStrategy (namespace-scoped)
    4. Cluster BuildStrategy

Additional Details:

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

Feature goal (what are we trying to solve here?)

Allow users to use the openshift installer to generate the IBI artifacts.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

Reasoning (why it’s important?)

  • IBI install should have the same experience as the agent-installer, the user runs the openshift-installer 
    With some input and gets an ISO to boot the machine with.
  •  
  • Currently Users need need MCE or an operator to generate the configuration. we want to enable users to generate the configuration independently using openshift-installer.
  • That should allow the flexibility to add the final configuration while generating the installation ISO (in case it’s provided).

Competitor analysis reference

  • Do our competitors have this feature?
    • No, it's unique or explicit to our product

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist yet

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

We should add the Image-based Installer imagebased create image and imagebased create installation-config-template subcommands to the OpenShift Installer, conforming to the respective enhancement, for the generation of the IBI installation ISO image.

We should add the Image-based Installer imagebased create config and imagebased create config-template subcommands to the OpenShift Installer, conforming to the respective enhancement, for the generation of the IBI config ISO image.

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To have PAO working on Hypershift environment with feature parity with SNO ( as much as possible )
  • Not to be enabled prior to 4.17 (at minimum)
  • Add sanity test/s in hypershift CI
  • Note: eventually hypershift tests will run in hypershit CI lanes

Why is this important?

  • Hypershift is a very interesting platform for clients and PAO has a key role in node tuning so make it work in Hypershift is a good way to ease the migration to this new platform as clients will not lose their tuning capabilities.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No regressions are introduced

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

When a `PerformanceProfile` should be mirrored into the HCP namespace, its given name which compose from a constant (perfprof) + the node pool name.

In case there are multiple `PerformanceProfiles`, one per hosted cluster, the approach mentioned above makes it hard to understand which `PreformanceProfile` in the clusters namespace is associate with each hostedcluster.

It requires to examine the node pool \`spec.tuningConfig\` to figure out the relation between user profile and the mirrored one.

we should  embed the configMap (that encapsulates the performanceProfile) name in the mirrored profile, to make the relation clear and observable on first sight.

We already have a lot of test cases to check PAO behaviour on OCP.
As we want to cover the same basic behaviour on Hypershift deployments as in Stand Alone ones, it seems a good way to go to adapt the test cases to run on both types of deployments.

Target: To have the same test coverage in Hypershift then we already have on Stand Alone.

PAO controller takes certain tuning decision based on the container-runtime that being used under the cluster. 

As of now hypershift does not support ContainerRuntimeConfiguration at all.

We should address this gap when ContainerRuntimeConfiguration will get supported on hypershift

UPDATE [15.05.2024]

Hypershift does support ContainerRuntimeConfiguration but do not populate the object into the HCP namespace, so the PAO still cannot read it.

Currently, running a specific suite or a spec for example:

ginkgo run --focus="Verify rcu_nocbs kernel argument on the node" 

 that is executing commands on the node (using the node inspector) will fail, as the user must run the 0_config suite beforehand (we are creating the node inspector in the 0_config suite).
This places a burden on the developer, so we should consider how to ensure the node inspector will be available for these scenarios.

A potential fix that can be adding lazy initialization for the node inspector, once we are executing commands on a node.

We need to update NTO roles with configmap/finalizers permissions so it would be able to set the controller reference properly for the dependent objects.

 

PerformanceProfile objects are handled in a different way on Hypershift so modifications in the Performance Profile controller are needed to handle this.

Basically Performance Profile controller have to reconcile ConfigMaps which will have PerformanceProfile objects embedded into them, create the different manifests as usual and then handle them to the hosted cluster using different methods.

More info in the enhancement proposal

Target: To have a feature equivalence in Hypershift and Standalone deployments

PAO controller takes certain tuning decision based on the container-runtime that being used under the cluster. 

As of now hypershift does not support ContainerRuntimeConfiguration at all.

We should address this gap when ContainerRuntimeConfiguration will get supported on hypershift

UPDATE [15.05.2024]

Hypershift does support ContainerRuntimeConfiguration but do not populate the object into the HCP namespace, so the PAO still cannot read it.

We need to make sure that the controller evaluates feature gate on Hypershift. 

This is needed for features which are behind feature gate such as MixedCPUs

 

Some insightful conversation regarding the subject: https://redhat-internal.slack.com/archives/C01C8502FMM/p1712846255671449?thread_ts=1712753642.635209&cid=C01C8502FMM 

On hypershift we don't have the machine-config-daemon pod, so we cannot execute commands directly on the nodes during the e2e test run.

In order to preserve the ability of executing commands on hypershift nodes , we should create a daemonset which creates a high-privileged pods that mounts the host-fs.

The daemons set should be spinned up at the beginning of the tests and deleted at the end of them.

In addition the API should remain the as similar as possible.

Relevant section in the design doc:
https://docs.google.com/document/d/1_NFonPShbi1kcybaH1NXJO4ZojC6q7ChklCxKKz6PIs/edit#heading=h.3zhagw19tayv 

we should add `--hypershift` flag for ppc to generate a performance profile that is adjusted for hypershift platform.

This epic is to track any stories for hypershift kubevirt development that do not fit cleanly within a larger effort.

Here are some examples of tasks that this "catch all" epic can capture

  • dependency update maintenance tasks
  • ci and testing changes/fixes
  • investigation spikes

The hypershift/kubevirt platform requires usage of RWX volumes for both the VM root volumes and any kubevirt-csi provisioned volumes in order to be live migratable. There are other situations where a VMI might not be live migratable as well.

We should report via a condition when a VMI is not live migratable (there's a condition on the VMI that indicates this) and warn that live migration is not possible on the NodePool (and likely HostedCluster as well). This warning shouldn't block the cluster creation/update

Issue CNV-30407 gives the user the ability to choose whether or not to utilize multiqueue for the VMs in a nodepool. This setting is known to improve performance when jumbo frames (mtu 9000 or larger) are in use, but has the issue that performance was degraded with smaller MTUs

The issue that impacted performance on smaller MTUs is being resolved here, KMAINT-145. Once KMAINT-145 lands, the default in both the CLI and CEL defaulting should be to enable multiqueue

This epic contains all the Dynamic Plugins related stories for OCP release-4.16 and implementing Core SDK utils.

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

In 4.18 we plan to remove react-router v5 shared modules. For that reason we need to deprecated these shared modules now, in 4.16.

 

We should announce to all the Dynamic Plugins owners that we are planning to update the react-route from v5 to v6 and that we already started with the process. Also we need to explain to them whats our migration plan and timeline:

  1. Make console compatible with both versions of react-router, by introduce `react-router-dom-v5-compat` package in 4.14 and updating the router components to v6.
  2. Remove the `react-router-dom-v5-compat` shared module by the end of the 4.18

 

We should give them some grace period (2 releases - end of 4.18). The change on the plugin's side needs to happen as one change.

 

List of Dynamic Plugin owners

 

AC:

  • Announce to all the Dynamic Plugin owners our migration plan, mainly ford with  team leads and PMs (contact are part of the List of Dynamic Plugin owners doc).
  • Establish a communication channel (slack/mail/regular meetings) to stay in sync in case of any questions or concerns.

Since console is aiming for adopting PF6 in 4.18, we need to start the deprecation process in 4.16, due to the N+1 deprecation policy.

This will give us time in 4.17 to prepare console for adopting the PF6.

This will give plugins time to move to PF5.

 

AC:

  • Deprecate PF4 in plugin docs in console repo
  • Address the deprecation in release notes
  • Make announcement to #forum-ui-extensibility and on xyz mailing list

This epic tracks the rebase of openshift/etcd to 3.5.14

This update includes the following changes:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v3514-2024-05-29

Most notably this includes the experimental flag to stop serving requests on an etcd member that is undergoing defragmentation which would help address https://issues.redhat.com/browse/OCPSTRAT-319

tracking here all the work that needs to be done to configure the ironic containers (ironic-image and ironic-agent-image) to be ready for OCP 4.18
this includes also CI configuration, tools and documentation updates

all the configuration bits need to happen at least one sprint BEFORE 4.18 branching (current target August 9)
docs tasks can be completed after the configuration tasks
the CI tasks need to be completed RIGHT AFTER 4.18 branching happens

tag creation is now automated during OCP tags creation

builder creation still needs to be coordinated with RHOS delivery

Epic Goal

  • Reduce the resource footprint and simplify the metal3 pod.
  • Reduce the maintenance burden on the upstream ironic team.

Why is this important?

  • Inspector is a separate service for purely historical reasons. It integrates pretty tightly with Ironic, but has to maintain its own database, which is synchronized with the Ironic's one via API. This hurts performance and debugability.
  • One fewer service will have a positive effect on the resource footprint of Metal3.

Scenarios

This is not a user-visible change.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No ironic-inspector service is running in the Metal3 pod. In-band inspection functionality is retained.

Dependencies (internal and external)

METAL-119

Previous Work (Optional):

METAL-119 provides the upstream ironic functionality

tracking here all the work that needs to be done to configure the ironic container images for OCP 4.17
this includes also CI configuration, tools and documentation updates

Epic Goal

  • To support multipath disks for installation

Why is this important?

  • Some partners are using SANs for their installation disks

Scenarios

  1. User installs on a multipath Fibre Channel (FC) SAN volume

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This document will include two approaches to configure an environment with multipath:

  1. Create VMs and configure it
  2. using test infta

The doc will be under assisted-service/docs/dev

Description of the problem:

Assisted Service currently skips formatting of FC and iSCSI drives, but doesn't skip multipath drives over FC/iSCSI.  These drives aren't really part of the server and shouldn't be formatted like direct-attached storage.

Description of the problem:

Multipath object in the inventory does not contain "wwn" value, thus a user cannot use "wwn" rootDiskHint in the config.

How reproducible:

Always

Steps to reproduce:

1.

2.

3.

Actual results:

Example:

{ "bootable": true, "by_id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "drive_type": "Multipath", "has_uuid": true, "id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "installation_eligibility": \{ "eligible": true, "not_eligible_reasons": null }

, "name": "dm-0", "path": "/dev/dm-0", "size_bytes": 214748364800 },

Expected results:

*

{ "bootable": true, "by_id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "drive_type": "Multipath", "has_uuid": true, "id": "/dev/disk/by-id/wwn-0x60002ac00000000000000ef700026667", "installation_eligibility": \{ "eligible": true, "not_eligible_reasons": null }

, "name": "dm-0", "path": "/dev/dm-0", "size_bytes": 214748364800, "wwn": "0x60002ac00000000000000ef700026667" },*

The F5 Global Server Load Balancing offers a DNS-based load balancer (called a Wide IP) that can be used alone or in front of local (proxying) load balancers. Essentially it does round-robin DNS. However, crucially, if there are no healthy listeners available then the DNS name will not resolve to an IP. Even if there is a second layer of local load balancer, if these are part of the F5 then the listener status will propagate through to the DNS load balancer (unless the user commits the error of creating a TCP monitor for the Wide IP).

This means that in assisted installations with a Wide IP as the first layer of load balancing, the DNS validations will fail even when everything is configured correctly because there are not yet any listeners available (the API and Ingress are not yet up).

At least in the case of the customer who brought this to light, there was a CNAME record created that pointed to the global load balancer, it just didn't resolve further to an IP. This presents an opportunity to stop disallowing this configuration in assisted installations.

Since golang 1.20 (which is used in the agent since 4.15), the net.LookupCNAME() function is able to look up the CNAME record for a domain name. If we were to fall back to doing this lookup when no IP addresses are found for the host, then we could treat evidence of having set up the CNAME record as success. (We will have to take care to make this robust against likely future changes to LookupCNAME(), since it remains broken in a number of ways that we don't care about here.) In theory this is relevant only when UserManagedNetworking is enabled.

A small downside is that we would no longer catch configuration problems caused by a mis-spelled CNAME record. However, since we don't validate the load balancer configuration from end-to-end anyway, I think we can discount the importance of this. The most important thing is to catch the case where users have neglected to take any action at all to set up DNS records, which this would continue to do.

There were several issues found in customer sites concerning connectivity checks:

  • In none platform, there is connectivity to subset of the  network interfaces.
  • In none platform we need to sync the node-ip and the kubelet-ip.
  • We need to handle virtual interfaces better such as bridge.  This calls to discovery if IP is external or internal in order to decide if interface needs to be included for connectivity check.

 
When none platform is in use, if there is ambiguity in node-ip assignment, then incorrect assignment might lead to installation failure. This happens when etcd detects that the socket address from an etcd node does not match the expected address in the peer certificate. In this case etcd rejects such connection.

Example: assuming to networks - net1 and net2.
master node 1 has 1 address that belongs to net1.
master node 2 has 2 addresses. one that belongs to net 1, and another that belongs to net 2
master node 3 has 1 address that belongs to net 1.
If the selected node-ip of master node 2 belongs to net 2, then when it will create a connection with any other master node, the socket address will be the address that belongs to net 1. Since etcd expects it to be the same as the node-ip, it will
reject the connection.
 
This can be solved by node-ips selection that will not cause such a conflict. 
Node ips assignment should be done through ignition.
To correctly set bootstrap ip, the machine-network for the cluster must be set to match the selected node-ip for that host.

A user that wants to install clusters in FIPS-mode must run the corresponding installer in a matching runtime environment to the one the installer was built against.

In 4.16 the installer started linking against the RHEL9 crypto library versions which means that in a given container in a FIPS-enabled environment only either >=4.16 or <4.16 version installers may be run.

To solve this we will limit users to only one of these sets of versions for a given assisted service deployment.

This epic should implement the "Publish multiple assisted-service images" alternative in https://github.com/openshift/assisted-service/pull/6290 which has much more detail about the justification for this approach.
The main enhancement will be implemented as a followup in https://issues.redhat.com/browse/MGMT-17314

Create a dockerfile in the assisted-service repo that will build the current service with an el8 (centos8 stream in this case) base.

Manage the effort for adding jobs for release-ocm-2.11 on assisted installer

https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng

 

Merge order:

  1. Add temporary image streams for Assisted Installer migration - day before (make sure images were created)
  2. Add Assisted Installer fast forwards for ocm-2.x release <depends on #1> - need approval from test-platform team at https://coreos.slack.com/archives/CBN38N3MW 
  3. Branch-out assisted-installer components for ACM 2.(x-1) - <depends on #1, #2> - At the day of the FF
  4. Prevent merging into release-ocm-2.x - <depends on #3> - At the day of the FF
  5. Update BUNDLE_CHANNELS to ocm-2.x on master - <depends on #3> - At the day of the FF
  6. ClusterServiceVersion for release 2.(x-1) branch references "latest" tag <depends on #5> - After  #5
  7. Update external components to AI 2.x <depends on #3> - After a week, if there are no issues update external branches
  8. Remove unused jobs - after 2 weeks

 

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it's important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn't exist anywhere
  • Related data - the feature doesn't exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

Currently the Infrastructure operator (AgentServiceConfig controller) will only install assissted service properly on an OpenShift cluster.

There is a requirement from the Sylva project to also be able to install on other Kubernetes distributions so the same operator should support this.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Infrastructure operator correctly installs assisted service and related components in a kubernetes cluster.

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

No

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s) Orange as a part of the Sylva project
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it's important?)

  • Please describe why this feature is important

This allows assisted installer to integrate properly with the Sylva upstream project which will be used by Orange.

  • How does this feature help the product?

While deployed in the Sylva upstream assisted installer should still deploy supported OpenShift clusters which will expand our install base to users who wouldn't previously have easy access to OpenShift.

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference

Yes, our competitors are also contributors to the Sylva project and can install their container orchestration platforms through it.

    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn't exist anywhere
  • Related data - the feature doesn't exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

N/A

Route is an openshift-specific type which is not available in non-OCP kubernetes clusters.

Ensure we create an Ingress instead of a route when we're running in non-OCP kubernetes clusters.
 

The webhook server used by assisted service will only run if tls certs and keys are provided.

This means that we first need to find a maintainable way to create and rotate certificates then use that solution to create a cert/key that can be used by the webhook server.

Currently the operator requires certain OCP-specific kinds to be present on startup.
It will fail without these.

Find a way to detect the type of cluster in which the operator is running and only watch the applicable types.

Feature goal (what are we trying to solve here?)

registry.ci.openshift.org/openshift/release:golang-* images are based on centos Linux 7, which is EOL. Therefore, dnf is not capable of installing packages anymore as its repositories in Centos Linux 7 are not supported. We want to replace these images.

DoD (Definition of Done)

All of our builds are based on Centos Stream 9, with the exception of builds for FIPS.

Does it need documentation support?

No

Feature origin (who asked for this feature?)

  • Centos Linux 7 EOL

Reasoning (why it’s important?)

 

  • Centos Linux 7 EOL, we don't want to use unsupported base image

 golang 1.20 image is based on centos 7.

Since centos 7 is EOL, installing rpms on it is impossible (there are no suitable registries).

This task is about fixing all the broken jobs (jobs that are based on golang 1.20 image and try to install rpms on it) cross all the AI relevant repos.

More info can be found in [this|https://redhat-internal.slack.com/archives/C035X734RQB/p1719907286531789] discussion.

Currently, we use CI Golang base image (registry.ci.openshift.org/openshift/release:golang-<version>) in a lot of assisted installer components builds. These images are based on Centos Linux 7, which is EOL. We want to replace these images with maintained images e.g.  `registry.access.redhat.com/ubi9/go-toolset:<version>`

 

  • assisted-service
    • master
    • ocm-2.11
    • ocm-2.10
    • ocm-2.9
    • ocm-2.8
  • assisted-image-service
    • main
    • ocm-2.11
    • ocm-2.10
    • ocm-2.9
    • ocm-2.8
  • assisted-installer
    • master
    • ocm-2.11
    • ocm-2.10
    • ocm-2.9
    • ocm-2.8
  • assisted-installer-agent
    • master
    • ocm-2.11
    • ocm-2.10
    • ocm-2.9
    • ocm-2.8
  • cluster api provider agent
    • master
    • ocm-2.11
    • ocm-2.10
    • ocm-2.9
    • ocm-2.8

Feature goal (what are we trying to solve here?)

Multipath + iscsi is not supported by AI. We should make sure installing from such a disk is blocked for the user (day 1 + 2).

 

Description of the problem:

When Iscsi multipath enabled , the disks detected as expected + mpath device.
In the UI we see the mpatha device and under it two disks, 
The issue here is that we allow to user set members disk as installation disk which trigger format and break the mpath . 

sda        8:0    0  120G  0 disk  
sdb        8:16   0   30G  0 disk  
└─mpatha 253:0    0   30G  0 mpath 
sdc        8:32   0   30G  0 disk  
└─mpatha 253:0    0   30G  0 mpath
In case user pick sdb as installation disk (which is wrong) , cluster installation will fail

This host failed its installation.
Failed - failed after 3 attempts, last error: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- coreos-installer install --insecure -i /opt/install-dir/worker-3bd2cf09-01c6-48ed-9a44-25b4dbb3633d.ign --append-karg ip=ens3:dhcp --append-karg rd.iscsi.firmware=1 /dev/sdf], Error exit status 1, LastOutput "Error: getting sector size of /dev/sdf Caused by: 0: opening "/dev/sdf" 1: No such device or address (os error 6)".

How to address this issue :
From what i saw i would expect :
 -  Installation disk should be handled from parent , means we should enable installation disk on the mpath device .
(I tried handle it from fdisk and looks like any changed on mpath device reflected to both member disks  that binded to mpath ( both "local" disks actually are the same disk on the target)

 

How reproducible:

always

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Epic Goal

Moving forward with changing the way we test assisted-installer. We should change the way assisted-test-infra and subsystem tests on assisted-service are deploying assisted-service.

Why is this important?

There are lots of issues when running minikube with kvm2 driver, most of them are because of the complex setup (downloading the large ISO image, setting up the libvirt VM, defining registry addon, etc.)

Scenarios

  1. e2e tests on assisted-test-infra
  2. subsystem tests on assisted-service repository

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

 Currently we are in the process of moving our test deployments to kind instead of minikube. Assisted-Service is already compatible with kind, but not in debug mode. We want to enable debugging of assisted-service in subsystem-tests, e2e tests, and all other deployments on kind

DoD: developers can run a command like make subsystem that deploys the service on kind and runs all subsystem tests.

Right now, we have hub-cluster creation only documented on README and not actually being automated in our run. We should have an automated way to have everything related to running subsystem tests.

Also, we should make sure we're able to build service and update the image in the kind cluster, to allow for a continuous development experience.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

CMO should inject the following environment variables into Alertmanager containers:

  • HTTP_PROXY
  • HTTPS_PROXY
  • NO_PROXY

The values are retrieved from the cluster-wide Proxy resource.

Epic Goal

  • Allow user-defined monitoring administrators to define PrometheusRules objects spanning multiple/all user namespaces.

Why is this important?

  • There's often a need to define similar alerting rules for multiple user namespaces (typically when the rule works on platform metrics such as kube-state-metrics or kubelet metrics).
  • In the current situation, such rule would have to be duplicated in each user namespace which doesn't scale well:
    • 100 expressions selecting 1 namespace each are more expensive than 1 expression selecting 100 namespaces.
    • updating 100 PrometheusRule resources is more time-consuming and error-prone than updating 1 PrometheusRule object.

Scenarios

  1. A user-defined monitoring admin can provision a PrometheusRules object for which the PromQL expressions aren't scoped to the namespace where the object is defined.
  2. A cluster admin can forbid user-defined monitoring admins to use cross-namespace rules.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Follow FeatureGate Guidelines
  • ...

Dependencies (internal and external)

  1. None (Prometheus-operator supports defining namespace-enforcement exceptions for PrometheusRules).

Previous Work (Optional):

  1.  

Open questions::

In terms of risks:

  • UWM admins may configure rules which overload the platform Prometheus and Thanos Querier.
    • This is not very different from the current situation where ThanosRuler can run many UWM rules.
    • All requests go first through the Thanos Querier which should "protect" Prometheus from DoS queries (there's a hard limit of 4 in-flight queries per Thanos Querier pod).
  • UWM admins may configure rules that access platform metrics unavailable for application owners (e.g. without a namespace label or for an openshift-* label).
    • In practice, UWM admins already have access to these metrics so it isn't a big change.
    • It also enables use cases such as ROSA admin customers that can't deploy their platform alerts to openshift-monitoring today. With this new feature, the limitation will be lifted.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Proposed title of this feature request

Add scrape time jitter tolerations to UWM prometheus

What is the nature and description of the request?

Change the configuration of the UWM Prometheus instances to tolerate scrape time jitters.

Why does the customer need this? (List the business requirements)

Prometheus chunk compression relies on scrape times being accurately aligned to the scrape interval. Due to the nature of delta of delta encoding, a small delay from the configured scrape interval can cause tsdb data to occupy significantly more space.

We have observed a 50% difference in on disk tsdb storage for a replicated HA pair.

The downside is a reduction in sample accuracy and potential impact to derivatives of the time series. Allowing a jitter toleration will trade off improved chunk compression for reduced accuracy of derived data like the running average of a time series.

List any affected packages or components.

UWM Prometheus

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Currently transit gateway is using global routing as default.

Since global routing costs double than the local routing, need to local routing when powervs and vpc regions are same.

Also introduce a new flag to pass global routing config from user. 

Epic Goal

  • Improve IPI on Power VS in the 4.16 cycle
    • Switch to CAPI provisioned bootstrap and control plane resources

Description of problem:

UDP aggregation on s390x is disabled. This was done, because changing the interface feature did not work in the past (https://issues.redhat.com/browse/OCPBUGS-2532).

We have implemented a change in openshift/ovn-kubernetes (https://issues.redhat.com/browse/OCPBUGS-18935) that makes the used Go library safchain/ethtool able to change the interface features on s390x, so enabling UDP aggregation should now be possible for s390x nodes. Our proposed change is: https://github.com/openshift/cluster-network-operator/pull/2331 - this fix still needs to be verified end-to-end with a payload that includes all fixes.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always, it is disabled in code:
https://github.com/openshift/cluster-network-operator/blob/a5b5de5098592867f39ac513dc62024b5b076044/pkg/network/ovn_kubernetes.go#L671-L674

Steps to Reproduce:

    1. Install OpenShift cluster on s390x with OVN-Kubernetes

Actual results:

Interface feature rx-udp-gro-forwarding is disabled.

Expected results:

Interface feature rx-udp-gro-forwarding is enabled.    

Additional info:

There is a config map for disabling the interface feature if needed (namespace openshift-network-operator, config map udp-aggregation-config, value disable-udp-aggregation = "true" disables UDP aggregation). This should be kept.

 

Epic Goal

  • The goal of this epic is continue to work on CI job delivery, some bug fixes related to multi-path, and some updates to the the assisted-image service.
  • With 4.16 we want to support LPAR on s390x
  • LPAR comes in two flavors: LPAR classic (iPXE boot) and LPAR DPM (ISO and iPXE boot)

As the IBM Z Openshift user, I would like to install Openshift on LPAR (classic and DPM) for s390x architecture. Due to the fact that Assisted Installer should ensure a good user experience, difficulties and configuration file handling should be limited to a minimum. With LPAR support additional files are needed to boot a LPAR.

  • INS File:
    A ins file is part of the boot process of a LPAR. For classic mode the content of the file is:
  • Name of the kernel.img and the load address of the kernel. E.g.:
    kernel.img 0x00000000
  • Name of the initrd.img and the load address. The load address depends on the size of the kernel and need to be adapted to the next MB boundary. E.g.:
    initrd.img 0x00900000
  • Name of the initrd addrsize file and the offset. This file contain a 16 byte binary value describing the the load address (first 8byte) and the size of the initrd.img file in the last second 8bytes. E.g.:
    initrd.img.addrsize 0x00010408
  • Name of the parameter file and the offset. This file contains the required kernel arguments to boot the LPAR.

With this story a new API call will be introduced to create and download the INS file based on the current infra-env.

The current version of openshift/coredns vendors Kubernetes 1.28 packages. OpenShift 4.17 is based on Kubernetes 1.30. We also want to bump CoreDNS to v1.11.3 contain the most recent updates and fixes. We met on Mar 27, 2024, and reviewed the updates to CoreDNS 1.11.3, notes are here: https://docs.google.com/document/d/1xMMi5_lZRzclqe8pI2P9U6JlcggfSRvrEDj7cOfMDa0

Using old Kubernetes API and client packages brings risk of API compatibility issues. We also want to stay up-to-date with CoreDNS fixes and improvements.

Goal

  • The goal of this epic is to adjust HighOverallControlPlaneCPU alert thresholds when Workload Partitioning is enabled.

Why is this important?

  • On SNO clusters this might lead to false positives. Also, it makes sense to have such mechanism because right now it's using all available CPU for control plane alert, while user can allow less cores for it to be used

Scenarios

  1. As a user i want to enable workload partitioning and have my alert values adjusted accordingly

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Open questions:

  1. Do we want to bump alert threshold for SNO clusters because they are running workloads on master nodes, rather than worker nodes

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal

  • Create an SNO kernel-rt CI lane that's running on non-virtualized metal infrastructure (Equinix, IBM Cloud, etc)

Why is this important?

  • The existing kernel-rt lane is HA OpenShift which misses an important category of CI testing for OpenShift, Single Node.
  • Anything beyond functional testing cannot happen without SNO running on metal. We do not want to run latency testing as part of something like a presubmit but it should be an option for periodics and release gating jobs as any failure would block Telco customers.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement

Dependencies (internal and external)

  1. AWS Metal access in CI

Previous Work (Optional):

Open questions:

None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Create the RT kernel lane that runs the parallel conformance tests against an SNO cluster running the realtime kernel on AWS Metal

Problem:

Future PatternFly upgrades shouldn't break our integration tests. Based on the upgrade to PF 5 we should try to avoid classnames in our tests.

Goal:

Remove all the usage pf pf-* classname selectors in our test cases.

As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.

AC:

  • discover which e2e tests exist that call out PatternFly classnames as selectors (search for '.pf')
  • remove and replace any PatternFly classnames from `frontend/packages/gitops-plugin/integration-tests`
     

Some of the work done to produce a build for arm64 and to produce custom builds in https://github.com/okd-project/okd-centos9-rebuild required Dockerfiles and similar assets from the cluster operators repositories to be forked. 

 

This story is to track the eventual backport that should be achieved soon to get rid of most of the forks in the repo by merging the "upstream".

Description

Provide the ability to export data in a CSV format from the various Observability pages in the OpenShift console.

 

Initially this will include exporting data from any tables that we use.

Goals & Outcomes

Product Requirements:

A user will have the ability to click a button which will download the data in the current table in a CSV format. The user will then be able to take this downloaded file use it to import into their system.

This epic will own all of the usual update, rebase and release chores which must be done during the OpenShift 4.17 timeframe for Custom Metrics Autoscaler, Vertical Pod Autoscaler and Cluster Resource Override Operator

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story:

As an SRE montitoring ROSA HCP, I want to be able to:

  • monitor the external-dns operator 

so that I can achieve

  • have better observability of our service (specifically for cluster creation)

Acceptance Criteria:

Description of criteria:

  • servicemonitor created alongside the external-dns deployment 

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal*

OCP storage components (operators + CSI drivers) should not use environment variables for cloud credentials. It's discouraged by OCP hardening guide and reported by compliance operator. Our customers noticed it, https://issues.redhat.com/browse/OCPBUGS-7270 

 
Why is this important? (mandatory)

We should honor our own recommendations.

 
Scenarios (mandatory) 

  1. users/cluster admins should not notice any change when this epic is implemented.
  2. storage operators + CSI drivers use Secrets files to get credentials.{}

Dependencies (internal and external) (mandatory)

none

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:

[AWS EBS CSI Driver] could not provision ebs volume succeed on cco manual mode private clusters

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-07-20-191204   

How reproducible:

Always    

Steps to Reproduce:

    1. Install a private cluster with manual mode ->
       https://docs.openshift.com/container-platform/4.16/authentication/managing_cloud_provider_credentials/cco-short-term-creds.html#cco-short-term-creds-format-aws_cco-short-term-creds     
    2. Create one pvc and pod consume the pvc.
    

Actual results:

  In step 2 the pod,pvc stuck at Pending  
$ oc logs aws-ebs-csi-driver-controller-75cb7dd489-vvb5j -c csi-provisioner|grep new-pvc
I0723 15:25:49.072662       1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started
I0723 15:25:49.073701       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc"
I0723 15:25:49.656889       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain
I0723 15:25:50.657418       1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started
I0723 15:25:50.658112       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc"
I0723 15:25:51.182476       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain

Expected results:

   In step 2 the pv should become Bond(volume provision succeed) and pod Running well. 

Additional info:

    

Description of problem:

The operator pass credentials to the CSI driver using environment variables, which is discouraged. This has already been changed in 4.18, let's backport this to 4.17 too.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Epic Goal*

Out AWS EBS CSI driver operator misses some nice to have functionality. This Epic means to track it, so we finish it in some next OCP release.

 
Why is this important? (mandatory)

In general, AWS EBS CSI driver controller should be a good citizen  in HyperShift's hosted control plane. It should scale appropriately, report metrics and not use kubeadmin privileges in the guest cluster.

Scenarios (mandatory) 

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • QE -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Our operators use Unstructred client to read HostedControlPlane. HyperShift has published their API types that don't require many dependencies and we could import their types.go.

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update OCP release number in OLM metadata manifests of:

  • local-storage-operator
  • aws-efs-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • secrets-store-csi-driver-operator
  • smb-csi-driver-operator

OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56 

We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • csi-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator
  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator
  • github.com/openshift/alibaba-disk-csi-driver-operator

The following operators were migrated to csi-operator, do not update these obsolete repos:

  • github.com/openshift/aws-efs-csi-driver-operator
  • github.com/openshift/azure-disk-csi-driver-operator
  • github.com/openshift/azure-file-csi-driver-operator

tools/library-bump.py  and tools/bump-all  may be useful. For 4.16, this was enough:

mkdir 4.16-bump
cd 4.16-bump
../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16" 

4.17 perhaps needs an older prometheus:

../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17" 

Epic Goal

  • Align to OKR 2024 for OCP, having a blocking job for MicroShift.
  • Ensure that issues that break MicroShift are reported against the OCP release as blocking by preventing the payload from being promoted.

Why is this important?

  • Ensures stability of MicroShift when there are OCP changes.
  • Ensures stability against MicroShift changes.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • To have PAO working on Hypershift environment with feature parity with SNO ( as much as possible ) - GA the feature
  • Compatible functioning hypershift CI
  • Full QE coverage for hypershift in d/s

Why is this important?

  • Hypershift is a very interesting platform for clients and PAO has a key role in node tuning so make it work in Hypershift is a good way to ease the migration to this new platform as clients will not lose their tuning capabilities.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No regressions are introduced

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. CNF-11959 [TP] PAO operations in a Hypershift hosted cluster

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ref:
https://github.com/openshift/must-gather/pull/345 

MG is using the NTO image which contains all the tools needed for collection sysinfo data.

On the hosted-cluster we do not have NTO image, because it reside on the MNG cluster.

the scripts should detects NTO image somehow as a fallback.

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

The Kubevirt team should assess when they no longer need the static implementation of their plugin after their migration to dynamic plugins is complete, and should remove the legacy static plugin code from the console repo.

AC:

  • find all kubevirt related code and remove kubevirt from the console repo

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Enable console users to use the OpenShift LightSpeed plugin

Why is this important?

  • Console users should have access to LightSpeed plugin and its functionality

Scenarios

Pre Lightspeed Install

  • LightSpeed Button becomes visible to all users no matter what the RBAC is
    • If user has access to install, We show them how and or link to Operator
    • If you user doesn’t have access to install, We tell them to Contact Admin and request them to install the LightSpeed Service
  • Console Setting will be added to disable the button for all by Admins only

Post Lightspeed Install

  • User Preference added to hide LightSpeed button

Acceptance Criteria

Dependencies (internal and external)

  1. https://github.com/openshift/lightspeed-console

Previous Work (Optional):

Open questions::

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a cluster admin I want to set a cluster wide setting for hiding the Lightspeed button from OCP Console, for all the console users. This change will need to consume api and console-operator changes done in CONSOLE-4161.  Console need to read the new field `LightspeedButtonState` field and set it as a SERVER_FLAG so frontend will be able to digest it. 

 

AC:

  • Extend the bridge flags with the `LightspeedButtonState` flag
  • Add ability to read the field directly from console-config.yaml 
  • Pass the field to the SERVER_FLAGS
  • Add proxy to console backend though with we would check availability of lightspeed-operator PackageManifest
  • Update Console's RBAC to be able to GET the lightspeed-operator PackageManifest [operator change]

Create Lightspeed hover button as part of core but extensible by Lightspeed Dynamic Plugin

 

  • We need the lightspeed hover button to become part of the core this will help users discover Lightspeed.
  • Once a user clicks the hover button the chat window will popup with instructions on how to enable/install lightspeed on the cluster.
  • We then need to make this extensible so that the lightspeed team only and can override the existing button and chat window.
  • We will not advertise this extension point as we don't want others using it.

AC:

  • Check if the PackageManifest for the Lightspeed operator is available on the cluster AND check if the user had permissions to install it
    • If YES, the render the btn
    • If NO,  the render the btn. Once user clicks on the btn the chat box will tell the User to contact Admin and request him to install the Lightspeed Operator
  • Add a user-preference to show/hide the hover button. if the 'LightSpeedButtonState' SERVER_FLAG is set to hide, we should not show the user setting.
  • The code for the hover btn should be part of the core console
  • The hover btn should be an extension point
  • LightSpeed plugin will need to use this new extension point

 

ccing Andrew Pickering 

As a cluster admin I want to set a cluster wide setting for hiding the Lightspeed button from OCP Console, for all the console users. For this change, an additional field will need to be introduced into console-operator's config API.

 

AC: 

  • Add new field to the console-operator's config, to its 'spec.customization' section, which would set the console. New field should be named 'LightspeedButtonState', which should be an enum, with states "Show" and "Hide". This change will need to be done in the openshift/api repo.
  • By default the state should be "Show"
  • Pass the state variable to the console-config CM
  • Add e2e and unit test

NOT FOR CUSTOMER USE

Summary

To save costs in OpenShift CI, we would like to be able to create test clusters using spot instances for both workers and masters.

Background

AWS spot instances are:

  • way cheaper than normal ("on-demand") instances;
  • less reliable than on-demand instances; but have been observed to have a high short-term survival rate.

They are thus deemed to be ideal for CI use cases because:

  • CI jobs last O(4h) and are thus unlikely to have spot instances yanked.
  • If a job is torpedoed by losing a spot instance, meh, it's just another flake, /retest.

Spot instances can be requested by editing manifests (generated via openshift-install create manifests) and injecting spotMarketOptions into the appropriate AWS provider-specific path. Today this works for workers via both terraform and CAPI code paths; but for masters only via CAPI. The lack of support for spot masters via terraform is due to an omission in translating the relevant field from the master machine manifest.

Proposal

This RFE proposes to:
1. Add that missing translation to openshift-install so that terraform-created master machines honor requests for spot instances.
2. Add some way to detect that this enablement exists in a given binary so that wrappers can perform validation (i.e. reject requests for spot masters if the support is absent – otherwise such requests are silently ignored).
3. Backport the above to all releases still in support to maximize the opportunity for cost savings.

NOT FOR CUSTOMER USE

We do not propose to officially support spot masters for customers, as their unreliability makes them unsuitable for general production use cases. As such, we will not add:

  • install-config affordance
  • documentation

(We may even want to consider adding words to the openshift API field for spotMarketOptions indicating that, even though it can be made to work, we don't support spot masters.)

User Story:

As a (user persona), I want to be able to:

  • detect support for spot instances in the installer

so that I can achieve

  • enabling the feature and saving costs in CI when an Installer version with such support is used

Acceptance Criteria:

Description of criteria:

  • some way to detect that the feature is enabled in the installer without exposing it to customers

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

Using Spot instances might bring significant cost savings in CI. There is support in CAPA already and we should enable it for terraform too.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
: [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] expand_more

failed
job link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-mce-agent-connected-ovn-ipv4-metal3-conformance/1801525321292320768 

The reason for the failure is the incorrect configuration of the proxy.|

failed log

  Will run 1 of 1 specs
  ------------------------------
  [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal]
  github.com/openshift/origin/test/extended/router/idle.go:49
    STEP: Creating a kubernetes client @ 06/14/24 10:24:21.443
  Jun 14 10:24:21.752: INFO: configPath is now "/tmp/configfile3569155902"
  Jun 14 10:24:21.752: INFO: The user is now "e2e-test-router-idling-8pjjg-user"
  Jun 14 10:24:21.752: INFO: Creating project "e2e-test-router-idling-8pjjg"
  Jun 14 10:24:21.958: INFO: Waiting on permissions in project "e2e-test-router-idling-8pjjg" ...
  Jun 14 10:24:22.039: INFO: Waiting for ServiceAccount "default" to be provisioned...
  Jun 14 10:24:22.149: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
  Jun 14 10:24:22.271: INFO: Waiting for ServiceAccount "builder" to be provisioned...
  Jun 14 10:24:22.400: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...
  Jun 14 10:24:22.419: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...
  Jun 14 10:24:22.440: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...
  Jun 14 10:24:22.740: INFO: Project "e2e-test-router-idling-8pjjg" has been fully provisioned.
    STEP: creating test fixtures @ 06/14/24 10:24:22.809
    STEP: Waiting for pods to be running @ 06/14/24 10:24:23.146
  Jun 14 10:24:24.212: INFO: Waiting for 1 pods in namespace e2e-test-router-idling-8pjjg
  Jun 14 10:24:26.231: INFO: All expected pods in namespace e2e-test-router-idling-8pjjg are running
    STEP: Getting a 200 status code when accessing the route @ 06/14/24 10:24:26.231
  Jun 14 10:24:28.315: INFO: GET#1 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host

  Jun 14 10:25:05.256: INFO: GET#38 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:04.256: INFO: GET#877 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:05.256: INFO: GET#878 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:06.257: INFO: GET#879 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:07.256: INFO: GET#880 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:08.256: INFO: GET#881 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:09.256: INFO: GET#882 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:10.256: INFO: GET#883 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:11.256: INFO: GET#884 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:12.256: INFO: GET#885 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:13.257: INFO: GET#886 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:14.256: INFO: GET#887 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  ...
  ...
  ...
  Jun 14 10:39:19.256: INFO: GET#892 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
  Jun 14 10:39:20.256: INFO: GET#893 "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org" error=Get "http://idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org": dial tcp: lookup idle-test-e2e-test-router-idling-8pjjg.apps.2c2012a4ac53d705fb5e.ostest.test.metalkube.org on 172.30.0.10:53: no such host
    [INTERRUPTED] in [It] - github.com/openshift/origin/test/extended/router/idle.go:49 @ 06/14/24 10:39:20.461
    ------------------------------
    Interrupted by User
    First interrupt received; Ginkgo will run any cleanup and reporting nodes but will skip all remaining specs.  Interrupt again to skip cleanup.
    Here's a current progress report:
      [sig-network-edge][Conformance][Area:Networking][Feature:Router] The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it [Skipped:Disconnected] [Suite:openshift/conformance/parallel/minimal] (Spec Runtime: 14m59.024s)
        github.com/openshift/origin/test/extended/router/idle.go:49
        In [It] (Node Runtime: 14m57.721s)
          github.com/openshift/origin/test/extended/router/idle.go:49
          At [By Step] Getting a 200 status code when accessing the route (Step Runtime: 14m54.229s)
            github.com/openshift/origin/test/extended/router/idle.go:175

          Spec Goroutine
          goroutine 307 [select]
            k8s.io/apimachinery/pkg/util/wait.waitForWithContext({0x95f5188, 0xda30720}, 0xc004cfbcf8, 0x30?)
              k8s.io/apimachinery@v0.29.0/pkg/util/wait/wait.go:205
            k8s.io/apimachinery/pkg/util/wait.poll({0x95f5188, 0xda30720}, 0x1?, 0xc0045c2a80?, 0xc0045c2a87?)
              k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:260
            k8s.io/apimachinery/pkg/util/wait.PollWithContext({0x95f5188?, 0xda30720?}, 0xc004cfbd90?, 0x88699b3?, 0x7?)
              k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:85
            k8s.io/apimachinery/pkg/util/wait.Poll(0xc004cfbd00?, 0x88699b3?, 0x1?)
              k8s.io/apimachinery@v0.29.0/pkg/util/wait/poll.go:66
          > github.com/openshift/origin/test/extended/router.waitHTTPGetStatus({0xc003d8fbc0, 0x5a}, 0xc8, 0x0?)
              github.com/openshift/origin/test/extended/router/idle.go:306
          > github.com/openshift/origin/test/extended/router.glob..func7.2.1()
              github.com/openshift/origin/test/extended/router/idle.go:178
            github.com/onsi/ginkgo/v2/internal.extractBodyFunction.func3({0x2e24138, 0xc0014f2d80})
              github.com/onsi/ginkgo/v2@v2.13.0/internal/node.go:463
            github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func3()
              github.com/onsi/ginkgo/v2@v2.13.0/internal/suite.go:896
            github.com/onsi/ginkgo/v2/internal.(*Suite).runNode in goroutine 1
              github.com/onsi/ginkgo/v2@v2.13.0/internal/suite.go:883
    -----------------------------

This is a clone of issue OCPBUGS-38713. The following is the description of the original issue:

: [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

failed
job link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-mce-e2e-agent-connected-ovn-dualstack-metal3-conformance/1822988278547091456 

failed log

  [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/dns/dns.go:499
    STEP: Creating a kubernetes client @ 08/12/24 15:55:02.255
    STEP: Building a namespace api object, basename dns @ 08/12/24 15:55:02.257
    STEP: Waiting for a default service account to be provisioned in namespace @ 08/12/24 15:55:02.517
    STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 08/12/24 15:55:02.581
    STEP: Creating a kubernetes client @ 08/12/24 15:55:02.646
  Aug 12 15:55:03.941: INFO: configPath is now "/tmp/configfile2098808007"
  Aug 12 15:55:03.941: INFO: The user is now "e2e-test-dns-dualstack-9bgpm-user"
  Aug 12 15:55:03.941: INFO: Creating project "e2e-test-dns-dualstack-9bgpm"
  Aug 12 15:55:04.299: INFO: Waiting on permissions in project "e2e-test-dns-dualstack-9bgpm" ...
  Aug 12 15:55:04.632: INFO: Waiting for ServiceAccount "default" to be provisioned...
  Aug 12 15:55:04.788: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
  Aug 12 15:55:04.972: INFO: Waiting for ServiceAccount "builder" to be provisioned...
  Aug 12 15:55:05.132: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...
  Aug 12 15:55:05.213: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...
  Aug 12 15:55:05.281: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...
  Aug 12 15:55:05.641: INFO: Project "e2e-test-dns-dualstack-9bgpm" has been fully provisioned.
    STEP: creating a dual-stack service on a dual-stack cluster @ 08/12/24 15:55:05.775
    STEP: Running these commands:for i in `seq 1 10`; do [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "172.31.255.230" ] && echo "test_endpoints@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "fd02::7321" ] && echo "test_endpoints_v6@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv4.v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "3.3.3.3 4.4.4.4" ] && echo "test_endpoints@ipv4.v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv6.v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "2001:4860:4860::3333 2001:4860:4860::4444" ] && echo "test_endpoints_v6@ipv6.v4v6.e2e-dns-2700.svc";sleep 1; done
     @ 08/12/24 15:55:05.935
    STEP: creating a pod to probe DNS @ 08/12/24 15:55:05.935
    STEP: submitting the pod to kubernetes @ 08/12/24 15:55:05.935
    STEP: deleting the pod @ 08/12/24 16:00:06.034
    [FAILED] in [It] - github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074
    STEP: Collecting events from namespace "e2e-test-dns-dualstack-9bgpm". @ 08/12/24 16:00:06.074
    STEP: Found 0 events. @ 08/12/24 16:00:06.207
  Aug 12 16:00:06.239: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
  Aug 12 16:00:06.239: INFO: 
  Aug 12 16:00:06.334: INFO: skipping dumping cluster info - cluster too large
  Aug 12 16:00:06.469: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-dns-dualstack-9bgpm-user}, err: <nil>
  Aug 12 16:00:06.506: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-dns-dualstack-9bgpm}, err: <nil>
  Aug 12 16:00:06.544: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~4QgFXAn8lyosshoHOjJeddr3MJbIL2DnCsoIvJVOGb4}, err: <nil>
    STEP: Destroying namespace "e2e-test-dns-dualstack-9bgpm" for this suite. @ 08/12/24 16:00:06.544
    STEP: dump namespace information after failure @ 08/12/24 16:00:06.58
    STEP: Collecting events from namespace "e2e-dns-2700". @ 08/12/24 16:00:06.58
    STEP: Found 2 events. @ 08/12/24 16:00:06.615
  Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: skip schedule deleting pod: e2e-dns-2700/dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30
  Aug 12 16:00:06.648: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
  Aug 12 16:00:06.648: INFO: 
  Aug 12 16:00:06.743: INFO: skipping dumping cluster info - cluster too large
    STEP: Destroying namespace "e2e-dns-2700" for this suite. @ 08/12/24 16:00:06.743
  • [FAILED] [304.528 seconds]
  [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/dns/dns.go:499

    [FAILED] Failed: timed out waiting for the condition
    In [It] at: github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074
  ------------------------------

  Summarizing 1 Failure:
    [FAIL] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
    github.com/openshift/origin/test/extended/dns/dns.go:251

  Ran 1 of 1 Specs in 304.528 seconds
  FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped
fail [github.com/openshift/origin/test/extended/dns/dns.go:251]: Failed: timed out waiting for the condition
Ginkgo exit error 1: exit with code 1

failure reason
TODO

Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Today, when we create an AKS cluster, we provide the catalog images like so:

--annotations hypershift.openshift.io/certified-operators-catalog-image=registry.redhat.io/redhat/certified-operator-index@sha256:fc68a3445d274af8d3e7d27667ad3c1e085c228b46b7537beaad3d470257be3e \
--annotations hypershift.openshift.io/community-operators-catalog-image=registry.redhat.io/redhat/community-operator-index@sha256:4a2e1962688618b5d442342f3c7a65a18a2cb014c9e66bb3484c687cfb941b90 \
--annotations hypershift.openshift.io/redhat-marketplace-catalog-image=registry.redhat.io/redhat/redhat-marketplace-index@sha256:ed22b093d930cfbc52419d679114f86bd588263f8c4b3e6dfad86f7b8baf9844 \
--annotations hypershift.openshift.io/redhat-operators-catalog-image=registry.redhat.io/redhat/redhat-operator-index@sha256:59b14156a8af87c0c969037713fc49be7294401b10668583839ff2e9b49c18d6 \

We need to fix this so that we don't need to override those images on create command when we are in AKS. 

The current reason we are annotating the catalog images when we create an AKS cluster is because the HCP controller will try to put the images out of an ImageStream if there are no overrides here - https://github.com/openshift/hypershift/blob/64149512a7a1ea21cb72d4473f46210ac1d3efe0/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L3672. In AKS, ImageStreams are not available.

This will be home to stories that aren't required for GA.

We have decided that each platform will have its own feature gate as we work on adding per platform support. This story involves ensuring that the boot image controller only runs when on a valid combination of feature gate and cluster platform is found.

There are use cases for using the baremetal platform and having the baremetal capability while disabling MachinePI (for example to use CAPI ).
Currently, there are few validations preventing this in the installer and in openshift/api, these validations exists because CBO will crash if the Machine CRD doesn't exist on the cluster.
CBO is usable on many platforms, and depends on MAPI on all of them, when it only really needs it for baremetal IPI.

 

Feature goal (what are we trying to solve here?)

Allow using the baremetal platform and having the baremetal capability while disabling MachinePI{}

Does it need documentation support?

yes, we should update the docs to say this configuration is valid.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s): Orange telco cloud (this is required for enabling CAPI to install multi node OCP)
    • How many customers asked for it? other costumers can benefit from this
    • Can we have a follow-up meeting with the customer(s)? sure.

 

  • Internal request

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
    There are use cases for using the baremetal platform and having the baremetal capability while disabling MachinePI (for example to use CAPI ).
  • How does this feature help the product?
    it makes more sense to enable CBO to work without MAPI }}{{since CBO is usable on many platforms, and it has the same dependence on MAPI on all of them, when it only really needs it for baremetal IPI

 

Overall this should allow users to install OCP with baremetal platform using ABI or assisted installer while disabling the MachineAPI capability.
Here is an example patch for configuring this when installing with assisted installer:

curl --request PATCH --header "Content-Type: application/json"     --data '"{\"capabilities\": {\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [\"openshift-samples\", \"marketplace\", \"Console\", \"baremetal\", \"Insights\", \"Storage\", \"NodeTuning\", \"CSISnapshot\", \"OperatorLifecycleManager\", \"Ingress\"]}}"' "http://$ASSISTED_SERVICE_IP:$ASSISTED_SERVICE_PORT/api/assisted-install/v2/clusters/$CLUSTER_ID/install-config" 

Feature goal (what are we trying to solve here?)

CI improvements in terms of cost, efficiency, UX, stability, etc.

Does it need documentation support?

No.

Reasoning (why it’s important?)

  • Improving CI performance

  • Reducing CI costs

  • Improve the accessibility for users less familiar with the system.

  • Improve the CI stability

 Currently subsystem test doesn't clean all resources in the end of tests resulting in errors when running more than once. We want to clean this resources so the tests could be repetitive 

Currently, we generate python client for assisted-service inside its primary Dockerfile. In order to generate the python client, we need the repository git tags. This behavior collisions with Konflux behavior because Konflux builds the image, searching for the tags in the fork instead of the upstream repository. As a result developers currently need to periodically push git tags to their fork for konflux to build successfully. We want to explore the option of removing the client generation from the Dockerfile.

 

reference - https://redhat-internal.slack.com/archives/C035X734RQB/p1719496182644409

 

It has been decided to move the client generation to test-infra

ACM 2.10/MCE 2.5 rolled back Assisted Installer/Service images to use RHEL 8 (issue ACM-10022) due to incompatibility of various dependencies when running FIPs.

 

In order to support RHEL 9 in assisted installer/service

  1. All blocker dependencies must be completed
  2. Use the RHEL 9 base image
    1. Need to re-add util-linux-core and remove util-linux from the image
  3. Test in both FIPs and non-FIPs mode to verify everything is working correctly

 

 

Slack thread discussion

For this epic we will clearly need to make major changes to the code that runs the installer binary. Right now this is all mixed in with the discovery ignition code as well as some generic ignition parsing and merging functions.

Split split logic into separate files so the changes for this epic will be easier to reason about.

Current file for reference: https://github.com/openshift/assisted-service/blob/9a4ece99d181459927586ac5105ca606ddc058fe/internal/ignition/ignition.go

Ensure there's a clear line between what is coming from assisted service for a particular request to generate the manifests and what needs to be provided during deployment through env vars.

Ideally this interface would be described by the pkg/generator package.

Clean up this package to remove unused parameters and ensure it is self contained (env vars are not coming from main, for example).

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend. 

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Oracle
    • NetApp
    • Cisco

Reasoning (why it’s important?)

  • In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

In order to successfully install OCP on an iSCSI boot volume, we need to make sure that the machine has 2 network interfaces:

  • an interface connected to the iSCSI volume
  • an interface used as default gateway that will be used by OCP

This is required because on startup OVS/OVN will reconfigure the default interface (the network interface used for the default gateway). This behavior makes the usage of the default interface impracticable for the iSCSI traffic because we loose the root volume, and the node becomes unusable. See https://issues.redhat.com/browse/OCPBUGS-26071

In the scope of this issue we need to:

  • report iSCSI host IP address from the assisted agent
  • check that the network interface used for the iSCSI boot volume is not the default one (default gateway is goes to one of the other interfaces) => implies 2 network interfaces
  • ensure that the network interface connected to the iSCSI network is configured with DHCP in the kernel args in order to mount the root volume over iSCSI
  • workaround https://issues.redhat.com/browse/OCPBUGS-26580 by dropping a script in a MachineConfig manifest that will reload the network interfaces on first boot

Currently, the monitoring stack is configured using a configmap. In OpenShift though the best practice is to configure operators using custom resources.

Why this matters

  • We can add [cross]validation rules to CRD fields to avoid misconfigurations
  • End users get a much faster feedback loop. No more applying the config and scanning logs if things don't look right. The API server will give immediate feedback
  • Organizational users (such as ACM) can manage a single resource and observe its status

To start the effort we should create a feature gate behind which we can start implementing a CRD config approach. This allows us to iterate in smaller increments without having to support full feature parity with the config map from the start. We can start small and add features as they evolve.

One proposal for a minimal DoD was:

  • We have a feature gate
  • We have outlined our idea and approach in an enhancement proposal. This does not have to be complete, just outline how we intend to implement this. OpenShift members have reviewed this and given their general approval. The OEP does not need to be complete or merged,
  • We have CRD scaffolding that CVO creates and CMO watches
  • We have a clear idea for a migration path. Even with a feature gate in place we may not simply switch config mechanisms, i.e. we must have a mechanism to merge settings from the config maps and CR, with the CR taking precedence.
  • We have at least one or more fields, CMO can act upon. For example
    • a bool field telling CMO to use the config map for configuration
    • ...

Feature parity should be planned in one or more separate epics.

Behind a feature flag we can start experimenting with a CRD and explore migration and upgrades.

Epic Goal

  • CMO currently has several ServiceMonitor and Rule object that belong to other componets
  • We should migrate these away from the CMO code base to the owning teams

Why is this important?

  • The respective component teams are the experts for their components and can more accurately decide on how to alert and what metrics to expose.
  • Issues with these artifacts get routed to the correct teams.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

(originated from https://issues.redhat.com/browse/TRT-55)

Currently the KubePersistentVolumeErrors alert is deployed by the cluster-monitoring operator and lives in the openshift-monitoring namespace. The metric involved in the alert (kube_persistentvolume_status_phase) comes from kube-state-metrics but it would be clearer if https://github.com/openshift/cluster-storage-operator/ owns the alert.

Also relevant https://coreos.slack.com/archives/C01CQA76KMX/p1637075395269400

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Warn users about Prometheus<->Kubernetes API failures (unreachable API, permissions issue etc.) which can lead into silent Service Discovery failures.
  • Add an alert based on the newly added metric https://github.com/prometheus/prometheus/pull/13554 that keeps track of these failures.
  • Maybe a runbook explaining the main reasons behind the failures and how to fix them.

Why is this important?

  • Warnings are only available as logs, logs can easily be missed and not regularly checked.
  • Even with logs, sometimes, users don't know what they need to do, a runbook will be helpful.

Scenarios

  1. I wanted Prometheus to scrape new targets but I didn't give it the needed permissions (many slack threads about that ""Failed to watch" in:#forum-openshift-monitoring")
  2. I mis-configured the Kube SD.
  3. Prometheus cannot reach the Kube API due to some DNS changes, connectivity/network issue.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • An alert with minimal false positives.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

The 4.16 dev cycle showed CMO e2e test timeouts more frequently. This hinders our development process and might indicate an issue in our code.

We should spend some time to analyze these failures and improve CMO e2e test reliability.

Most of the Kube API requests executed during CMO e2e tests wait a few seconds before actually issuing the request. We could save a fraction of time per action if it doesn't wait.

This epic is to track stories that are not completed in MON-3378

There are a few places in CMO where we need to remove code after the release-4.16 branch is cut.

To find them, look for the "TODO" comments.

After we have replaced all oauth-proxy occurrences in the monitoring stack, we need to make sure that all references to oauth-proxy are removed from the cluster monitoring operator. Examples:

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Stop setting `-cloud-provider` and `-cloud-config` arguments on KAS, KCM and MCO
  • Remove `CloudControllerOwner` condition from CCM and KCM ClusterOperators
  • Remove feature gating reliance in library-go IsCloudProviderExternal
  • Remove CloudProvider feature gates from openshift/api

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

KCM and KAS previously relied on having the `{}cloud-provider` and `{-}-cloud-config` flags set. However, these are no longer required as there is no cloud provider code in either binary.

Both operators rely on the config observer in library go to set these flags.

In the future, if these values are set, even to the empty string, then startup will fail.

The config observer sets keys and values for a map (see), we need to make sure the keys for these two flags are deleted rather than set to a specific value.

Steps

  • Update the logic in the config observer to remove the `{}cloud-config` and `{-}-cloud-provider` flag, neither should be set going forward
  • Update KAS and KCM operators to include the new logic.

Stakeholders

  • Cluster Infra
  • API team
  • Workloads team

Definition of Done

  • Clusters do not set `{}cloud-provider` or `{-}-cloud-config` on KAS and KCM
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Goal

  • Stabilize the new Kernel RT CI lanes (go green)
  • Transfer Kernel RT blocking lanes from GCP to the new EC2 Metal lanes

Why is this important?

  • The new EC2 metal-based lanes are a better representation of how customers will use the real-time kernel and as such should replace the outdated lanes that were built on virtualized GCP hardware. 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully (Green)

Dependencies (internal and external)

  1.  

Previous Work (Optional):

Open questions:

None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Problem:

ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17. In the console UI, we have a ClusterTask list page, and ClusterTasks are also listed in the Tasks quick search in the Pipeline builder form.

Goal:

Remove ClusterTask and references from the console UI and use Tasks from `openshift-pipelines` namespace.

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. Remove the ClusterTasks tab and the list page
  2. Remove ClusterTasks from Tasks quick search
  3. List Tasks from the `openshift-pipelines` namespace in the Tasks quick search
  4. Users should be able to create pipelines using the tasks from `openshift-pipelines` namespace in the Pipeline builder.
  5. Remove the ClusterTasks tab, list page, and from Task quick search from the static plugin only if 1.17 Pipelines operator is installed.
  6. Backport the static plugin changes to the previous OCP version supported by 1.17

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Resolver in Tekton https://tekton.dev/docs/pipelines/resolution-getting-started/ 

Task resolution: https://tekton.dev/docs/pipelines/cluster-resolver/#task-resolution 

Note:

Description

ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17

We have to use Tasks from `openshift-pipelines` namespace. This change will happen in console-plugin repo(dynamic plugin). So in console repository we have to remove all the dependency of ClusterTask if the Pipelines Operator is 1.17 and above

Acceptance Criteria

  1. Remove ClusterTask list page in search menu
  2. Remove ClusterTask list page tab in Tasks navigation menu
  3. ClusterTask to be removed from quick search in Pipelines builder
  4. Update the test cases (can we remove ClusterTask test for Pipelines 1.17 and above??)

Additional Details:

Problem:

Goal:

Acceptance criteria:

  1. Move the PipelinesBuilder to the dynamic plugin
  2. The Pipeline Builder should work without and with the new dynamic plugin

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

This is a clone of issue OCPBUGS-43752. The following is the description of the original issue:

Description of problem:

    Add disallowed flag to hide the pipelines-plugin pipeline builder route, add action and to catalog provider extension as it is migrated to Pipelines console-plugin. So, that no duplicate action in console

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently `pkg/operator/bootstrap.go` is quite a mess as vSphere explicitly ignores the `getPlatformManifests` function and creates manifest list manually.

As the logic used there is different than for any other platform, this creates some confusion as when CoreDNS and keepalived are included. I.e. for every platform except vSphere we always deploy CoreDNS and sometimes skip keepalived, but for vSphere whenever keepalived is skipped, CoreDNS is skipped too.

As the code is not really documented regarding reasons of this, that should be refactored.

List of flakes:

 

Spec Location
creates v1 CRDs with a v1 schema successfully test/e2e/crd_e2e_test.go
should have removed the old configmap and put the new configmap in place test/e2e/gc_e2e_test.go
can satisfy an associated ClusterServiceVersion's ownership requirement test/e2e/csv_e2e_test.go
Is updated when the CAs expire test/e2e/webhook_e2e_test.go
upgrade CRD with deprecated version test/e2e/installplan_e2e_test.go
consistent generation test/e2e/installplan_e2e_test.go
should clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created test/e2e/installplan_e2e_test.go
OperatorCondition Upgradeable type and overrides test/e2e/operator_condition_e2e_test.go
eventually reports a successful state when using skip ranges test/e2e/fail_forward_e2e_test.go
eventually reports a successful state when using replaces test/e2e/fail_forward_e2e_test.go
intersection test/e2e/operator_groups_e2e_test.go
OLM applies labels to Namespaces that are associated with an OperatorGroup test/e2e/operator_groups_e2e_test.go
updates multiple intermediates test/e2e/subscription_e2e_test.go
creation with dependencies test/e2e/subscription_e2e_test.go
choose the dependency from the right CatalogSource based on lexicographical name ordering of catalogs test/e2e/subscription_e2e_test.go
should report only package and channel deprecation conditions when bundle is no longer deprecated test/e2e/subscription_e2e_test.go

 

$ grep -ir "[FLAKE]" test/e2e

 

Description of problem:

The e2e test "upgrade CRD with deprecated version" in the test/e2e/installplan_e2e_test.go suite is flaking

Version-Release number of selected component (if applicable):

    

How reproducible:

Hard to reproduce, could be related to other tests running at the same time, or any number of things. 

Steps to Reproduce:

It might be worthwhile trying to re-rerun the test multiple times against a ClusterBot, or OpenShift Local, cluster

Actual results:

    

Expected results:

    

Additional info:

    

Goal

Deploy ODF with only SSDs

Problem

Some customer's (especially those using vmware) are deploying ODF with HDDs

Why is this important?

Some customer's (especially those using vmware) are deploying ODF with HDDs

Prioritized Scenarios

In Scope

Deployment - Add a warning and block deployment in case HDD disks are in use with LSO

Out of Scope

Add capacity warning and block 

Documentation Requirements

No documentation requirements,

Customers

No, this request is coming from support

Customer Facing Story

Per [1], ODF does not support HDD in internal mode. I would like to request we add a feature to the console during install that either stops, or warns the customer that they're installing an unsupported cluster if HDDs are detected and selected as the osd devices. I know that we can detect the rotational flag of all locally attached devices since we currently have the option to filter by ssd vs. hdd when picking the backing disks during install. This bz is a request to take it a step further and present the customer with the information explicitly during console install that hdds are unsupported. [1] https://access.redhat.com/articles/5001441

Reporter and Point of Contact:

bmcmurra@redhat.com

Addition of an ODF use-case specific warning to the LSO's UI indicating users that - "HDD devices are not supported for ODF", in case they are planning to use them for ODF "StorageSystem" creation later.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

In some cases, it is desirable to set the control plane size of a cluster regardless of the number of workers. 

Introduce an annotation hypershift.openshift.io/cluster-size-override to set the value of the cluster size label regardless of number of workers on the hosted cluster.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We want to remove official support for UPI and IPI support for Alibaba Cloud provider. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.

Why is this important?

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Scenarios

Impacted areas based on CI:

alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI jobs are removed
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Acceptance Criteria

  • Since api and library-go are the last projects for removal, remove only alibaba specific code and vendoring

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

*USER STORY:*

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

*DESCRIPTION:*

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

*Required:*

...

*Nice to have:*

...

*ACCEPTANCE CRITERIA:*

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

*ENGINEERING DETAILS:*

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

*USER STORY:*

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

*DESCRIPTION:*

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

*Required:*

...

*Nice to have:*

...

*ACCEPTANCE CRITERIA:*

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

*ENGINEERING DETAILS:*

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

We have a consistent complication where developers miss or ignore job failures on presubmits, because they don't trust the jobs which sometimes have overall pass rates under 30%.

We have a systemic problem with flaky tests and jobs. Few pay attention anymore, and even fewer people know how to distinguish serious failures from the noise.

Just fixing the test and jobs is infeasible, piece by piece maybe but we do not have the time to invest in what would be a massive effort.

Sippy now has presubmit data throughout the history of a PR.

Could sippy analyze the presubmits for every PR, check test failures against their current pass rate, filter out noise from on-going incidents, and then comment on PRs letting developers know what's really going on.

As an example:

job foo - failure severity: LOW

  • test a failed x times, current pass rate 40%, flake rate 20%

job bar - failure severity: HIGH

  • test b failed 2 times, current pass rate 99%

job zoo - failure severity: UNKNOWN

  • on-going incident: Azure Install Failures (TRT-XXX)

David requests this get published in the job as a spyglass panel, gives a historical artifact. We'd likely do both so we know they see comments.

This epic will cover TRTs project to enhance Sippy to categorize the likely severity of test failures in a bad job run, store this as a historical artifact on the job run, and communicate it directly to developers in their PRs via a comment.

We want to work on enhancing the analytics for Risk Analysis.  Currently we comment when we see repeated failures that have high historical pass rates, however when a regression comes in from another PR we will flag that regression as a risk for each PR that sees the failure.

 

For this story we want to persist potential regressions detected by risk analysis in big query.  A potential place to insert logic is in pr_commenting_processor.

We want to make sure we can track the repo, pr, test name, potentially testid and associated risk.

Future work will include querying this data when building the risk summary to see if any tests flagged as risky within the current PR are also failing in other PRs indicating this PR is not the contributing factor and a regression has potentially entered the main branches / payloads.

The intervals charts displayed at the top of all prow job runs has become a critical tool for TRT and OpenShift engineering in general, allowing us to determine what happened when, and in relation to other events. The tooling however is falling short in a number of areas we'd like to improve.

Goals:

  • sharable links
  • improved filtering
  • a live service rather than chart html as an artifact that is difficult to improve on and regenerate for old jobs

Stretch goals:

  • searchable intervals across many / all jobs

See linked Jira which must be completed before this can be done.

We want this to be dynamic, so origin code can effectively decide what the intervals should show. This is done via the new "display" field on Intervals. Grouping should likewise be automatic, and the JS / react UI should no longer have to decide which groups to show. Possible origin changes will be needed here.

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/157

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/54

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/118

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38557. The following is the description of the original issue:

Similar to the work done for AWS STS and Azure WIF support, the console UI (specifically OperatorHub) needs to:

  1. warn users when they are on an GCP cluster that support GCP's Workload Identity Management and the operator they will be installing supports it
  2. Subscribing to an operator that supports it can be customized in the UI by adding fields to the subscription config field that need to be provided to the operator at install time.

CONSOLE-3776 was adding filtering for the GCP WIP case, for the operator-hub tile view. Part fo the change was also check for the annotation which indicates that the operator supports GCP's WIF:

features.operators.openshift.io/token-auth-gcp: "true"

 

AC:

  • Add warning alert to the operator-hub-item-details component, if the cluster is GCP with WIF, similar to Azure and AWS.
  • Add warning alert to the operator-hub-subscribe component, if the cluster is GCP with WIF, similar to Azure and AWS.
  • If the cluster is in GCP WIF mode and the operator claims support for it the the subscription page provides configuring 4 additional fields, which will be set on the Subscription's spec.config.env field:
    • POOL_ID
    • PROVIDER_ID
    • SERVICE_ACCOUNT_EMAIL
  • Default subscription to manual for installs on WIF mode clusters for operators that support it.

 

Design docs

Added a new CLI to autogenerate the updates needed for the deployment.yaml, create a branch, push the changes to GitLab, and create the MR

Description of problem:

Starting with OCP 4.14, we have decided to start using OCP's own "bridge" CNI build instead of our "cnv-bridge" rebuild. To make sure that current users of "cnv-bridge" don't have to change their configuration, we kept "cnv-bridge" as a symlink to "bridge". While the old name still functions, we should make an effort to move users to "bridge". To do that, we can start by changing UI so it generates NADs of the type "bridge" instead of "cnv-bridge".

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Always

Steps to Reproduce:

1. Use the NetworkAttachmentDefinition dialog to create a network of type bridge
2. Read the generated yaml

Actual results:

It has "type": "cnv-bridge"

Expected results:

It should have "type": "bridge"

Additional info:

The same should be done to any instance of "cnv-tuning" by changing it to "tuning".

currently azure capi-provider crashes with following error 

E0529 10:44:25.385040 1 main.go:430] "unable to create controller" err="failed to create mapper for Cluster to AzureMachines: failed to get restmapping: no matches for kind \"AzureMachinePoolList\" in group \"infrastructure.cluster.x-k8s.io\"" logger="setup" controller="AzureMachinePool"  

This is caused by the MachinePool feature gate now being enabled by default in cluster-api-provider-azure.

Description of problem:

Version-Release number of selected component (if applicable):
Build the cluster with PR openshift/ovn-kubernetes#2223,openshift/cluster-network-operator#2433, enable TechPreview feature gate

How reproducible:
Always
Steps to Reproduce:

1. Create namespace ns1,ns2,ns3

2. Create NAD under ns1,ns2,ns3 with

 % oc get net-attach-def -n ns1 -o yaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2024-07-11T08:35:13Z"
    generation: 1
    name: l3-network-ns1
    namespace: ns1
    resourceVersion: "165141"
    uid: 8eca76bf-ee30-4a0e-a892-92a480086aa1
  spec:
    config: |
      {
              "cniVersion": "0.3.1",
              "name": "l3-network-ns1",
              "type": "ovn-k8s-cni-overlay",
              "topology":"layer3",
              "subnets": "10.200.0.0/16/24",
              "mtu": 1300,
              "netAttachDefName": "ns1/l3-network-ns1",
              "role": "primary"
      }
kind: List
metadata:
  resourceVersion: ""

% oc get net-attach-def -n ns2 -o yaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2024-07-11T08:35:19Z"
    generation: 1
    name: l3-network-ns2
    namespace: ns2
    resourceVersion: "165183"
    uid: 944b50b1-106f-4683-9cea-450521260170
  spec:
    config: |
      {
              "cniVersion": "0.3.1",
              "name": "l3-network-ns2",
              "type": "ovn-k8s-cni-overlay",
              "topology":"layer3",
              "subnets": "10.200.0.0/16/24",
              "mtu": 1300,
              "netAttachDefName": "ns2/l3-network-ns2",
              "role": "primary"
      }
kind: List
metadata:
  resourceVersion: ""

% oc get net-attach-def -n ns3 -o yaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2024-07-11T08:35:26Z"
    generation: 1
    name: l3-network-ns3
    namespace: ns3
    resourceVersion: "165257"
    uid: 93683aac-7f8a-4263-b0f6-ed9182c5c47c
  spec:
    config: |
      {
              "cniVersion": "0.3.1",
              "name": "l3-network-ns3",
              "type": "ovn-k8s-cni-overlay",
              "topology":"layer3",
              "subnets": "10.200.0.0/16/24",
              "mtu": 1300,
              "netAttachDefName": "ns3/l3-network-ns3",
              "role": "primary"
      }
kind: List
metadata:

3. Create test pods under ns1,ns2,ns3
Using below yaml to create pods under ns1

% cat data/udn/list-for-pod.json 
{
    "apiVersion": "v1",
    "kind": "List",
    "items": [
        {
            "apiVersion": "v1",
            "kind": "ReplicationController",
            "metadata": {
                "labels": {
                    "name": "test-rc"
                },
                "name": "test-rc"
            },
            "spec": {
                "replicas": 2,
                "template": {
                    "metadata": {
                        "labels": {
                            "name": "test-pods"
                        },
                      "annotations": { "k8s.v1.cni.cncf.io/networks": "l3-network-ns1"}
                    },
                    "spec": {
                        "containers": [
                            {
                                "image": "quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4",
                                "name": "test-pod",
                                "imagePullPolicy": "IfNotPresent"
                                }
                        ]
                    }
                }
            }
        },
        {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "labels": {
                    "name": "test-service"
                },
                "name": "test-service"
            },
            "spec": {
                "ports": [
                    {
                        "name": "http",
                        "port": 27017,
                        "protocol": "TCP",
                        "targetPort": 8080
                    }
                ],
                "selector": {
                    "name": "test-pods"
                }
            }
        }
    ]
}
 oc get pods -n ns1 
NAME            READY   STATUS    RESTARTS   AGE
test-rc-5ns7z   1/1     Running   0          3h7m
test-rc-bxf2h   1/1     Running   0          3h7m

Using below yaml to create a pod in ns2
% cat data/udn/podns2.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: hello-pod-ns2
  namespace: ns2
  annotations:
    k8s.v1.cni.cncf.io/networks: l3-network-ns2
  labels:
    name: hello-pod-ns2
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - image: "quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4"
    name: hello-pod-ns2
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]

Using below yaml to create a pod in ns3
% cat data/udn/podns3.yaml 
kind: Pod
apiVersion: v1
metadata:
  name: hello-pod-ns3
  namespace: ns3
  annotations:
    k8s.v1.cni.cncf.io/networks: l3-network-ns3
  labels:
    name: hello-pod-ns3
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - image: "quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4"
    name: hello-pod-ns3
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]

4. Test the pods connection in primary network in ns1, it worked well

% oc rsh -n ns1 test-rc-5ns7z  
~ $ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if157: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:80:02:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.30/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe80:21e/64 scope link 
       valid_lft forever preferred_lft forever
3: net1@if158: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:c8:01:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.200.1.3/24 brd 10.200.1.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fec8:103/64 scope link 
       valid_lft forever preferred_lft forever
~ $ exit
 % oc rsh -n ns1 test-rc-bxf2h  
~ $ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if123: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:0c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.12/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:c/64 scope link 
       valid_lft forever preferred_lft forever
3: net1@if124: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:c8:02:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.200.2.3/24 brd 10.200.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fec8:203/64 scope link 
       valid_lft forever preferred_lft forever
~ $ ping 10.200.1.3
PING 10.200.1.3 (10.200.1.3) 56(84) bytes of data.
64 bytes from 10.200.1.3: icmp_seq=1 ttl=62 time=3.20 ms
64 bytes from 10.200.1.3: icmp_seq=2 ttl=62 time=1.06 ms
^C
--- 10.200.1.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.063/2.131/3.199/1.068 ms

5. Restart all ovn pods
% oc delete pods --all -n openshift-ovn-kubernetes
pod "ovnkube-control-plane-97f479fdc-qxh2g" deleted
pod "ovnkube-control-plane-97f479fdc-shkcm" deleted
pod "ovnkube-node-b4crf" deleted
pod "ovnkube-node-k2lzs" deleted
pod "ovnkube-node-nfnhn" deleted
pod "ovnkube-node-npltt" deleted
pod "ovnkube-node-pgz4z" deleted
pod "ovnkube-node-r9qbl" deleted

% oc get pods -n openshift-ovn-kubernetes
NAME READY STATUS RESTARTS AGE
ovnkube-control-plane-97f479fdc-4cxkc 2/2 Running 0 43s
ovnkube-control-plane-97f479fdc-prpcn 2/2 Running 0 43s
ovnkube-node-g2x5q 8/8 Running 0 41s
ovnkube-node-jdpzx 8/8 Running 0 40s
ovnkube-node-jljrd 8/8 Running 0 41s
ovnkube-node-skd9g 8/8 Running 0 40s
ovnkube-node-tlkgn 8/8 Running 0 40s
ovnkube-node-v9qs2 8/8 Running 0 39s

Check pods connection in primary network in ns1 again

Actual results:
The connection was broken in primary network

% oc rsh -n ns1 test-rc-bxf2h 
~ $ ping  10.200.1.3
PING 10.200.1.3 (10.200.1.3) 56(84) bytes of data.
From 10.200.2.3 icmp_seq=1 Destination Host Unreachable
From 10.200.2.3 icmp_seq=2 Destination Host Unreachable
From 10.200.2.3 icmp_seq=3 Destination Host Unreachable

Expected results:
The connection was not broken in primary network.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

Description of problem:

https://github.com/openshift/api/pull/1829 needs to be backported to 4.15 and 4.14. The API team asked (https://redhat-internal.slack.com/archives/CE4L0F143/p1715024118699869) to have an test before they can review and approve a backport. This bug's goal is to implement an e2e test which would use the connect timeout tunning option.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Always

Steps to Reproduce:

N/A

Actual results:

 

Expected results:

 

Additional info:

The e2e test could have been a part of the initial implementation PR (https://github.com/openshift/cluster-ingress-operator/pull/1035).

Description of problem:

  This bug is created for tracking Automation of OCPBUGS-35347 (https://issues.redhat.com/browse/OCPBUGS-35347)

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-31367. The following is the description of the original issue:

Description of problem:

Alert that have been silenced are still seen on Console overview page, 

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    

Steps to Reproduce:

    1.for a cluster installed on version 4.15
    2. Silence a alert that is firing by going to Console --> Observe --> Alerting --> Alerts
    3. Check if the alert is added to silenced alert Console --> Observe --> Alerting --> Silences
    4. Go back to Console (Overview page) silenced alert is still seen there

Actual results:

    Silenced alert can be seen on ocp overview page

Expected results:

    Silenced alert should not be seen on overview page

Additional info:

    

Description of problem:

The new-in-4.16 oc adm prune renderedmachineconfigs is dry-run by default. But as of 4.16.0-rc.9, the wording can be a bit alarming:

$ oc adm prune renderedmachineconfigs
Dry run enabled - no modifications will be made. Add --confirm to remove rendered machine configs.
Error deleting rendered MachineConfig rendered-master-3fff60688940de967f8aa44e5aa0e87e: deleting rendered MachineConfig rendered-master-3fff60688940de967f8aa44e5aa0e87e failed: machineconfigs.machineconfiguration.openshift.io "rendered-master-3fff60688940de967f8aa44e5aa0e87e" is forbidden: User "wking" cannot delete resource "machineconfigs" in API group "machineconfiguration.openshift.io" at the cluster scope 
Error deleting rendered MachineConfig rendered-master-c4d5b90a040ed6026ccc5af8838f7031: deleting rendered MachineConfig rendered-master-c4d5b90a040ed6026ccc5af8838f7031 failed: machineconfigs.machineconfiguration.openshift.io "rendered-master-c4d5b90a040ed6026ccc5af8838f7031" is forbidden: User "wking" cannot delete resource "machineconfigs" in API group "machineconfiguration.openshift.io" at the cluster scope
...

Those are actually dry-run requests, as you can see with --v=8:

$ oc --v=8 adm prune renderedmachineconfigs
...
I0625 10:49:36.291173    7200 request.go:1212] Request Body: {"kind":"DeleteOptions","apiVersion":"machineconfiguration.openshift.io/v1","dryRun":["All"]}
I0625 10:49:36.291209    7200 round_trippers.go:463] DELETE https://api.build02.gcp.ci.openshift.org:6443/apis/machineconfiguration.openshift.io/v1/machineconfigs/rendered-master-3fff60688940de967f8aa44e5aa0e87e
...

But Error deleting ... failed isn't explicit about it being a dry-run deletion that failed. Even with appropriate privileges:

$ oc --as system:admin adm prune renderedmachineconfigs
Dry run enabled - no modifications will be made. Add --confirm to remove rendered machine configs.
DRY RUN: Deleted rendered MachineConfig rendered-master-3fff60688940de967f8aa44e5aa0e87e
...

could be more clear that it was doing a dry-run deletion.

Version-Release number of selected component

4.16 and 4.17.

How reproducible

Every time.

Steps to Reproduce

1. Install a cluster.
2. Get some outdated rendered MachineConfig by bumping something, unless your cluster has some by default.
3. Run oc adm prune renderedmachineconfigs, both with and without permission to do the actual deletion.

Actual results

Wording like Error deleting ... failed and Deleted rendered... can spook folks who don't understand that it was an attempted dry-run deletion.

Expected results

Soothing wording that makes it very clear that the API request was a dry-run deletion.

This is a clone of issue OCPBUGS-41532. The following is the description of the original issue:

cns-migration tool should check for supported versions of vcenter before starting migration of CNS volumes.

Description of problem:

    When enabled virtualHostedStyle with regionEndpoint set in config.image/cluster , image registry failed to be running. errors throw:

time="2024-04-22T14:14:31.057192227Z" level=error msg="s3aws: RequestError: send request failed\ncaused by: Get \"https://s3-fips.us-west-1.amazonaws.com/ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc?list-type=2&max-keys=1&prefix=\": dial tcp: lookup s3-fips.us-west-1.amazonaws.com on 172.30.0.10:53: no such host" go.version="go1.20.12 X:strictfipsruntime" 

Version-Release number of selected component (if applicable):

    4.14.18

How reproducible:

    always

Steps to Reproduce:

    1.
$ oc get config.imageregistry/cluster -ojsonpath="{.status.storage}"|jq 
{
  "managementState": "Managed",
  "s3": {
    "bucket": "ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc",
    "encrypt": true,
    "region": "us-west-1",
    "regionEndpoint": "https://s3-fips.us-west-1.amazonaws.com",
    "trustedCA": {
      "name": ""
    },
    "virtualHostedStyle": true
  }
}     
    2. Check registry pod
$ oc get co image-registry
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.15.5    True        True          True       79m     Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-b6c58998d" has timed out progressing
    
    

Actual results:

$ oc get pods image-registry-b6c58998d-m8pnb -oyaml| yq '.spec.containers[0].env'
- name: REGISTRY_STORAGE_S3_REGIONENDPOINT
  value: https://s3-fips.us-west-1.amazonaws.com
[...]
- name: REGISTRY_STORAGE_S3_VIRTUALHOSTEDSTYLE
  value: "true"
[...]

$ oc logs image-registry-b6c58998d-m8pnb
[...]
time="2024-04-22T14:14:31.057192227Z" level=error msg="s3aws: RequestError: send request failed\ncaused by: Get \"https://s3-fips.us-west-1.amazonaws.com/ci-ln-67zbmzk-76ef8-4n6wb-image-registry-us-west-1-xjyfbabyboc?list-type=2&max-keys=1&prefix=\": dial tcp: lookup s3-fips.us-west-1.amazonaws.com on 172.30.0.10:53: no such host" go.version="go1.20.12 X:strictfipsruntime"     

Expected results:

    virtual hosted-style should work

Additional info:

    

Description of problem:

Additional IBM Cloud Services require the ability to override their service endpoints within the Installer. The list of available services provided in openshift/api must be expanded to account for this.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%

Steps to Reproduce:

    1. Create an install-config for IBM Cloud
    2. Define serviceEndpoints, including one for "resourceCatalog"
    3. Attempt to run IPI
    

Actual results:

 

Expected results:

Successful IPI installation, using additional IBM Cloud Service endpoint overrides.

Additional info:

IBM Cloud is working on multiple patches to incorporate these additional services. The full list is still a work in progress, but currently includes:
- Resource (Global) Catalog endpoint
- COS Config endpoint

Changes are required in the follow components currently. May open separate Jira's (if required) to track their progress.
- openshift/api
- openshift-installer
- openshift/cluster-image-registry-operator

Please review the following PR: https://github.com/openshift/csi-operator/pull/242

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When running must-gather, a DaemonSet is created to collect performance related information for nodes in a cluster. If a node is tainted (for example with well defined OpenShift taints for Infra nodes, ODF nodes, master nodes etc), then the DaemonSet does not create Pods on these nodes and the information is not collected.

Version-Release number of selected component (if applicable):

    4.14.z

How reproducible:

    Reproducible

Steps to Reproduce:

1. Taint a node in a cluster with a custom taint i.e. "oc adm taint  node <node_name> node-role.kubernetes.io/infra=reserved:NoSchedule node-role.kubernetes.io/infra=reserved:NoExecute". Ensure at least one node is not tainted.    

2.Run `oc adm must-gather` to generate report to local filesystem
  
    

Actual results:

    The performance stats collected under directory <must_gather_dir>/nodes/ only contains results for nodes without taints.

Expected results:

    The performance stats collected under directory <must_gather_dir>/nodes/ should contain entries for all nodes in the cluster.

Additional info:

    This issue has been identified by using the Performance Profile Creator. This tool requires the output of must-gather as its input (as described in the instructions here: https://docs.openshift.com/container-platform/4.14/scalability_and_performance/cnf-create-performance-profiles.html#running-the-performance-profile-profile-cluster-using-podman_cnf-create-performance-profiles). When following this guide, the missing performance information for tainted nodes results in being returned the error "failed to load node's worker's GHW snapshot: can't obtain the path: <node_name>" when running the tool in discovery mode

 

 

Please review the following PR: https://github.com/openshift/bond-cni/pull/64

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The ACM perf/scale hub OCP has  3 baremetal nodes, each has 480GB for the installation disk. metal3 pod uses too much disk space for logs and make the node has disk presure and start evicting pods. which make the ACM stop provisioning clusters.
below is the log size of the metal3 pods:
# du -h -d 1 /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83
4.0K	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/machine-os-images
276M	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-httpd
181M	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic
384G	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs
77M	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic-inspector
385G	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83

# ls -l -h /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs
total 384G
-rw-------. 1 root root 203G Jun 10 12:44 0.log
-rw-r--r--. 1 root root 6.5G Jun 10 09:05 0.log.20240610-084807.gz
-rw-r--r--. 1 root root 8.1G Jun 10 09:27 0.log.20240610-090606.gz
-rw-------. 1 root root 167G Jun 10 09:27 0.log.20240610-092755

the logs are too huge to be attached. Please contact me if you need access to the cluster to check.

 

 

 

Version-Release number of selected component (if applicable):

the one has the issue is 4.16.0-rc4. 4.16.0.rc3 does not have the issue

How reproducible:

 

Steps to Reproduce:

1.Install latest ACM 2.11.0 build on OCP 4.16.0-rc4 and deploy 3500 SNOs on baremetal hosts
2.
3.

Actual results:

ACM stop deploying the rest of SNOs after 1913 SNOs are deployed b/c ACM pods are being evicated. 

Expected results:

3500 SNOs are deployed.

Additional info:

 

Description of problem

To reduce QE load, we've decided to block up the hole drilled in OCPBUGS-24535. We might not want a pure revert, if some of the changes are helpful (e.g. more helpful error messages).

We also want to drop the oc adm upgrade rollback subcommand which was the client-side tooling associated with the OCPBUGS-24535 hole.

Version-Release number of selected component

Both 4.16 and 4.17 currently have the rollback subcommand and associated CVO-side hole.

How reproducible

Every time.

Steps to Reproduce

Try to perform the rollbacks that OCPBUGS-24535 allowed.

Actual results

They work, as verified in OCPBUGS-24535.

Expected results

They stop working, with reasonable ClusterVersion conditions explaining that even those rollback requests will not be accepted.

Note, we're tracking the blocking fix for this via the workaround in OCPBUGS-35971, therefore moving this to blocker rejected. This will track picking up the fixed kernel and potentially disabling the workaround.

 

Description of problem:

Since 4.16.0 pods with memory limits tend to OOM very frequently when writing files larger than memory limit to PVC

Version-Release number of selected component (if applicable):

4.16.0-rc.4

How reproducible:

100% on certain types of storage
(AWS FSx, certain LVMS setups, see additional info)

Steps to Reproduce:

1. Create pod/pvc that writes a file larger than the container memory limit (attached example)
2.
3.

Actual results:

OOMKilled

Expected results:

Success

Additional info:

Reproducer in OpenShift terms:
https://gist.github.com/akalenyu/949200f48ec89c42429ddb177a2a4dee

The following is relevant for eliminating the OpenShift layer from the issue.
For simplicity, I will focus on BM setup that produces this with LVM storage.
This is also reproducible on AWS clusters with NFS backed NetApp ONTAP FSx.

Further reduced to exclude the OpenShift layer, LVM on a separate (non root) disk:

Prepare disk
lvcreate -T vg1/thin-pool-1 -V 10G -n oom-lv
mkfs.ext4 /dev/vg1/oom-lv 
mkdir /mnt/oom-lv
mount /dev/vg1/oom-lv /mnt/oom-lv

Run container
podman run -m 600m --mount type=bind,source=/mnt/oom-lv,target=/disk --rm -it quay.io/centos/centos:stream9 bash
[root@2ebe895371d2 /]# curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-x86_64-9-20240527.0.x86_64.qcow2 -o /disk/temp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 47 1157M   47  550M    0     0   111M      0  0:00:10  0:00:04  0:00:06  111MKilled
(Notice the process gets killed, I don't think podman ever whacks the whole container over this though)

The same process on the same hardware on a 4.15 node (9.2) does not produce an OOM
(vs 4.16 which is RHEL 9.4)

For completeness, I will provide some details about the setup behind the LVM pool, though I believe it should not impact the decision about whether this is an issue:
sh-5.1# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/sdb
  VG Name               vg1
  PV Size               446.62 GiB / not usable 4.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              114335
  Free PE               11434
  Allocated PE          102901
  PV UUID               <UUID>
Hardware:
SSD (INTEL SSDSC2KG480G8R) behind a RAID 0 of a PERC H330 Mini controller

At the very least, this seems like a change in behavior but tbh I am leaning towards an outright bug.

Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/108

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Recent metal-ipi serial jobs are taking a lot longer then they previously had been,

e.g.

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift-metal3_dev-scripts/1668/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-serial-ipv4/1808978586380537856
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-metal3_dev-scripts/1668/pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-serial-ipv4/1808978586380537856/

 

sometimes the tests are timing out after 3 hours

They seem to be spending a lot of the time in these 3 etcd test (40 minutes in total)

passed: (10m13s) 2024-07-05T00:01:50 "[sig-etcd][OCPFeatureGate:HardwareSpeed][Serial] etcd is able to set the hardware speed to \"\" [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (20m53s) 2024-07-05T01:12:19 "[sig-etcd][OCPFeatureGate:HardwareSpeed][Serial] etcd is able to set the hardware speed to Slower [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (10m13s) 2024-07-05T01:25:33 "[sig-etcd][OCPFeatureGate:HardwareSpeed][Serial] etcd is able to set the hardware speed to Standard [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

This is a clone of issue OCPBUGS-37945. The following is the description of the original issue:

Description of problem:

    openshift-install create cluster leads to error:
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. 

Vsphere standard port group

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. openshift-install create cluster
     2. Choose Vsphere
    3. fill in the blanks
4. Have a standard port group
    

Actual results:

    error

Expected results:

    cluster creation

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/419

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CPO unit tests fail

Version-Release number of selected component (if applicable):

4.17

How reproducible:

https://github.com/openshift/cloud-provider-openstack/pull/282

Description of problem:

documentationBaseURL still points to 4.16

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-20-005211

How reproducible:

Always

Steps to Reproduce:

1. check documentationBaseURL on a 417 cluster
$ oc get cm console-config -n openshift-console -o yaml | grep documentation
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.16/
2.
3.

Actual results:

documentationBaseURL still links to 4.16

Expected results:

documentationBaseURL should link to 4.17

Additional info:

 

Description of problem

I haven't traced out the trigger-pathway yet, but 4.13 and 4.16 machine-config controllers seem to react to Node status updates with makeMasterNodeUnSchedulable calls that result in Kubernetes API PATCH calls, even when the patch being requested is empty. This creates unnecessary API volume, loading the control plane and resulting in distracting Kube-API audit log activity.

Version-release number of selected component (if applicable)

Seen in 4.13.34, and reproduced in 4.16 CI builds, so likely all intervening versions. Possibly all versions.

How reproducible

Every time.

Steps to reproduce

mco#4277 reproduces this by reverting OCPBUGS-29713 to get frequent Node status updates. And then in presubmit CI, e2e-aws-ovn > Artifacts > ... > gather-extra pod logs :

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4277/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn/1770970949110206464/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-bdcdf554f-ct5hh_machine-config-controller.log | tail

and:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4277/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn/1770970949110206464/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-bdcdf554f-ct5hh_machine-config-controller.log | grep -o 'makeMasterNodeUnSchedulable\|UpdateNodeRetry' | sort | uniq -c

give...

Actual results

I0322 02:10:13.027938       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-34-167.us-east-2.compute.internal
I0322 02:10:13.029671       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-34-167.us-east-2.compute.internal: {}
I0322 02:10:13.669568       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-84-206.us-east-2.compute.internal
I0322 02:10:13.671023       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-84-206.us-east-2.compute.internal: {}
I0322 02:10:21.095260       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-114-0.us-east-2.compute.internal
I0322 02:10:21.098410       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-114-0.us-east-2.compute.internal: {}
I0322 02:10:23.215168       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-34-167.us-east-2.compute.internal
I0322 02:10:23.219672       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-34-167.us-east-2.compute.internal: {}
I0322 02:10:24.049456       1 node_controller.go:310] makeMasterNodeUnSchedulable ip-10-0-84-206.us-east-2.compute.internal
I0322 02:10:24.050939       1 kubeutils.go:48] UpdateNodeRetry patch ip-10-0-84-206.us-east-2.compute.internal: {}

showing frequent, no-op patch attempts and:

   1408 makeMasterNodeUnSchedulable
   1414 UpdateNodeRetry

showing many attempts over the life of the MCC container.

Expected results

No need to PATCH on makeMasterNodeUnSchedulable unless the generated patch content contained more than a no-op patch.

Additional info

setDesiredMachineConfigAnnotation has a "DesiredMachineConfigAnnotationKey already matches what I want" no-op out here

setUpdateInProgressTaint seems to lack a similar guard  here , and it should probably grow a check for "NodeUpdateInProgressTaint  is already present".  Same for removeUpdateInProgressTaint. But the hot loop in response to these Node status updates is makeMasterNodeUnSchedulable calling UpdateNodeRetry, and UpdateNodeRetry also lacks this kind of "no need to PATCH when no changes are requested" logic.

For this bug, we should:

  • Add that kind of guard somewhere to the makeMasterNodeUnSchedulable stack. I'd personally recommend putting it down at UpdateNodeRetry.
  • Optionally add similar guards to setUpdateInProgressTaint and removeUpdateInProgressTaint (or port those to use UpdateNodeRetry? If you port, you might want to port setDesiredMachineConfigAnnotation to, just for consistency).
  • Optionally add logging like I'm floating in mco#4277, so it's easier to understand why the MCC thinks it needs to take externally-visible action, to help debug future "these Kube API server audit logs have the MCC doing surprising stuff..." situations.
I0516 19:40:24.080597       1 controller.go:156] mbooth-psi-ph2q7-worker-0-9z9nn: reconciling Machine
I0516 19:40:24.113866       1 controller.go:200] mbooth-psi-ph2q7-worker-0-9z9nn: reconciling machine triggers delete
I0516 19:40:32.487925       1 controller.go:115]  "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="machine-controller" "name"="mbooth-psi-ph2q7-worker-0-9z9nn" "namespace"="openshift-machine-api" "object"={"name":"mbooth-psi-ph2q7-worker-0-9z9nn","namespace":"openshift-machine-api"} "reconcileID"="f477312c-dd62-49b2-ad08-28f48c506c9a"
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x242a275]

goroutine 317 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x29cfb00?, 0x40f1d50?})
        /usr/lib/golang/src/runtime/panic.go:914 +0x21f
sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).constructPorts(0x3056b80?, 0xc00074d3d0, 0xc0004fe100)
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go:188 +0xb5
sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).DeleteInstance(0xc00074d388, 0xc000c61300?, {0x3038ae8, 0xc0008b7440}, 0xc00097e2a0, 0xc0004fe100)
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go:678 +0x42d
github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).Delete(0xc0001f2380, {0x304f708?, 0xc000c6df80?}, 0xc0008b7440)
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:341 +0x305
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc00045de50, {0x304f708, 0xc000c6df80}, {{{0xc00066c7f8?, 0x0?}, {0xc000dce980?, 0xc00074dd48?}}})
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:216 +0x1cfe
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x3052e08?, {0x304f708?, 0xc000c6df80?}, {{{0xc00066c7f8?, 0xb?}, {0xc000dce980?, 0x0?}}})
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0004eb900, {0x304f740, 0xc00045c500}, {0x2ac0340?, 0xc0001480c0?})
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3cc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004eb900, {0x304f740, 0xc00045c500})
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 269
        /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x565
> kc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-ec.6   True        False         7d3h    Cluster version is 4.16.0-ec.6
> kc -n openshift-machine-api get machines.m mbooth-psi-ph2q7-worker-0-9z9nn -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: ERROR
    openstack-resourceId: dc08c2a2-cbda-4892-a06b-320d02ec0c6c
  creationTimestamp: "2024-05-16T16:53:16Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-05-16T19:23:44Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: mbooth-psi-ph2q7-worker-0-
  generation: 3
  labels:
    machine.openshift.io/cluster-api-cluster: mbooth-psi-ph2q7
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: mbooth-psi-ph2q7-worker-0
    machine.openshift.io/instance-type: ci.m1.xlarge
    machine.openshift.io/region: regionOne
    machine.openshift.io/zone: ""
  name: mbooth-psi-ph2q7-worker-0-9z9nn
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: mbooth-psi-ph2q7-worker-0
    uid: f715dba2-b0b2-4399-9ab6-19daf6407bd7
  resourceVersion: "8391649"
  uid: 6d1ad181-5633-43eb-9b19-7c73c86045c3
spec:
  lifecycleHooks: {}
  metadata: {}
  providerID: openstack:///dc08c2a2-cbda-4892-a06b-320d02ec0c6c
  providerSpec:
    value:
      apiVersion: machine.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: ci.m1.xlarge
      image: ""
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            tags: openshiftClusterID=mbooth-psi-ph2q7
      rootVolume:
        diskSize: 50
        sourceUUID: rhcos-4.16
        volumeType: tripleo
      securityGroups:
      - filter: {}
        name: mbooth-psi-ph2q7-worker
      serverGroupName: mbooth-psi-ph2q7-worker
      serverMetadata:
        Name: mbooth-psi-ph2q7-worker
        openshiftClusterID: mbooth-psi-ph2q7
      tags:
      - openshiftClusterID=mbooth-psi-ph2q7
      trunk: true
      userDataSecret:
        name: worker-user-data
status:
  addresses:
  - address: mbooth-psi-ph2q7-worker-0-9z9nn
    type: Hostname
  - address: mbooth-psi-ph2q7-worker-0-9z9nn
    type: InternalDNS
  conditions:
  - lastTransitionTime: "2024-05-16T16:56:05Z"
    status: "True"
    type: Drainable
  - lastTransitionTime: "2024-05-16T19:24:26Z"
    message: Node drain skipped
    status: "True"
    type: Drained
  - lastTransitionTime: "2024-05-16T17:14:59Z"
    status: "True"
    type: InstanceExists
  - lastTransitionTime: "2024-05-16T16:56:05Z"
    status: "True"
    type: Terminable
  lastUpdated: "2024-05-16T19:23:52Z"
  phase: Deleting

Description of problem: OCP doesn't resume from "hibernation" (shutdown/restart of cloud instances).

NB: This is not related to certs.

Version-Release number of selected component (if applicable): 4.16 nightlies, at least 4.16.0-0.nightly-2024-05-14-095225 through 4.16.0-0.nightly-2024-05-21-043355

How reproducible: 100%

Steps to Reproduce:

1. Install 4.16 nightly on AWS. (Other platforms may be affected, don't know.)
2. Shut down all instances. (I've done this via hive hibernation; Vadim Rutkovsky has done it via cloud console.)
3. Start instances. (Ditto.)

Actual results: OCP doesn't start. Per Vadim:
"kubelet says host IP unknown; known addresses: [] so etcd can't start."

Expected results: OCP starts normally.

Additional info: We originally thought this was related to OCPBUGS-30860, but reproduced with nightlies containing the updated AMIs.

Description of problem:

When building ODF Console Plugin, webpack issues tons of PatternFly dynamic module related warnings like:

<w> No dynamic module found for Button in @patternfly/react-core

Version-Release number of selected component (if applicable):

  • @openshift-console/dynamic-plugin-sdk version 1.3.0
  • @openshift-console/dynamic-plugin-sdk-webpack version 1.1.0

Steps to Reproduce:
1. git clone https://github.com/red-hat-storage/odf-console.git
2. cd odf-console
3. yarn install && yarn build-mco

Actual results: tons of warnings about missing dynamic modules.

Expected results: no warnings about missing dynamic modules.

Description of problem

The cluster-ingress-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-ingress-operator repository also vendors k8s.io/client-go v0.29.0. However, OpenShift 4.17 is based on Kubernetes 1.30.

Version-Release number of selected component (if applicable)

4.17.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.17/go.mod.

Actual results

The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/client-go package is at v0.29.0.

Expected results

The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and k8s.io/client-go is at v0.30.0 or newer. The k8s.io/client-go package version should match other k8s.io packages, such as k8s.io/api.

Additional info

https://github.com/openshift/cluster-ingress-operator/pull/1046 already bumped the k8s.io/* packages other than k8s.io/client-go to v0.29.0. Presumably k8s.io/client-go was missed by accident because of a replace rule in go.mod. In general, k8s.io/client-go should be at the same version as k8s.io/api and other k8s.io/* packages, and the controller-runtime package should be bumped to a version that uses the same minor version of the k8s.io/* packages.

The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.

Our ocp Dockerfile curently uses the build target. This however now also builds the react frontend and requires npm in turn. We build the frontend during mirroring already.

Switch out Dockerfile to use the common-build target. This should enable the bump to 0.27, tracked through this issue as well.

Description of problem:

    When selecting a runtime icon while deploying an image (e.g., when uploading a JAR file or importing from a container registry), the default icon is not checked on the dropdown menu. 

However, when I select the same icon from the dropdown menu, it is now checked. But, it should have already been checked when I first opened the dropdown menu.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. In the developer perspective, on the page sidebar, select "+Add", and then select either "Container images" or "Upload JAR file" 
    2. On the form body, open the runtime icon dropdown menu
    3. Scroll down until you see the default icon in the dropdown menu ("openshift" and "java" respectively)
    

Actual results:

    It is not checked

Expected results:

    The icon is checked in the dropdown menu

Additional info:

    

Description of problem:

The issue is found when QE testing the minimal Firewall list required by an AWS installation
(https://docs.openshift.com/container-platform/4.15/installing/install_config/configuring-firewall.html) for 4.16. The way we're verifying this is by setting all the URLs listed in the doc into the whitelist of a proxy server[1], adding the proxy to install-config.yaml, so addresses outside of the doc will be rejected by the proxy server during cluster installation. 
[1]https://steps.ci.openshift.org/chain/proxy-whitelist-aws

We're seeing such error from Masters' console
``` 
[  344.982244] ignition[782]: GET https://api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623/config/master: attempt #73
[  344.985074] ignition[782]: GET error: Get "https://api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623/config/master": Forbidden
```

And the deny log from proxy server 
```
1717653185.468   0 10.0.85.91 TCP_DENIED/403 2252 CONNECT api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com:22623 - HIER_NONE/- text/html

```
So looks Master is using proxy to visit the MCS address, and the Internal API domain - api-int.ci-op-b2hcg02h-ce587.qe.devcluster.openshift.com  is not in the whitelist of proxy, so the request is denied by proxy. But actually such Internal API address should be already in the NoProxy list, so master shouldn't use proxy to send the internal request. 

This is a proxy info collected from another cluster, the api-int.<cluter_domain> is added in the no proxy list by default. 
```
[root@ip-10-0-11-89 ~]# cat /etc/profile.d/proxy.sh 
export HTTP_PROXY="http://ec2-3-16-83-95.us-east-2.compute.amazonaws.com:3128"
export HTTPS_PROXY="http://ec2-3-16-83-95.us-east-2.compute.amazonaws.com:3128"
export NO_PROXY=".cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.gpei-dis3.qe.devcluster.openshift.com,localhost,test.no-proxy.com" 
```


Version-Release number of selected component (if applicable):

registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-06-02-202327

How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

If infrastructure or machine provisioning is slow, the installer may wait several minutes before declaring provisioning successful due to the exponential backoff.

For instance, if dns resolution from load balancers is slow to propogate and we 

Version-Release number of selected component (if applicable):

 

How reproducible:

Sometimes, it depends on provisioning being slow. 

Steps to Reproduce:

1. Provision a cluster in an environment that has slow dns resolution (unclear how to set this up)
2.
3.

Actual results:

The installer will only check for infrastructure or machine readiness at intervals of several minutes after a certain threshold (say 10 minutes).

Expected results:

Installer should just check regularly, e.g. every 15 seconds.

Additional info:

It may not be possible to definitively test this. We may want to just check ci logs for an improvement in provisioning time and check for lack of regressions.

This is a clone of issue OCPBUGS-39081. The following is the description of the original issue:

If the network to the bootstrap VM is slow, the extract-machine-os.service can time out (after 180s). If this happens, it will be restarted but services that depend on it (like ironic) will never be started even once it succeeds. systemd added support for Restart:on-failure for Type:oneshot services, but they still don't behave the same way as other types of services.

This can be simulated in dev-scripts by doing:

sudo tc qdisc add dev ostestbm root netem rate 33Mbit

Description of problem:
VIP's are on a different network than the machine network on a 4.14 cluster

failing cluster: 4:14

Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.83
apiServerInternalIPs: 10.8.0.83
ingressIP: 10.8.0.84
ingressIPs: 10.8.0.84

All internal IP addresses of all nodes match the Machine Network.

Machine Network: 10.8.42.0/23

Node name IP Address Matches CIDR
..............................................................................................................
sv1-prd-ocp-int-bn8ln-master-0 10.8.42.24 YES
sv1-prd-ocp-int-bn8ln-master-1 10.8.42.35 YES
sv1-prd-ocp-int-bn8ln-master-2 10.8.42.36 YES
sv1-prd-ocp-int-bn8ln-worker-0-5rbwr 10.8.42.32 YES
sv1-prd-ocp-int-bn8ln-worker-0-h7fq7 10.8.42.49 YES

logs from one of the haproxy pods

oc logs -n openshift-vsphere-infra haproxy-sv1-prd-ocp-int-bn8ln-master-0 haproxy-monitor
.....
2024-04-02T18:48:57.534824711Z time="2024-04-02T18:48:57Z" level=info msg="An error occurred while trying to read master nodes details from api-vip:kube-apiserver: failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.534849744Z time="2024-04-02T18:48:57Z" level=info msg="Trying to read master nodes details from localhost:kube-apiserver"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=error msg="Failed to retrieve API members information" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:48:57.544507441Z time="2024-04-02T18:48:57Z" level=info msg="GetLBConfig failed, sleep half of interval and retry" kubeconfigPath=/var/lib/kubelet/kubeconfig
2024-04-02T18:49:00.572652095Z time="2024-04-02T18:49:00Z" level=error msg="Could not retrieve subnet for IP 10.8.0.83" err="failed find a interface for the ip 10.8.0.83"

There is a kcs that addresses this:
https://access.redhat.com/solutions/7037425

Howerver, this same configuration works in production on 4.12

working cluster:
Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 10.8.0.73
apiServerInternalIPs: 10.8.0.73
ingressIP: 10.8.0.72
ingressIPs: 10.8.0.72

All internal IP addresses of all nodes match the Machine Network.

Machine Network: 10.8.38.0/23

Node name IP Address Matches CIDR
..............................................................................................................
sb1-prd-ocp-int-qls2m-cp4d-4875s 10.8.38.29 YES
sb1-prd-ocp-int-qls2m-cp4d-phczw 10.8.38.19 YES
sb1-prd-ocp-int-qls2m-cp4d-ql5sj 10.8.38.43 YES
sb1-prd-ocp-int-qls2m-cp4d-svzl7 10.8.38.27 YES
sb1-prd-ocp-int-qls2m-cp4d-x286s 10.8.38.18 YES
sb1-prd-ocp-int-qls2m-cp4d-xk48m 10.8.38.40 YES
sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 YES
sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 YES
sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 YES
sb1-prd-ocp-int-qls2m-worker-njzdx 10.8.38.15 YES
sb1-prd-ocp-int-qls2m-worker-rhqn5 10.8.38.39 YES

logs from one of the haproxy pods

2023-08-18T21:12:19.730010034Z time="2023-08-18T21:12:19Z" level=info msg="API is not reachable through HAProxy"
2023-08-18T21:12:19.755357706Z time="2023-08-18T21:12:19Z" level=info msg="Config change detected" configChangeCtr=1 curConfig="{6443 9445 29445 [

{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
2023-08-18T21:12:19.782529185Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
2023-08-18T21:12:19.794532220Z time="2023-08-18T21:12:19Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
2023-08-18T21:12:25.816406455Z time="2023-08-18T21:12:25Z" level=info msg="Config change detected" configChangeCtr=2 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443}

{sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443} {sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}] }"
2023-08-18T21:12:25.919248671Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat PREROUTING rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT"
2023-08-18T21:12:25.965663811Z time="2023-08-18T21:12:25Z" level=info msg="Removing existing nat OUTPUT rule" spec="--dst 10.8.0.73 -p tcp --dport 6443 -j REDIRECT --to-ports 9445 -m comment --comment OCP_API_LB_REDIRECT -o lo"
2023-08-18T21:12:32.005310398Z time="2023-08-18T21:12:32Z" level=info msg="Config change detected" configChangeCtr=3 curConfig="{6443 9445 29445 [{sb1-prd-ocp-int-qls2m-master-1 10.8.38.24 6443} {sb1-prd-ocp-int-qls2m-master-0 10.8.38.25 6443}

{sb1-prd-ocp-int-qls2m-master-2 10.8.38.30 6443}

] }"

The data is being redirected

found this in the sos report: sos_commands/firewall_tables/

nft_-a_list_ruleset

table ip nat { # handle 2
chain PREROUTING

{ # handle 1 type nat hook prerouting priority dstnat; policy accept; meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 66 counter packets 82025408 bytes 5088067290 jump OVN-KUBE-ETP # handle 30 counter packets 82025421 bytes 5088068062 jump OVN-KUBE-EXTERNALIP # handle 28 counter packets 82025439 bytes 5088069114 jump OVN-KUBE-NODEPORT # handle 26 }

chain INPUT

{ # handle 2 type nat hook input priority 100; policy accept; }

chain POSTROUTING

{ # handle 3 type nat hook postrouting priority srcnat; policy accept; counter packets 245475292 bytes 16221809463 jump OVN-KUBE-EGRESS-SVC # handle 25 oifname "ovn-k8s-mp0" counter packets 58115015 bytes 4184247096 jump OVN-KUBE-SNAT-MGMTPORT # handle 16 counter packets 187360548 bytes 12037581317 jump KUBE-POSTROUTING # handle 10 }

chain OUTPUT

{ # handle 4 type nat hook output priority -100; policy accept; oifname "lo" meta l4proto tcp ip daddr 10.8.0.73 tcp dport 6443 counter packets 0 bytes 0 redirect to :9445 # handle 67 counter packets 245122162 bytes 16200621351 jump OVN-KUBE-EXTERNALIP # handle 29 counter packets 245122163 bytes 16200621411 jump OVN-KUBE-NODEPORT # handle 27 counter packets 245122166 bytes 16200621591 jump OVN-KUBE-ITP # handle 24 }

... many more lines ...

This code was not added by the customer

None of the redirect statements are in the same file for 4.14 (the failing cluster)

ocp 4.14: (if applicable):{code:none}

    

How reproducible:100%

    Steps to Reproduce:{code:none}
This is the install script that our ansible job uses to install 4.12

If you need it cleared up let me know, all the items in {{}} are just variables for file paths

cp -r {{  item.0.cluster_name }}/install-config.yaml {{ openshift_base }}{{  item.0.cluster_name }}/
./openshift-install create manifests --dir {{ openshift_base }}{{  item.0.cluster_name }}/
cp -r machineconfigs/* {{ openshift_base }}{{  item.0.cluster_name }}/openshift/
cp -r {{  item.0.cluster_name }}/customizations/* {{ openshift_base }}{{  item.0.cluster_name }}/openshift/
./openshift-install create ignition-configs --dir {{ openshift_base }}{{  item.0.cluster_name }}/
./openshift-install create cluster --dir {{ openshift_base }}{{  item.0.cluster_name }} --log-level=debug

We are installing IPI on vmware

API and Ingress VIPs are configured on our external load balancer appliance. (Citrix ADCs if that matters)


    

Actual results:


haproxy pods crashloop and do not work
In 4.14 following the same install workflow neither the API or Ingress IP binds to masters or workers and we see HAPROXY crashlooping
    

Expected results:


for 4.12
Following a completion of 4.12 if we look in vmware at our master and worker nodes we will see all of them have an IP address from the machine network assigned to them, and one node from both masters and workers will have the VIP bound to them as well.  
 

    

Additional info:


    

Description of problem:

If one attempts to create more than one MachineOSConfig at the same time that requires a canonicalized secret, only one will build. The rest will not build.

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

Always

Steps to Reproduce:

    1. Create multiple MachineConfigPools. Wait for the MachineConfigPool to get a rendered config.
    2. Create multiple MachineOSConfigs at the same time for each of the newly-created MachineConfigPools that uses a legacy Docker pull secret. A legacy Docker pull secret is one which does not have each of its secrets under a top-level auths key. One can use the builder-dockercfg secret in the MCO namespace for this purpose.
    3. Wait for the machine-os-builder pod to start.

    

Actual results:

Only one of the MachineOSBuilds begins building. The remaining MachineOSBuilds do not build nor do they get a status assigned to them. The root cause is because if they both attempt to use the same legacy Docker pull secret, one will create the canonicalized version of it. Subsequent requests that occur concurrently will fail because the canonicalized secret already exists.

Expected results:

Each MachineOSBuild should occur whenever it is created. It should also have some kind of status assigned to it as well.

Additional info:

    

Multiple pr failing with this error...

Deploy git workload with devfile from topology page: A-04-TC01: Create the different workloads from Add page Deploy git workload with devfile from topology page: A-04-TC01 expand_less18s{`cy.focus()` can only be called on a single element. Your subject contained 14 elements.

https://on.cypress.io/focus CypressError CypressError: `cy.focus()` can only be called on a single element. Your subject contained 14 elements.

 

https://search.dptools.openshift.org/?search=Deploy+git+workload+with+devfile+from+topology+page&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

This is a clone of issue OCPBUGS-44162. The following is the description of the original issue:

Description of problem:

We were told that adding connections to a Transit Gateway also costs an exorbitant amount of money. So the create option tgName now means that we will not clean up the connections during destroy cluster.
    

Description of problem:

Cluster-ingress-operator logs an update when one didn't happen.

% grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -m 1 -- ingress-operator.log       
2024-05-17T14:46:01.434Z	INFO	operator.ingress_controller	ingress/controller.go:326	successfully updated Infra CR with Ingress Load Balancer IPs

% grep -e 'successfully updated Infra CR with Ingress Load Balancer IPs' -c -- ingress-operator.log 
142

https://github.com/openshift/cluster-ingress-operator/pull/1016 has a logic error, which causes the operator to log this message even when it didn't do an update:

[https://github.com/openshift/cluster-ingress-operator/blob/009644a6b197b67f074cc34a07868ef01db31510/pkg/operator/controller/ingress/controller.go#L1135-L1145

// If the lbService exists for the "default" IngressController, then update Infra CR's PlatformStatus with the Ingress LB IPs. 

if haveLB && ci.Name == manifests.DefaultIngressControllerName 
{ if updated, err := computeUpdatedInfraFromService(lbService, infraConfig); err != nil 
{ errs = append(errs, fmt.Errorf("failed to update Infrastructure PlatformStatus: %w", err)) } 
else if updated 
{ if err := r.client.Status().Update(context.TODO(), infraConfig); err != nil { errs = append(errs, fmt.Errorf("failed to update Infrastructure CR after updating Ingress LB IPs: %w", err)) } } 

log.Info("successfully updated Infra CR with Ingress Load Balancer IPs") }

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1. Create a LB service for the default Ingress Operator
    2. Watch ingress operator logs for the search strings mentioned above
    

Actual results:

    Lots of these log entries will be seen even though no further updates are made to the default ingress operator:

2024-05-17T14:46:01.434Z INFO operator.ingress_controller ingress/controller.go:326 successfully updated Infra CR with Ingress Load Balancer IPs

Expected results:

    Only see this log entry when an update to Infra CR is made.  Perhaps just one the first time you add a LB service to the default ingress operator.

Additional info:

     https://github.com/openshift/cluster-ingress-operator/pull/1016 was backported to 4.15, so it would be nice to fix it and backport the fix to 4.15. It is rather noisy, and it's trivial to fix.

Description of problem: ovnkube-node and multus DaemonSets have hostPath volumes which prevent clean unmount of CSI Volumes because of missing "mountPropagation: HostToContainer" parameter in volumeMount

Version-Release number of selected component (if applicable):  OpenShift 4.14

How reproducible:  Always

Steps to Reproduce:

1. on a node mount a file system underneath /var/lib/kubelet/ simulating the mount of a  CSI driver PersistentVolume

2. restart the ovnkube-node pod running on that node

3. unmount the filesystem from 1. The mount will then be removed from the host list of mounted devices however a copy of the mount is still active in the mount namespace of the ovnkube-node pod.
This is blocking some CSI drivers relying on multipath to properly delete a block device, since mounts are still registered on the block device.
 

Actual results:
CSI Volume Mount cleanly unmounted.
 

Expected results:
CSI Volume Mount uncleanly unmounted.
 

Additional info:

The mountPropagation parameter is already implememted in the volumeMount for the host rootFS:

            - name: host-slash
              readOnly: true
              mountPath: /host
              mountPropagation: HostToContainer

 However the same parameter is missing for the volumeMount of /var/lib/kubelet

It is possible to workaround the issue with a kubectl patch command like this:

$ kubectl patch daemonset ovnkube-node --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/template/spec/containers/7/volumeMounts/1",
    "value": {
      "name": "host-kubelet",
      "mountPath": "/var/lib/kubelet",
      "mountPropagation": "HostToContainer",
      "readOnly": true
   }
 }
]'

 

Affected Platforms: Platform Agnostic UPI

Description of problem:

Using "accessTokenInactivityTimeoutSeconds: 900" for "OAuthClient" config. 

One inactive or idle tab causes session expiry for all other tabs. 

Following are the tests performed: 
Test 1 - a single window with a single tab no activity would time out after 15 minutes. 
 
Test 2 - a single window two tabs. No activity in the first tab, but was active in the second tab. Timeout occurred for both tabs after 15 minutes.

Test 3 - a single window with a single tab and activity, does not time out after 15 minutes.

Hence single idle tab causes the user logout from rest of the tabs.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Set the OAuthClient.accessTokenInactivityTimeoutSeconds to 300(or any value)
    2. Login using to OCP web console and open multiple tabs.
    3. Keep one tab idle and work on the other open tabs.
    4. After 5 minutes the session expires for all tabs.
    

Actual results:

    One inactive or idle tab causes session expiry for all other tabs. 

Expected results:

    Session should not be expired if any tab is not idle. 

Additional info:

    

Description of problem:

    The TechPreviewNoUpgrade featureset could be disabled on a 4.16 cluster after enabling it. But according to the official doc `Enabling this feature set cannot be undone and prevents minor version updates`, it should not be disabled.

# ./oc get featuregate cluster -ojson|jq .spec
{  "featureSet": "TechPreviewNoUpgrade"}

# ./oc patch featuregate cluster --type=json -p '[{"op":"remove", "path":"/spec/featureSet"}]
'featuregate.config.openshift.io/cluster patched

# ./oc get featuregate cluster -ojson|jq .spec
{}

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-06-03-060250

How reproducible:

    always

Steps to Reproduce:

    1. enable the TechPreviewNoUpgrade fs on a 4.16 cluster
    2. then remove it 
    3.
    

Actual results:

    TechPreviewNoUpgrade featureset was disabled

Expected results:

    Enabling this feature set cannot be undone

Additional info:

https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L43-L44

This is a clone of issue OCPBUGS-42412. The following is the description of the original issue:

Description of problem:

When running 4.17 installer QE full function test, following am64 instances types are detected and tested passed, so append them in installer doc[1]: 
* standardBasv2Family
* StandardNGADSV620v1Family 
* standardMDSHighMemoryv3Family
* standardMIDSHighMemoryv3Family
* standardMISHighMemoryv3Family
* standardMSHighMemoryv3Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/120

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

After successfully mirroring the ibm-ftm-operator via the latest oc-mirror command to internal registry and applying the newly generated IBM CatalogSource YAML file. The created catalog pod in the openShift-marketplace namespace enters CrashLoopBackOff.

Customer is trying to mirror operators and  list the catalogue the command has no issues, but catalog pod is crashing with the following error:
~~~
time="2024-07-10T13:43:07Z" level=info msg="starting pprof endpoint" address="localhost:6060"
time="2024-07-10T13:43:08Z" level=fatal msg="cache requires rebuild: cache reports digest as \"e891bfd5a4cb5702\", but computed digest is \"1922475dc0ee190c\""
~~~

    

Version-Release number of selected component (if applicable):

oc-mirror 4.16
OCP 4.14.z

    

How reproducible:



    

Steps to Reproduce:

    1. Create catalog image with the following imagesetconfiguration:
~~~
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
archiveSize: 4
storageConfig:
  registry:
    imageURL: <internal-registry>:Port/oc-mirror-metadata/12july24
    skipTLS: false
mirror:
  platform:
    architectures:
      - "amd64"
    channels:
    - name: stable-4.14
      minVersion: 4.14.11
      maxVersion: 4.14.30
      type: ocp
      shortestPath: true
    graph: true
  operators:
  - catalog: icr.io/cpopen/ibm-operator-catalog:v1.22
    packages:
    - name: ibm-ftm-operator
      channels:
      - name: v4.4
~~~
    2.  Run the following command:
~~~
/oc-mirror --config=./imageset-config.yaml docker://Internal-registry:Port --rebuild-catalogs
~~~
    3. Create catalogsourcepod under openshift-marketplace namespace:
~~~
 cat oc-mirror-workspace/results-1721222945/catalogSource-cs-ibm-operator-catalog.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: cs-ibm-operator-catalog
  namespace: openshift-marketplace
spec:
  image: Internal-registry:Port/cpopen/ibm-operator-catalog:v1.22
  sourceType: grpc
~~~
    

Actual results:


catalog pod is crashing with the following error:
~~~
time="2024-07-10T13:43:07Z" level=info msg="starting pprof endpoint" address="localhost:6060"
time="2024-07-10T13:43:08Z" level=fatal msg="cache requires rebuild: cache reports digest as \"e891bfd5a4cb5702\", but computed digest is \"1922475dc0ee190c\""
~~~

    

Expected results:

The pod should run without any issue. 

    

Additional info:

1. The issue is reproducible with the OCP 4.14.14 and OCP 4.14.29
2. Customer is already using oc-mirror 4.16:
~~~
./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407030803.p0.g394b1f8.assembly.stream.el9-394b1f8", GitCommit:"394b1f814f794f4f01f473212c9a7695726020bf", GitTreeState:"clean", BuildDate:"2024-07-03T10:18:49Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.module+el8.10.0+21986+2112108a) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
~~~
3. Customer tried with workaround described in the KB[1]: https://access.redhat.com/solutions/7006771 but no luck
4. Customer also tried to set the OPM_BINARY, but didn't work. They download  OPM with respective arch: https://github.com/operator-framework/operator-registry/releases rename the downloaded binary as opm and set below variable before executing oc-mirror
OPM_BINARY=/path/to/opm

    

Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/97

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

- Pods that reside in a namespace utilizing EgressIP are experiencing intermittent TCP IO timeouts when attempting to communicate with external services.

  • Connection response while connecting external service from one of the pods:
    ❯ oc exec gitlab-runner-aj-02-56998875b-n6xxb -- bash -c 'while true; do timeout 3 bash -c "</dev/tcp/10.135.108.56/443" && echo "Connection success" || echo "Connection timeout"; sleep 0.5; done'
    Connection success
    Connection timeout
    Connection timeout
    Connection timeout
    Connection timeout
    Connection timeout
    Connection success
    Connection timeout
    Connection success 
  • The customer followed this solution https://access.redhat.com/solutions/7005481 and noticed an IP address in logical_router_policy nexthops that is not associated with any node.
    # Get pod node and podIP variable for the problematic pod 
    ❯ oc get pod gitlab-runner-aj-02-56998875b-n6xxb -ojson 2>/dev/null | jq -r '"\(.metadata.name) \(.spec.nodeName) \(.status.podIP)"' | read -r pod node podip
    
    # Find the ovn-kubernetes pod running on the same node as  gitlab-runner-aj-02-56998875b-n6xxb
    ❯ oc get pods -n openshift-ovn-kubernetes -lapp=ovnkube-node -ojson | jq --arg node "$node" -r '.items[] | select(.spec.nodeName == $node)| .metadata.name' | read -r ovn_pod
    
    # Collect each possible logical switch port address into variable LSP_ADDRESSES
    ❯ LSP_ADDRESSES=$(oc -n openshift-ovn-kubernetes exec ${ovn_pod} -it -c northd -- bash -c 'ovn-nbctl lsp-list transit_switch | while read guid name; do printf "%s " "${name}"; ovn-nbctl lsp-get-addresses "${guid}"; done')
    
    # List the logical router policy for the problematic pod
    ❯ oc -n openshift-ovn-kubernetes exec ${ovn_pod} -c northd -- ovn-nbctl find logical_router_policy match="\"ip4.src == ${podip}\""
    _uuid               : c55bec59-6f9a-4f01-a0b1-67157039edb8
    action              : reroute
    external_ids        : {name=gitlab-runner-caasandpaas-egress}
    match               : "ip4.src == 172.40.114.40"
    nexthop             : []
    nexthops            : ["100.88.0.22", "100.88.0.57"]
    options             : {}
    priority            : 100
    
    # Check whether each nexthop entry exists in the LSP addresses table
    ❯ echo $LSP_ADDRESSES | grep 100.88.0.22
    (tstor-c1nmedi01-9x2g9-worker-cloud-paks-m9t6b) 0a:58:64:58:00:16 100.88.0.22/16
    ❯ echo $LSP_ADDRESSES | grep 100.88.0.57 

     

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.

2.

3.

Actual results:

  • Pods configured to use EgressIP face intermittent connection timeout while connecting to external services.

Expected results:

  • The connection timeout should not happen.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

Description of problem:

    When setting up openshift-sdn(multitenant) cluster. live migration to ovn should be blocked. however network operator still going on and migration to ovn

     When setting up openshift-sdn cluster and create egress router pod. live migration to ovn should be blocked. however network operator still going on and migration to ovn 

Version-Release number of selected component (if applicable):

    pre-merge https://github.com/openshift/cluster-network-operator/pull/2392

How reproducible:

    always

Steps to Reproduce:

    1. setup openshift-sdn(multitenant) cluster 
    2. do migration with
oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'    

 3.
    

Actual results:

    After showing the error. the migration progress still be going on.

Expected results:

    network operator should block the migration

Additional info:

    

This is a clone of issue OCPBUGS-38274. The following is the description of the original issue:

Description of problem:

When the vSphere CSI driver is removed (using managementState: Removed), it leaves all existing conditions in the ClusterCSIDriver. IMO it should delete all of them and keep some something like"Disabled: true" that we use for Manila CSI driver operator.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-09-031511

How reproducible: always

Steps to Reproduce:

  1. Edit ClusterCSIDriver and set `managementState: Removed`.
  2. See the CSI driver deployment + DaemonSet are removed.
  3. Check ClusterCSIDriver conditions

Actual results: All Deployment + DaemonSet conditions are present

Expected results: The conditions are pruned.

https://redhat-internal.slack.com/archives/C01CQA76KMX/p1717069106405899

Joseph Callen reported this test is failing a fair bit on vsphere, and it looks like it's usually the only thing failing. Thomas has some etcd comments in the thread, need to decide what we should do here.

Also new vsphere hardware being phased in which doesn't seem to be showing the problem.

Move to a flake on vsphere? Kill the test?

Please review the following PR: https://github.com/openshift/telemeter/pull/532

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

  ConsoleYAMLSample CRD
      redirect to home
      ensure perspective switcher is set to Administrator
    1) creates, displays, tests and deletes a new ConsoleYAMLSample instance


  0 passing (2m)
  1 failing

  1) ConsoleYAMLSample CRD
       creates, displays, tests and deletes a new ConsoleYAMLSample instance:
     AssertionError: Timed out retrying after 30000ms: Expected to find element: `[data-test-action="View instances"]:not([disabled])`, but never found it.
      at Context.eval (webpack:///./support/selectors.ts:47:5)

 

console flakes

console-operator

Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-41283.

Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-41622.

Description of problem:

Intermittent error during the installation process when enabling Cluster API (CAPI) in the install-config for OCP 4.16 tech preview IPI installation on top of OSP. The error occurs during the post-machine creation hook, specifically related to Floating IP association.

Version-Release number of selected component (if applicable):

OCP: 4.16.0-0.nightly-2024-05-16-092402 TP enabled
on top of
OSP: RHOS-17.1-RHEL-9-20240123.n.1

How reproducible:

The issue occurs intermittently, sometimes the installation succeeds, and other times it fails.

Steps to Reproduce:

    1.Install OSP
    2.Initiate OCP installation with TP and CAPI enabled
    3.Observe the installation logs of the failed installation.     

Actual results:

    The installation fails intermittently with the following error message:
...
2024-05-17 23:37:51.590 | level=debug msg=E0517 23:37:29.833599  266622 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error creating client for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.ostest.shiftstack.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ostest-4qrz2-master-0" namespace="openshift-cluster-api-guests" name="ostest-4qrz2-master-0" reconcileID="985ba50c-2a1d-41f6-b494-f5af7dca2e7b"
2024-05-17 23:37:51.597 | level=debug msg=E0517 23:37:39.838706  266622 controller.go:329] "Reconciler error" err="failed to create cluster accessor: error creating http client and mapper for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error creating client for remote cluster \"openshift-cluster-api-guests/ostest-4qrz2\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://api.ostest.shiftstack.com:6443/api/v1?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ostest-4qrz2-master-0" namespace="openshift-cluster-api-guests" name="ostest-4qrz2-master-0" reconcileID="dfe5f138-ac8e-4790-948f-72d6c8631f21"
2024-05-17 23:37:51.603 | level=debug msg=Machine ostest-4qrz2-master-0 is ready. Phase: Provisioned
2024-05-17 23:37:51.610 | level=debug msg=Machine ostest-4qrz2-master-1 is ready. Phase: Provisioned
2024-05-17 23:37:51.615 | level=debug msg=Machine ostest-4qrz2-master-2 is ready. Phase: Provisioned
2024-05-17 23:37:51.619 | level=info msg=Control-plane machines are ready
2024-05-17 23:37:51.623 | level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during post-machine creation hook: Resource not found: [POST https://10.46.44.159:13696/v2.0/floatingips], error message: {"NeutronError": {"type": "ExternalGatewayForFloatingIPNotFound", "message": "External network 654792e9-dead-485a-beec-f3c428ef71da is not reachable from subnet d9829374-f0de-4a41-a1c0-a2acdd4841da.  Therefore, cannot associate Port 01c518a9-5d5f-42d8-a090-6e3151e8af3f with a Floating IP.", "detail": ""}}
2024-05-17 23:37:51.629 | level=info msg=Shutting down local Cluster API control plane...
2024-05-17 23:37:51.637 | level=info msg=Stopped controller: Cluster API
2024-05-17 23:37:51.643 | level=warning msg=process cluster-api-provider-openstack exited with error: signal: killed
2024-05-17 23:37:51.653 | level=info msg=Stopped controller: openstack infrastructure provider
2024-05-17 23:37:51.659 | level=info msg=Local Cluster API system has completed operations

Expected results:

The installation should complete successfully

Additional info: CAPI is enabled by adding the following to the install-config: 

featureSet: 'CustomNoUpgrade'
featureGates: ['ClusterAPIInstall=true']

Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/37

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2186

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-34800. The following is the description of the original issue:

The APIRemovedInNextReleaseInUse and APIRemovedInNextEUSReleaseInUse need to be updated for kube 1.30 in OCP 4.17.

This is a clone of issue OCPBUGS-37491. The following is the description of the original issue:

Description of problem:

co/ingress is always good even operator pod log error:

2024-07-24T06:42:09.580Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
    

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-07-20-191204

How reproducible:

    100%

Steps to Reproduce:

    1. install AWS cluster
    2. update ingresscontroller/default and adding   "endpointPublishingStrategy.loadBalancer.allowedSourceRanges", eg

spec:
  endpointPublishingStrategy:
    loadBalancer:
      allowedSourceRanges:
      - 1.1.1.2/32

    3. above setting drop most traffic to LB, so some operator degraded  
    

Actual results:

    co/authentication and console degraded but co/ingress is still good

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.17.0-0.nightly-2024-07-20-191204   False       False         True       22m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-aws.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 
console                                    4.17.0-0.nightly-2024-07-20-191204   False       False         True       22m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 
ingress                                    4.17.0-0.nightly-2024-07-20-191204   True        False         False      3h58m   


check the ingress operator log and see:

2024-07-24T06:59:09.588Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

Expected results:

    co/ingress status should reflect the real condition timely

Additional info:

    even co/ingress status can be updated in some scenarios, but it is always less sensitive than authentication and console, we always rely on authentication/console to know the route healthy, the purpose of ingress canary route becomes meaningless.

 

Description of the problem:
it is allowed to create a patch file with the name:
".yaml.patch"
which actually mean a patch file for .yaml file
 

How reproducible:

 

Steps to reproduce:

1. create any cluster

2. try to add patch manifest file with the name .yaml.patch

3.

Actual results:
no exception when trying to add it
 

Expected results:
should be blocked since something.yaml not make sense to be empty since it is not possible to create .yaml manifest file

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After enabling separate alertmanager instance for user-defined alert routing, the alertmanager-user-workload pods are initialized but the configmap alertmanager-trusted-ca-bundle is not injected in the pods.
[-] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-alert-routing-for-user-defined-projects.html#enabling-a-separate-alertmanager-instance-for-user-defined-alert-routing_enabling-alert-routing-for-user-defined-projects

Version-Release number of selected component (if applicable):

RHOCP 4.13, 4.14 and 4.15

How reproducible:

100%

Steps to Reproduce:

1. Enable user-workload monitoring using[a]
2. Enable separate alertmanager instance for user-defined alert routing using [b]
3. Check if alertmanager-trusted-ca-bundle configmap is injected in alertmanager-user-workload pods which are running in openshift-user-workload-monitoring project.
$ oc describe pod alertmanager-user-workload-0 -n openshift-user-workload-monitoring | grep alertmanager-trusted-ca-bundle

[a] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects

[b] https://docs.openshift.com/container-platform/4.15/observability/monitoring/enabling-alert-routing-for-user-defined-projects.html#enabling-a-separate-alertmanager-instance-for-user-defined-alert-routing_enabling-alert-routing-for-user-defined-projects

Actual results:

alertmanager-user-workload pods are NOT injected with alertmanager-trusted-ca-bundle configmap.

Expected results:

alertmanager-user-workload pods should be injected with alertmanager-trusted-ca-bundle configmap.

Additional info:

Similar configmap is injected fine in alertmanager-main pods which are running in openshift-monitoring project.

For a while now we have a "nasty" carry titled "UPSTREAM: <carry>: don't fail integration due to too many goroutines" which only prints information about leaking goroutines but doesn't fail.

See https://github.com/openshift/kubernetes/commit/501f19354bb79f0566039907b179974444f477a2 as an example of that commit.

Upstream went a major refactoring reported under https://github.com/kubernetes/kubernetes/issues/108483 which was meant to prevent those leaks, unfortunately in our case we still are a subject to this problem.

Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/547

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Would like to be able to use an annotation to set verbosity of kube-apiserver for Hypershift. By default the verbosity is locked to a level of 2. Allowing this to be configurable would enable better debugging when it is desired. This is configurable in Openshift and would like to extend it to Hypershift as well.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

   n/a
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-41936. The following is the description of the original issue:

Description of problem:

IBM Cloud CCM was reconfigured to use loopback as the bind address in 4.16. However, the liveness probe was not configured to use loopback too, so the CCM constantly fails the liveness probe and restarts continuously.    

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1. Create a IPI cluster on IBM Cloud
    2. Watch the IBM Cloud CCM pod and restarts, increase every 5 mins (liveness probe timeout)
    

Actual results:

    # oc --kubeconfig cluster-deploys/eu-de-4.17-rc2-3/auth/kubeconfig get po -n openshift-cloud-controller-manager
NAME                                            READY   STATUS             RESTARTS          AGE
ibm-cloud-controller-manager-58f7747d75-j82z8   0/1     CrashLoopBackOff   262 (39s ago)     23h
ibm-cloud-controller-manager-58f7747d75-l7mpk   0/1     CrashLoopBackOff   261 (2m30s ago)   23h



  Normal   Killing     34m (x2 over 40m)    kubelet            Container cloud-controller-manager failed liveness probe, will be restarted
  Normal   Pulled      34m (x2 over 40m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ac9fb24a0e051aba6b16a1f9b4b3f9d2dd98f33554844953dd4d1e504fb301e" already present on machine
  Normal   Created     34m (x3 over 45m)    kubelet            Created container cloud-controller-manager
  Normal   Started     34m (x3 over 45m)    kubelet            Started container cloud-controller-manager
  Warning  Unhealthy   29m (x8 over 40m)    kubelet            Liveness probe failed: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused
  Warning  ProbeError  3m4s (x22 over 40m)  kubelet            Liveness probe error: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused
body:

Expected results:

    CCM runs continuously, as it does on 4.15

# oc --kubeconfig cluster-deploys/eu-de-4.15.10-1/auth/kubeconfig get po -n openshift-cloud-controller-manager
NAME                                            READY   STATUS    RESTARTS   AGE
ibm-cloud-controller-manager-66d4779cb8-gv8d4   1/1     Running   0          63m
ibm-cloud-controller-manager-66d4779cb8-pxdrs   1/1     Running   0          63m

Additional info:

    IBM Cloud have a PR open to fix the liveness probe.
https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360

Please review the following PR: https://github.com/openshift/oc/pull/1779

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-42231. The following is the description of the original issue:

Description of problem:

    OCP Conformance MonitorTests can fail based on CSI Drivers pod and ClusterRole applied order. SA, CR, CRB likely should be applied first prior to deployment/pods.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    60%

Steps to Reproduce:

    1. Create IPI cluster on IBM Cloud
    2. Run OCP Conformance w/ MonitorTests
    

Actual results:

    : [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

{  fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors
Error creating: pods "ibm-vpc-block-csi-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[2].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/ibm-vpc-block-csi-node -n openshift-cluster-csi-drivers happened 7 times

Ginkgo exit error 1: exit with code 1}

Expected results:

    No pod creation failures using the wrong SCC, because the ClusterRole/ClusterRoleBinding, etc. had not been applied yet.

Additional info:

Sorry, I did not see an IBM Cloud Storage listed in the targeted Component for this bug, so selected the generic Storage component. Please forward as necessary/possible.


Items to consider:

ClusterRole:  https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/privileged_role.yaml

ClusterRoleBinding:  https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/node_privileged_binding.yaml

The ibm-vpc-block-csi-node-* pods eventually reach running using privileged SCC. I do not know whether it is possible to stage the resources that get created first, within the CSI Driver Operator
https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/9288e5078f2fe3ce2e69a4be3d94622c164c3dbd/pkg/operator/starter.go#L98-L99
Prior to the CSI Driver daemonset (`node.yaml`), perhaps order matters within the list.

Example of failure in CI:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8235/pull-ci-openshift-installer-master-e2e-ibmcloud-ovn/1836521032031145984

 

Description of problem:

Can't access the openshift namespace images without auth after grant public access to openshift namespace

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-05-102537 

How reproducible:

    always

Steps to Reproduce:

    1.   $ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
  $ HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
    2. $ oc adm policy add-role-to-group system:image-puller system:unauthenticated --namespace openshift
  Warning: Group 'system:unauthenticated' not found
clusterrole.rbac.authorization.k8s.io/system:image-puller added: "system:unauthenticated"

    3. Try to fetch image metadata:
    $ oc image info --insecure "${HOST}/openshift/cli:latest"

Actual results:

   $ oc image info default-route-openshift-image-registry.apps.wxj-a41659.qe.azure.devcluster.openshift.com/openshift/cli:latest  --insecure
error: unable to read image default-route-openshift-image-registry.apps.wxj-a41659.qe.azure.devcluster.openshift.com/openshift/cli:latest: unauthorized: authentication required

Expected results:

    Could get the public image info without auth

Additional info:

   This is a regression for 4.16, this feature works on 4.15 and below.

This is a clone of issue OCPBUGS-42100. The following is the description of the original issue:

Description of problem:

    HyperShift currently runs 3 replicas of active/passive HA deployments such as kube-controller-manager, kube-scheduler, etc. In order to reduce the overhead of running a HyperShift control plane, we should be able to run these deployments with 2 replicas.

In a 3 zone environment with 2 replicas, we can still use a rolling update strategy, and set the maxSurge value to 1, as the new pod would schedule into the unoccupied zone.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

release-4.17 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.30 branch.
We should import them in our downstream fork.

Description of problem:


Installation of 4.16 fails with a AWS AccessDenied error trying to attach a bootstrap s3 bucket policy. 

Version-Release number of selected component (if applicable):


4.16+

How reproducible:


Every time

Steps to Reproduce:

1. Create an installer policy with the permissions listed in the installer [here|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go]
2. Run a install in AWS IPI

Actual results:


Install fails attempting to attach a policy to the bootstrap s3 bucket

{code:java}
time="2024-06-11T14:58:15Z" level=debug msg="I0611 14:58:15.485718     132 s3.go:256] \"Created bucket\" controller=\"awscluster\" controllerGroup=\"infrastru
cture.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/jamesh-sts-8tl72\" namespace=\"openshift-cluster-api-guests\"
 name=\"jamesh-sts-8tl72\" reconcileID=\"c390f027-a2ee-4d37-9e5d-b6a11882c46b\" cluster=\"openshift-cluster-api-guests/jamesh-sts-8tl72\" bucket_name=\"opensh
ift-bootstrap-data-jamesh-sts-8tl72\""
time="2024-06-11T14:58:15Z" level=debug msg="E0611 14:58:15.643613     132 controller.go:329] \"Reconciler error\" err=<"
time="2024-06-11T14:58:15Z" level=debug msg="\tfailed to reconcile S3 Bucket for AWSCluster openshift-cluster-api-guests/jamesh-sts-8tl72: ensuring bucket pol
icy: creating S3 bucket policy: AccessDenied: Access Denied"

Expected results:{code:none}

Install completes successfully

Additional info:


The installer did not attach an S3 bootstrap bucket policy in the past as far as I can tell [here|https://github.com/openshift/installer/blob/release-4.15/data/data/aws/cluster/main.tf#L133-L148], this new permission is required because of new functionality. 

CAPA is placing a policy that denies non SSL encrypted traffic to the bucket, this shouldn't have an effect on installs, adding the IAM policy to allow the policy to be added results in a successful install. 

S3 bootstrap bucket policy:


{code:java}
            "Statement": [
                {
                    "Sid": "ForceSSLOnlyAccess",
                    "Principal": {
                        "AWS": [
                            "*"
                        ]
                    },
                    "Effect": "Deny",
                    "Action": [
                        "s3:*"
                    ],
                    "Resource": [
                        "arn:aws:s3:::openshift-bootstrap-data-jamesh-sts-2r5f7/*"
                    ],
                    "Condition": {
                        "Bool": {
                            "aws:SecureTransport": false
                        }
                    }
                }
            ]
        },

This is a clone of issue OCPBUGS-44068. The following is the description of the original issue:

Description of problem:

When the user provides an existing VPC, the IBM CAPI will not add ports 443, 5000, and 6443 to the VPC's security group. It is safe to always check for these ports since we only add them if they are missing.
    

Description of the problem:

Trying to create a cluster from UI , fails.

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

This is a clone of issue OCPBUGS-42660. The following is the description of the original issue:

There were remaining issues from the original issue. A new bug has been opened to address this. This is a clone of issue OCPBUGS-32947. The following is the description of the original issue:

Description of problem:

    [vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-04-23-032717

How reproducible:

    Always

Steps to Reproduce:

    1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-04-23-032717   True        False         24m     Cluster version is 4.16.0-0.nightly-2024-04-23-032717     

    2.Check the controlplanemachineset, you can see network.devices, template and workspace have value.
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset     
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   51m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  creationTimestamp: "2024-04-25T02:52:11Z"
  finalizers:
  - controlplanemachineset.machine.openshift.io
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
  name: cluster
  namespace: openshift-machine-api
  resourceVersion: "18273"
  uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  state: Active
  strategy:
    type: RollingUpdate
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: VSphere
        vsphere:
        - name: generated-failure-domain
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        lifecycleHooks: {}
        metadata: {}
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices:
              - networkName: devqe-segment-221
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone
            userDataSecret:
              name: master-user-data
            workspace:
              datacenter: DEVQEdatacenter
              datastore: /DEVQEdatacenter/datastore/vsanDatastore
              folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl
              resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources
              server: vcenter.devqe.ibmc.devcluster.openshift.com
status:
  conditions:
  - lastTransitionTime: "2024-04-25T02:59:37Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Error
  - lastTransitionTime: "2024-04-25T03:03:45Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-25T03:03:45Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-25T03:01:04Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasUpdated
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3     

    3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values ​​before are now cleared.

liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster
controlplanemachineset.machine.openshift.io "cluster" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE      AGE
cluster   3         3         3       3                       Inactive   6s
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  creationTimestamp: "2024-04-25T03:45:51Z"
  finalizers:
  - controlplanemachineset.machine.openshift.io
  generation: 1
  name: cluster
  namespace: openshift-machine-api
  resourceVersion: "46172"
  uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  state: Inactive
  strategy:
    type: RollingUpdate
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: VSphere
        vsphere:
        - name: generated-failure-domain
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        lifecycleHooks: {}
        metadata: {}
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices: null
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: ""
            userDataSecret:
              name: master-user-data
            workspace: {}
status:
  conditions:
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Error
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasUpdated
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3     

    4.I active the controlplanemachineset and it does not trigger an update,  I continue to add these field values ​​back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. 


            network:
              devices:
              - networkName: devqe-segment-221
              - networkName: devqe-segment-222


By the way, I can create worker machines with other network device or two network devices.
huliu-vs425c-f5tfl-worker-0a-ldbkh    Running                          81m
huliu-vs425c-f5tfl-worker-0aa-r8q4d   Running                          70m

Actual results:

    network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

Expected results:

    The fields value should not be changed when deleting the controlplanemachineset, 
    Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.

Additional info:

    Must gather:  https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing 

This is a clone of issue OCPBUGS-43048. The following is the description of the original issue:

Description of problem:

When deploying 4.16, customer identified an inbound rule security risk for the "node" security group allowing access from 0.0.0.0/0 to node port range 30000-32767.
This issue did not exist in versions prior to 4.16 and suspect this may be a regression.  It seems to be related to the use of CAPI which could have changed the behavior.  
Trying to understand why this was allowed.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

  

Steps to Reproduce:

    1. Install 4.16 cluster

*** On 4.12 installations, this is not the case ***
    

Actual results:

The installer configures an inbound rule for the node security group allowing access from 0.0.0.0/0 for port range 30000-32767.     

Expected results:

The installer should *NOT* create an inbound security rule allowing access to node port range 30000-32767 from any CIDR range (0.0.0.0/0)

Additional info:

#forum-ocp-cloud slack discussion:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1728484197441409

Relevant Code :

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v2.4.0/pkg/cloud/services/securitygroup/securitygroups.go#L551

This is a clone of issue OCPBUGS-42717. The following is the description of the original issue:

Description of problem:

    When using an internal publishing strategy, the client is not properly initialized and will cause a code path to be hit which tries to access a field of a null pointer.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy a private cluster
    2. segfault
    3. 
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38177. The following is the description of the original issue:

Description of problem:

When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    console message shows start installation service and agent register service has not started

Expected results:

    console message shows agent import cluster and add host services has started

Additional info:

    

This is a clone of issue OCPBUGS-43417. The following is the description of the original issue:

Description of problem:

4.17: [VSphereCSIDriverOperator] [Upgrade] VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference 

UPI installed vsphere cluster upgrade failed caused by CSO degrade
Upgrade path: 4.8 -> 4.17

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-10-12-174022      

How reproducible:

 Always   

Steps to Reproduce:

    1. Install the OCP cluster on vSphere by UPI with version 4.8.
    2. Upgrade the cluster to 4.17 nightly.
    

Actual results:

    In Step 2: The upgrade failed from path 4.16 to 4.17.    

Expected results:

    In Step 2: The upgrade should be successful.

Additional info:

$ omc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-10-12-102620   True        True          1h8m    Unable to apply 4.17.0-0.nightly-2024-10-12-174022: wait has exceeded 40 minutes for these operators: storage
$ omc get co storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.17.0-0.nightly-2024-10-12-174022   True        True          True       15h  
$  omc get co storage -oyaml   
...
status:
  conditions:
  - lastTransitionTime: "2024-10-13T17:22:06Z"
    message: |-
      VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: panic caught:
      VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference
    reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_SyncError
    status: "True"
    type: Degraded
...

$ omc logs vmware-vsphere-csi-driver-operator-5c7db457-nffp4|tail -n 50
2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?})
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d
2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2()
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65
2024-10-13T19:00:02.531545739Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9
2024-10-13T19:00:02.534308382Z I1013 19:00:02.532858       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"e44ce388-4878-4400-afae-744530b62281", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Vmware-Vsphere-Csi-Driver-OperatorPanic' Panic observed: runtime error: invalid memory address or nil pointer dereference
2024-10-13T19:00:03.532125885Z E1013 19:00:03.532044       1 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
2024-10-13T19:00:03.532125885Z   line 1: cannot unmarshal !!seq into config.CommonConfigYAML
2024-10-13T19:00:03.532498631Z I1013 19:00:03.532460       1 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.
2024-10-13T19:00:03.532708025Z I1013 19:00:03.532571       1 config.go:283] Config initialized
2024-10-13T19:00:03.533270439Z E1013 19:00:03.533160       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2024-10-13T19:00:03.533270439Z goroutine 701 [running]:
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2cf3100, 0x54fd210})
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0014c54e8, 0x1, 0xc000e7e1c0?})
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b
2024-10-13T19:00:03.533270439Z panic({0x2cf3100?, 0x54fd210?})
2024-10-13T19:00:03.533270439Z     runtime/panic.go:770 +0x132
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).createVCenterConnection(0xc0008b2788, {0xc0022cf600?, 0xc0014c57c0?}, 0xc0006a3448)
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:491 +0x94
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).loginToVCenter(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, 0x3377a7c?)
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:446 +0x5e
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).sync(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, {0x38ee700, 0xc0011d08d0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:240 +0x6fc
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}, {0x38ee700?, 0xc0011d08d0?})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:201 +0x43
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:260 +0x1ae
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1({0x3900f30, 0xc0000b9ae0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:192 +0x89
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x1f
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002bb1e80?)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:226 +0x33
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0014c5f10, {0x38cf7e0, 0xc00142b470}, 0x1, 0xc0013ae960)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:227 +0xaf
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00115bf10, 0x3b9aca00, 0x0, 0x1, 0xc0013ae960)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:204 +0x7f
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x3900f30, 0xc0000b9ae0}, 0xc00115bf70, 0x3b9aca00, 0x0, 0x1)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x93
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:170
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2()
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65
2024-10-13T19:00:03.533270439Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9

Description of problem:

Long cluster names are trimmed by the installer. Warn the user before this happens because if the user intended to distinguish these based on some suffix at the end of a long name, the suffix will get chopped off. If some resources are created on the basis of cluster name alone (rare), there could even be conflicts.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Use a "long cluster name" (at the moment > 27 characters)
    2. Deploy a cluster
    3. Look at the names of resources, the name will have been trimmed.
    

Actual results:

Cluster resources with trimmed names are created.

Expected results:

The same as Actual results, but a warning should be shown.

Additional info:

    

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2176

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/148

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15    

Version-Release number of selected component (if applicable):

Upgrade from 4.1 to 4.15
4.1.41-x86_64, 4.2.36-x86_64, 4.3.40-x86_64, 4.4.33-x86_64, 4.5.41-x86_64, 4.6.62-x86_64, 4.7.60-x86_64, 4.8.57-x86_64, 4.9.59-x86_64, 4.10.67-x86_64, 4.11 nightly, 4.12 nightly, 4.13 nightly, 4.14 nightly, 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest    

How reproducible:

Seems always, the issue was found in our prow ci, and I also reproduce it.    

Steps to Reproduce:

1.Create an aws IPI 4.1 cluster, then upgrade it one by one to 4.14
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2024-01-19-110702   True        True          26m     Working towards 4.12.0-0.nightly-2024-02-04-062856: 654 of 830 done (78% complete), waiting on authentication, openshift-apiserver, openshift-controller-manager
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2024-02-04-062856   True        False         5m12s   Cluster version is 4.12.0-0.nightly-2024-02-04-062856
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2024-02-04-062856   True        True          61m     Working towards 4.13.0-0.nightly-2024-02-04-042638: 713 of 841 done (84% complete), waiting up to 40 minutes on machine-config
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2024-02-04-042638   True        False         10m     Cluster version is 4.13.0-0.nightly-2024-02-04-042638
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2024-02-04-042638   True        True          17m     Working towards 4.14.0-0.nightly-2024-02-02-173828: 233 of 860 done (27% complete), waiting on control-plane-machine-set, machine-api
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2024-02-02-173828   True        False         18m     Cluster version is 4.14.0-0.nightly-2024-02-02-173828     

2.When it upgrade to 4.14, check the machine scale successfully
liuhuali@Lius-MacBook-Pro huali-test %  oc create -f ms1.yaml 
machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa created
liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a    1         1         1       1           14h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa   0         0                             3s
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f    2         2         2       2           14h
liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=1
machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                                  PHASE     TYPE         REGION      ZONE         AGE
ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running   m6a.xlarge   us-east-1   us-east-1f   15h
ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running   m6a.xlarge   us-east-1   us-east-1a   15h
ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running   m6a.xlarge   us-east-1   us-east-1f   15h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running   m6a.xlarge   us-east-1   us-east-1a   15h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa-mt9kh   Running   m6a.xlarge   us-east-1   us-east-1a   15m
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running   m6a.xlarge   us-east-1   us-east-1f   15h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running   m6a.xlarge   us-east-1   us-east-1f   15h
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                           STATUS   ROLES    AGE     VERSION
ip-10-0-128-51.ec2.internal    Ready    master   15h     v1.27.10+28ed2d7
ip-10-0-143-198.ec2.internal   Ready    worker   14h     v1.27.10+28ed2d7
ip-10-0-143-64.ec2.internal    Ready    worker   14h     v1.27.10+28ed2d7
ip-10-0-143-80.ec2.internal    Ready    master   15h     v1.27.10+28ed2d7
ip-10-0-144-123.ec2.internal   Ready    master   15h     v1.27.10+28ed2d7
ip-10-0-147-94.ec2.internal    Ready    worker   14h     v1.27.10+28ed2d7
ip-10-0-158-61.ec2.internal    Ready    worker   3m40s   v1.27.10+28ed2d7
liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=0
machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled
liuhuali@Lius-MacBook-Pro huali-test % oc get node                                                                   
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-128-51.ec2.internal    Ready    master   15h   v1.27.10+28ed2d7
ip-10-0-143-198.ec2.internal   Ready    worker   15h   v1.27.10+28ed2d7
ip-10-0-143-64.ec2.internal    Ready    worker   15h   v1.27.10+28ed2d7
ip-10-0-143-80.ec2.internal    Ready    master   15h   v1.27.10+28ed2d7
ip-10-0-144-123.ec2.internal   Ready    master   15h   v1.27.10+28ed2d7
ip-10-0-147-94.ec2.internal    Ready    worker   15h   v1.27.10+28ed2d7
liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                                                
NAME                                                 PHASE     TYPE         REGION      ZONE         AGE
ci-op-trzci0vq-8a8c4-dq95h-master-0                  Running   m6a.xlarge   us-east-1   us-east-1f   15h
ci-op-trzci0vq-8a8c4-dq95h-master-1                  Running   m6a.xlarge   us-east-1   us-east-1a   15h
ci-op-trzci0vq-8a8c4-dq95h-master-2                  Running   m6a.xlarge   us-east-1   us-east-1f   15h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt   Running   m6a.xlarge   us-east-1   us-east-1a   15h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k   Running   m6a.xlarge   us-east-1   us-east-1f   15h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb   Running   m6a.xlarge   us-east-1   us-east-1f   15h
liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa 
machineset.machine.openshift.io "ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2024-02-02-173828   True        False         43m     Cluster version is 4.14.0-0.nightly-2024-02-02-173828     

3.Upgrade to 4.15
As upgrade to 4.15 nightly stuck on operator-lifecycle-manager-packageserver which is a bug https://issues.redhat.com/browse/OCPBUGS-28744  so I build image with the fix pr (job build openshift/operator-framework-olm#679 succeeded) and upgrade to the image, upgrade successfully

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2024-02-02-173828   True        True          7s      Working towards 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest: 10 of 875 done (1% complete)
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         23m     Cluster version is 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
baremetal                                  4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      11h     
cloud-controller-manager                   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      8h      
cloud-credential                           4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
cluster-autoscaler                         4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
config-operator                            4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      13h     
console                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      3h19m   
control-plane-machine-set                  4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      5h      
csi-snapshot-controller                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      7h10m   
dns                                        4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
etcd                                       4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
image-registry                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      33m     
ingress                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
insights                                   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
kube-apiserver                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
kube-controller-manager                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
kube-scheduler                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
kube-storage-version-migrator              4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      34m     
machine-api                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
machine-approver                           4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      13h     
machine-config                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      10h     
marketplace                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      10h     
monitoring                                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
network                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
node-tuning                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      56m     
openshift-apiserver                        4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
openshift-controller-manager               4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      4h56m   
openshift-samples                          4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      58m     
operator-lifecycle-manager                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
operator-lifecycle-manager-catalog         4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
operator-lifecycle-manager-packageserver   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      57m     
service-ca                                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
storage                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                                 PHASE     TYPE         REGION      ZONE         AGE
ci-op-trzci0vq-8a8c4-dq95h-master-0                  Running   m6a.xlarge   us-east-1   us-east-1f   16h
ci-op-trzci0vq-8a8c4-dq95h-master-1                  Running   m6a.xlarge   us-east-1   us-east-1a   16h
ci-op-trzci0vq-8a8c4-dq95h-master-2                  Running   m6a.xlarge   us-east-1   us-east-1f   16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt   Running   m6a.xlarge   us-east-1   us-east-1a   16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k   Running   m6a.xlarge   us-east-1   us-east-1f   16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb   Running   m6a.xlarge   us-east-1   us-east-1f   16h 

4.Check machine scale stuck in Provisioned, no csr pending

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 created
liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a    1         1         1       1           16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1   0         0                             6s
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f    2         2         2       2           16h
liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 --replicas=1
machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 scaled
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                                  PHASE          TYPE         REGION      ZONE         AGE
ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running        m6a.xlarge   us-east-1   us-east-1f   16h
ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running        m6a.xlarge   us-east-1   us-east-1a   16h
ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running        m6a.xlarge   us-east-1   us-east-1f   16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running        m6a.xlarge   us-east-1   us-east-1a   16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877   Provisioning   m6a.xlarge   us-east-1   us-east-1a   4s
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running        m6a.xlarge   us-east-1   us-east-1f   16h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running        m6a.xlarge   us-east-1   us-east-1f   16h
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                                  PHASE         TYPE         REGION      ZONE         AGE
ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running       m6a.xlarge   us-east-1   us-east-1f   18h
ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running       m6a.xlarge   us-east-1   us-east-1a   18h
ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running       m6a.xlarge   us-east-1   us-east-1f   18h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running       m6a.xlarge   us-east-1   us-east-1a   18h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877   Provisioned   m6a.xlarge   us-east-1   us-east-1a   97m
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running       m6a.xlarge   us-east-1   us-east-1f   18h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running       m6a.xlarge   us-east-1   us-east-1f   18h
ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f1-4ln47   Provisioned   m6a.xlarge   us-east-1   us-east-1f   50m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-128-51.ec2.internal    Ready    master   18h   v1.28.6+a373c1b
ip-10-0-143-198.ec2.internal   Ready    worker   18h   v1.28.6+a373c1b
ip-10-0-143-64.ec2.internal    Ready    worker   18h   v1.28.6+a373c1b
ip-10-0-143-80.ec2.internal    Ready    master   18h   v1.28.6+a373c1b
ip-10-0-144-123.ec2.internal   Ready    master   18h   v1.28.6+a373c1b
ip-10-0-147-94.ec2.internal    Ready    worker   18h   v1.28.6+a373c1b
liuhuali@Lius-MacBook-Pro huali-test % oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                  REQUESTEDDURATION   CONDITION
csr-596n7   21m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-147-94.ec2.internal    <none>              Approved,Issued
csr-7nr9m   42m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-147-94.ec2.internal    <none>              Approved,Issued
csr-bc9n7   16m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-128-51.ec2.internal    <none>              Approved,Issued
csr-dmk27   18m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-128-51.ec2.internal    <none>              Approved,Issued
csr-ggkgd   64m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-143-198.ec2.internal   <none>              Approved,Issued
csr-rs9cz   70m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-143-80.ec2.internal    <none>              Approved,Issued
liuhuali@Lius-MacBook-Pro huali-test %     

Actual results:

 Machine stuck in Provisioned   

Expected results:

  Machine should get Running  

Additional info:

Must gather: https://drive.google.com/file/d/1TrZ_mb-cHKmrNMsuFl9qTdYo_eNPuF_l/view?usp=sharing 
I can see the provisioned machine on AWS console: https://drive.google.com/file/d/1-OcsmvfzU4JBeGh5cil8P2Hoe5DQsmqF/view?usp=sharing
System log of ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877: https://drive.google.com/file/d/1spVT_o0S4eqeQxE5ivttbAazCCuSzj1e/view?usp=sharing 
Some log on the instance: https://drive.google.com/file/d/1zjxPxm61h4L6WVHYv-w7nRsSz5Fku26w/view?usp=sharing 
    

Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/419

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ironic-image/pull/501

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Query the CAPI provider for the timeouts needed during provisioning. This is optional to support.

The current default of 15 minutes is sufficient for normal CAPI installations. However, given how the current PowerVS CAPI provider waits for some resources to be created before creating the load balancers, it is possible that the LBs will not create before the 15 minute timeout. An issue was created to track this [1].

[1] kubernetes-sigs/cluster-api-provider-ibmcloud#1837

This is a clone of issue OCPBUGS-44049. The following is the description of the original issue:

Description of problem:

When the machineconfig tab is opened on the console the below error is displayed.

Oh no! Something went wrong
Type Error
Description:
Cannot read properties of undefined (reading 'toString")

Version-Release number of selected component (if applicable):

    OCP version 4.17.3

How reproducible:

    Every time at customers end. 

Steps to Reproduce:

    1. Go on console.
    2. Under compute tab go to machineconfig tab.
    
    

Actual results:

     Oh no! Something went wrong 

Expected results:

     Should be able to see all the available mc.

Additional info:

    

Description of problem:

maybe the same error as OCPBUGS-37232, admin user login admin console, go to "Observe - Alerting", check alert details, example Watchdog alert, will see the "S is not a function" in the graph, see picture: https://drive.google.com/file/d/1FxHz0yk1w_8Np3Whm-qAhBTSt3VXGG8j/view?usp=drive_link,  same error as "Observe - Metrics", query any metrics, would see the "S is not a function" in the graph. no such error for the dev console

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-17-183402

How reproducible:

always

Steps to Reproduce:

1. see the descitpion
    

Actual results:

"S is not a function" in the admin console graph

Expected results:

no error

Additional info:

 

Description of problem:

oc-mirror crane export fails with latest docker registry/2 on s390x

Version-Release number of selected component (if applicable):

 

How reproducible:

    Everytime

Steps to Reproduce:

    1.git clone https://github.com/openshift/oc-mirror/
    2.cd oc-mirror
    3. mkdir -p bin
    4.curl -o bin/oc-mirror.tar.gz https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.16.0-rc.2/oc-mirror.tar.gz
5.cd bin
6.tar xvf oc-mirror.tar.gz oc-mirror
7.chmod +x oc-mirror
8.cd ..
9.podman build -f Dockerfile -t local/go-toolset:latest
10.podman run -it -v $(pwd):/build:z --env ENV_CATALOGORG="powercloud" --env ENV_CATALOGNAMESPACE="powercloud/oc-mirror-dev-s390x" --env ENV_CATALOG_ID="17282f4c" --env ENV_OCI_REGISTRY_NAMESPACE="powercloud" --entrypoint /bin/bash local/go-toolset:latest ./test/e2e/e2e-simple.sh bin/oc-mirror 2>&1 | tee ../out.log

Actual results:

    /build/test/e2e/operator-test.18664 /build
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 ---- ---- ----     0
100 52.1M  100 52.1M    0     0   779k      0  0:01:08  0:01:08 ----  301k
go: downloading github.com/google/go-containerregistry v0.19.1
go: downloading github.com/docker/cli v24.0.0+incompatible
go: downloading github.com/spf13/cobra v1.7.0
go: downloading github.com/opencontainers/image-spec v1.1.0-rc3
go: downloading github.com/mitchellh/go-homedir v1.1.0
go: downloading golang.org/x/sync v0.2.0
go: downloading github.com/opencontainers/go-digest v1.0.0
go: downloading github.com/docker/distribution v2.8.2+incompatible
go: downloading github.com/containerd/stargz-snapshotter/estargz v0.14.3
go: downloading github.com/google/go-cmp v0.5.9
go: downloading github.com/klauspost/compress v1.16.5
go: downloading github.com/spf13/pflag v1.0.5
go: downloading github.com/vbatts/tar-split v0.11.3
go: downloading github.com/pkg/errors v0.9.1
go: downloading github.com/docker/docker v24.0.0+incompatible
go: downloading golang.org/x/sys v0.15.0
go: downloading github.com/sirupsen/logrus v1.9.1
go: downloading github.com/docker/docker-credential-helpers v0.7.0
Error: pulling Image s390x/registry:2: no child with platform linux/amd64 in index s390x/registry:2
/build/test/e2e/lib/util.sh: line 17: PID_DISCONN: unbound variable

Expected results:

  Should not give any error

Additional info:

    

This is a clone of issue OCPBUGS-38620. The following is the description of the original issue:

Our e2e jobs fail with:

pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError" 

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/55652/rehearse-55652-periodic-ci-openshift-csi-operator-release-4.19-periodic-e2e-aws-efs-csi/1824483696548253696

The jobs should succeed.

This is a clone of issue OCPBUGS-43674. The following is the description of the original issue:

Description of problem:

The assisted service is throwing an error message stating that the Cloud Controller Manager (CCM) is not enabled, even though the CCM value is correctly set in the install-config file.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-19-045205

How reproducible:

Always

Steps to Reproduce:

    1. Prepare install-config and agent-config for external OCI platform.
      example of install-config configuration
.......
.......
platform: external
  platformName: oci
  cloudControllerManager: External
.......
.......
    2. Create agent ISO for external OCI platform     
    3. Boot up nodes using created agent ISO     

Actual results:

Oct 21 16:40:47 agent-sno.private.agenttest.oraclevcn.com service[2829]: time="2024-10-21T16:40:47Z" level=info msg="Register cluster: agenttest with id 2666753a-0485-420b-b968-e8732da6898c and params {\"api_vips\":[],\"base_dns_domain\":\"abitest.oci-rhelcert.edge-sro.rhecoeng.com\",\"cluster_networks\":[{\"cidr\":\"10.128.0.0/14\",\"host_prefix\":23}],\"cpu_architecture\":\"x86_64\",\"high_availability_mode\":\"None\",\"ingress_vips\":[],\"machine_networks\":[{\"cidr\":\"10.0.0.0/20\"}],\"name\":\"agenttest\",\"network_type\":\"OVNKubernetes\",\"olm_operators\":null,\"openshift_version\":\"4.18.0-0.nightly-2024-10-19-045205\",\"platform\":{\"external\":{\"cloud_controller_manager\":\"\",\"platform_name\":\"oci\"},\"type\":\"external\"},\"pull_secret\":\"***\",\"schedulable_masters\":false,\"service_networks\":[{\"cidr\":\"172.30.0.0/16\"}],\"ssh_public_key\":\"ssh-rsa XXXXXXXXXXXX\",\"user_managed_networking\":true,\"vip_dhcp_allocation\":false}" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/src/internal/bminventory/inventory.go:515" cluster_id=2666753a-0485-420b-b968-e8732da6898c go-id=2110 pkg=Inventory request_id=82e83b31-1c1b-4dea-b435-f7316a1965e

Expected results:

The cluster installation should be successful. 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Description of problem:

Setting a non-existant network ID in install-config as control plane additionalNetworkID makes CAPO panic with nil pointer dereference, and Installer does not give out a more explicit error.

Installer should run a pre-flight check on the network ID, and CAPO should not panic.

Version-Release number of selected component (if applicable):


How reproducible:


install-config:

apiVersion: v1
controlPlane:
  name: master
  platform:
    openstack:
      type: ${CONTROL_PLANE_FLAVOR}
      additionalNetworkIDs: [43e553c2-9d45-4fdc-b29e-233231faf46e]

Steps to Reproduce:

1. Add a non-existant network ID in controlPlane.platform.openstack.additionalNetworkIDs
2. openshift-install create cluster
3. enjoy

Actual results:

DEBUG I0613 15:32:14.683137  314433 machine_controller_noderef.go:60] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api-guests/ocp1-f4dwz-bootstrap" namespace="openshift-cluster-api-guests" name="ocp1-f4dwz-bootstrap" reconcileID="f89c5c84-4832-44ae-b522-bdfc8e1b0fdf" Cluster="openshift-cluster-api-guests/ocp1-f4dwz" Cluster="openshift-cluster-api-guests/ocp1-f4dwz" OpenStackMachine="openshift-cluster-api-guests/ocp1-f4dwz-bootstrap"
DEBUG panic: runtime error: invalid memory address or nil pointer dereference [recovered]
DEBUG   panic: runtime error: invalid memory address or nil pointer dereference
DEBUG [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1b737b5]
DEBUG
DEBUG goroutine 326 [running]:
DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1e5
DEBUG panic({0x1db4540?, 0x367bd90?})
DEBUG   /var/home/pierre/sdk/go1.22.3/src/runtime/panic.go:770 +0x132
DEBUG sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking.(*Service).CreatePort(0xc0003c71a0, {0x24172d0, 0xc000942008}, 0xc000a4e688)
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking/port.go:195 +0xd55
DEBUG sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking.(*Service).CreatePorts(0xc0003c71a0, {0x24172d0, 0xc000942008}, {0xc000a4e5a0, 0x2, 0x1b9b265?}, 0xc0008595f0)
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking/port.go:336 +0x66
DEBUG sigs.k8s.io/cluster-api-provider-openstack/controllers.getOrCreateMachinePorts(0xc000c53d10?, 0x242ebd8?)
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/controllers/openstackmachine_controller.go:759 +0x59
DEBUG sigs.k8s.io/cluster-api-provider-openstack/controllers.(*OpenStackMachineReconciler).reconcileNormal(0xc00052e480, {0x242ebd8, 0xc000af5c50}, 0xc000c53d10, {0xc000f27c50, 0x27}, 0xc000943908, 0xc000943188, 0xc000942008)
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/controllers/openstackmachine_controller.go:602 +0x307
DEBUG sigs.k8s.io/cluster-api-provider-openstack/controllers.(*OpenStackMachineReconciler).Reconcile(0xc00052e480, {0x242ebd8, 0xc000af5c50}, {{{0xc00064b280?, 0x0?}, {0xc000f3ecd8?, 0xc00076bd50?}}})
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/cluster-api-provider-openstack/controllers/openstackmachine_controller.go:162 +0xb6d
DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x2434e10?, {0x242ebd8?, 0xc000af5c50?}, {{{0xc00064b280?, 0xb?}, {0xc000f3ecd8?, 0x0?}}})
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xb7
DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005961e0, {0x242ec10, 0xc000988b40}, {0x1e7eda0, 0xc0009805e0})
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3bc
DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005961e0, {0x242ec10, 0xc000988b40})
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1c9
DEBUG sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x79
DEBUG created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 213
DEBUG   /var/home/pierre/code/src/github.com/openshift/installer.git/master/cluster-api/providers/openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x50c
DEBUG Checking that machine ocp1-f4dwz-bootstrap has provisioned...
DEBUG Machine ocp1-f4dwz-bootstrap has not yet provisioned: Pending
DEBUG Checking that machine ocp1-f4dwz-master-0 has provisioned...
DEBUG Machine ocp1-f4dwz-master-0 has not yet provisioned: Pending
DEBUG Checking that machine ocp1-f4dwz-master-1 has provisioned...
DEBUG Machine ocp1-f4dwz-master-1 has not yet provisioned: Pending
DEBUG Checking that machine ocp1-f4dwz-master-2 has provisioned...
DEBUG Machine ocp1-f4dwz-master-2 has not yet provisioned: Pending
DEBUG Checking that machine ocp1-f4dwz-bootstrap has provisioned...
DEBUG Machine ocp1-f4dwz-bootstrap has not yet provisioned: Pending
DEBUG Checking that machine ocp1-f4dwz-master-0 has provisioned...
DEBUG Machine ocp1-f4dwz-master-0 has not yet provisioned: Pending
[...]

Expected results:

ERROR "The additional network $ID was not found in OpenStack."

Additional info:

A separate report will be filed against CAPO.

This is a clone of issue OCPBUGS-41136. The following is the description of the original issue:

Description of problem:

Customer is unable to scale deploymentConfig in RHOCP 4.14.21 cluster.

If they scale a DeploymentConfig they get the error: "New size: 4; reason: cpu resource utilization (percentage of request) above target; error: Internal error occurred: converting (apps.DeploymentConfig) to (v1beta1.Scale): unknown conversion"  

Version-Release number of selected component (if applicable):

4.14.21    

How reproducible:

N/A    

Steps to Reproduce:

    1. deploy apps using DC
    2. create HPA
    3. observe pods unable to scale. Also manual scaling fails
    

Actual results:

Pods are not getting scaled    

Expected results:

Pods should be scaled using HPA    

Additional info:

    

This is a clone of issue OCPBUGS-38392. The following is the description of the original issue:

Description of problem:

For CFE-920: Update GCP userLabels and userTags configs description

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Azure Disk CSI driver operator runs node DaemonSet that exposes CSI driver metrics on loopback, but there is no kube-rbac-proxy in front of it and there is no Service / ServiceMonitor for it. Therefore OCP doesn't collect these metrics.

Description of problem:

When creating IPI cluster, following unexpected traceback appears in terminal occasionally, it won't cause any failure and install succeed finally.

# ./openshift-install create cluster --dir cluster --log-level debug
...
INFO Importing OVA sgao-nest-ktqck-rhcos-generated-region-generated-zone into failure domain generated-failure-domain.
[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
	>  goroutine 131 [running]:
	>  runtime/debug.Stack()
	>  	/usr/lib/golang/src/runtime/debug/stack.go:24 +0x5e
	>  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
	>  	/go/src/github.com/openshift/installer/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:60 +0xcd
	>  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc000e37200, {0x26fd23c0, 0xc0016b4270}, {0x77d22d3, 0x3d}, {0x0, 0x0, 0x0})
	>  	/go/src/github.com/openshift/installer/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:139 +0x5d
	>  github.com/go-logr/logr.Logger.Error({{0x270398d8?, 0xc000e37200?}, 0x0?}, {0x26fd23c0, 0xc0016b4270}, {0x77d22d3, 0x3d}, {0x0, 0x0, 0x0})
	>  	/go/src/github.com/openshift/installer/vendor/github.com/go-logr/logr/logr.go:301 +0xda
	>  sigs.k8s.io/cluster-api-provider-vsphere/pkg/session.newClient.func1({0x26fd6c40?, 0xc0021f0160?})
	>  	/go/src/github.com/openshift/installer/vendor/sigs.k8s.io/cluster-api-provider-vsphere/pkg/session/session.go:265 +0xda
	>  sigs.k8s.io/cluster-api-provider-vsphere/pkg/session.newClient.KeepAliveHandler.func2()
	>  	/go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keep_alive.go:36 +0x22
	>  github.com/vmware/govmomi/session/keepalive.(*handler).Start.func1()
	>  	/go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keepalive/handler.go:124 +0x98
	>  created by github.com/vmware/govmomi/session/keepalive.(*handler).Start in goroutine 1
	>  	/go/src/github.com/openshift/installer/vendor/github.com/vmware/govmomi/session/keepalive/handler.go:116 +0x116

 

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-13-213831

How reproducible:

sometimes

Steps to Reproduce:

1. Create IPI cluster on vSphere multiple times
2, Check output in terminal 

Actual results:

unexpected log traceback appears in terminal

Expected results:

unexpected log traceback should not appear in terminal

Additional info:

 

This is a clone of issue OCPBUGS-38717. The following is the description of the original issue:

The Telemetry userPreference added to the General tab in https://github.com/openshift/console/pull/13587 results in empty nodes being output to the DOM.  This results in extra spacing any time a new user preference is added to the bottom of the General tab.

Description of problem

The openshift/router repository vendors k8s.io/* v0.29.1. OpenShift 4.17 is based on Kubernetes 1.30.

Version-Release number of selected component (if applicable)

4.17.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/router/blob/release-4.17/go.mod.

Actual results

The k8s.io/* packages are at v0.29.1.

Expected results

The k8s.io/* packages are at v0.30.0 or newer.

This is a clone of issue OCPBUGS-44305. The following is the description of the original issue:

Description of problem:

The finally tasks do not get removed and remain in the pipeline.    

Version-Release number of selected component (if applicable):

    In all supported OCP version

How reproducible:

    Always

Steps to Reproduce:

1. Create a finally task in a pipeline in pipeline builder
2. Save pipeline
3. Edit pipeline and remove finally task in pipeline builder
4. Save pipeline
5. Observe that the finally task has not been removed

Actual results:

The finally tasks do not get removed and remain in the pipeline.    

Expected results:

Finally task gets removed from pipeline when removing the finally tasks and saving the pipeline in the "pipeline builder" mode.    

Additional info:

    

In all releases tested, in particular, 4.16.0-0.okd-scos-2024-08-21-155613, Samples operator uses incorrect templates, resulting in following alert:

Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: fuse7-eap-openshift fuse7-eap-openshift-java11 fuse7-java-openshift fuse7-java11-openshift fuse7-karaf-openshift-jdk11 golang httpd java jboss-datagrid73-openshift jboss-eap-xp3-openjdk11-openshift jboss-eap-xp3-openjdk11-runtime-openshift jboss-eap-xp4-openjdk11-openshift jboss-eap-xp4-openjdk11-runtime-openshift jboss-eap74-openjdk11-openshift jboss-eap74-openjdk11-runtime-openshift jboss-eap74-openjdk8-openshift jboss-eap74-openjdk8-runtime-openshift jboss-webserver57-openjdk8-tomcat9-openshift-ubi8 jenkins jenkins-agent-base mariadb mysql nginx nodejs perl php postgresql13-for-sso75-openshift-rhel8 postgresql13-for-sso76-openshift-rhel8 python redis ruby sso75-openshift-rhel8 sso76-openshift-rhel8 fuse7-karaf-openshift jboss-webserver57-openjdk11-tomcat9-openshift-ubi8 postgresql

For example, the sample image for Mysql 8.0 is being pulled from registry.redhat.io/rhscl/mysql-80-rhel7:latest (and cannot be found using the dummy pull secret).

Works correctly on OKD FCOS builds.

Description of problem:

Cu's reporting TelemeterClientFailures Warnings seen in their multiple clusters. Multiple cases opened in roughly last ~36 hours    

Version-Release number of selected component (if applicable):

OCP 4.13.38, OCP 4.12.40     

How reproducible:

As per the latest update from one of the CU's:| `After 5 May 18:00 (HKT), this alert resolved by itself on all cluster.
"gateway error" also not appear after After 5 May 18:00 (HKT)`    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Telemeter-client container logs report the below errors:
2024-05-05T06:40:32.162068012Z level=error caller=forwarder.go:276 ts=2024-05-05T06:40:32.161990057Z component=forwarder/worker msg="unable to forward results" err="gateway server reported unexpected error code: 503: <html>\r\n  <head>\r\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\r\n\r\n    <style type=\"text/css\">\r\n      body {\r\n        font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;\r\n        line-height: 1.66666667;\r\n        font-size: 16px;\r\n        color: #333;\r\n        background-color: #fff;\r\n        margin: 2em 1em;\r\n      }\r\n      h1 {\r\n        font-size: 28px;\r\n        font-weight: 400;\r\n      }\r\n      p {\r\n        margin: 0 0 10px;\r\n      }\r\n      .alert.alert-info {\r\n        background-color: #F0F0F0;\r\n        margin-top: 30px;\r\n        padding: 30px;\r\n      }\r\n      .alert p {\r\n        padding-left: 35px;\r\n      }\r\n      ul {\r\n        padding-left: 51px;\r\n        position: relative;\r\n      }\r\n      li {\r\n        font-size: 14px;\r\n        margin-bottom: 1em;\r\n      }\r\n      p.info {\r\n        position: relative;\r\n        font-size: 20px;\r\n      }\r\n      p.info:before, p.info:after {\r\n        content: \"\";\r\n        left: 0;\r\n        position: absolute;\r\n        top: 0;\r\n      }\r\n"  

Expected results:

TelemeterClientFailures alerts should not be seen    

Additional info:

What could be the reason behind TelemeterClientFailures alerts firing all of a sudden and likely disappeared after a while ?    

Please review the following PR: https://github.com/openshift/cluster-api/pull/208

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

During IPI CAPI cluster creation, it is possible that the load balancer is currently busy. So wrap AddIPToLoadBalancerPool in a PollUntilContextCancel loop.
    

Description of problem:

The version info is useful, however, I couldn't get the cluster-olm-operator's. As follows,

    jiazha-mac:~ jiazha$ oc rsh cluster-olm-operator-7cc6c89999-hql9m
Defaulted container "cluster-olm-operator" out of: cluster-olm-operator, copy-catalogd-manifests (init), copy-operator-controller-manifests (init), copy-rukpak-manifests (init)
sh-5.1$ ps -elf|cat
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S 1000790+       1       0  0  80   0 - 490961 futex_ 04:39 ?       00:00:28 /cluster-olm-operator start -v=2
4 S 1000790+      15       0  0  80   0 -  1113 do_wai 07:33 pts/0    00:00:00 /bin/sh
4 R 1000790+      22      15  0  80   0 -  1787 -      07:33 pts/0    00:00:00 ps -elf
4 S 1000790+      23      15  0  80   0 -  1267 pipe_r 07:33 pts/0    00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
sh-5.1$ ./cluster-olm-operator -h
OpenShift Cluster OLM Operator


Usage:
  cluster-olm-operator [command]


Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  start       Start the Cluster OLM Operator


Flags:
  -h, --help                           help for cluster-olm-operator
      --log-flush-frequency duration   Maximum number of seconds between log flushes (default 5s)
  -v, --v Level                        number for the log level verbosity
      --vmodule moduleSpec             comma-separated list of pattern=N settings for file-filtered logging (only works for the default text log format)


Use "cluster-olm-operator [command] --help" for more information about a command.

sh-5.1$ ./cluster-olm-operator start -h
Start the Cluster OLM Operator


Usage:
  cluster-olm-operator start [flags]


Flags:
      --config string                    Location of the master configuration file to run from.
  -h, --help                             help for start
      --kubeconfig string                Location of the master configuration file to run from.
      --listen string                    The ip:port to serve on.
      --namespace string                 Namespace where the controller is running. Auto-detected if run in cluster.
      --terminate-on-files stringArray   A list of files. If one of them changes, the process will terminate.


Global Flags:
      --log-flush-frequency duration   Maximum number of seconds between log flushes (default 5s)
  -v, --v Level                        number for the log level verbosity
      --vmodule moduleSpec             comma-separated list of pattern=N settings for file-filtered logging (only works for the default text log format)

Version-Release number of selected component (if applicable):

    jiazha-mac:~ jiazha$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-07-20-191204   True        False         9h      Cluster version is 4.17.0-0.nightly-2024-07-20-191204

How reproducible:

    always

Steps to Reproduce:

    1. build an OCP cluster and Enable TP.
$ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type=merge

     2. Check cluster-olm-operator version info.
  

Actual results:

Couldn't get it.

    

Expected results:

The cluster-olm-operator should have a global flag to output the version info.

  

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/73

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

 The value box in the ConfigMap Form view is no longer resizable. It is resizable as expected in  OCP version 4.14.

Version-Release number of selected component (if applicable):

    

How reproducible:

 

Steps to Reproduce:

OCP Console -> Administrator -> Workloads -> ConfigMaps -> Create ConfigMap -> Form view -> value     
    

Actual results:

    Value window box should be resizable.

Expected results:

    It is not resizable anymore in 4.15 OpenShift Clusters.

Additional info:

    

Description of problem:

    Removing imageContentSources from HostedCluster does not update IDMS for the cluster.

Version-Release number of selected component (if applicable):

    Tested with 4.15.14

How reproducible:

    100%

Steps to Reproduce:

    1. add imageContentSources to HostedCluster
    2. verify it is applied to IDMS
    3. remove imageContentSources from HostedCluster
    

Actual results:

    IDMS is not updated to remove imageDigestMirrors contents

Expected results:

    IDMS is updated to remove imageDigestMirrors contents

Additional info:

    Workaround, set imageContentSources=[]

Description of problem:
Multiple failures with this error

{Timed out retrying after 30000ms: Expected to find element: `#page-sidebar`, but never found it. AssertionError AssertionError: Timed out retrying after 30000ms: Expected to find element: `#page-sidebar`, but never found it.
    at Context.eval (webpack:////go/src/github.com/openshift/console/frontend/packages/integration-tests-cypress/support/index.ts:48:5)}
    

Test failures

https://search.dptools.openshift.org/?search=Expected+to+find+element%3A+%60%23page-sidebar%60&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

 

Additional findings:

Initial investigation of the test failure artifacts traces to the following element is not found.

get [data-test="catalogSourceDisplayName-red-hat"] 

test code https://github.com/openshift/console/blob/master/frontend/packages/operator-lifecycle-manager/integration-tests-cypress/tests/operator-hub.cy.ts#L26

 

Based on the following screenshots from the failure video, the Red Hat catalog source is not available on the test cluster.

https://drive.google.com/file/d/18xV5wviekcS6KJ4ObBNQdtwsnfkSpFxl/view?usp=drive_link

https://drive.google.com/file/d/17yMDb42CM2Mc3z-DkLKiz1P4HEjqAr-k/view?usp=sharing

 

 

 

 

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/311

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Snapshot support is being delivered for kubevirt-csi in 4.16, but the cli used to configure snapshot support did not expose the argument that makes using snapshots possible.


The cli arg [--infra-volumesnapshot-class-mapping] was added to the developer cli [hypershift] but never made it to the productized cli [hcp] that end users will use. 

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%

Steps to Reproduce:

1. hcp create cluster kubevirt -h | grep infra-volumesnapshot-class-mapping
2.
3.

Actual results:

no value is found

Expected results:

the infra-volumesnapshot-class-mapping cli arg should be found

Additional info:

 

Description of problem:

I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.
    

Version-Release number of selected component (if applicable):

     [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

     Always
    

Steps to Reproduce:

    1. use imageSetConfig.yaml as shown below
    2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2
    3.
    

Actual results:

    fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2

2024/08/03 09:24:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/08/03 09:24:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/08/03 09:24:38  [INFO]   : ⚙️  setting up the environment for you...
2024/08/03 09:24:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/08/03 09:24:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/08/03 09:24:38  [INFO]   : 🔍 collecting release images...
2024/08/03 09:24:44  [INFO]   : kubeVirtContainer set to true [ including :  ]
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty
2024/08/03 09:24:44  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty 

    

Expected results:

    If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.
    

Additional info:

    [fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
      - name: stable-4.12
        minVersion: 4.12.61
        maxVersion: 4.12.61
    kubeVirtContainer: true
  operators:
  - catalog: oci:///test/ibm-catalog
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator
      minVersion: "0.26.0"
    - name: nfd
      maxVersion: "4.15.0-202402210006"
    - name: cluster-logging
      minVersion: 5.8.3
      maxVersion: 5.8.4
    - name: quay-bridge-operator
      channels:
      - name: stable-3.9
        minVersion: 3.9.5
    - name: quay-operator
      channels:
      - name: stable-3.9
        maxVersion: "3.9.1"
    - name: odf-operator
      channels:
      - name: stable-4.14
        minVersion: "4.14.5-rhodf"
        maxVersion: "4.14.5-rhodf"
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308

    

Description of problem:

    Based on the results in [Sippy|https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Etcd&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-19%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-13%2000%3A00%3A00&testId=Operator%20results%3A45d55df296fbbfa7144600dce70c1182&testName=operator%20conditions%20etcd], it appears that the periodic tests are not waiting for the etcd operator to complete before exiting.

The test is supposed to wait for up to 20 mins after the final control plane machine is rolled, to allow operators to settle. But we are seeing the etcd operator triggering 2 further revisions after this happens.

We need to understand if the etcd operator is correctly rolling out vs whether these changes should have rolled out prior to the final machine going away, and, understand if there's a way to add more stability to our checks to make sure that all of the operators stabilise, and, that they have been stable for at least some period (1 minute)

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

knative-ci.feature testi is failing with:

  Logging in as kubeadmin
      Installing operator: "Red Hat OpenShift Serverless"
      Operator Red Hat OpenShift Serverless was not yet installed.
      Performing Serverless post installation steps
      User has selected namespace knative-serving
  1) "before all" hook for "Create knative workload using Container image with extrenal registry on Add page: KN-05-TC05 (example #1)"

  0 passing (3m)
  1 failing

  1) Perform actions on knative service and revision
       "before all" hook for "Create knative workload using Container image with extrenal registry on Add page: KN-05-TC05 (example #1)":
     AssertionError: Timed out retrying after 40000ms: Expected to find element: `[title="knativeservings.operator.knative.dev"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping all of the remaining tests.

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
      at createKnativeServing (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/knativeSubscriptions.ts:15:5)
      at performPostInstallationSteps (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:176:26)
      at verifyAndInstallOperator (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:221:2)
      at verifyAndInstallKnativeOperator (webpack:////go/src/github.com/openshift/console/frontend/packages/dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts:231:27)
      at Context.eval (webpack:///./support/commands/hooks.ts:7:33)



[mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_knative.json


  (Results)

  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Tests:        16                                                                               │
  │ Passing:      0                                                                                │
  │ Failing:      1                                                                                │
  │ Pending:      0                                                                                │
  │ Skipped:      15                                                                               │
  │ Screenshots:  1                                                                                │
  │ Video:        true                                                                             │
  │ Duration:     3 minutes, 8 seconds                                                             │
  │ Spec Ran:     knative-ci.feature                                                               │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘


  (Screenshots)

  -  /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree     (1280x720)
     nshots/knative-ci.feature/Create knative workload using Container image with ext               
     renal registry on Add page KN-05-TC05 (example #1) -- before all hook (failed).p               
     ng                                                                                             

 

Search link

Description of problem:

The `oc set env` command changes apiVersion for route and deploymentconfig

Version-Release number of selected component (if applicable):

4.12, 4.13, 4.14

How reproducible:

100%

Steps to Reproduce:

With oc client 4.10
 
$ oc410 set env -e FOO="BAR" -f process.json --local -o json
{
    "kind": "Service",
    "apiVersion": "v1",
    "metadata": {
        "name": "test",
        "creationTimestamp": null,
        "labels": {
            "app_name": "test",
            "template": "immutable"
        }
    },
    "spec": {
        "ports": [
            {
                "name": "8080-tcp",
                "protocol": "TCP",
                "port": 8080,
                "targetPort": 8080
            }
        ],
        "selector": {
            "app_name": "test",
            "deploymentconfig": "test"
        },
        "type": "ClusterIP",
        "sessionAffinity": "None"
    },
    "status": {
        "loadBalancer": {}
    }
}
{
    "kind": "Route",
    "apiVersion": "route.openshift.io/v1",
    "metadata": {
        "name": "test",
        "creationTimestamp": null,
        "labels": {
            "app_name": "test",
            "template": "immutable"
        }
    },

With oc client 4.12, 4.13 and 4.14

$ oc41245 set env -e FOO="BAR" -f process.json --local -o json

{
    "kind": "Service",
    "apiVersion": "v1",
    "metadata": {
        "name": "test",
        "creationTimestamp": null,
        "labels": {
            "app_name": "test",
            "template": "immutable"
        }
    },
    "spec": {
        "ports": [
            {
                "name": "8080-tcp",
                "protocol": "TCP",
                "port": 8080,
                "targetPort": 8080
            }
        ],
        "selector": {
            "app_name": "test",
            "deploymentconfig": "test"
        },
        "type": "ClusterIP",
        "sessionAffinity": "None"
    },
    "status": {
        "loadBalancer": {}
    }
}
{
    "kind": "Route",
    "apiVersion": "v1",
    "metadata": {
        "name": "test"

.....
.....
    "kind": "DeploymentConfig",
    "apiVersion": "v1",


Actual results:

The oc client version for 4.12, 4.13, 4.14 changes the apiVersion.

Expected results:

The apiVersion should not change the apiVersion for route and DeploymentConfig.

Additional info:

    

Description of problem:

    The audit-logs container for kas, oapi and oauth apiservers does not terminate within the `TerminationGracePeriodSeconds` timer. This is due to the container not terminating when a `SIGTERM` command is issued.

When testing without the audit logs container, oapi and oath-apiserver terminates within a 90-110 second range gracefully. The kas does not terminate with the container gone and I have a hunch that it's the konnectivity container that also does not follow `SIGTERM` (I've attempted 10 minutes and still does not timeout).

So this issue is to change the logic for audit-logs to terminate gracefully and increase the TerminationGracePeriodSeconds from the default of 30s to 120s.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Create a hypershift cluster with auditing enabled
    2. Try deleting apiserver pods and watch the pods being force deleted after 30 seconds (95 for kas) instead of gracefully terminated.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:
Post featuregate enabling for egressfirewall doesn't work

Version-Release number of selected component (if applicable):
4.16

How reproducible:
Always

Steps to Reproduce:

1. Setup 4.16 ovn cluster
2. Following doc to enable feature gate https://docs.openshift.com/container-platform/4.15/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-cli_nodes-cluster-enabling

3. Configure egressfirewall with dnsName

Actual results:
no dnsnameresolver under openshift-ovn-kubernetes
Expected results:
The feature is enabled and should have dnsnameresolver

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Description of problem:

HCP has audit log configuration for Kube API server, OpenShift API server, OAuth API server (like OCP), but does not have audit for oauth-openshift (OAuth server). Discussed with Standa in https://redhat-internal.slack.com/archives/CS05TR7BK/p1714124297376299 , oauth-openshift needs audit too in HCP.

Version-Release number of selected component (if applicable):

4.11 ~ 4.16

How reproducible:

Always

Steps to Reproduce:

1. Launch HCP env.
2. Check audit log configuration:
$ oc get deployment -n clusters-hypershift-ci-279389 kube-apiserver openshift-apiserver openshift-oauth-apiserver oauth-openshift -o yaml | grep -e '^    name:' -e 'audit\.log'

Actual results:

2. It outputs oauth-openshift (OAuth server) has no audit:
    name: kube-apiserver
          - /var/log/kube-apiserver/audit.log
    name: openshift-apiserver
          - /var/log/openshift-apiserver/audit.log
    name: openshift-oauth-apiserver
          - --audit-log-path=/var/log/openshift-oauth-apiserver/audit.log
          - /var/log/openshift-oauth-apiserver/audit.log
    name: oauth-openshift

Expected results:

2. oauth-openshift (OAuth server) needs to have audit too.

Additional info:

OCP has audit for OAuth server since 4.11 AUTH-6 https://docs.openshift.com/container-platform/4.11/security/audit-log-view.html saying "You can view the logs for the OpenShift API server, Kubernetes API server, OpenShift OAuth API server, and OpenShift OAuth server".

Description of problem:

    Console UI alerting page show as `Not found`

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-07-13-042606

How reproducible:

    100%

Steps to Reproduce:

    1. open console UI navigate to Observe-> Alerting

Actual results:

    Alerts, Silences, Alerting rules pages display as not found

Expected results:

    able to see detils

Additional info:

Request URL:https://console-openshift-console.apps.tagao-417.qe.devcluster.openshift.com/api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/
Request Method:GET
Status Code:404 Not Found
Remote Address:10.68.5.32:3128
Referrer Policy:strict-origin-when-cross-origin

This is a clone of issue OCPBUGS-37534. The following is the description of the original issue:

Description of problem:

Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13.

Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs.

The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28

We have reproduced the issue and we found an ordering cycle error in the journal log

Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.)
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.

    

Version-Release number of selected component (if applicable):

    Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13:
    
      version: 4.13.0-0.nightly-2024-07-23-154444
      version: 4.12.0-0.nightly-2024-07-23-230744
      version: 4.11.59
      version: 4.10.67
      version: 4.9.59

    

How reproducible:

    Always
    

Steps to Reproduce:

    1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.

    

Actual results:


    Nodes become not ready
$ oc get nodes
NAME                                                 STATUS                        ROLES    AGE     VERSION
ci-op-g94jvswm-cc71e-998q8-master-0                  Ready                         master   6h14m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-1                  Ready                         master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-2                  NotReady,SchedulingDisabled   master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb   NotReady,SchedulingDisabled   worker   6h2m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6   Ready                         worker   6h4m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj   Ready                         worker   6h6m    v1.25.16+306a47e

And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
    

    

Expected results:

No ordering cycle error should happen and the upgrade should be executed without problems.
    

Additional info:


    

Please review the following PR: https://github.com/openshift/prometheus/pull/203

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/console-operator/pull/906

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to  4.16.0-0.nightly

Version-Release number of selected component (if applicable):

    

How reproducible:

 Once   

Steps to Reproduce:

    1.Run prow ci job: 
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-vsphere-ipi-disk-encryption-tang-fips-f28/1790991142867701760 

     2.Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to  4.16.0-0.nightly from 4.15.13:
 Last Transition Time:  2024-05-16T09:35:05Z
    Message:               VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/04_clusterrole.yaml" (string): client rate limiter Wait returned an error: context canceled
VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/05_clusterrolebinding.yaml" (string): client rate limiter Wait returned an error: context canceled
VSphereProblemDetectorStarterStaticControllerDegraded: "vsphere_problem_detector/10_service.yaml" (string): client rate limiter Wait returned an error: context canceled
VSphereProblemDetectorStarterStaticControllerDegraded:
    Reason:                VSphereProblemDetectorStarterStaticController_SyncError
    Status:                True
    Type:                  Degraded

     3.must-gather is available: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.15-vsphere-ipi-disk-encryption-tang-fips-f28/1790991142867701760/artifacts/vsphere-ipi-disk-encryption-tang-fips-f28/gather-must-gather/     

Actual results:

Storage degraded by VSphereProblemDetectorStarterStaticControllerDegraded during uprading to  4.16.0-0.nightly from 4.15.13

Expected results:

 Upgrade should be successful

Additional info:

    

Description of problem:

    When specifying imageDigestSources (or the deprecated imageContentSources), SNAT should be disabled to prevent public internet traffic.

Version-Release number of selected component (if applicable):

    

How reproducible:

Easily    

Steps to Reproduce:

    1. Specify imageDigestSources or imageContentSources along with an Internal publish strategy
    2. DHCP service will not have SNAT disabled
    

Actual results:

    DHCP service will not have SNAT disabled

Expected results:

    DHCP service will have SNAT disabled

Additional info:

    

Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/116

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   Customer is running Openshift on AHV and their Tenable Security Scan reported the following vulnerability on the Nutanix Cloud Controller Manager Deployment. 
https://www.tenable.com/plugins/nessus/42873 on port 10258 SSL Medium Strength Cipher Suites Supported (SWEET32)
The Nutanix Cloud Controller Manager deployment runs two pods and exposes port 10258 to the outside world. 
sh-4.4# netstat -ltnp|grep -w '10258'
tcp6       0      0 :::10258                :::*                    LISTEN      10176/nutanix-cloud
sh-4.4# ps aux|grep 10176
root       10176  0.0  0.2 1297832 59764 ?       Ssl  Feb15   4:40 /bin/nutanix-cloud-controller-manager --v=3 --cloud-provider=nutanix --cloud-config=/etc/cloud/nutanix_config.json --controllers=* --configure-cloud-routes=false --cluster-name=trulabs-8qmx4 --use-service-account-credentials=true --leader-elect=true --leader-elect-lease-duration=137s --leader-elect-renew-deadline=107s --leader-elect-retry-period=26s --leader-elect-resource-namespace=openshift-cloud-controller-manager
root     1403663  0.0  0.0   9216  1100 pts/0    S+   14:17   0:00 grep 10176


[centos@provisioner-trulabs-0-230518-065321 ~]$ oc get pods -A -o wide | grep nutanix
openshift-cloud-controller-manager                 nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c            1/1     Running     0                4d18h   172.17.0.249   trulabs-8qmx4-master-1       <none>           <none>
openshift-cloud-controller-manager                 nutanix-cloud-controller-manager-5c4cdbb9c-vtrz5            1/1     Running     0                4d18h   172.17.0.121   trulabs-8qmx4-master-0       <none>           <none>


[centos@provisioner-trulabs-0-230518-065321 ~]$ oc describe pod -n openshift-cloud-controller-manager                 nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c
Name:                 nutanix-cloud-controller-manager-5c4cdbb9c-jnv7c
Namespace:            openshift-cloud-controller-manager
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      cloud-controller-manager
Node:                 trulabs-8qmx4-master-1/172.17.0.249
Start Time:           Thu, 15 Feb 2024 19:24:52 +0000
Labels:               infrastructure.openshift.io/cloud-controller-manager=Nutanix
                      k8s-app=nutanix-cloud-controller-manager
                      pod-template-hash=5c4cdbb9c
Annotations:          operator.openshift.io/config-hash: b3e08acdcd983115fe7a2b94df296362b20c35db781c8eec572fbe24c3a7c6aa
Status:               Running
IP:                   172.17.0.249
IPs:
  IP:           172.17.0.249
Controlled By:  ReplicaSet/nutanix-cloud-controller-manager-5c4cdbb9c
Containers:
  cloud-controller-manager:
    Container ID:  cri-o://f5c0f39e1907093c9359aa2ac364c5bcd591918b06103f7955b30d350c730a8a
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f3e7b600d94d1ba0be1edb328ae2e32393acba819742ac3be5e6979a3dcbf4c
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7f3e7b600d94d1ba0be1edb328ae2e32393acba819742ac3be5e6979a3dcbf4c
    Port:          10258/TCP
    Host Port:     10258/TCP
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -o allexport
      if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
        source /etc/kubernetes/apiserver-url.env
      fi
      exec /bin/nutanix-cloud-controller-manager \
        --v=3 \
        --cloud-provider=nutanix \
        --cloud-config=/etc/cloud/nutanix_config.json \
        --controllers=* \
        --configure-cloud-routes=false \
        --cluster-name=$(OCP_INFRASTRUCTURE_NAME) \
        --use-service-account-credentials=true \
        --leader-elect=true \
        --leader-elect-lease-duration=137s \
        --leader-elect-renew-deadline=107s \
        --leader-elect-retry-period=26s \
        --leader-elect-resource-namespace=openshift-cloud-controller-manager

    State:          Running
      Started:      Thu, 15 Feb 2024 19:24:56 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     200m
      memory:  128Mi
    Environment:
      OCP_INFRASTRUCTURE_NAME:   trulabs-8qmx4
      NUTANIX_SECRET_NAMESPACE:  openshift-cloud-controller-manager
      NUTANIX_SECRET_NAME:       nutanix-credentials
      POD_NAMESPACE:             openshift-cloud-controller-manager (v1:metadata.namespace)
    Mounts:
      /etc/cloud from nutanix-config (ro)
      /etc/kubernetes from host-etc-kube (ro)
      /etc/pki/ca-trust/extracted/pem from trusted-ca (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4ht28 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  nutanix-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cloud-conf
    Optional:  false
  trusted-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ccm-trusted-ca
    Optional:  false
  host-etc-kube:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes
    HostPathType:  Directory
  kube-api-access-4ht28:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:                      <none>


Medium Strength Ciphers (> 64-bit and < 112-bit key, or 3DES)

    Name                          Code             KEX           Auth     Encryption             MAC
    ----------------------        ----------       ---           ----     ---------------------  ---
    ECDHE-RSA-DES-CBC3-SHA        0xC0, 0x12       ECDH          RSA      3DES-CBC(168)          SHA1
    DES-CBC3-SHA                  0x00, 0x0A       RSA           RSA      3DES-CBC(168)          SHA1

The fields above are :

  {Tenable ciphername}
  {Cipher ID code}
  Kex={key exchange}
  Auth={authentication}
  Encrypt={symmetric encryption method}
  MAC={message authentication code}
  {export flag}


[centos@provisioner-trulabs-0-230518-065321 ~]$ curl -v telnet://172.17.0.2:10258
* About to connect() to 172.17.0.2 port 10258 (#0)
*   Trying 172.17.0.2...
* Connected to 172.17.0.2 (172.17.0.2) port 10258 (#0)

Version-Release number of selected component (if applicable):

    

How reproducible:

The nutanix CCM manager pod running in the OCP cluster does not set the option "--tls-cipher-suites".

Steps to Reproduce:

Create an OCP Nutanix cluster.

Actual results:

Run the below cli returns nothing.
$ oc describe pod -n openshift-cloud-controller-manager nutanix-cloud-controller-manager-... | grep "\--tls-cipher-suites"

Expected results:

   Expect the nutanix CCM manager deployment set the proper option "--tls-cipher-suites".

Additional info:

    

Description of problem:

Navigate to Node overview and check the Utilization of CPU and memory, it shows something like: "6.53 GiB available of 300 MiB total limit", which looks very confuse.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1. Navigate to Node overview
2. Check the Utilization of CPU and memory
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-34647. The following is the description of the original issue:

Description of problem:

When we enable OCB functionality and we create a MC that configures an eforcing=0 kernel argumnent the MCP is degraded reporting this message

              {
                  "lastTransitionTime": "2024-05-30T09:37:06Z",
                  "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"",
                  "reason": "1 nodes are reporting degraded status on sync",
                  "status": "True",
                  "type": "NodeDegraded"
              },


    

Version-Release number of selected component (if applicable):

IPI on AWS

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-05-30-021120   True        False         97m     Error while reconciling 4.16.0-0.nightly-2024-05-30-021120: the cluster operator olm is not available

    

How reproducible:

Alwasy
    

Steps to Reproduce:

    1. Enable techpreview
$ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'

    2. Configure a MSOC resource to enable OCB functionality in the worker pool

When we hit this problem we were using the mcoqe quay repository.
A copy of the pull-secret for baseImagePullSecret and renderedImagePushSecret and no currentImagePullSecret configured.

apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  machineConfigPool:
    name: worker
#  buildOutputs:
#    currentImagePullSecret:
#      name: ""
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: pull-copy 
    renderedImagePushSecret:
      name: pull-copy 
    renderedImagePushspec: "quay.io/mcoqe/layering:latest"

    3. Create a MC to use enforing=0 kernel argument

{
    "kind": "List",
    "apiVersion": "v1",
    "metadata": {},
    "items": [
        {
            "apiVersion": "machineconfiguration.openshift.io/v1",
            "kind": "MachineConfig",
            "metadata": {
                "labels": {
                    "machineconfiguration.openshift.io/role": "worker"
                },
                "name": "change-worker-kernel-selinux-gvr393x2"
            },
            "spec": {
                "config": {
                    "ignition": {
                        "version": "3.2.0"
                    }
                },
                "kernelArguments": [
                    "enforcing=0"
                ]
            }
        }
    ]
}

    

Actual results:

The worker MCP is degraded reporting this message:

oc get mcp worker -oyaml
....

              {
                  "lastTransitionTime": "2024-05-30T09:37:06Z",
                  "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"",
                  "reason": "1 nodes are reporting degraded status on sync",
                  "status": "True",
                  "type": "NodeDegraded"
              },

    

Expected results:

The MC should be applied without problems and selinux should be using enforcing=0
    

Additional info:


    

Description of problem:

    When there is more than one password-based IDP (like htpasswd) and its name contains whitespaces, it causes the oauth-server to panic, if Golang is v1.22 or higher.

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Create a cluster with OCP 4.17
    2. Create at least two password-based IDP (like htpasswd) with whitespaces in the name.
    3. oauth-server panics.
    

Actual results:

    oauth-server panics (if Go is at version 1.22 or higher).

Expected results:

    NO REGRESSION, it worked with Go 1.21 and lower.

Additional info:

    

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/232

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Now that capi/aws is the default in 4.16+, the old terraform aws configs won't be maintained since there is no way to use them. Users interested in the configs can still access them in the 4.15 branch where they are still maintained as the installer still uses terraform.

Version-Release number of selected component (if applicable):

4.16+

How reproducible:

always

Steps to Reproduce:

1.
2.
3.

Actual results:

terraform aws configs are left in the repo.

Expected results:

Configs are removed.

Additional info:

 

This is a clone of issue OCPBUGS-38733. The following is the description of the original issue:

Description of problem:

In OpenShift 4.13-4.15, when a "rendered" MachineConfig in use is deleted, it's automatically recreated. In OpenShift 4.16, it's not recreated, and nodes and MCP becomes degraded due to the "rendered" not found error.

 

Version-Release number of selected component (if applicable):

4.16

 

How reproducible:

Always

 

Steps to Reproduce:

1. Create a MC to deploy any file in the worker MCP

2. Get the name of the new rendered MC, like for example "rendered-worker-bf829671270609af06e077311a39363e"

3. When the first node starts updating, delete the new rendered MC

    oc delete mc rendered-worker-bf829671270609af06e077311a39363e     

 

Actual results:

Node degraded with "rendered" not found error

 

Expected results:

In OCP 4.13 to 4.15, the "rendered" MC is automatically re-created, and the node continues updating to the MC content without issues. It should be the same in 4.16.

 

Additional info:

The same behavior in 4.12 and older than now in 4.16. In 4.13-4.15, the "rendered" is re-created and no issues with the nodes/MCPs are shown.

Description of problem:

Config custom AMI for cluster:
platform.aws.defaultMachinePlatform.amiID
Or 
installconfig.controlPlane.platform.aws.amiID
installconfig.compute.platform.aws.amiID


Master machines still use default AMI instead of custom one.

aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/yunjiang-cap6-qjc5t,Va│
lues=owned" "Name=tag:Name,Values=*worker*" --output json | jq '.Reservations[].Instances[].ImageId' | sort | uniq   	│
"ami-0f71147cab4dbfb61"


aws ec2 describe-instances --filters "Name=tag:kubernetes.io/cluster/yunjiang-cap6-qjc5t,Va│
lues=owned" "Name=tag:Name,Values=*master*" --output json | jq '.Reservations[].Instances[].ImageId' | sort | uniq   	│
"Ami-0ae9b509738034a2c" <- default ami

    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-08-222442
    

How reproducible:

    

Steps to Reproduce:

    1.See description
    2.
    3.
    

Actual results:

See description
    

Expected results:

master machines use custom AMI

    

Additional info:

    

This is a clone of issue OCPBUGS-43084. The following is the description of the original issue:

Description of problem:

While accessing the node terminal of the cluster from web-console the below warning message observed.
~~~
Admission Webhook WarningPod master-0.americancluster222.lab.psi.pnq2.redhat.com-debug violates policy 299 - "metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]"
~~~



Note: This is not impacting the cluster. However creating confusion among customers due to the warning message.

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

    Everytime.

Steps to Reproduce:

    1. Install cluster of version 4.16.11 
    2. Upgrade the cluster from web-console to the next-minor version 4.16.13
    3. Try to access the node terminal from UI
    

Actual results:

    Showing warning while accessing the node terminal.

Expected results:

    Does not show any warning.

Additional info:

    

Description of problem:

console-operator is fetching the organization ID from OCM on every sync call, which is too often. We need to reduce the fetch period.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-41776. The following is the description of the original issue:

Description of problem:

the section is: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-arm-tested-machine-types_installing-aws-vpc  

all tesed arm instances for 4.14+:
c6g.*
c7g.*
m6g.*
m7g.*
r8g.*

we need to ensure all sections include "Tested instance types for AWS on 64-bit ARM infrastructures" section been updated for 4.14+    

Additional info:

    

Description of problem:

Found a panic at the end of the catalog-operator/catalog-operator/logs/previous.log
2024-07-23T23:37:48.446406276Z panic: runtime error: invalid memory address or nil pointer dereference

    

Version-Release number of selected component (if applicable):

Cluster profile: aws with ipi installation with localzone and fips on
4.17.0-0.nightly-2024-07-20-191204

    

How reproducible:

once
    

Steps to Reproduce:

   Searched the panic in log files of must-gather.

    

Actual results:

Panic occurred with catalog-operator, seems to have caused it to restart, 

$ tail -20 namespaces/openshift-operator-lifecycle-manager/pods/catalog-operator-77c8dd875-d4dpf/catalog-operator/catalog-operator/logs/previous.log
2024-07-23T23:37:48.425902169Z time="2024-07-23T23:37:48Z" level=info msg="of 1 pods matching label selector, 1 have the correct images and matching hash" correctHash=true correctImages=true current-pod.name=certified-operators-rrm5v current-pod.namespace=openshift-marketplace
2024-07-23T23:37:48.440899013Z time="2024-07-23T23:37:48Z" level=error msg="error updating InstallPlan status" id=a9RUB ip=install-spcrz namespace=e2e-test-storage-lso-h9nqf phase=Installing updateError="Operation cannot be fulfilled on installplans.operators.coreos.com \"install-spcrz\": the object has been modified; please apply your changes to the latest version and try again"
2024-07-23T23:37:48.446406276Z panic: runtime error: invalid memory address or nil pointer dereference
2024-07-23T23:37:48.446406276Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ef8d9b]
2024-07-23T23:37:48.446406276Z 
2024-07-23T23:37:48.446406276Z goroutine 273 [running]:
2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc000212480, {0x25504a0?, 0xc000328000?})
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:2012 +0xb9b
2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.NewOperator.LegacySyncHandler.ToSyncer.LegacySyncHandler.ToSyncerWithDelete.func107({0x20?, 0x2383f40?}, {0x298e2d0, 0xc002cf0140})
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:181 +0xbc
2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.SyncFunc.Sync(0x2383f40?, {0x29a60b0?, 0xc000719720?}, {0x298e2d0?, 0xc002cf0140?})
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:184 +0x37
2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35
2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00072a0b0, {0x29a60b0, 0xc000719720}, 0xc0009829c0)
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:316 +0x59f
2024-07-23T23:37:48.446406276Z github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(...)
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:260
2024-07-23T23:37:48.446406276Z created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start in goroutine 142
2024-07-23T23:37:48.446406276Z 	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:250 +0x4e5

    

Expected results:

Should not panic with catalog-operator
    

Additional info:

From the e2e [test log summary|https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-aws-ipi-localzone-fips-f2/1815851708257931264/artifacts/aws-ipi-localzone-fips-f2/openshift-extended-test/artifacts/extended.log], we got one information that catalog-operator container exited with panic,

Jul 23 23:37:49.361 E ns/openshift-operator-lifecycle-manager pod/catalog-operator-77c8dd875-d4dpf node/ip-10-0-24-201.ec2.internal container=catalog-operator container exited with code 2 (Error): d memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1ef8d9b]\n\ngoroutine 273 [running]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc000212480, {0x25504a0?, 0xc000328000?})\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:2012 +0xb9b\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.NewOperator.LegacySyncHandler.ToSyncer.LegacySyncHandler.ToSyncerWithDelete.func107({0x20?, 0x2383f40?}, {0x298e2d0, 0xc002cf0140})\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:181 +0xbc\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.SyncFunc.Sync(0x2383f40?, {0x29a60b0?, 0xc000719720?}, {0x298e2d0?, 0xc002cf0140?})\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:184 +0x37\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00072a0b0, {0x29a60b0, 0xc000719720}, 0xc0009829c0)\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:316 +0x59f\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(...)\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:260\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start in goroutine 142\n	/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:250 +0x4e5\n
    

Description of problem:

When performing a UPI installation, the installer fails with:
time="2024-05-29T14:38:59-04:00" level=fatal msg="failed to fetch Cluster API Machine Manifests: failed to generate asset \"Cluster API Machine Manifests\": unable to generate CAPI machines for vSphere unable to get network inventory path: unable to find network ci-vlan-896 in resource pool /cidatacenter/host/cicluster/Resources/ci-op-yrhjini6-9ef4a"

If I pre-create the resource pool(s), the installation proceeds.

Version-Release number of selected component (if applicable):

    4.16 nightly

How reproducible:

    consistently

Steps to Reproduce:

    1. Follow documentation to perform a UPI installation
    2. Installation will fail during manifest creation
    3.
    

Actual results:

    Installation fails

Expected results:

    Installation should proceed

Additional info:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/51894/rehearse-51894-periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-upi-zones/1795883271666536448    

Description of problem:

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    10-25%

Steps to Reproduce:

    1.Run TestMTLSWithCRLs     

Actual results:

    Fails with "failed to find host name"

Expected results:

    Shouldn't fail.

Additional info:

    It appears the logic for the `getRouteHost` is incorrect. There is a poll loop that waits for it to become not "", but `getRouteHost` returns a Fatal if it can't find it, so the poll is useless.

This is a clone of issue OCPBUGS-39438. The following is the description of the original issue:

Description of problem: If a customer applies ethtool configuration to the interface used in br-ex, that configuration will be dropped when br-ex is created. We need to read and apply the configuration from the interface to the phys0 connection profile, as described in https://issues.redhat.com/browse/RHEL-56741?focusedId=25465040&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25465040

Version-Release number of selected component (if applicable): 4.16

How reproducible: Always

Steps to Reproduce:

1. Deploy a cluster with an NMState config that sets the ethtool.feature.esp-tx-csum-hw-offload field to "off"

2.

3.

Actual results: The ethtool setting is only applied to the interface profile which is disabled after configure-ovs runs

Expected results: The ethtool setting is present on the configure-ovs-created profile

Additional info:

Affected Platforms: VSphere. Probably baremetal too and possibly others.

Description of problem:

router pod is in CrashLoopBackup after y-stream upgrade from 4.13->4.14   

Version-Release number of selected component (if applicable):

    

How reproducible:

always    

Steps to Reproduce:

    1. create a cluster with 4.13
    2. upgrade HC to 4.14
    3.
    

Actual results:

    router pod in CrashLoopBackoff

Expected results:

    router pod is running after upgrade HC from 4.13->4.14

Additional info:

images:
======
HO image: 4.15
upgrade HC from 4.13.0-0.nightly-2023-12-19-114348 to 4.14.0-0.nightly-2023-12-19-120138

router pod log:
==============
jiezhao-mac:hypershift jiezhao$ oc get pods router-9cfd8b89-plvtc -n clusters-jie-test
NAME          READY  STATUS       RESTARTS    AGE
router-9cfd8b89-plvtc  0/1   CrashLoopBackOff  11 (45s ago)  32m
jiezhao-mac:hypershift jiezhao$

Events:
 Type   Reason              Age          From        Message
 ----   ------              ----          ----        -------
 Normal  Scheduled            27m          default-scheduler Successfully assigned clusters-jie-test/router-9cfd8b89-plvtc to ip-10-0-42-36.us-east-2.compute.internal
 Normal  AddedInterface          27m          multus       Add eth0 [10.129.2.82/23] from ovn-kubernetes
 Normal  Pulling             27m          kubelet      Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3"
 Normal  Pulled              27m          kubelet      Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" in 14.309s (14.309s including waiting)
 Normal  Created             26m (x3 over 27m)   kubelet      Created container private-router
 Normal  Started             26m (x3 over 27m)   kubelet      Started container private-router
 Warning BackOff             26m (x5 over 27m)   kubelet      Back-off restarting failed container private-router in pod router-9cfd8b89-plvtc_clusters-jie-test(e6cf40ad-32cd-438c-8298-62d565cf6c6a)
 Normal  Pulled              26m (x3 over 27m)   kubelet      Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3d2acba15f69ea3648b3c789111db34ff06d9230a4371c5949ebe3c6218e6ea3" already present on machine
 Warning FailedToRetrieveImagePullSecret 2m38s (x131 over 27m) kubelet      Unable to retrieve some image pull secrets (router-dockercfg-q768b); attempting to pull the image may not succeed.
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ oc logs router-9cfd8b89-plvtc -n clusters-jie-test
[NOTICE]  (1) : haproxy version is 2.6.13-234aa6d
[NOTICE]  (1) : path to executable is /usr/sbin/haproxy
[ALERT]  (1) : config : [/usr/local/etc/haproxy/haproxy.cfg:52] : 'server ovnkube_sbdb/ovnkube_sbdb' : could not resolve address 'None'.
[ALERT]  (1) : config : Failed to initialize server(s) addr.
jiezhao-mac:hypershift jiezhao$

notes:
=====
not sure if it has the same root cause as https://issues.redhat.com/browse/OCPBUGS-24627

Description of problem:

When Cypress runs in CI, videos showing the test runs are missing (e.g., https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_console/14106/pull-ci-openshift-console-master-e2e-gcp-console/1820750269743370240/artifacts/e2e-gcp-console/test/).  I suspect changes in https://github.com/openshift/console/pull/13937 resulted in the videos not getting properly copied over.

I noticed this error in previous e2e-azure tests:
logger.go:146: 2024-06-05T15:38:14.058Z INFO Successfully created resource group {"name": "example-xwd7d-"}
 
which causes an issue when you go to create a subnet:

{"level":"info","ts":"2024-06-05T15:38:23Z","msg":"Creating new subnet for vnet creation"}

hypershift_framework.go:275: failed to create cluster, tearing down: failed to create infra: failed to create vnet: PUT https://management.azure.com/subscriptions/5f99720c-6823-4792-8a28-69efb0719eea/resourceGroups/example-xwd7d-/providers/Microsoft.Network/virtualNetworks/example-xwd7d-
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: InvalidResourceName
--------------------------------------------------------------------------------
{
"error":

{ "code": "InvalidResourceName", "message": "Resource name example-xwd7d- is invalid. The name can be up to 80 characters long. It must begin with a word character, and it must end with a word character or with '_'. The name may contain word characters or '.', '-', '_'.", "details": [] }

}
-------

 
Example - failure here.

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/151

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-42360. The following is the description of the original issue:

Description of problem:

    due to https://issues.redhat.com/browse/API-1644, no token was generate for sa automatically, it is needed to add one step to create the token manually. 

Version-Release number of selected component (if applicable):

After creating a new service account, one step should be added to create a long-lived API token

How reproducible:

    always

Steps to Reproduce:

secret yaml file exmaple:

xzha@xzha1-mac OCP-24771 % cat secret.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: scoped
  annotations:
    kubernetes.io/service-account.name: scoped
type: kubernetes.io/service-account-token

Actual results:

  

Expected results:

   

Additional info:

  

Description of problem:

After we applied the old tlsSecurityProfile to the Hypershift hosted clsuter, the apiserver ran into CrashLoopBackOff failure, this blocked our test.
    

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-03-13-061822   True        False         129m    Cluster version is 4.16.0-0.nightly-2024-03-13-061822

    

How reproducible:

    always
    

Steps to Reproduce:

    1. Specify KUBECONFIG with kubeconfig of the Hypershift management cluster
    2. hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r .items[].metadata.name)
    3. oc patch hostedcluster $hostedcluster -n clusters --type=merge -p '{"spec": {"configuration": {"apiServer": {"tlsSecurityProfile":{"old":{},"type":"Old"}}}}}'
hostedcluster.hypershift.openshift.io/hypershift-ci-270930 patched
    4. Checked the tlsSecurityProfile,
    $ oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.apiServer
{
  "audit": {
    "profile": "Default"
  },
  "tlsSecurityProfile": {
    "old": {},
    "type": "Old"
  }
}
    

Actual results:

One of the kube-apiserver of Hosted cluster ran into CrashLoopBackOff, stuck in this status, unable to complete the old tlsSecurityProfile configuration.

$ oc get pods -l app=kube-apiserver  -n clusters-${hostedcluster}
NAME                              READY   STATUS             RESTARTS      AGE
kube-apiserver-5b6fc94b64-c575p   5/5     Running            0             70m
kube-apiserver-5b6fc94b64-tvwtl   5/5     Running            0             70m
kube-apiserver-84c7c8dd9d-pnvvk   4/5     CrashLoopBackOff   6 (20s ago)   7m38s
    

Expected results:

    Applying the old tlsSecurityProfile should be successful.
    

Additional info:

   This also can be reproduced on 4.14, 4.15. We have the last passed log of the test case as below:
  passed      API_Server       2024-02-19 13:34:25(UTC)    aws 	4.14.0-0.nightly-2024-02-18-123855   hypershift 	
  passed      API_Server	  2024-02-08 02:24:15(UTC)   aws 	4.15.0-0.nightly-2024-02-07-062935 	hypershift
  passed      API_Server	  2024-02-17 08:33:37(UTC)   aws 	4.16.0-0.nightly-2024-02-08-073857 	hypershift

From the history of the test, it seems that some code changes were introduced in February that caused the bug.
    

This is a clone of issue OCPBUGS-38289. The following is the description of the original issue:

Description of problem:

The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.

Version-Release number of selected component (if applicable):

RHOCP 4.16.4

How reproducible:

100%

Steps to Reproduce:

1. Configure proxy custom resource in RHOCP 4.16.4 cluster
2. Create cluster-monitoring-config configmap in openshift-monitoring project
3. Inject remote-write config (without specifically configuring proxy for remote-write)
4. After saving the modification in  cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet:
==============
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
[...]
  name: k8s
  namespace: openshift-monitoring
spec:
[...]
  remoteWrite:
  - proxyUrl: http://proxy.abc.com:8080     <<<<<====== Injected Automatically but there is no noProxy URL.
    url: http://test-remotewrite.test.svc.cluster.local:9090
    

Actual results:

The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.

Expected results:

The noProxy URL should get injected in Prometheus k8s CR as well.

Additional info:

 

We need to document properly:

  • How to expose the HostedCluster services
  • Which services a relevant for on-premises
  • Use a sample to expose it with MetalLB

Description of problem:

    The current api version used by the registry operator does not include the recently added "ChunkSizeMiB" feature gate. We need to bump the openshift/api to latest so that this feature gate becomes available for use. Initialize the feature "ChunkSizeMiB" behind feature gate as TechPreviewNoUpgrade

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://issues.redhat.com//browse/IR-471    

Description of problem:

When running the bootstrap e2e test, the featuregate does not have a value when the controllers get Run().
While in the actual code, the featuregate is ready before the controllers Run().

Version-Release number of selected component (if applicable):

    

How reproducible:

The bootstrap test log of commit 2092c9e has an error fetching featuregate inside controller Run().

I0221 18:34:00.360752   17716 container_runtime_config_controller.go:235] imageverification sigstore FeatureGates: false, error: featureGates not yet observed    

Steps to Reproduce:

    1. Add function call inside the containerruntimeconfig controller Run() funciton: featureGates, err := ctrl.featureGateAccess.CurrentFeatureGates(). Print out the error message.
    2. Run the e2e bootstrap test: ci/prow/bootstrap-unit 
         

Actual results:

    The function in step 1 returns error: featureGates not yet observed

Expected results:    

    featureGateAccess.CurrentFeatureGates() should not return not yet observed error and return the featuregates.

Additional info:

    

Description of problem:

Node has been cordoned manually.After several days, machine-config-controller uncordoned the same node after rendering a new machine-config.

Version-Release number of selected component (if applicable):

    4.13

Actual results:

The mco rolled out and the node was uncordoned by the mco

Expected results:

 MCO treat unscedhulable node as not ready for performing update. Also, it may halt update on other nodes  in the pool based on what maxUnavailable is set for that pool

Additional info:

    

This is a clone of issue OCPBUGS-41824. The following is the description of the original issue:

Description of problem:

    The kubeconfigs for the DNS Operator and the Ingress Operator are managed by Hypershift and they should only be managed by the cloud service provider. This can lead to the kubeconfig/certificate being invalid in the cases where the cloud service provider further manages the kubeconfig (for example ca-rotation).

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/41

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Specifying N2D machine types for compute and controlPlane machines, with "confidentialCompute: Enabled", "create cluster" got the error "Confidential Instance Config is only supported for compatible cpu platforms" [1], while the real cause is missing the settings "onHostMaintenance: Terminate". That being said, the 4.16 error is mis-leading, suggest to be consistent with 4.15 [2] / 4/14 [3] error messages. 

FYI Confidential VM is supported on N2D machine types (see https://cloud.google.com/confidential-computing/confidential-vm/docs/supported-configurations#machine-type-cpu-zone).

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-21-221942    

How reproducible:

Always    

Steps to Reproduce:

    1. Please refer to [1]    

Actual results:

    The error message is like "Confidential Instance Config is only supported for compatible cpu platforms", which is mis-leading.

Expected results:

    4.15 [2] / 4/14 [3] error messages, which look better.

Additional info:

    FYI it is about QE test case OCP-60212 scenario b.

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/68

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-39096. The following is the description of the original issue:

Description of problem:

    CNO doesnt report, as a metric, when there is a network overlap when live migration is initiated. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 

This is a clone of issue OCPBUGS-39231. The following is the description of the original issue:

Description of problem:

   Feature : https://issues.redhat.com/browse/MGMT-18411
when to assited installer v. 2.34.0 but apprently not including in any openshift version to be used in ABI installation.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Went thru a loop to very the different commits to check if this is delivered in any ocp version.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:


    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:

    Hiding the version is a good security practice

    

Additional info:


    

Description of the problem:
For LVMS we need an additional disk for each worker node (incase there are such , else for each master node )

It is currently possible to attach a bootable disk with data to a worker node , select skip formatting and the lvms requirement is satisfied  and it is possible to start installation

How reproducible:

 

Steps to reproduce:

1.Create a cluster 3 masters 3 workers

2. attach for the worker nodes 1 additional disk

3. in one of the worker node make sure that the disk has a file system and contain data

4. for that disk select skip formatting
Actual results:
the issue here is that the disk which will be used to lvms , will not be formatted and will have
 

Expected results:
In that scenario , the lvms requirement should turn to failed since he disk which AI is planning to use for lvms have file system and may cause installation issues

Description of problem:

Enabling KMS for IBM Cloud will result in the kube-apiserver failing with the following configuration error:

17:45:45 E0711 17:43:00.264407       1 run.go:74] "command failed" err="error while parsing file: resources[0].providers[0]: Invalid value: config.ProviderConfiguration{AESGCM:(*config.AESConfiguration)(nil), AESCBC:(*config.AESConfiguration)(nil), Secretbox:(*config.SecretboxConfiguration)(nil), Identity:(*config.IdentityConfiguration)(0x89b4c60), KMS:(*config.KMSConfiguration)(0xc000ff1900)}: more than one provider specified in a single element, should split into different list elements"

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38436. The following is the description of the original issue:

Description of problem:

    e980 is a valid system type for the madrid region but it is not listed as such in the installer.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy to mad02 with SysType set to e980
    2. Fail
    3.
    

Actual results:

    Installer exits

Expected results:

    Installer should continue as it's a valid system type.

Additional info:

    

Description of problem:

Security baselines such as CIS do not recommend using secrets as environment variables, but using files.

5.4.1 Prefer using secrets as files over secrets as environmen... | Tenable®
https://www.tenable.com/audits/items/CIS_Kubernetes_v1.6.1_Level_2_Master.audit:98de3da69271994afb6211cf86ae4c6b
Secrets in Kubernetes must not be stored as environment variables.
https://www.stigviewer.com/stig/kubernetes/2021-04-14/finding/V-242415

However, metal3 and metal3-image-customization Pods are using environment variables.

$ oc get pod -A -o jsonpath='{range .items[?(@..secretKeyRef)]} {.kind} {.metadata.name} {"\n"}{end}' | grep metal3
 Pod metal3-66b59bbb76-8xzl7 
 Pod metal3-image-customization-965f5c8fc-h8zrk 
    

Version-Release number of selected component (if applicable):

4.14, 4.13, 4.12    

How reproducible:

100%

Steps to Reproduce:

    1. Install a new cluster using baremetal IPI
    2. Run a compliance scan using compliance operator[1], or just look at the manifest of metal3 or metal3-image-customization pod
    
    [1] https://docs.openshift.com/container-platform/4.14/security/compliance_operator/co-overview.html   

Actual results:

Not compliant to CIS or other security baselines   

Expected results:

Compliant to CIS or other security baselines    

Additional info:

    

Description of problem:

When images have been skipped and no images have been mirrored i see idms and itms are generated.
2024/05/15 15:38:25  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/05/15 15:38:25  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/05/15 15:38:25  [INFO]   : ⚙️  setting up the environment for you...
2024/05/15 15:38:25  [INFO]   : 🔀 workflow mode: mirrorToMirror 
2024/05/15 15:38:25  [INFO]   : 🕵️  going to discover the necessary images...
2024/05/15 15:38:25  [INFO]   : 🔍 collecting release images...
2024/05/15 15:38:25  [INFO]   : 🔍 collecting operator images...
2024/05/15 15:38:25  [INFO]   : 🔍 collecting additional images...
2024/05/15 15:38:25  [WARN]   : [AdditionalImagesCollector] mirroring skipped : source image quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc has both tag and digest
2024/05/15 15:38:25  [WARN]   : [AdditionalImagesCollector] mirroring skipped : source image quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9 has both tag and digest
2024/05/15 15:38:25  [INFO]   : 🚀 Start copying the images...
2024/05/15 15:38:25  [INFO]   : === Results ===
2024/05/15 15:38:25  [INFO]   : All release images mirrored successfully 0 / 0 ✅
2024/05/15 15:38:25  [INFO]   : All operator images mirrored successfully 0 / 0 ✅
2024/05/15 15:38:25  [INFO]   : All additional images mirrored successfully 0 / 0 ✅
2024/05/15 15:38:25  [INFO]   : 📄 Generating IDMS and ITMS files...
2024/05/15 15:38:25  [INFO]   : /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml file created
2024/05/15 15:38:25  [INFO]   : 📄 Generating CatalogSource file...
2024/05/15 15:38:25  [INFO]   : mirror time     : 715.644µs
2024/05/15 15:38:25  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
[fedora@preserve-fedora36 knarra]$ ls -l /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml
-rw-r--r--. 1 fedora fedora 0 May 15 15:38 /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml
[fedora@preserve-fedora36 knarra]$ cat /app1/knarra/customertest1/working-dir/cluster-resources/idms-oc-mirror.yaml

    

Version-Release number of selected component (if applicable):

     4.16 oc-mirror
    

How reproducible:

     Always
    

Steps to Reproduce:

    1. Use the following imageSetConfig.yaml and run command `./oc-mirror --v2 -c /tmp/bug331961.yaml --workspace file:///app1/knarra/customertest1 docker://localhost:5000/bug331961 --dest-tls-verify=false`


   
cat /tmp/imageSetConfig.yaml
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
   additionalImages:
   - name: quay.io/cilium/cilium-etcd-operator:v2.0.7@sha256:04b8327f7f992693c2cb483b999041ed8f92efc8e14f2a5f3ab95574a65ea2dc
   - name: quay.io/coreos/etcd:v3.5.4@sha256:a67fb152d4c53223e96e818420c37f11d05c2d92cf62c05ca5604066c37295e9

Actual results:

    Nothing will be mirrored and the images listed will be skipped as these images has both tag and digest but i see idms and itms empty files being generated
    

Expected results:

     If nothing is mirrored, idms and itms files should not be generated.
    

Additional info:

    https://issues.redhat.com/browse/OCPBUGS-33196
    

Description of problem:

There is regression issue found with libreswan 4.9 and later versions which causes ipsec tunnel broken and making pod to pod traffic failing intermittently. But this issue is not seen with libreswan 4.5.

So we must provide a flexibility for user to install their own IPsec machine config to choose their own libreswan version instead of sticking with CNO managed IPsec machine config which installs libreswan version which comes with RHCOS distro.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

We need to document properly:

  • How IDMS/ICSP should be configured into the Management cluster in order to allow a successful disconnected HostedCluster deployment.
  • Sample with hcp command
  • file format

This is a clone of issue OCPBUGS-38051. The following is the description of the original issue:

Description of problem:

Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-41778. The following is the description of the original issue:

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

Description of problem:

`make test` is failing in openshift/coredns repo due to TestImportOrdering failure.

This is due to the recent addition of the github.com/openshift/coredns-ocp-dnsnameresolver external plugin and the fact that CoreDNS doesn't generate zplugin.go formatted correctly so TestImportOrdering fails after generation.

Version-Release number of selected component (if applicable):

4.16-4.17    

How reproducible:

    100%

Steps to Reproduce:

    1. make test   

Actual results:

    TestImportOrdering failure

Expected results:

    TestImportOrdering should not fail

Additional info:

I created an upstream issue and PR: https://github.com/coredns/coredns/pull/6692 which recently merged.

We will just need to carry-patch this in 4.17 and 4.16.

The CoreDNS 1.11.3 rebase https://github.com/openshift/coredns/pull/118 is blocked on this.

Description of problem:

    The installer will not add some ports needed for private clusters.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Create a VPC with a default security group
    2. Deploy a private cluster
    3. Fail
    

Actual results:

    COs cannot use necessary ports

Expected results:

    cluster can fully deploy without manually adding ports

Additional info:

    

Description of problem:

Currently we show the debug container action for pods that are failing. We should be showing the action also for pods in 'Succeeded' phase

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

    1. Log in into a cluster
    2. Create an example Job resource
    3. Check the job's pod and wait till it is in 'Succeeded' phase

Actual results:

Debug container action is not available, on the pod's Logs page

Expected results:

Debug container action is available, on the pod's Logs page

Additional info:

Since users are looking for this feature even for pods in any phase, we are treating this issue as bug.
Related stories:
RFE - https://issues.redhat.com/browse/RFE-1935
STORY - https://issues.redhat.com/browse/CONSOLE-4057

Code that needs to be removed - https://github.com/openshift/console/blob/ae115a9e8c72f930a67ee0c545d36f883cd6be34/frontend/public/components/utils/resource-log.tsx#L149-L151

Description of problem:

    When publish: internal, bootstrap SSH rules are still open to public internet (0.0.0.0/0) instead of restricted to the machine cidr.

Version-Release number of selected component (if applicable):

    

How reproducible:

    all private clusters

Steps to Reproduce:

    1. set publish: internal in installconfig
    2. inspect ssh rule
    3.
    

Actual results:

    ssh is open to public internet

Expected results:

    should be restricted to machine network

Additional info:

    

This is a clone of issue OCPBUGS-38918. The following is the description of the original issue:

Description of problem:

   When installing OpenShift 4.16 on vSphere using IPI method with a template it fails with below error:
2024-08-07T09:55:51.4052628Z             "level=debug msg=  Fetching Image...",
2024-08-07T09:55:51.4054373Z             "level=debug msg=  Reusing previously-fetched Image",
2024-08-07T09:55:51.4056002Z             "level=debug msg=  Fetching Common Manifests...",
2024-08-07T09:55:51.4057737Z             "level=debug msg=  Reusing previously-fetched Common Manifests",
2024-08-07T09:55:51.4059368Z             "level=debug msg=Generating Cluster...",
2024-08-07T09:55:51.4060988Z             "level=info msg=Creating infrastructure resources...",
2024-08-07T09:55:51.4063254Z             "level=debug msg=Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202406251923-0/x86_64/rhcos-416.94.202406251923-0-vmware.x86_64.ova?sha256=893a41653b66170c7d7e9b343ad6e188ccd5f33b377f0bd0f9693288ec6b1b73'",
2024-08-07T09:55:51.4065349Z             "level=debug msg=image download content length: 12169",
2024-08-07T09:55:51.4066994Z             "level=debug msg=image download content length: 12169",
2024-08-07T09:55:51.4068612Z             "level=debug msg=image download content length: 12169",
2024-08-07T09:55:51.4070676Z             "level=error msg=failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to use cached vsphere image: bad status: 403"

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    All the time in user environment

Steps to Reproduce:

    1.Try to install disconnected IPI install on vSphere using a template.
    2.
    3.
    

Actual results:

    No cluster installation

Expected results:

    Cluster installed with indicated template

Additional info:

    - 4.14 works as expected in customer environment
    - 4.15 works as expected in customer environment

This is a clone of issue OCPBUGS-38006. The following is the description of the original issue:

Description of problem:

    sometimes cluster-capi-operator pod stuck in CrashLoopBackOff on osp

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-01-213905    

How reproducible:

    Sometimes

Steps to Reproduce:

    1.Create an osp cluster with TechPreviewNoUpgrade
    2.Check cluster-capi-operator pod
    3.
    

Actual results:

cluster-capi-operator pod in CrashLoopBackOff status
$ oc get po                               
cluster-capi-operator-74dfcfcb9d-7gk98          0/1     CrashLoopBackOff   6 (2m54s ago)   41m

$ oc get po         
cluster-capi-operator-74dfcfcb9d-7gk98          1/1     Running   7 (7m52s ago)   46m

$ oc get po                                                               
cluster-capi-operator-74dfcfcb9d-7gk98          0/1     CrashLoopBackOff   7 (2m24s ago)   50m

E0806 03:44:00.584669       1 kind.go:66] "kind must be registered to the Scheme" err="no kind is registered for the type v1alpha7.OpenStackCluster in scheme \"github.com/openshift/cluster-capi-operator/cmd/cluster-capi-operator/main.go:86\"" logger="controller-runtime.source.EventHandler"
E0806 03:44:00.685539       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for clusteroperator caches to sync: timed out waiting for cache to be synced for Kind *v1alpha7.OpenStackCluster" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator"
I0806 03:44:00.685610       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0806 03:44:00.685620       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0806 03:44:00.685646       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685706       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster"
I0806 03:44:00.685712       1 controller.go:242] "All workers finished" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster"
I0806 03:44:00.685717       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685722       1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685718       1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685720       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator"
I0806 03:44:00.685823       1 recorder_in_memory.go:80] &Event{ObjectMeta:{dummy.17e906d425f7b2e1  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:CustomResourceDefinitionUpdateFailed,Message:Failed to update CustomResourceDefinition.apiextensions.k8s.io/openstackclusters.infrastructure.cluster.x-k8s.io: Put "https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/openstackclusters.infrastructure.cluster.x-k8s.io": context canceled,Source:EventSource{Component:cluster-capi-operator-capi-installer-apply-client,Host:,},FirstTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,LastTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0806 03:44:00.719743       1 capi_installer_controller.go:309] "CAPI Installer Controller is Degraded" logger="CapiInstallerController" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"
E0806 03:44:00.719942       1 controller.go:329] "Reconciler error" err="error during reconcile: failed to set conditions for CAPI Installer controller: failed to sync status: failed to update cluster operator status: client rate limiter Wait returned an error: context canceled" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"

Expected results:

    cluster-capi-operator pod is always Running

Additional info:

    

Description of problem:

Pseudolocalization is not working in console.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Go to any console's page and add '?pseudolocalization=true' suffix to the URL
    2.     3.
    

Actual results:

    The page stays set with the same language

Expected results:

    The page should be pseudolocalized language

Additional info:

    Looks like this is the issue https://github.com/MattBoatman/i18next-pseudo/issues/4

Description of problem:

Follow up the step described in https://github.com/openshift/installer/pull/8350 to destroy bootstrap server manually, failed with error `FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jimatest-5sjqx-bootstrap" not found`


# ./openshift-install version
./openshift-install 4.16.0-0.nightly-2024-05-15-001800
built from commit 494b79cf906dc192b8d1a6d98e56ce1036ea932f
release image registry.ci.openshift.org/ocp/release@sha256:d055d117027aa9afff8af91da4a265b7c595dc3ded73a2bca71c3161b28d9d5d
release architecture amd64

On AWS:
# ./openshift-install create cluster --dir ipi-aws
INFO Credentials loaded from the "default" profile in file "/root/.aws/credentials" 
WARNING failed to find default instance type: no instance type found for the zone constraint 
WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint 
INFO Consuming Install Config from target directory 
WARNING failed to find default instance type: no instance type found for the zone constraint 
WARNING FeatureSet "TechPreviewNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
INFO Creating IAM roles for control-plane and compute nodes 
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-aws/auth/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:44379 --webhook-port=44331 --webhook-cert-dir=/tmp/envtest-serving-certs-1391600832] 
INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:42725 --webhook-port=45711 --webhook-cert-dir=/tmp/envtest-serving-certs-1758849099 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true] 
INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests 
INFO Created manifest *v1beta2.AWSClusterControllerIdentity, namespace= name=default 
INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=jima16a-2xszh 
INFO Created manifest *v1beta2.AWSCluster, namespace=openshift-cluster-api-guests name=jima16a-2xszh 
INFO Waiting up to 15m0s (until 11:01PM EDT) for network infrastructure to become ready... 
INFO Network infrastructure is ready              
INFO Creating private Hosted Zone                 
INFO Creating Route53 records for control plane load balancer 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-0 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-1 
INFO Created manifest *v1beta2.AWSMachine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-2 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-1 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master-2 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima16a-2xszh-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima16a-2xszh-master 
INFO Waiting up to 15m0s (until 11:07PM EDT) for machines to provision... 
INFO Control-plane machines are ready             
INFO Cluster API resources have been created. Waiting for cluster to become ready... 
INFO Waiting up to 20m0s (until 11:12PM EDT) for the Kubernetes API at https://api.jima16a.qe.devcluster.openshift.com:6443... 
INFO API v1.29.4+4a87b53 up                       
INFO Waiting up to 30m0s (until 11:25PM EDT) for bootstrapping to complete... 
^CWARNING Received interrupt signal                    
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: aws infrastructure provider 
INFO Local Cluster API system has completed operations 

# ./openshift-install destroy bootstrap --dir ipi-aws
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-aws/auth/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:45869 --webhook-port=43141 --webhook-cert-dir=/tmp/envtest-serving-certs-3670728979] 
INFO Running process: aws infrastructure provider with args [-v=4 --diagnostics-address=0 --health-addr=127.0.0.1:46111 --webhook-port=35061 --webhook-cert-dir=/tmp/envtest-serving-certs-3674093147 --feature-gates=BootstrapFormatIgnition=true,ExternalResourceGC=true] 
FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jima16a-2xszh-bootstrap" not found 
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: aws infrastructure provider 
INFO Local Cluster API system has completed operations 

Same issue on vSphere:
# ./openshift-install create cluster --dir ipi-vsphere/
INFO Consuming Install Config from target directory 
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-vsphere/auth/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:39945 --webhook-port=36529 --webhook-cert-dir=/tmp/envtest-serving-certs-3244100953] 
INFO Running process: vsphere infrastructure provider with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:45417 --webhook-port=37503 --webhook-cert-dir=/tmp/envtest-serving-certs-3224060135 --leader-elect=false] 
INFO Created manifest *v1.Namespace, namespace= name=openshift-cluster-api-guests 
INFO Created manifest *v1beta1.Cluster, namespace=openshift-cluster-api-guests name=jimatest-5sjqx 
INFO Created manifest *v1beta1.VSphereCluster, namespace=openshift-cluster-api-guests name=jimatest-5sjqx 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=vsphere-creds 
INFO Waiting up to 15m0s (until 10:47PM EDT) for network infrastructure to become ready... 
INFO Network infrastructure is ready              
INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap 
INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-0 
INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-1 
INFO Created manifest *v1beta1.VSphereMachine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-2 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-1 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master-2 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jimatest-5sjqx-master 
INFO Waiting up to 15m0s (until 10:47PM EDT) for machines to provision... 
INFO Control-plane machines are ready             
INFO Cluster API resources have been created. Waiting for cluster to become ready... 
INFO Waiting up to 20m0s (until 10:57PM EDT) for the Kubernetes API at https://api.jimatest.qe.devcluster.openshift.com:6443... 
INFO API v1.29.4+4a87b53 up                       
INFO Waiting up to 1h0m0s (until 11:37PM EDT) for bootstrapping to complete... 
^CWARNING Received interrupt signal                    
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: vsphere infrastructure provider 
INFO Local Cluster API system has completed operations 
 
# ./openshift-install destroy bootstrap --dir ipi-vsphere/
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi-vsphere/auth/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:34957 --webhook-port=34511 --webhook-cert-dir=/tmp/envtest-serving-certs-94748118] 
INFO Running process: vsphere infrastructure provider with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42073 --webhook-port=46721 --webhook-cert-dir=/tmp/envtest-serving-certs-4091171333 --leader-elect=false] 
FATAL error destroying bootstrap resources failed to delete bootstrap machine: machines.cluster.x-k8s.io "jimatest-5sjqx-bootstrap" not found 
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: vsphere infrastructure provider 
INFO Local Cluster API system has completed operations 


Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-15-001800

How reproducible:

    Always

Steps to Reproduce:

    1. Create cluster
    2. Interrupt the installation when waiting for bootstrap completed
    3. Run command "openshift-install destroy bootstrap --dir <dir>" to destroy bootstrap manually
    

Actual results:

    Failed to destroy bootstrap through command 'openshift-install destroy bootstrap --dir <dir>'

Expected results:

    Bootstrap host is destroyed successfully

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

The customer's cloud credentials operator generates millions of the below messages per day in the GCP cluster.

And they want to reduce/stop these logs as it is consuming more disks. Also, their "cloud credentials" operator runs in manual mode.

time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds
time="2024-06-21T08:37:42Z" level=error msg="error creating GCP client" error="Secret \"gcp-credentials\" not found"
time="2024-06-21T08:37:42Z" level=error msg="error determining whether a credentials update is needed" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm error="unable to check whether credentialsRequest needs update"
time="2024-06-21T08:37:42Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials
time="2024-06-21T08:37:42Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials
time="2024-06-21T08:37:42Z" level=info msg="reconciling clusteroperator status"
time="2024-06-21T08:37:42Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator
time="2024-06-21T08:37:42Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator
time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds

Description of problem:

When running ose-tests conformance suites against hypershift clusters, they error due to the `openshift-oauth-apiserver` namespace not existing. 
    

Version-Release number of selected component (if applicable):

4.15.13
    

How reproducible:

Consistent
    

Steps to Reproduce:

    1. Create a hypershift cluster
    2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface
    3. Note errors in logs
    

Actual results:

ERRO[0352]   Finished CollectData for [Jira:"kube-apiserver"] monitor test apiserver-availability collection with not-supported error  error="not supported: namespace openshift-oauth-apiserver not present"
error running options: failed due to a MonitorTest failureerror: failed due to a MonitorTest failure
    

Expected results:

No errors
    

Additional info:


    

As it happened for the ironic container, the ironic-agent container build script needs to be updated for FIPS before we can enable the IPA FIPS option

This is a clone of issue OCPBUGS-39133. The following is the description of the original issue:

Description of problem:

Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.

According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)

> 	"Aug 16 *16:43:17.672* - 2s    E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"

The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:

> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 *16:43:17.650083*    2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...

The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:

> I0816 16:40:24.711048    4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:{10250 [10.128.2.31 10.131.0.12] []}]

*I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"*

AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.

For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40

bad decisions may be made if the if the services and/or endpoints cache are stale.

Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:

> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T15:40:02.005051Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T15:40:35.085330Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T15:40:35.128519Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:19:41.148148Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:19:47.797420Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:20:23.051594Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:20:23.100761Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:20:23.938927Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:21:01.699722Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:39:00.328312Z	system:serviceaccount:kube-system:endpoint-controller	update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:39:41.124449Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:43:23.701015Z	system:serviceaccount:kube-system:endpoint-controller	update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:43:47.108903Z	system:serviceaccount:kube-system:endpoint-controller	update

We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod

Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.

To summarize, I can see two problems here (maybe one is the consequence of the other):

    A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
    The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
    The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.


    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. See "Description of problem"
    2.
    3.
    

Actual results:

    

Expected results:

the kube-aggregator should detect stale Apiservice endpoints.
    

Additional info:

the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
    

Description of problem:

While extracting the cluster's release image using the jq tool from JSON output, an uninitialized variable was mistakenly used as the version number string constant. This caused jq to fail to match any version numbers correctly, resulting in an empty extraction. Consequently, the expected image could not be extracted, resulting in an empty image.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always.

Steps to Reproduce:

    1.Prepare an Azure OpenShift cluster.
    2.Migration to Azure AD workload Identity using procedure https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md#steps-to-in-place-migrate-an-openshift-cluster-to-azure-ad-workload-identity.
    3.Failed on step8: Extract CredentialsCrequests from the cluster's release image for the given version.

Actual results:

Could not extract the expected image, actually it is an empty image.

$ CLUSTER_VERSION=`oc get clusterversion version -o json | jq -r '.status.desired.version'` 
$ RELEASE_IMAGE=`oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "VERSION_FROM_PREVIOUS_COMMAND") | .image'`     

Expected results:

The variable should be properly initialized with the correct version number string. This ensures that jq can accurately match the version number and extract the correct image information.

$ CLUSTER_VERSION=`oc get clusterversion version -o json | jq -r '.status.desired.version'`
$ RELEASE_IMAGE=`oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "'$CLUSTER_VERSION'") | .image'`

Additional info:

# Obtain release image from the cluster version. Will not work with pre-release versions.

$ CLUSTER_VERSION=`oc get clusterversion version -o json | jq -r '.status.desired.version'`

(Error)$ oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "VERSION_FROM_PREVIOUS_COMMAND") | .image'

$ oc get clusterversion version -o json | jq -r '.status.history[] | select(.version == "'$CLUSTER_VERSION'") | .image'
registry.ci.openshift.org/ocp-arm64/release-arm64@sha256:c605269e51d60b18e6c7251c92355b783bac7f411e137da36b031a1c6f21579b

This is a clone of issue OCPBUGS-39118. The following is the description of the original issue:

Description of problem:

For light theme, the Lightspeed logo should use the multi-color version.

For dark theme, the Lightspeed logo should use the single color version for both the button and the content.

Description of problem:

While testing the backport of Azure Reserved Capacity Group support a customer observed that they lack permissions required when operating with Workload Identity (token and role based auth). However this is a curious one where it's likely that the target group may not necessarily be part of the group in which the cluster runs so it would require input from the admin in some use cases.

Version-Release number of selected component (if applicable):

4.17.0    

How reproducible:

100%    

Steps to Reproduce:

    1. Install 4.17 in Azure with Workload Identity configured
    2. Create an Azure Reserved Capacity Group, for simplicity in the same resource group as the cluster
    3. Update a machineset to use that reserved group
    

Actual results:

    Permissions errors, missing Microsoft.Compute/capacityReservationGroups/deploy/action

Expected results:

    Machines created successfully

Additional info:

As mentioned in the description it's likely that the reserved capacity group may be in another resource group and thus in order to create the role it requires admin input, ie: cannot be computed 100% of the time. Therefore that specific use case may be a documentation concern unless we can come up with a novel solution.

Further, it also brings up the question of whether or not we should include this permission in the default cred request or not. Customers may not use that default for reasons mentioned above, or they may not even use the feature entirely. But, I think for now it may be worth adding it to the 4.17 CredRequest and treating use of this feature in backported versions as a documentation concern, since we wouldn't want to expand permissions on all 4.16 or older clusters.    

 

This bug is to track the initial triage of our low install success rates on vsphere.  I couldn't find a duplicate but I could have missing one.  Feel free to close it as such if there's a deeper investigation happening elsewhere.

 

Component Readiness has found a potential regression in the following test:

install should succeed: overall

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-07-11T00:00:00Z
End Time: 2024-07-17T23:59:59Z
Success Rate: 60.00%
Successes: 12
Failures: 8
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 60
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=vsphere&Platform=vsphere&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Installer%20%2F%20openshift-installer&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20vsphere%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-07-17%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-07-11%2000%3A00%3A00&testId=cluster%20install%3A0cb1bb27e418491b1ffdacab58c5c8c0&testName=install%20should%20succeed%3A%20overall

This is a clone of issue OCPBUGS-39396. The following is the description of the original issue:

Description of problem:

    When using an amd64 release image and setting the multi-arch flag to false, HCP CLI cannot create a HostedCluster. The following error happens:
/tmp/hcp create cluster aws --role-arn arn:aws:iam::460538899914:role/cc1c0f586e92c42a7d50 --sts-creds /tmp/secret/sts-creds.json --name cc1c0f586e92c42a7d50 --infra-id cc1c0f586e92c42a7d50 --node-pool-replicas 3 --base-domain origin-ci-int-aws.dev.rhcloud.com --region us-east-1 --pull-secret /etc/ci-pull-credentials/.dockerconfigjson --namespace local-cluster --release-image registry.build01.ci.openshift.org/ci-op-0bi6jr1l/release@sha256:11351a958a409b8e34321edfc459f389058d978e87063bebac764823e0ae3183
2024-08-29T06:23:25Z	ERROR	Failed to create cluster	{"error": "release image is not a multi-arch image"}
github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1
	/remote-source/app/product-cli/cmd/cluster/aws/create.go:35
github.com/spf13/cobra.(*Command).execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1032
main.main
	/remote-source/app/product-cli/main.go:59
runtime.main
	/usr/lib/golang/src/runtime/proc.go:271
Error: release image is not a multi-arch image
release image is not a multi-arch image

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Try to create a HC with an amd64 release image and multi-arch flag set to false
    

Actual results:

   HC does not create and this error is displayed:
Error: release image is not a multi-arch image release image is not a multi-arch image 

Expected results:

    HC should create without errors

Additional info:

  This bug seems to have occurred as a result of HOSTEDCP-1778 and this line:  https://github.com/openshift/hypershift/blob/e2f75a7247ab803634a1cc7f7beaf99f8a97194c/cmd/cluster/aws/create.go#L520

Description of problem:

For troubleshooting OSUS cases, the default {{must-gather}} doesn't collect OSUS information, and an {{inspect}} of the {{openshift-update-service}} namespace is missing several OSUS related resources like {{UpdateService}}, {{ImageSetConfiguration}}, and maybe more.

 

Version-Release number of selected component (if applicable):

4.14, 4.15, 4.16, 4.17

 

Actual results:

No OSUS information in must-gather

 

Expected results:

OSUS data in must-gather

 

Additional info: OTA-1177

PR for 4.17 in [1]

 

[1] https://github.com/openshift/must-gather/pull/443

This is a clone of issue OCPBUGS-43757. The following is the description of the original issue:

Description of problem:

    If the node-joiner container encounters an error, the "oc adm node-image create" command does not show it. It currently returns an error but should also display the node-joiner container's logs so that we can see the underlying issue.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    always

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    The node-image create command returns container error.

Expected results:

    The node-image create command returns container error and displays the container's log to aid in diagnosing the issue.

Additional info:

    

Description of problem:

Builds from a buildconfig are failing on OCP 4.12.48. The developers are impacted since large files cant be cloned anymore within a BuildConfig.

Version-Release number of selected component (if applicable):

4.12.48

How reproducible:

Always

Steps to Reproduce:

The issue is fixed in version 4.12.45 as per https://issues.redhat.com/browse/OCPBUGS-23419 but still the issue persists in 4.12.48

Actual results:

The build is failing.

Expected results:

The build should work without any issues.

Additional info:

Build fails with error:
```
Adding cluster TLS certificate authority to trust store
Cloning "https://<path>.git" ...
error: Downloading <github-repo>/projects/<path>.mp4 (70 MB)
Error downloading object: <github-repo>/projects/<path>.mp44 (a11ce74): Smudge error: Error downloading <github-repo>/projects/<path>.mp4 (a11ce745c147aa031dd96915716d792828ae6dd17c60115b675aba75342bb95a): batch request: missing protocol: "origin.git/info/lfs"
Errors logged to /tmp/build/inputs/.git/lfs/logs/20240430T112712.167008327.log
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: <github-repo>/projects/<path>.mp4: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'
```


Description of problem:

Featuregate taking unknown value

Version-Release number of selected component (if applicable):

4.16 and 4.17

How reproducible:

Always

Steps to Reproduce:

 oc patch featuregate cluster --type=json -p '[{"op": "replace", "path": "/spec/featureSet", "value": "unknownghfh"}]'
featuregate.config.openshift.io/cluster patched
 oc get  featuregate cluster -o yaml
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
  creationTimestamp: "2024-06-21T07:20:25Z"
  generation: 2
  name: cluster
  resourceVersion: "56172"
  uid: c900a975-78ea-4076-8e56-e5517e14b55e
spec:
  featureSet: unknownghfh 

Actual results:

featuregate.config.openshift.io/cluster patched
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
  creationTimestamp: "2024-06-21T07:20:25Z"
  generation: 2
  name: cluster
  resourceVersion: "56172"
  uid: c900a975-78ea-4076-8e56-e5517e14b55e
spec:
  featureSet: unknownghfh 

 

Expected results:

Should not take invalid value and give error

{{oc patch featuregate cluster --type=json -p '[

{"op": "replace", "path": "/spec/featureSet", "value": "unknownghfh"}

]'}}
The FeatureGate "cluster" is invalid: spec.featureSet: Unsupported value: "unknownghfh": supported values: "", "CustomNoUpgrade", "LatencySensitive", "TechPreviewNoUpgrade"
 
 
 
Additional info:

https://github.com/openshift/kubernetes/commit/facd3b18622d268a4780de1ad94f7da763351425

 

Add a recording rule, acm_capacity_effective_cpu_cores, on telemeter server side for ACM subscription usage with two labels, _id and managed_cluster_id.

The rule is built based on the 3 metrics:

  • acm_managed_cluster_info
  • acm_managed_cluster_worker_cores:sum
  • cluster:capacity_effective_cpu_cores
    The metric value is measured in virtual CPU cores. That means physical CPU cores will be normalized to virtual CPU cores (1 physical CPU core = 2 virtual CPU cores).

Here is the logic for the recording rule:

  • Self managed OpenShift clusters
      acm_capacity_effective_cpu_cores = 2 * cluster:capacity_effective_cpu_cores
  • Managed OpenShift clusters and non-OpenShift clusters
      acm_capacity_effective_cpu_cores = acm_managed_cluster_worker_cores.

Note: If the metric cluster:capacity_effective_cpu_cores is not available for a self managed OpenShift cluster, the value of the metric acm_capacity_effective_cpu_cores will fall back to the metric acm_managed_cluster_worker_cores:sum.

 

See the DDR for more details: https://docs.google.com/document/d/1WbQyaY3C6MxfsebJrV_glX8YvqqzS5fwt3z8Z8SQ0VY/edit?usp=sharing

Description of problem:

Trying to install AWS EFS Driver 4.15 in 4.16 OCP. And driver pods get stuck with the below error:
$ oc get pods
NAME                                             READY   STATUS    RESTARTS   AGE
aws-ebs-csi-driver-controller-5f85b66c6-5gw8n    11/11   Running   0          80m
aws-ebs-csi-driver-controller-5f85b66c6-r5lzm    11/11   Running   0          80m
aws-ebs-csi-driver-node-4mcjp                    3/3     Running   0          76m
aws-ebs-csi-driver-node-82hmk                    3/3     Running   0          76m
aws-ebs-csi-driver-node-p7g8j                    3/3     Running   0          80m
aws-ebs-csi-driver-node-q9bnd                    3/3     Running   0          75m
aws-ebs-csi-driver-node-vddmg                    3/3     Running   0          80m
aws-ebs-csi-driver-node-x8cwl                    3/3     Running   0          80m
aws-ebs-csi-driver-operator-5c77fbb9fd-dc94m     1/1     Running   0          80m
aws-efs-csi-driver-controller-6c4c6f8c8c-725f4   4/4     Running   0          11m
aws-efs-csi-driver-controller-6c4c6f8c8c-nvtl7   4/4     Running   0          12m
aws-efs-csi-driver-node-2frs7                    0/3     Pending   0          6m29s
aws-efs-csi-driver-node-5cpb8                    0/3     Pending   0          6m26s
aws-efs-csi-driver-node-bchg5                    0/3     Pending   0          6m28s
aws-efs-csi-driver-node-brndb                    0/3     Pending   0          6m27s
aws-efs-csi-driver-node-qcc4m                    0/3     Pending   0          6m27s
aws-efs-csi-driver-node-wpk5d                    0/3     Pending   0          6m27s
aws-efs-csi-driver-operator-6b54c78484-gvxrt     1/1     Running   0          13m

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  6m58s                  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  3m42s (x2 over 4m24s)  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.

 

 

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    all the time

Steps to Reproduce:

    1. Install AWS EFS CSI driver 4.15 in 4.16 OCP
    2.
    3.
    

Actual results:

    EFS CSI drive node pods are stuck in pending state

Expected results:

    All pod should be running.

Additional info:

    More info on the initial debug here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1715757611210639

Please review the following PR: https://github.com/openshift/oc/pull/1785

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/474

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-39209. The following is the description of the original issue:

Description of problem:
Attempting to Migrate from OpenShiftSDN to OVNKubernetes but experiencing the below Error once the Limited Live Migration is started.

+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
I0829 14:06:20.313928   82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf
I0829 14:06:20.314202   82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}}
F0829 14:06:20.315468   82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"

The OpenShift Container Platform 4 - Cluster has been installed with the below configuration and therefore has a conflict because of the clusterNetwork with the Join Subnet of OVNKubernetes.

$ oc get cm -n kube-system cluster-config-v1 -o yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: sandbox1730.opentlc.com
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform: {}
      replicas: 3
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform: {}
      replicas: 3
    metadata:
      creationTimestamp: null
      name: nonamenetwork
    networking:
      clusterNetwork:
      - cidr: 100.64.0.0/15
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.241.0.0/16
      networkType: OpenShiftSDN
      serviceNetwork:
      - 198.18.0.0/16
    platform:
      aws:
        region: us-east-2
    publish: External
    pullSecret: ""

So following the procedure, the below steps were executed but still the problem is being reported.

oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.68.0.0/16"}}}}}'

Checking whether change was applied and one can see it being there/configured.

$ oc get network.operator cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2024-08-29T10:05:36Z"
  generation: 376
  name: cluster
  resourceVersion: "135345"
  uid: 37f08c71-98fa-430c-b30f-58f82142788c
spec:
  clusterNetwork:
  - cidr: 100.64.0.0/15
    hostPrefix: 23
  defaultNetwork:
    openshiftSDNConfig:
      enableUnidling: true
      mode: NetworkPolicy
      mtu: 8951
      vxlanPort: 4789
    ovnKubernetesConfig:
      egressIPConfig: {}
      gatewayConfig:
        ipv4: {}
        ipv6: {}
        routingViaHost: false
      genevePort: 6081
      ipsecConfig:
        mode: Disabled
      ipv4:
        internalJoinSubnet: 100.68.0.0/16
      mtu: 8901
      policyAuditConfig:
        destination: "null"
        maxFileSize: 50
        maxLogFiles: 5
        rateLimit: 20
        syslogFacility: local0
    type: OpenShiftSDN
  deployKubeProxy: false
  disableMultiNetwork: false
  disableNetworkDiagnostics: false
  kubeProxyConfig:
    bindAddress: 0.0.0.0
  logLevel: Normal
  managementState: Managed
  migration:
    mode: Live
    networkType: OVNKubernetes
  observedConfig: null
  operatorLogLevel: Normal
  serviceNetwork:
  - 198.18.0.0/16
  unsupportedConfigOverrides: null
  useMultiNetworkPolicy: false

Following the above the Limited Live Migration is being triggered, which then suddently stops because of the Error shown.

oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.9

How reproducible:
Always

Steps to Reproduce:
1. Install OpenShift Container Platform 4 with OpenShiftSDN, the configuration shown above and then update to OpenShift Container Platform 4.16
2. Change internalJoinSubnet to prevent a conflict with the Join Subnet of OVNKubernetes (oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":

{"internalJoinSubnet": "100.68.0.0/16"}

}}}}')
3. Initiate the Limited Live Migration running oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
4. Check the logs of ovnkube-node using oc logs ovnkube-node-XXXXX -c ovnkube-controller

Actual results:

+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
I0829 14:06:20.313928   82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf
I0829 14:06:20.314202   82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}}
F0829 14:06:20.315468   82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"

Expected results:
OVNKubernetes Limited Live Migration to recognize the change applied for internalJoinSubnet and don't report any CIDR/Subnet overlap during the OVNKubernetes Limited Live Migration

Additional info:
N/A

Affected Platforms:
OpenShift Container Platform 4.16 on AWS

Please review the following PR: https://github.com/openshift/azure-kubernetes-kms/pull/7

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

sometimes the unit test is failing in CI, it appears to be related to how the tests are structured in combination with the random ordering.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

intermittent, there appears to be a condition that when the tests are executed in random order it is possible for some of them to fail due to old objects being in the test suite.
    

Steps to Reproduce:

    1. run `make test`
    2.
    3.
    

Actual results:

sometimes, this will be in the build output
    
=== RUN   TestApplyConfigMap/skip_on_extra_label
    resourceapply_test.go:177: 
        Expected success, but got an error:
            <*errors.StatusError | 0xc0002dc140>: 
            configmaps "foo" already exists
            {
                ErrStatus: {
                    TypeMeta: {Kind: "", APIVersion: ""},
                    ListMeta: {
                        SelfLink: "",
                        ResourceVersion: "",
                        Continue: "",
                        RemainingItemCount: nil,
                    },
                    Status: "Failure",
                    Message: "configmaps \"foo\" already exists",
                    Reason: "AlreadyExists",
                    Details: {Name: "foo", Group: "", Kind: "configmaps", UID: "", Causes: nil, RetryAfterSeconds: 0},
                    Code: 409,
                },
            }

Expected results:

tests pass
    

Additional info:

it looks like we are also missing the proper test env remote bucket flag on the make command. it should have something like:

"--remote-bucket openshift-kubebuilder-tools"
    

Description of problem:

    The cluster-api-operator https://github.com/openshift/cluster-api-operator is missing the latest update release from upstream cluster-api-operator https://github.com/kubernetes-sigs/cluster-api-operator/tree/main

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

In 4.17, users no longer need the this job in their cluster. The migration the job does is been done for a couple releases (4.15, 4.16).

Acceptance Criteria

  • azure-path-fix job is not created in new 4.17 clusters
  • azure-path-fix job is deleted if it exists in the cluster (the job will exist for users upgrading from 4.14, 4.15, and 4.16)
  • the image registry continues to work for users upgrading from 4.13 to 4.17 (directly or through 4.14, 4.15 and 4.16)

Description of the problem:

right now , when entering patch manifest in the manifest folder it will be ignoresd

User should be able to upload manifest only to the openshift folder

How reproducible:

 

Steps to reproduce:

1.create cluster

2.try to create a patch manifest in the manifest folder

3.

Actual results:
It is created installation starts patch is ignored
Expected results:

UI should block creating the patch manifest

This is a clone of issue OCPBUGS-38482. The following is the description of the original issue:

Description of problem:

The PR for "AGENT-938: Enhance console logging to display node ISO expiry date during addNodes workflow" landed in the master branch after the release branch for 4.17 was cut due to delays in tide merge pool.

Need to backport this commit to 4.17 https://github.com/openshift/installer/commit/8c381ff6edbbc9885aac7ce2d6dedc055e01c70d

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When running oc-mirror against yaml that includes community-operator-index the process terminates prematurely

Version-Release number of selected component (if applicable):

$ oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202404221110.p0.g0e2235f.assembly.stream.el9-0e2235f", GitCommit:"0e2235f4a51ce0a2d51cfc87227b1c76bc7220ea", GitTreeState:"clean", BuildDate:"2024-04-22T16:05:56Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

$ cat imageset-config.yaml 
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
archiveSize: 4
mirror:
  platform:
    channels:
    - name: stable-4.15
      type: ocp
    graph: true
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    full: false
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15
    full: false
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.15
    full: false
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  helm: {}


$ oc-mirror --v2 -c imageset-config.yaml  --loglevel debug --workspace file:////data/oc-mirror/workdir/ docker://registry.local.momolab.io:8443


Last 10 lines:

2024/04/29 06:01:40  [DEBUG]  : source docker://public.ecr.aws/aws-controllers-k8s/apigatewayv2-controller:1.0.7
2024/04/29 06:01:40  [DEBUG]  : destination docker://registry.local.momolab.io:8443/aws-controllers-k8s/apigatewayv2-controller:1.0.7
2024/04/29 06:01:40  [DEBUG]  : source docker://quay.io/openshift-community-operators/ack-apigatewayv2-controller@sha256:c6844909fa2fdf8aabf1c6762a2871d85fb3491e4c349990f46e4cd1e7ecc099
2024/04/29 06:01:40  [DEBUG]  : destination docker://registry.local.momolab.io:8443/openshift-community-operators/ack-apigatewayv2-controller:c6844909fa2fdf8aabf1c6762a2871d85fb3491e4c349990f46e4cd1e7ecc099
2024/04/29 06:01:40  [DEBUG]  : source docker://quay.io/openshift-community-operators/openshift-nfd-operator@sha256:880517267f12e0ca4dd9621aa196c901eb1f754e5ec990a1459d0869a8c17451
2024/04/29 06:01:40  [DEBUG]  : destination docker://registry.local.momolab.io:8443/openshift-community-operators/openshift-nfd-operator:880517267f12e0ca4dd9621aa196c901eb1f754e5ec990a1459d0869a8c17451
2024/04/29 06:01:40  [DEBUG]  : source docker://quay.io/openshift/origin-cluster-nfd-operator:4.10
2024/04/29 06:01:40  [DEBUG]  : destination docker://registry.local.momolab.io:8443/openshift/origin-cluster-nfd-operator:4.10
2024/04/29 06:01:40  [ERROR]  : [OperatorImageCollector] unable to parse image registry.redhat.io/openshift4/ose-kube-rbac-proxy correctly
2024/04/29 06:01:40  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
error closing log file registry.log: close /data/oc-mirror/workdir/working-dir/logs/registry.log: file already closed
2024/04/29 06:01:40  [ERROR]  : unable to parse image registry.redhat.io/openshift4/ose-kube-rbac-proxy correctly 

 

Steps to Reproduce:

    1. Run oc-mirror command as above with debug enabled
    2. Wait a few minutes
    3. oc-mirror fails
    

Actual results:

    oc-mirror fails when openshift-community-operator is included

Expected results:

    oc-mirror to complete

Additional info:

I have the debug logs, which I can attach.    

Please review the following PR: https://github.com/openshift/installer/pull/8456

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

The MAC mapping validation added in MGMT-17618 caused a regression on ABI.

To avoid this regression, the validation should be mitigated to validate only non-predictable interface names.

We should still make sure at least one MAC address exist in the MAC map, to be able to detect the relevant host.

slack discussion.

 

 

How reproducible:

100%

 

Steps to reproduce:

  1. Install on a node with two (statically configured via nmstate yaml) interfaces with a predictable name format (not eth*).
  2. add on one of the interfaces MAC address to the MAC map.

 

Actual results:
error 'mac-interface mapping for interface xxxx is missing'
Expected results:

Installation succeeds and the interfaces are correctly configured.

Please review the following PR: https://github.com/openshift/csi-operator/pull/241

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/configmap-reload/pull/61

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The troubleshooting panel global trigger is not displayed in the application launcher, this blocks users from discovering the panel and troubleshoot problems correctly

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Install COO
    2. Install the troubleshooting panel using the UIPlugin CR
    

Actual results:

The "Signal Correlation" item does not appear in the application launcher when the troubleshooting panel is installed

Expected results:

     The "Signal Correlation" item appears in the application launcher when the troubleshooting panel is installed

Additional info:

https://github.com/openshift/console/pull/14097 to 4.17

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/121

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1056

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-35048. The following is the description of the original issue:

Description of problem:

same admin console bug OCPBUGS-31931 on developer console, 4.15.17 cluster, kubeadmin user goes to developer console UI, click "Observe", select one project, example: openshift-monitoring, select Silences tab, click "Create silence", Creator filed is not auto filled with user name, add label name/value, and Comment to create silence.

will see error on page

An error occurred
createdBy in body is required 

see picture: https://drive.google.com/file/d/1PR64hvpYCC-WOHT1ID9A4jX91LdGG62Y/view?usp=sharing

this issue exists in 4.15/4.16/4.17/4.18, no issue with 4.14

Version-Release number of selected component (if applicable):

4.15.17

How reproducible:

alwawys

Steps to Reproduce:

see the description

Actual results:

Creator filed is not auto filled with user name    

Expected results:

no error

Additional info:

    

Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/85

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

We need to disable the migration feature, including setting of migration-datastore-url if multiple vcenters are enabled in the cluster, because currently CSI migration doesn't work in that environment.

Description of problem:

Navigation:
           Pipelines -> Pipelines -> Click on kebab menu -> Add Trigger -> Select Git provider type

Issue:     "Show variables" "Hide variables" are in English.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-01-063526

How reproducible:

Always

Steps to Reproduce:

1. Log into web console and set language as non en_US
2. Navigate to Pipelines -> Pipelines -> Click on kebab menu -> Add Trigger -> Select Git provider type 
3. "Show variables" "Hide variables" are in English

Actual results:

Content is in English

Expected results:

Content should be in set language.

Additional info:

Reference screenshot attached.

This is a clone of issue OCPBUGS-42120. The following is the description of the original issue:

Description of problem:

    After upgrading OCP and LSO to version 4.14, elasticsearch pods in the openshift-logging deployment are unable to schedule to their respective nodes and remain Pending, even though the LSO managed PVs are bound to the PVCs. A test pod using a newly created test PV managed by the LSO is able to schedule correctly however.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    Consistently

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Pods consuming previously existing LSO managed PVs are unable to schedule and remain in a Pending state after upgrading OCP and LSO to 4.14.

Expected results:

    That pods would be able to consume LSO managed PVs and schedule correctly to nodes.

Additional info:

    

The upcoming OpenShift Pipelines which will be deployed shortly has stricter validations on Pipelines and Task manifests. Clamav would fail the new validations.

Description of problem:

In OCPBUGS-30951, we modified a check used in the Cinder CSI Driver Operator to relax the requirements for enabling topology support. Unfortunately in doing this we introduced a bug: we now attempt to access the volume AZ for each compute AZ, which isn't valid if there are more compute AZs than volume AZS. This needs to be addressed.

Version-Release number of selected component (if applicable):

This affects 4.14 through to master (unreleased 4.17).

How reproducible:

Always.

Steps to Reproduce:

1. Deploy OCP-on-OSP on a cluster with fewer storage AZs than compute AZs

Actual results:

Operator fails due to out-of-range error.

Expected results:

Operator should not fail.

Additional info:

None.

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/70

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

non-lowercase hostname in DHCP breaks assisted installation

How reproducible:

100%

Steps to reproduce:

  1. https://issues.redhat.com/browse/AITRIAGE-10248
  2. User did ask for a valid requested_hostname

Actual results:

bootkube fails

Expected results:{}

bootkube should succeed

 

slack thread

Description of problem:

    Alignment issue with the Breadcrumbs in the Task Selection QuickSearch

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    Always

Steps to Reproduce:

    1. Install the Pipelines Operator
    2. Use the Quick Search in the Pipeline Builder page
    3. Type "git-clone"
    

Actual results:

    Alignment issue with the Breadcrumbs in the Task Selection QuickSearch

Expected results:

    Proper alignment

Additional info:

Screenshot: https://drive.google.com/file/d/1qGWLyfLBHAzfhv8Bnng3IyEJCx8hdMEo/view?usp=drive_link

This is a clone of issue OCPBUGS-38241. The following is the description of the original issue:

Component Readiness has found a potential regression in the following test:

operator conditions control-plane-machine-set

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-09%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-03%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set

Please review the following PR: https://github.com/openshift/ovirt-csi-driver-operator/pull/134

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/153

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Proxy settings in buildDefaults preserved in image

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

I have a customer, so during builds their developers need proxy access.
For this they have configured buildDefaults on thier cluster as described here:https://docs.openshift.com/container-platform/4.10/cicd/builds/build-configuration.html.
The problem is that buildDefaults.defaultProxy sets the proxy environment variables in uppercase.
Several RedHat S2I images use tools that depend on curl. curl only supports lower-case proxy environment variables. As such the defaultProxy settings are not taken into account.To workaround this "behavior defect", they have configured:
- buildDefaults.env.http_proxy
- buildDefaults.env.https_proxy
- buildDefaults.env.no_proxy
But the side effect is that the lowercase environment variables are preserved in the container image. So at runtime, the proxy settings are still active and they constantly have to support developers to unset them again (when using non-fqdn for example). This is causing frustration for them and thier developers.
1. Why can't the buildDefaults.defaultProxy not be set in lower and uppercase proxy settings?2. Why are the buildDefaults.env preserved in the container image while buildDefaults.defaultProxy is correctly unset/removed from the container image. As the name implies, for us "buildDefaults" should only be kept during the build and settings should correctly be removed before pushing the image in the registry.Also have shared them the below KCS:
https://access.redhat.com/solutions/1575513.
But cu was not satisfied with that , and they responded with the following:
The article does not provide a solution to the problem. It describes the same and gives a dirty workaround a developers will have to apply it on each individual buildconfig. This is not wanted.
The fact that we set these envs using buildDefaults, is the same workaround. But still the core problem remains: the envs are preserved in the container image when using this workaround.
This needs to be addressed by engineering so this is fixed properly. 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Azure HC fails to create AzureMachineTemplate if a MachineIdentityID is not provided. 

E0705 19:09:23.783858       1 controller.go:329] "Reconciler error" err="failed to parse ProviderID : invalid resource ID: id cannot be empty" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate" AzureMachineTemplate="clusters-hostedcp-1671-hc/hostedcp-1671-hc-f412695a" namespace="clusters-hostedcp-1671-hc" name="hostedcp-1671-hc-f412695a" reconcileID="74581db2-0ac0-4a30-abfc-38f07b8247cc"

https://github.com/openshift/hypershift/blob/84f594bd2d44e03aaac2d962b0d548d75505fed7/hypershift-operator/controllers/nodepool/azure.go#L52 does not check first to see if a MachineIdentityID was provided before adding the UserAssignedIdentity field.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Create an Azure HC without a MachineIdentityID
    

Actual results:

    Azure HC fails to create AzureMachineTemplate properly, nodes aren't created, and HC is in a failed state.

Expected results:

     Azure HC creates AzureMachineTemplate properly, nodes are created, and HC is in a completed state.

Additional info:

    

This is a clone of issue OCPBUGS-38722. The following is the description of the original issue:

Description of problem:

    We should add validation in the Installer when public-only subnets is enabled to make sure that:

	1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set
	2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal
	3. If this flag is only applicable for byo-vpc configuration, we could
 consider exit earlier if no subnets provided in install-config.

Version-Release number of selected component (if applicable):

    all versions that support public-only subnets

How reproducible:

    always

Steps to Reproduce:

    1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY
    2. Do a cluster install without specifying a VPC.
    3.
    

Actual results:

    No warning about the invalid configuration.

Expected results:

    

Additional info:

    This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.

Please review the following PR: https://github.com/openshift/machine-os-images/pull/38

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:
Apps exposed via NodePort do not return responses to client requests if the client's ephemeral port is 22623 or 22624.
When testing with curl command specifying the local port as shown below, a response is returned if the ephemeral port is 22622 or 22626, but it times out if the ephemeral port is 22623 or 22624.

[root@bastion ~]# for i in {22622..22626}; do echo localport:${i}; curl -m 10 -I 10.0.0.20:32325 --local-port ${i}; done
localport:22622
HTTP/1.1 200 OK
Server: nginx/1.22.1
Date: Thu, 25 Jul 2024 07:44:22 GMT
Content-Type: text/html
Content-Length: 37451
Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT
Connection: keep-alive
ETag: "66a0f183-924b"
Accept-Ranges: bytes
localport:22623
curl: (28) Connection timed out after 10001 milliseconds
localport:22624
curl: (28) Connection timed out after 10000 milliseconds
localport:22625
HTTP/1.1 200 OK
Server: nginx/1.22.1
Date: Thu, 25 Jul 2024 07:44:42 GMT
Content-Type: text/html
Content-Length: 37451
Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT
Connection: keep-alive
ETag: "66a0f183-924b"
Accept-Ranges: bytes
localport:22626
HTTP/1.1 200 OK
Server: nginx/1.22.1
Date: Thu, 25 Jul 2024 07:44:42 GMT
Content-Type: text/html
Content-Length: 37451
Last-Modified: Wed, 24 Jul 2024 12:20:19 GMT
Connection: keep-alive
ETag: "66a0f183-924b"
Accept-Ranges: bytes

This issue has been occurring since upgrading to version 4.16. Confirmed that it does not occur in versions 4.14 and 4.12.

Version-Release number of selected component (if applicable):
OCP 4.16

How reproducible:
100%

Steps to Reproduce:
1. Prepare a 4.16 cluster.
2. Launch any web app pod (nginx, httpd, etc.).
3. Expose the application externally using NodePort.
4. Access the URL using curl --local-port option to specify 22623 or 22624.

Actual results:
No response is returned from the exposed application when the ephemeral port is 22623 or 22624.

Expected results:
A response is returned regardless of the ephemeral port.

Additional info:
This issue started occurring from version 4.16, so it is possible that this is due to changes in RHEL 9.4, particularly those related to nftables.

Description of problem:

Since we aim for removing PF4 and ReactRouter5 in 4.18 we need to deprecate these shared modules in 4.16 to give plugin creators time to update their plugins.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/134

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

kube-apiserver was stuck in updating versions when upgrade from 4.1 to 4.16 with AWS ipi installation
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-01-111315
    

How reproducible:

    always
    

Steps to Reproduce:

    1. IPI Install an AWS 4.1 cluster, upgrade it to 4.16
    2. Upgrade was stuck in 4.15 to 4.16, waiting on etcd, kube-apiserver updating
    
    

Actual results:

   1. Upgrade was stuck in 4.15 to 4.16, waiting on etcd, kube-apiserver updating
   $ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.0-0.nightly-2024-05-16-091947   True        True          39m     Working towards 4.16.0-0.nightly-2024-05-16-092402: 111 of 894 done (12% complete)

    

Expected results:

Upgrade should be successful.
    

Additional info:

Must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.1-aws-ipi-f30/1791391925467615232/artifacts/aws-ipi-f30/gather-must-gather/artifacts/must-gather.tar

Checked the must-gather logs, 
$ omg get clusterversion -oyaml
...
conditions:
  - lastTransitionTime: '2024-05-17T09:35:29Z'
    message: Done applying 4.15.0-0.nightly-2024-05-16-091947
    status: 'True'
    type: Available
  - lastTransitionTime: '2024-05-18T06:31:41Z'
    message: 'Multiple errors are preventing progress:

      * Cluster operator kube-apiserver is updating versions

      * Could not update flowschema "openshift-etcd-operator" (82 of 894): the server
      does not recognize this resource, check extension API servers'
    reason: MultipleErrors
    status: 'True'
    type: Failing

$ omg get co | grep -v '.*True.*False.*False'
NAME                                      VERSION                             AVAILABLE  PROGRESSING  DEGRADED  SINCE
kube-apiserver                            4.15.0-0.nightly-2024-05-16-091947  True       True         False     10m

$ omg get pod -n openshift-kube-apiserver
NAME                                               READY  STATUS     RESTARTS  AGE
installer-40-ip-10-0-136-146.ec2.internal          0/1    Succeeded  0         2h29m
installer-41-ip-10-0-143-206.ec2.internal          0/1    Succeeded  0         2h25m
installer-43-ip-10-0-154-116.ec2.internal          0/1    Succeeded  0         2h22m
installer-44-ip-10-0-154-116.ec2.internal          0/1    Succeeded  0         1h35m
kube-apiserver-guard-ip-10-0-136-146.ec2.internal  1/1    Running    0         2h24m
kube-apiserver-guard-ip-10-0-143-206.ec2.internal  1/1    Running    0         2h24m
kube-apiserver-guard-ip-10-0-154-116.ec2.internal  0/1    Running    0         2h24m
kube-apiserver-ip-10-0-136-146.ec2.internal        5/5    Running    0         2h27m
kube-apiserver-ip-10-0-143-206.ec2.internal        5/5    Running    0         2h24m
kube-apiserver-ip-10-0-154-116.ec2.internal        4/5    Running    17        1h34m
revision-pruner-39-ip-10-0-136-146.ec2.internal    0/1    Succeeded  0         2h44m
revision-pruner-39-ip-10-0-143-206.ec2.internal    0/1    Succeeded  0         2h50m
revision-pruner-39-ip-10-0-154-116.ec2.internal    0/1    Succeeded  0         2h52m
revision-pruner-40-ip-10-0-136-146.ec2.internal    0/1    Succeeded  0         2h29m
revision-pruner-40-ip-10-0-143-206.ec2.internal    0/1    Succeeded  0         2h29m
revision-pruner-40-ip-10-0-154-116.ec2.internal    0/1    Succeeded  0         2h29m
revision-pruner-41-ip-10-0-136-146.ec2.internal    0/1    Succeeded  0         2h26m
revision-pruner-41-ip-10-0-143-206.ec2.internal    0/1    Succeeded  0         2h26m
revision-pruner-41-ip-10-0-154-116.ec2.internal    0/1    Succeeded  0         2h26m
revision-pruner-42-ip-10-0-136-146.ec2.internal    0/1    Succeeded  0         2h24m
revision-pruner-42-ip-10-0-143-206.ec2.internal    0/1    Succeeded  0         2h23m
revision-pruner-42-ip-10-0-154-116.ec2.internal    0/1    Succeeded  0         2h23m
revision-pruner-43-ip-10-0-136-146.ec2.internal    0/1    Succeeded  0         2h23m
revision-pruner-43-ip-10-0-143-206.ec2.internal    0/1    Succeeded  0         2h23m
revision-pruner-43-ip-10-0-154-116.ec2.internal    0/1    Succeeded  0         2h23m
revision-pruner-44-ip-10-0-136-146.ec2.internal    0/1    Succeeded  0         1h35m
revision-pruner-44-ip-10-0-143-206.ec2.internal    0/1    Succeeded  0         1h35m
revision-pruner-44-ip-10-0-154-116.ec2.internal    0/1    Succeeded  0         1h35m

Checked the kube-apiserver kube-apiserver-ip-10-0-154-116.ec2.internal logs, seems something wring with informers, 
$ grep 'informers not started yet' current.log  | wc -l
360

$ grep 'informers not started yet' current.log 
2024-05-18T06:34:51.888804183Z [-]informer-sync failed: 4 informers not started yet: [*v1.PriorityLevelConfiguration *v1.Secret *v1.FlowSchema *v1.ConfigMap]
2024-05-18T06:34:51.889350484Z [-]informer-sync failed: 4 informers not started yet: [*v1.PriorityLevelConfiguration *v1.FlowSchema *v1.Secret *v1.ConfigMap]
2024-05-18T06:34:52.004808401Z [-]informer-sync failed: 2 informers not started yet: [*v1.FlowSchema *v1.PriorityLevelConfiguration]
2024-05-18T06:34:52.095516498Z [-]informer-sync failed: 2 informers not started yet: [*v1.PriorityLevelConfiguration *v1.FlowSchema]
...


    

Please review the following PR: https://github.com/openshift/installer/pull/8459

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38622. The following is the description of the original issue:

Description of problem:

    See https://github.com/prometheus/prometheus/issues/14503 for more details
    

Version-Release number of selected component (if applicable):

    4.16
    

How reproducible:


    

Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:

# TYPE requests_per_second_requests gauge
# UNIT requests_per_second_requests requests
# HELP requests_per_second_requests test-description
requests_per_second_requests 16 1722466225604
requests_per_second_requests 14 1722466226604
requests_per_second_requests 40 1722466227604
requests_per_second_requests 15 1722466228604
# EOF

2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:


    

Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)


    

Additional info:

     Regression introduced in Prometheus 2.52.
    Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685 
    

This is a clone of issue OCPBUGS-38636. The following is the description of the original issue:

Description of problem:

Version-Release number of selected component (if applicable):

When navigating from Lightspeed's "Don't show again" link, it can be hard to know which element is relevant.  We should look at utilizing Spotlight to highlight the relevant user preference.

Also, there is an undesirable gap before the Lightspeed user preference caused by an empty div from data-test="console.telemetryAnalytics".

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-42514. The following is the description of the original issue:

Description of problem:

When configuring the OpenShift image registry to use a custom Azure storage account in a different resource group, following the official documentation [1], the image-registy CO degrade and upgrade from version 4.14.x to 4.15.x fails. The image registry operator reports misconfiguration errors related to Azure storage credentials, preventing the upgrade and causing instability in the control plane.

[1] Configuring registry storage in Azure user infrastructure

Version-Release number of selected component (if applicable):

   4.14.33, 4.15.33

How reproducible:

  1. Set up ARO:
    • Deploy an ARO or OpenShift cluster on Azure, version 4.14.x.
  1. Configure Image Registry:
    • Follow the official documentation [1] to configure the image registry to use a custom Azure storage account located in a different resource group.
    • Ensure that the image-registry-private-configuration-user secret is created in the openshift-image-registry namespace.
    • Do not modify the installer-cloud-credentials secret.
  1. Check the image registry CO status
  2. Initiate Upgrade:
    • Attempt to upgrade the cluster to OpenShift version 4.15.x.

Steps to Reproduce:

  1. If we have the image-registry-private-configuration-user inplace and installer-cloud-credentials with no modified

We got the error 

    NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: client misconfigured, missing 'TenantID', 'ClientID', 'ClientSecret', 'FederatedTokenFile', 'Creds', 'SubscriptionID' option(s) 

The oeprator will also genreate a new secret image-registry-private-configuration with the same content as image-registry-private-configuration-user

$ oc get secret  image-registry-private-configuration -o yaml
apiVersion: v1
data:
  REGISTRY_STORAGE_AZURE_ACCOUNTKEY: xxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  annotations:
    imageregistry.operator.openshift.io/checksum: sha256:524fab8dd71302f1a9ade9b152b3f9576edb2b670752e1bae1cb49b4de992eee
  creationTimestamp: "2024-09-26T19:52:17Z"
  name: image-registry-private-configuration
  namespace: openshift-image-registry
  resourceVersion: "126426"
  uid: e2064353-2511-4666-bd43-29dd020573fe
type: Opaque 

 

2. then we delete the secret image-registry-private-configuration-user

now the secret image-registry-private-configuration will still exisit with the same content, but image-registry CO got a new error 

 

NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account arojudesa: storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Storage/storageAccounts/arojudesa' under resource group 'aro-ufjvmbl1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix" 

3. apply the workaround to manually changeing the secret installer-cloud-credentials azure_resourcegroup key with custom storage account resourcegroup

$ oc get secret installer-cloud-credentials -o yaml
apiVersion: v1
data:
  azure_client_id: xxxxxxxxxxxxxxxxx
  azure_client_secret: xxxxxxxxxxxxxxxxx
  azure_region: xxxxxxxxxxxxxxxxx
  azure_resource_prefix: xxxxxxxxxxxxxxxxx
  azure_resourcegroup: xxxxxxxxxxxxxxxxx <<<<<-----THIS
  azure_subscription_id: xxxxxxxxxxxxxxxxx
  azure_tenant_id: xxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  annotations:
    cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-image-registry-azure
  creationTimestamp: "2024-09-26T16:49:57Z"
  labels:
    cloudcredential.openshift.io/credentials-request: "true"
  name: installer-cloud-credentials
  namespace: openshift-image-registry
  resourceVersion: "133921"
  uid: d1268e2c-1825-49f0-aa44-d0e1cbcda383
type: Opaque 

 

The  image-registry report healthy and this help the continue the upgrade

 

Actual results:

    The image registry seems still use the service principal way for Azure storage account authentication

Expected results:

    We expect the REGISTRY_STORAGE_AZURE_ACCOUNTKEY should the only thing image registry operator need for storage account authentication if Customer provide 
  • The image registry continues to function using the custom Azure storage account in the different resource group.

Additional info:

  • Reproducibility: The issue is consistently reproducible by following the official documentation to configure the image registry with a custom storage account in a different resource group and then attempting an upgrade.
  • Related Issues:
    • Similar problems have been reported in previous incidents, suggesting a systemic issue with the image registry operator's handling of Azure storage credentials.
  • Critical Customer Impact: Customers are required to perform manual interventions after every upgrade for each cluster, which is not sustainable and leads to operational overhead.

 

Slack : https://redhat-internal.slack.com/archives/CCV9YF9PD/p1727379313014789

Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/28

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38794. The following is the description of the original issue:

Description of problem:

HCP cluster is being updated but the nodepool is stuck updating:
~~~
NAME                   CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
nodepool-dev-cluster   dev       2               2               False         False        4.15.22   True              True
~~~

Version-Release number of selected component (if applicable):

Hosting OCP cluster 4.15
HCP 4.15.23

How reproducible:

N/A

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Nodepool stuck in upgrade

Expected results:

Upgrade success

Additional info:

I have found this error repeating continually in the ignition-server pods:
~~~
{"level":"error","ts":"2024-08-20T09:02:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-nodepool-dev-cluster-3146da34","namespace":"dev-dev"},"namespace":"dev-dev","name":"token-nodepool-dev-cluster-3146da34","reconcileID":"ec1f0a7f-1657-4245-99ef-c984977ff0f8","error":"error getting ignition payload: failed to download binaries: failed to extract image file: failed to extract image file: file not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

{"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"discovered machine-config-operator image","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede"}
{"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"created working directory","dir":"/payloads/get-payload4089452863"}

{"level":"info","ts":"2024-08-20T09:02:28Z","logger":"get-payload","msg":"extracted image-references","time":"8s"}

{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"extracted templates","time":"10s"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"image-cache","msg":"retrieved cached file","imageRef":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede","file":"usr/lib/os-release"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"read os-release","mcoRHELMajorVersion":"8","cpoRHELMajorVersion":"9"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"copying file","src":"usr/bin/machine-config-operator.rhel9","dest":"/payloads/get-payload4089452863/bin/machine-config-operator"}
~~~

This is a clone of issue OCPBUGS-42553. The following is the description of the original issue:

Context thread.

Description of problem:

     Monitoring the 4.18 agent-based installer CI job for s390x (https://github.com/openshift/release/pull/50293) I discovered unexpected behavoir onces the installation triggers reboot into disk step for the 2nd and 3rd control plane nodes. (The first control plane node is rebooted last because it's also the bootstrap node). Instead of rebooting successully as expected, it fails to find the OSTree and drops to dracut, stalling the installation.

Version-Release number of selected component (if applicable):

    OpenShift 4.18 on s390x only; discovered using agent installer

How reproducible:

    Try to install OpenShift 4.18 using agent-based installer on s390x

Steps to Reproduce:

    1. Boot nodes with XML (see attached)
    2. Wait for installation to get to reboot phase.
    

Actual results:

    Control plane nodes fail to reboot.

Expected results:

    Control plane nodes reboot and installation progresses.

Additional info:

    See attached logs.

This is a clone of issue OCPBUGS-38368. The following is the description of the original issue:

After multi-VC changes were merged, now when we use this tool, following warnings get logged:

E0812 13:04:34.813216   13159 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
line 1: cannot unmarshal !!seq into config.CommonConfigYAML
I0812 13:04:34.813376   13159 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.

Which looks bit scarier than it should.

Description of the problem:

The error "Setting Machine network CIDR is forbidden when cluster is not in vip-dhcp-allocation mode" is only ever seen on cluster updates. This means that users may see this issue only after a cluster is fully installed which would prevent day2 node additions.

How reproducible:

100%

Steps to reproduce:

1. Create an AgentClusterInstall with both VIPs and machineNetwork CIDR set

2. Observe SpecSynced condition

Actual results:

No error is seen

Expected results:

An error is presented saying this is an invalid combination.

Additional information:

This was originally seen as a part of https://issues.redhat.com/browse/ACM-10853 and I was only able to see SpecSynced success for a few seconds before I saw the mentioned error. Somehow though this user was able to install a cluster with this configuration so maybe we should block it with a webhook rather than a condition?

This is a clone of issue OCPBUGS-41538. The following is the description of the original issue:

Description of problem:

    When the user selects a shared vpc install, the created control plane service account is left over. To verify, after the destruction of the cluster check the principals in the host project for a remaining name XXX-m@some-service-account.com

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    No principal remaining

Additional info:

    

This is a clone of issue OCPBUGS-42961. The following is the description of the original issue:

With the rapid recommendations feature (enhancement) one can request various messages from Pods matching various Pod name regular expressions

The problem is when there is a Pod (e.g foo-1 from the below example) matching more than one requested Pod name regex:

{
    'namespace': 'test-namespace',
    'pod_name_regex': 'foo-.*',
    'messages': ['regex1', 'regex2']
},
{
    'namespace': 'test-namespace'',
    'pod_name_regex': 'foo-1',
    'messages': ['regex3', 'regex4']
}

Assume Pods with names foo-1 and foo-bar. Currently all the regexes (regex1,regex2, regex3, regex4) are filtered for both Pods.

The desired behavior is foo1 filters all the regexes, but foo-bar is filtered only with regex1 and regex2

Please review the following PR: https://github.com/openshift/must-gather/pull/423

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

I was seeing the following error running `build.sh` with go v1.19.5 until I upgraded to v1.22.4:

```
❯ ./build.sh
pkg/auth/sessions/server_session.go:7:2: cannot find package "." in:
/Users/rhamilto/Git/console/vendor/slices
```

Description of problem:

    The MAPI for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%, dependent on order of subnets returned by IBM Cloud API's however

Steps to Reproduce:

    1. Create 50+ IBM Cloud VPC Subnets
    2. Create a new IPI cluster (with or without BYON)
    3. MAPI will attempt to find Subnet details by name, likely failing as it only checks the first group (50)...depending on order returned by IBM Cloud API
    

Actual results:

    MAPI fails to find Subnet ID, thus cannot create/manage cluster nodes.

Expected results:

    Successful IPI deployment.

Additional info:

    IBM Cloud is working on a patch to MAPI to handle the ListSubnets API call and pagination results.

This is a clone of issue OCPBUGS-38225. The following is the description of the original issue:

Description of problem:

    https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Fallout of https://issues.redhat.com/browse/OCPBUGS-35371

We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.

A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.

Prom query:

max by (node, metrics_path) (up{job="kubelet"}) == 0

Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.

Description of problem:

Regression of OCPBUGS-12739

level=warning msg="Couldn't unmarshall OVN annotations: ''. Skipping." err="unexpected end of JSON input"
    

Upstream OVN changed the node annotation from "k8s.ovn.org/host-addresses" to "k8s.ovn.org/host-cidrs" in OpenShift 4.14

https://github.com/ovn-org/ovn-kubernetes/pull/3915

We might need to fix baremetal-runtimecfg

diff --git a/pkg/config/node.go b/pkg/config/node.go
index 491dd4f..078ad77 100644
--- a/pkg/config/node.go
+++ b/pkg/config/node.go
@@ -367,10 +367,10 @@ func getNodeIpForRequestedIpStack(node v1.Node, filterIps []string, machineNetwo
                log.Debugf("For node %s can't find address using NodeInternalIP. Fallback to OVN annotation.", node.Name)
 
                var ovnHostAddresses []string
-               if err := json.Unmarshal([]byte(node.Annotations["k8s.ovn.org/host-addresses"]), &ovnHostAddresses); err != nil {
+               if err := json.Unmarshal([]byte(node.Annotations["k8s.ovn.org/host-cidrs"]), &ovnHostAddresses); err != nil {
                        log.WithFields(logrus.Fields{
                                "err": err,
-                       }).Warnf("Couldn't unmarshall OVN annotations: '%s'. Skipping.", node.Annotations["k8s.ovn.org/host-addresses"])
+                       }).Warnf("Couldn't unmarshall OVN annotations: '%s'. Skipping.", node.Annotations["k8s.ovn.org/host-cidrs"])
                }
 

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-30-130713
 

How reproducible:

Frequent

Steps to Reproduce:

    1. Deploy vsphere IPv4 cluster
    2. Convert to Dualstack IPv4/IPv6
    3. Add machine network and IPv6 apiServerInternalIPs and ingressIPs
    4. Check keepalived.conf
for f in $(oc get pods -n openshift-vsphere-infra -l app=vsphere-infra-vrrp --no-headers -o custom-columns=N:.metadata.name  ) ; do oc -n openshift-vsphere-infra exec -c keepalived $f -- cat /etc/keepalived/keepalived.conf | tee $f-keepalived.conf ; done
    

Actual results:

IPv6 VIP is not in keepalived.conf

Expected results:
Something like:

vrrp_instance rbrattai_INGRESS_1 {
    state BACKUP
    interface br-ex
    virtual_router_id 129
    priority 20
    advert_int 1

    unicast_src_ip fd65:a1a8:60ad:271c::cc
    unicast_peer {
        fd65:a1a8:60ad:271c:9af:16a9:cb4f:d75c
        fd65:a1a8:60ad:271c:86ec:8104:1bc2:ab12
        fd65:a1a8:60ad:271c:5f93:c9cf:95f:9a6d
        fd65:a1a8:60ad:271c:bb4:de9e:6d58:89e7
        fd65:a1a8:60ad:271c:3072:2921:890:9263
    }
...
    virtual_ipaddress {
        fd65:a1a8:60ad:271c::1117/128
    }
...
}
    

This is a clone of issue OCPBUGS-42579. The following is the description of the original issue:

Hello Team,

When we deploy the HyperShift cluster with OpenShift Virtualization by specifying NodePort strategy for services, the requests to ignition, oauth, connectivity (for oc rsh, oc logs, oc exec), virt-launcher-hypershift-node-pool pod fails as by default following netpols get created automatically and restricting the traffic on on all other ports.

 

$ oc get netpol
NAME                      POD-SELECTOR           AGE
kas                       app=kube-apiserver     153m
openshift-ingress         <none>                 153m
openshift-monitoring      <none>                 153m
same-namespace            <none>                 153m 

I resolved

$ cat ingress-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ingress
spec:
  ingress:
  - ports:
    - port: 31032
      protocol: TCP
  podSelector:
    matchLabels:
      kubevirt.io: virt-launcher
  policyTypes:
  - Ingress


$ cat oauth-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: oauth
spec:
  ingress:
  - ports:
    - port: 6443
      protocol: TCP
  podSelector:
    matchLabels:
      app: oauth-openshift
      hypershift.openshift.io/control-plane-component: oauth-openshift
  policyTypes:
  - Ingress


$ cat ignition-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nodeport-ignition-proxy
spec:
  ingress:
  - ports:
    - port: 8443
      protocol: TCP
  podSelector:
    matchLabels:
      app: ignition-server-proxy
  policyTypes:
  - Ingress


$ cat konn-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: konn
spec:
  ingress:
  - ports:
    - port: 8091
      protocol: TCP
  podSelector:
    matchLabels:
      app: kube-apiserver
      hypershift.openshift.io/control-plane-component: kube-apiserver
  policyTypes:
  - Ingress

The bug for ignition netpol has already been reported.

--> https://issues.redhat.com/browse/OCPBUGS-39158

--> https://issues.redhat.com/browse/OCPBUGS-39317

 

It would be helpful if these policies get created automatically as well or maybe we get an option in HyperShift to disable the automatic management of network policies where we can manually take care of the network policies.

 

User Story:

As an SRE, I want to aggregate the `cluster_proxy_ca_expiry_timestamp` metric to achieve feature parity with OSD/ROSA clusters as they are today.

Acceptance Criteria:

Description of criteria:

  • cluster_proxy_ca_expiry_timestamp is added as a metric to hypershift operator
  • cluster_proxy_ca_expiry_timestamp is exported to RHOBS for alerting and monitoring

Engineering Details:

The expiry in classic is calculated by looking at the user supplied CA bundle, running openssl command to extract the expiry, and calculating the number of days that the cert is valid for. We should use the same approach to calculate the expiry for HCP clusters.

Current implementation:

SRE Spike for this effort: https://issues.redhat.com/browse/OSD-15414

Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/306

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

After installed MCE operator, tried to create MultiClusterEngine instance, it failed with error:
 "error applying object Name: mce Kind: ConsolePlugin Error: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found"
Checked in openshift-console-operator, there is not webhook service, also deployment "console-conversion-webhook" is missed.
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-25-103421
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Check resources in openshift-console-opeator, such as deployment and service.
    2.
    3.
    

Actual results:

1. There is not webhook related deployment, pod and service. 
    

Expected results:

1. Should have webhook related resources.
    

Additional info:


    

Description of problem:

    When deploying a private cluster, if the VPC isn't a permitted network the installer will not add it as one

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Deploy a private cluster with CAPI in a VPC where the desired DNS zone is not permitted
    2. Fail
    3.
    

Actual results:

    Cluster cannot reach endpoints and deployment fails

Expected results:

    Network should be permitted and deployment should succeed

Additional info:

    

Description of problem:

Live migration gets stuck when the ConfigMap MTU is absent. The ConfigMap mtu should be created by the mtu-prober job at the installation time since 4.11. But if the cluster was upgrade from a very early releases, such as 4.4.4, the ConfigMap mtu may be absent.

Version-Release number of selected component (if applicable):

4.16.rc2

How reproducible:

 

Steps to Reproduce:

1. build a 4.16 cluster with OpenShiftSDN
2. remove the configmap mtu from the namespace cluster-network-operator.
3. start live migration.

Actual results:

Live migration gets stuck with error

NetworkTypeMigrationFailed
Failed to process SDN live migration (configmaps "mtu" not found)

Expected results:

Live migration finished successfully.

Additional info:

A workaround is to create the configmap mtu manually before starting live migration.

Converted to a bug so it could be backported.

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38994. The following is the description of the original issue:

Description of problem:

The library-sync.sh script may leave some files of the unsupported samples in the checkout. In particular, the files that have been renamed are not deleted even though they should have.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Run library-sync.sh

Actual results:

A couple of files under assets/operator/ocp-x86_64/fis are present.    

Expected results:

The directory should not be present at all, because it is not supported.    

Additional info:

    

Request for sending data via telemetry

The goal is to collect recording rule about issues within the pods on cnv containers, at the moment cnv_abnormal metric includes memory exceed values by container for the pod with the highest exceeded bytes.

The recording rules attached in the screenshot.

Labels

  • container, possible values are `virt-api`, `virt-controller`, `virt-handler`, `virt-operator`
  • reason, possible values are `memory_working_set_delta_from_request` or `memory_rss_delta_from_request`

The cardinality of the metric is at most 8
The end results contain 4 (containers) x 2 (memory types), 8 records and 2 labels for each record. In addition we use 2 additional kubevirt rules that calculates the end values by memory type: https://github.com/kubevirt/kubevirt/blob/main/pkg/monitoring/rules/recordingrules/operator.go. cnv_abnormal reported in hco: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/pkg/monitoring/rules/recordingrules/operator.go#L28

Description of problem: Missing dependency warning error in console UI dev env

[yapei@yapei-mac frontend (master)]$ yarn lint --fix
yarn run v1.22.15
$ NODE_OPTIONS=--max-old-space-size=4096 yarn eslint . --fix
$ eslint --ext .js,.jsx,.ts,.tsx,.json,.gql,.graphql --color . --fix/Users/yapei/go/src/github.com/openshift/console/frontend/packages/console-shared/src/components/close-button/CloseButton.tsx
2:46 error Unable to resolve path to module '@patternfly/react-component-groups' import/no-unresolved/Users/yapei/go/src/github.com/openshift/console/frontend/public/components/resource-dropdown.tsx
109:6 warning React Hook React.useEffect has missing dependencies: 'clearItems' and 'recentSelected'. Either include them or remove the dependency array react-hooks/exhaustive-deps✖ 2 problems (1 error, 1 warning)error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

[yapei@yapei-mac frontend (master)]$ git log -1
commit 9478885a967f633ddc327ade1c0d552094db418b (HEAD -> master, origin/release-4.17, origin/release-4.16, origin/master, origin/HEAD)
Merge: 3e708b7df9 c3d89c5798
Author: openshift-merge-bot[bot] <148852131+openshift-merge-bot[bot]@users.noreply.github.com>
Date: Mon Mar 18 16:33:11 2024 +0000 Merge pull request #13665 from cyril-ui-developer/add-locales-support-fr-es

Description of problem:

Infra machine is going to failed status:

2024-05-18 07:26:49.815 | NAMESPACE               NAME                          PHASE     TYPE     REGION      ZONE   AGE
2024-05-18 07:26:49.822 | openshift-machine-api   ostest-wgdc2-infra-0-4sqdh    Running   master   regionOne   nova   31m
2024-05-18 07:26:49.826 | openshift-machine-api   ostest-wgdc2-infra-0-ssx8j    Failed                                31m
2024-05-18 07:26:49.831 | openshift-machine-api   ostest-wgdc2-infra-0-tfkf5    Running   master   regionOne   nova   31m
2024-05-18 07:26:49.841 | openshift-machine-api   ostest-wgdc2-master-0         Running   master   regionOne   nova   38m
2024-05-18 07:26:49.847 | openshift-machine-api   ostest-wgdc2-master-1         Running   master   regionOne   nova   38m
2024-05-18 07:26:49.852 | openshift-machine-api   ostest-wgdc2-master-2         Running   master   regionOne   nova   38m
2024-05-18 07:26:49.858 | openshift-machine-api   ostest-wgdc2-worker-0-d5cdp   Running   worker   regionOne   nova   31m
2024-05-18 07:26:49.868 | openshift-machine-api   ostest-wgdc2-worker-0-jcxml   Running   worker   regionOne   nova   31m
2024-05-18 07:26:49.873 | openshift-machine-api   ostest-wgdc2-worker-0-t29fz   Running   worker   regionOne   nova   31m 

Logs from machine-controller shows below error:

2024-05-18T06:59:11.159013162Z I0518 06:59:11.158938       1 controller.go:156] ostest-wgdc2-infra-0-ssx8j: reconciling Machine
2024-05-18T06:59:11.159589148Z I0518 06:59:11.159529       1 recorder.go:104] events "msg"="Reconciled machine ostest-wgdc2-worker-0-jcxml" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-wgdc2-worker-0-jcxml","uid":"245bac8e-c110-4bef-ac11-3d3751a93353","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"18617"} "reason"="Reconciled" "type"="Normal"
2024-05-18T06:59:12.749966746Z I0518 06:59:12.749845       1 controller.go:349] ostest-wgdc2-infra-0-ssx8j: reconciling machine triggers idempotent create
2024-05-18T07:00:00.487702632Z E0518 07:00:00.486365       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-api-provider-openstack-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-openstack-leader": http2: client connection lost
2024-05-18T07:00:00.487702632Z W0518 07:00:00.486497       1 controller.go:351] ostest-wgdc2-infra-0-ssx8j: failed to create machine: error creating bootstrap for ostest-wgdc2-infra-0-ssx8j: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/worker-user-data": http2: client connection lost
2024-05-18T07:00:00.487702632Z I0518 07:00:00.486534       1 controller.go:391] Actuator returned invalid configuration error: error creating bootstrap for ostest-wgdc2-infra-0-ssx8j: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/worker-user-data": http2: client connection lost
2024-05-18T07:00:00.487702632Z I0518 07:00:00.486548       1 controller.go:404] ostest-wgdc2-infra-0-ssx8j: going into phase "Failed"   

The openstack VM is not even created:

2024-05-18 07:26:50.911 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+
2024-05-18 07:26:50.917 | | ID                                   | Name                        | Status | Networks                                                                                                            | Image              | Flavor |
2024-05-18 07:26:50.924 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+
2024-05-18 07:26:50.929 | | 3a1b9af6-d284-4da5-8ebe-434d3aa95131 | ostest-wgdc2-worker-0-jcxml | ACTIVE | StorageNFS=172.17.5.187; network-dualstack=192.168.192.185, fd2e:6f44:5dd8:c956:f816:3eff:fe3e:4e7c                 | ostest-wgdc2-rhcos | worker |
2024-05-18 07:26:50.935 | | 5c34b78a-d876-49fb-a307-874d3c197c44 | ostest-wgdc2-infra-0-tfkf5  | ACTIVE | network-dualstack=192.168.192.133, fd2e:6f44:5dd8:c956:f816:3eff:fee6:4410, fd2e:6f44:5dd8:c956:f816:3eff:fef2:930a | ostest-wgdc2-rhcos | master |
2024-05-18 07:26:50.941 | | d2025444-8e11-409d-8a87-3f1082814af1 | ostest-wgdc2-infra-0-4sqdh  | ACTIVE | network-dualstack=192.168.192.156, fd2e:6f44:5dd8:c956:f816:3eff:fe82:ae56, fd2e:6f44:5dd8:c956:f816:3eff:fe86:b6d1 | ostest-wgdc2-rhcos | master |
2024-05-18 07:26:50.947 | | dcbde9ac-da5a-44c8-b64f-049f10b6b50c | ostest-wgdc2-worker-0-t29fz | ACTIVE | StorageNFS=172.17.5.233; network-dualstack=192.168.192.13, fd2e:6f44:5dd8:c956:f816:3eff:fe94:a2d2                  | ostest-wgdc2-rhcos | worker |
2024-05-18 07:26:50.951 | | 8ad98adf-147c-4268-920f-9eb5c43ab611 | ostest-wgdc2-worker-0-d5cdp | ACTIVE | StorageNFS=172.17.5.217; network-dualstack=192.168.192.173, fd2e:6f44:5dd8:c956:f816:3eff:fe22:5cff                 | ostest-wgdc2-rhcos | worker |
2024-05-18 07:26:50.957 | | f01d6740-2954-485d-865f-402b88789354 | ostest-wgdc2-master-2       | ACTIVE | StorageNFS=172.17.5.177; network-dualstack=192.168.192.198, fd2e:6f44:5dd8:c956:f816:3eff:fe1f:3c64                 | ostest-wgdc2-rhcos | master |
2024-05-18 07:26:50.963 | | d215a70f-760d-41fb-8e30-9f3106dbaabe | ostest-wgdc2-master-1       | ACTIVE | StorageNFS=172.17.5.163; network-dualstack=192.168.192.152, fd2e:6f44:5dd8:c956:f816:3eff:fe4e:67b6                 | ostest-wgdc2-rhcos | master |
2024-05-18 07:26:50.968 | | 53fe495b-f617-412d-9608-47cd355bc2e5 | ostest-wgdc2-master-0       | ACTIVE | StorageNFS=172.17.5.170; network-dualstack=192.168.192.193, fd2e:6f44:5dd8:c956:f816:3eff:febd:a836                 | ostest-wgdc2-rhcos | master |
2024-05-18 07:26:50.975 | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------------------------------------------+--------------------+--------+ 

Version-Release number of selected component (if applicable):

RHOS-17.1-RHEL-9-20240123.n.1
4.15.0-0.nightly-2024-05-16-091947

Additional info:

   Must-gather link provided on private comment.

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/109

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The tech preview jobs can sometimes fail: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616

It seems early on the pinnedimageset controller can panic: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-66559c9856-58g4w_machine-config-controller_previous.log

Although it is fine on future syncs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial/1787262709813743616/artifacts/e2e-vsphere-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-66559c9856-58g4w_machine-config-controller.log

Version-Release number of selected component (if applicable):

4.16.0 techpreview only    

How reproducible:

Unsure

Steps to Reproduce:

See CI

Actual results:

 

Expected results:

Don't panic

Additional info:

    

This is a clone of issue OCPBUGS-43280. The following is the description of the original issue:

Description of problem:

NTO CI starts falling with:
 • [FAILED] [247.873 seconds]
[rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0]
/go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309
  [FAILED] Expected
      <cpuset.CPUSet>: {
          elems: {0: {}, 2: {}},
      }
  to equal
      <cpuset.CPUSet>: {
          elems: {0: {}, 1: {}, 2: {}, 3: {}},
      }
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436 

The failure happened due to the fact that the test pod couldn't get admitted after Kubelet restart.

Adding the failure is happening at this line:
https://github.com/openshift/kubernetes/blob/cec2232a4be561df0ba32d98f43556f1cad1db01/pkg/kubelet/cm/cpumanager/policy_static.go#L352 

something has changed with how Kubelet accounts for `availablePhysicalCPUs`

Version-Release number of selected component (if applicable):

    4.18 (start happening after OCP rebased on top of k8s 1.31

How reproducible:

    Always

Steps to Reproduce:

    1. Set up a system with 4 CPUs and apply performance-profile with single-numa-policy
    2. Run pao-functests
    

Actual results:

    Tests falling with:
 • [FAILED] [247.873 seconds] [rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309 [FAILED] Expected <cpuset.CPUSet>: { elems: {0: {}, 2: {}}, } to equal <cpuset.CPUSet>: { elems: {0: {}, 1: {}, 2: {}, 3: {}}, } In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436 

Expected results:

    Tests should pass

Additional info:

    NOTE: The issue occurs only on system with small amount of CPUs (4 in our case) 

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/21

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    observed panic in kube-scheduler:

2024-05-29T07:53:40.874397450Z E0529 07:53:40.873820       1 runtime.go:79] Observed a panic: "integer divide by zero" (runtime error: integer divide by zero)
2024-05-29T07:53:40.874397450Z goroutine 2363 [running]:
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x215e8a0, 0x3c7c150})
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?})
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
2024-05-29T07:53:40.874397450Z panic({0x215e8a0?, 0x3c7c150?})
2024-05-29T07:53:40.874397450Z     runtime/panic.go:770 +0x132
2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).findNodesThatFitPod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488)
2024-05-29T07:53:40.874397450Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:505 +0xaf0
2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulePod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488)
2024-05-29T07:53:40.874397450Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:402 +0x31f
2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle(0xc0005b6900, {0x28f4618, 0xc002a97360}, 0xc002ac1a00, {0x291d688, 0xc00039f688}, 0xc002a96370, {0xc18dd5a13410037e, 0x72c11612c5e, 0x3d515e0}, ...)
2024-05-29T07:53:40.874397450Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:149 +0x115
2024-05-29T07:53:40.874397450Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).ScheduleOne(0xc0005b6900, {0x28f4618, 0xc000df7ea0})
2024-05-29T07:53:40.874397450Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:111 +0x698
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x1f
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00214bee0?)
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00214bf70, {0x28cfa20, 0xc00169e6c0}, 0x1, 0xc000cfd9e0)
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001256f70, 0x0, 0x0, 0x1, 0xc000cfd9e0)
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x28f4618, 0xc000df7ea0}, 0xc0009be200, 0x0, 0x0, 0x1)
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x93
2024-05-29T07:53:40.874397450Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
2024-05-29T07:53:40.874397450Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:170
2024-05-29T07:53:40.874397450Z created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 2386
2024-05-29T07:53:40.874397450Z     k8s.io/kubernetes/pkg/scheduler/scheduler.go:445 +0x119
2024-05-29T07:53:40.876894723Z panic: runtime error: integer divide by zero [recovered]
2024-05-29T07:53:40.876894723Z     panic: runtime error: integer divide by zero
2024-05-29T07:53:40.876894723Z 
2024-05-29T07:53:40.876894723Z goroutine 2363 [running]:
2024-05-29T07:53:40.876894723Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?})
2024-05-29T07:53:40.876894723Z     k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
2024-05-29T07:53:40.876929875Z panic({0x215e8a0?, 0x3c7c150?})
2024-05-29T07:53:40.876929875Z     runtime/panic.go:770 +0x132
2024-05-29T07:53:40.876929875Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).findNodesThatFitPod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488)
2024-05-29T07:53:40.876943106Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:505 +0xaf0
2024-05-29T07:53:40.876953277Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulePod(0xc0005b6900, {0x28f4618, 0xc002a97360}, {0x291d688, 0xc00039f688}, 0xc002ac1a00, 0xc0022fc488)
2024-05-29T07:53:40.876962958Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:402 +0x31f
2024-05-29T07:53:40.876973018Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).schedulingCycle(0xc0005b6900, {0x28f4618, 0xc002a97360}, 0xc002ac1a00, {0x291d688, 0xc00039f688}, 0xc002a96370, {0xc18dd5a13410037e, 0x72c11612c5e, 0x3d515e0}, ...)
2024-05-29T07:53:40.877000640Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:149 +0x115
2024-05-29T07:53:40.877000640Z k8s.io/kubernetes/pkg/scheduler.(*Scheduler).ScheduleOne(0xc0005b6900, {0x28f4618, 0xc000df7ea0})
2024-05-29T07:53:40.877011311Z     k8s.io/kubernetes/pkg/scheduler/schedule_one.go:111 +0x698
2024-05-29T07:53:40.877028792Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
2024-05-29T07:53:40.877028792Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x1f
2024-05-29T07:53:40.877028792Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00214bee0?)
2024-05-29T07:53:40.877028792Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
2024-05-29T07:53:40.877049294Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00214bf70, {0x28cfa20, 0xc00169e6c0}, 0x1, 0xc000cfd9e0)
2024-05-29T07:53:40.877058805Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
2024-05-29T07:53:40.877068225Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001256f70, 0x0, 0x0, 0x1, 0xc000cfd9e0)
2024-05-29T07:53:40.877088457Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
2024-05-29T07:53:40.877088457Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x28f4618, 0xc000df7ea0}, 0xc0009be200, 0x0, 0x0, 0x1)
2024-05-29T07:53:40.877099448Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x93
2024-05-29T07:53:40.877099448Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
2024-05-29T07:53:40.877109888Z     k8s.io/apimachinery/pkg/util/wait/backoff.go:170
2024-05-29T07:53:40.877109888Z created by k8s.io/kubernetes/pkg/scheduler.(*Scheduler).Run in goroutine 2386
2024-05-29T07:53:40.877119479Z     k8s.io/kubernetes/pkg/scheduler/scheduler.go:445 +0x119

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    there are a lot of instances; see https://search.dptools.openshift.org/?search=runtime+error%3A+integer+divide+by+zero&maxAge=24h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

$ podman run -it corbinu/alpine-w3m -dump -cols 200 "https://search.dptools.openshift.org/?search=runtime+error%3A+integer+divide+by+zero&maxAge=24h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job" | grep 'failures match' | sort
openshift-origin-28839-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-techpreview-serial (all) - 3 runs, 33% failed, 300% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-techpreview-serial (all) - 3 runs, 33% failed, 200% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-sdn-techpreview-serial (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview-serial (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.17-fips-payload-scan (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
pull-ci-openshift-api-master-e2e-aws-serial-techpreview (all) - 8 runs, 100% failed, 50% of failures match = 50% impact
pull-ci-openshift-hypershift-main-e2e-kubevirt-azure-ovn (all) - 27 runs, 70% failed, 5% of failures match = 4% impact
pull-ci-openshift-installer-master-e2e-openstack-dualstack-upi (all) - 6 runs, 83% failed, 20% of failures match = 17% impact

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    see https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial/1795684524709908480

you need to pull the must-gather and you will find the panic in the openshift-kube-scheduler pod

Description of problem:
We have an egressfirewall set in our build farm build09. Once a node is deleted, all ovnkube-node-* pods are crashed immediately.

Version-Release number of selected component (if applicable):
4.16.0-rc3

How reproducible:

Steps to Reproduce:

1. create an egressfirewall object in any namespace

2. delete an node on the cluster

3.

Actual results:
All ovnkube-node-* pods are crashed

Expected results:
Nothing shall happen

Additional info:
https://redhat-internal.slack.com/archives/CDCP2LA9L/p1718210291108709

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

This is a clone of issue OCPBUGS-44099. The following is the description of the original issue:

Description of problem:

OCPBUGS-42772 is verified. But testing found oauth-server panic with OAuth2.0 idp names that contain whitespaces

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-31-190119    

How reproducible:

Always    

Steps to Reproduce:

1. Set up Google IDP with below:
$ oc create secret generic google-secret-1 --from-literal=clientSecret=xxxxxxxx -n openshift-config
$ oc edit oauth cluster
spec:
  identityProviders:
  - google:
      clientID: 9745..snipped..apps.googleusercontent.com
      clientSecret:
        name: google-secret-1
      hostedDomain: redhat.com
    mappingMethod: claim
    name: 'my Google idp'
    type: Google
...

Actual results:

oauth-server panic:

$ oc get po -n openshift-authentication
NAME                               READY   STATUS             RESTARTS
oauth-openshift-59545c6f5-dwr6s    0/1     CrashLoopBackOff   11 (4m10s ago)
...

$ oc logs -p -n openshift-authentication oauth-openshift-59545c6f5-dwr6s
Copying system trust bundle
I1101 03:40:09.883698       1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key"
I1101 03:40:09.884046       1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com"
I1101 03:40:10.335739       1 audit.go:340] Using audit backend: ignoreErrors<log>
I1101 03:40:10.347632       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
panic: parsing "/oauth2callback/my Google idp": at offset 0: invalid method "/oauth2callback/my"goroutine 1 [running]:
net/http.(*ServeMux).register(...)
        net/http/server.go:2738
net/http.(*ServeMux).Handle(0x29844c0?, {0xc0008886a0?, 0x2984420?}, {0x2987fc0?, 0xc0006ff4a0?})
        net/http/server.go:2701 +0x56
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthenticationHandler(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:407 +0x11ad
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthorizeAuthenticationHandlers(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:243 +0x65
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).WithOAuth(0xc0006c28c0, {0x2982500, 0xc0000aca80})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:108 +0x21d
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth(0xc0006c28c0, {0x2982500?, 0xc0000aca80?}, 0xc000785888)
        github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:342 +0x45
k8s.io/apiserver/pkg/server.completedConfig.New.func1({0x2982500?, 0xc0000aca80?})
        k8s.io/apiserver@v0.29.2/pkg/server/config.go:825 +0x28
k8s.io/apiserver/pkg/server.NewAPIServerHandler({0x252ca0a, 0xf}, {0x2996020, 0xc000501a00}, 0xc0005d1740, {0x0, 0x0})
        k8s.io/apiserver@v0.29.2/pkg/server/handler.go:96 +0x2ad
k8s.io/apiserver/pkg/server.completedConfig.New({0xc000785888?, {0x0?, 0x0?}}, {0x252ca0a, 0xf}, {0x29b41a0, 0xc000171370})
        k8s.io/apiserver@v0.29.2/pkg/server/config.go:833 +0x2a5
github.com/openshift/oauth-server/pkg/oauthserver.completedOAuthConfig.New({{0xc0005add40?}, 0xc0006c28c8?}, {0x29b41a0?, 0xc000171370?})
        github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:322 +0x6a
github.com/openshift/oauth-server/pkg/cmd/oauth-server.RunOsinServer(0xc000451cc0?, 0xc000810000?, 0xc00061a5a0)
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/server.go:45 +0x73
github.com/openshift/oauth-server/pkg/cmd/oauth-server.(*OsinServerOptions).RunOsinServer(0xc00030e168, 0xc00061a5a0)
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:108 +0x259
github.com/openshift/oauth-server/pkg/cmd/oauth-server.NewOsinServerCommand.func1(0xc00061c300?, {0x251a8c8?, 0x4?, 0x251a8cc?})
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:46 +0xed
github.com/spf13/cobra.(*Command).execute(0xc000780008, {0xc00058d6c0, 0x7, 0x7})
        github.com/spf13/cobra@v1.7.0/command.go:944 +0x867
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a3b08)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/component-base/cli.run(0xc0001a3b08)
        k8s.io/component-base@v0.29.2/cli/run.go:146 +0x290
k8s.io/component-base/cli.Run(0xc00061a5a0?)
        k8s.io/component-base@v0.29.2/cli/run.go:46 +0x17
main.main()
        github.com/openshift/oauth-server/cmd/oauth-server/main.go:46 +0x2de

Expected results:

No panic

Additional info:

Tried in old env like 4.16.20 with same steps, no panic:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.20   True        False         95m     Cluster version is 4.16.20

$ oc get po -n openshift-authentication
NAME                               READY   STATUS    RESTARTS   AGE    
oauth-openshift-7dfcd8c8fd-77ltf   1/1     Running   0          116s   
oauth-openshift-7dfcd8c8fd-sr97w   1/1     Running   0          89s    
oauth-openshift-7dfcd8c8fd-tsrff   1/1     Running   0          62s

This is a clone of issue OCPBUGS-38114. The following is the description of the original issue:

Description of problem:

Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.

Version-Release number of selected component (if applicable):

    

How reproducible:

The installation procedure fails systemically when using a predefined VPC

Steps to Reproduce:

    1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC
    2. Run `openshift-install create cluster ...'
    3. The procedure fails: `failed to create load balancer`
    

Actual results:

The installation procedure fails.

Expected results:

An OCP cluster to be provisioned in AWS, with public subnets only.    

Additional info:

    

ControlePlaneReleaseProvider is modifying the cached release image directly which means the userReleaseProvider is still picking up and using the registry overrides for data-plane components.

This is a clone of issue OCPBUGS-18007. The following is the description of the original issue:

Description of problem:

When the TelemeterClientFailures alert fires, there's no runbook link explaining the meaning of the alert and what to do about it.

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Check the TelemeterClientFailures alerting rule's annotations
2.
3.

Actual results:

No runbook_url annotation.

Expected results:

runbook_url annotation is present.

Additional info:

This is a consequence of a telemeter server outage that triggered questions from customers about the alert:
https://issues.redhat.com/browse/OHSS-25947
https://issues.redhat.com/browse/OCPBUGS-17966
Also in relation to https://issues.redhat.com/browse/OCPBUGS-17797

Description of the problem:
When running the infrastructure operator, the local cluster is not being imported in ACM as expected.
 

How reproducible:
Run the infrastructure operator in ACM

Steps to reproduce:

1. Install ACM

Actual results:
Local cluster entities are not created
 

Expected results:
Local cluster entities should be created

Description of problem:

A ServiceAccount is not deleted due to a race condition in the controller manager. When deleting the SA, this is logged in the controller manager:

2024-06-17T15:57:47.793991942Z I0617 15:57:47.793942       1 image_pull_secret_controller.go:233] "Internal registry pull secret auth data does not contain the correct number of entries" ns="test-qtreoisu" name="sink-eguqqiwm-dockercfg-vh8mw" expected=3 actual=0
2024-06-17T15:57:47.794120755Z I0617 15:57:47.794080       1 image_pull_secret_controller.go:163] "Refreshing image pull secret" ns="test-qtreoisu" name="sink-eguqqiwm-dockercfg-vh8mw" serviceaccount="sink-eguqqiwm"

As a result, the Secret is updated and the ServiceAccount owning the Secret is updated by the controller via server-side apply operation as can be seen in the managedFields:

{
   "apiVersion":"v1",
   "imagePullSecrets":[
      {
         "name":"default-dockercfg-vdck9"
      },
      {
         "name":"kn-test-image-pull-secret"
      },
      {
         "name":"sink-eguqqiwm-dockercfg-vh8mw"
      }
   ],
   "kind":"ServiceAccount",
   "metadata":{
      "annotations":{
         "openshift.io/internal-registry-pull-secret-ref":"sink-eguqqiwm-dockercfg-vh8mw"
      },
      "creationTimestamp":"2024-06-17T15:57:47Z",
      "managedFields":[
         {
            "apiVersion":"v1",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
               "f:imagePullSecrets":{
                  
               },
               "f:metadata":{
                  "f:annotations":{
                     "f:openshift.io/internal-registry-pull-secret-ref":{
                        
                     }
                  }
               },
               "f:secrets":{
                  "k:{\"name\":\"sink-eguqqiwm-dockercfg-vh8mw\"}":{
                     
                  }
               }
            },
            "manager":"openshift.io/image-registry-pull-secrets_service-account-controller",
            "operation":"Apply",
            "time":"2024-06-17T15:57:47Z"
         }
      ],
      "name":"sink-eguqqiwm",
      "namespace":"test-qtreoisu",
      "resourceVersion":"104739",
      "uid":"eaae8d0e-8714-4c2e-9d20-c0c1a221eecc"
   },
   "secrets":[
      {
         "name":"sink-eguqqiwm-dockercfg-vh8mw"
      }
   ]
}"Events":{
   "metadata":{
      
   },
   "items":null
} 

The ServiceAccount then hangs there and is NOT deleted.

We have seen this only on OCP 4.16 (not on older versions) but already several time, like for example in this CI run which also has must-gather logs that can be investigated.

Another run is here

The controller code is new in 4.16 and it seems to be a regression.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-14-130320

How reproducible:

It happens sometimes in our CI runs where we want to delete a ServiceAccount but it's hanging there. The test doesn't try to delete it again. It tries only once.

Steps to Reproduce:

The following reproducer works for me. Some service accounts keep handing there after running the script

#!/usr/bin/env bash

kubectl create namespace test

for i in `seq 100`; do
	(
		kubectl create sa "my-sa-${i}" -n test
		kubectl wait --for=jsonpath="{.metadata.annotations.openshift\\.io/internal-registry-pull-secret-ref}" sa/my-sa-${i}
		kubectl delete sa/my-sa-${i}
		kubectl wait --for=delete sa/my-sa-${i} --timeout=60s
	)&
done

wait

Actual results:

ServiceAccount not deleted

Expected results:

ServiceAccount deleted

Additional info:

 

Monitor test for nodes should fail when nodes go ready=false unexpectedly.
Monitor test for nodes should fail when the unreachable taint is placed on them.

Getting this into release-4.17.

This is a clone of issue OCPBUGS-36479. The following is the description of the original issue:

Description of problem:

    As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate.

However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100%

Steps to Reproduce:

   $ oc get featuregates.config.openshift.io cluster -oyaml 
<......>
spec:
  featureSet: TechPreviewNoUpgrade
status:
  featureGates:
    enabled:
    - name: ExternalRouteCertificate
    - name: RouteExternalCertificate
<......>     

Actual results:

    Both RouteExternalCertificate and ExternalRouteCertificate were added in the API

Expected results:

We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html

Additional info:

 Git commits

https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3

https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930

Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219

This is a clone of issue OCPBUGS-38409. The following is the description of the original issue:

Update our CPO and HO dockerfiles to use appropriate base image versions.

Description of problem:

Gathering bootstrap log bundles has been failing in CI with:

     level=error msg=Attempted to gather debug logs after installation failure: must provide bootstrap host address 

Example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-installer-8427-ci-4.17-e2e-aws-ovn/1792972823245885440

Version-Release number of selected component (if applicable):

    

How reproducible:

    not. this is a race condition when serializing the machine manifests to disk

Steps to Reproduce:

    can't reproduce. need to verify in ci.
    

Actual results:

can't pull bootstrap log bundle    

Expected results:

    grabs bootstrap log bundle

Additional info:

    

 

This is a clone of issue OCPBUGS-41617. The following is the description of the original issue:

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

Description of problem:

checked in 4.17.0-0.nightly-2024-09-18-003538, default thanos-ruler retention time is 24h, not 15d mentioned in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.17/Documentation/api.md#thanosrulerconfig, the issue exists in 4.12+

$ for i in $(oc -n openshift-user-workload-monitoring get sts --no-headers | awk '{print $1}'); do echo $i; oc -n openshift-user-workload-monitoring get sts $i -oyaml | grep retention; echo -e "\n"; done
prometheus-user-workload
        - --storage.tsdb.retention.time=24h

thanos-ruler-user-workload
        - --tsdb.retention=24h

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-18-003538    

How reproducible:

always

Steps to Reproduce:

1. see the description

Actual results:

default thanos-ruler retention time is 15d in api.md

Expected results:

should be 24h

Additional info:

    

This is a clone of issue OCPBUGS-38558. The following is the description of the original issue:

Description of problem:

Remove the extra . from below INFO message when running add-nodes workdflow

INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z 

The INFO message will be visible inside the container which runs the node joiner , if using oc adm command

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Run  oc adm node-image create command to create a node iso
    2. See the INFO message at the end
    3.
    

Actual results:

 INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z   

Expected results:

    INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z 

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1059

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

During a security audit, questions were raised about why a number of our containers run privileged. The short answer is that they are doing things that require more permissions than a regular container, but what is not clear is whether we could accomplish the same thing by adding individual capabilities. If it is not necessary to run them fully privileged then we should stop doing that. If it is necessary for some reason we'll need to document why the container must be privileged.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Refactor name to Dockerfile.ocp as a better alternative to Dockerfile.rhel7 since contents are actually rhel9.

This is a clone of issue OCPBUGS-39111. The following is the description of the original issue:

Gather the nodenetworkconfigurationpolicy.nmstate.io/v1 and nodenetworkstate.nmstate.io/v1beta1 cluster scoped resources in the Insights data. This CRs are introduced by the NMState operator.

Please review the following PR: https://github.com/openshift/console/pull/13886

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Azure HostedClusters are failing in OCP 4.17 due to issues with the cluster-storage-operator.
- lastTransitionTime: "2024-05-29T19:58:39Z"
          message: 'Unable to apply 4.17.0-0.nightly-multi-2024-05-29-121923: the cluster operator storage is not available'
          observedGeneration: 2
          reason: ClusterOperatorNotAvailable
          status: "True"
          type: ClusterVersionProgressing  
I0529 20:05:21.547544       1 status_controller.go:218] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2024-05-29T20:02:00Z","message":"AzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: \"node_service.yaml\" (string): namespaces \"clusters-test-case4\" not found\nAzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: ","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverGuestStaticResourcesController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2024-05-29T20:04:15Z","message":"AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"True","type":"Progressing"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"False","type":"Available"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2024-05-29T19:59:00Z","reason":"NoData","status":"Unknown","type":"EvaluationConditionsDetected"}]}} I0529 20:05:21.566215       1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"azure-cloud-controller-manager", UID:"205a4307-67e4-481e-9fee-975b2c5c40fb", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nAzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"

 

On the HostedCluster itself, these errors with the csi pods not coming up are:

% k describe pod/azure-disk-csi-driver-node-5hb24 -n openshift-cluster-csi-drivers | grep fail
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Liveness:     http-get http://:rhealthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
  Warning  FailedMount  2m (x28 over 42m)  kubelet            MountVolume.SetUp failed for volume "metrics-serving-cert" : secret "azure-disk-csi-driver-node-metrics-serving-cert" not found  

There was an error with the CO as well:

storage                                    4.17.0-0.nightly-multi-2024-05-29-121923   False       True          True       49m     AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service  

 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Every time

Steps to Reproduce:

    1. Create a HC with a 4.17 nightly
    

Actual results:

    Azure HC does not complete; nodes do join NodePool though

Expected results:

    Azure HC should complete

Additional info:

    

Description of problem:

Occasionally, the TestMCDGetsMachineOSConfigSecrets e2e test fails during the e2e-gcp-op-techpreview CI job run. The reason for this failure is because the MCD pod is restarted when the MachineOSConfig is created because it must be aware of the new secrets that the MachineOSConfig expects. The final portion of the test involves using the MCD as a bridge to determine whether the expected actions have occurred. Without the MCD pod containers in a running / ready state, this operation fails.

Version-Release number of selected component (if applicable):

    

How reproducible:

Variable

Steps to Reproduce:

1. Run the e2e-gcp-op-techpreview job on 4.17+

Actual results:

The TestMCDGetsMachineOSConfigSecrets test fails because it cannot get the expected config file from the targeted node.    

Expected results:

The test should pass.

Additional info:

This was discovered and fixed in 4.16 during the backport of the PR that introduced this problem. Consequently, this bug only covers 4.17+.

This is a clone of issue OCPBUGS-43764. The following is the description of the original issue:

Description of problem:

IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service. 

Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected. 

IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16

https://github.com/openshift/sdn/blob/release-4.16/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L979-L992

Without this functionality IBM ROKS is not able to GA OCP 4.17

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/174

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/356

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-35054. The following is the description of the original issue:

Description of problem:

Create VPC and subnets with following configs [refer to attached CF template]:
Subnets (subnets-pair-default) in CIDR 10.0.0.0/16
Subnets (subnets-pair-134) in CIDR 10.134.0.0/16
Subnets (subnets-pair-190) in CIDR 10.190.0.0/16

Create cluster into subnets-pair-134, the bootstrap process fails [see attached log-bundle logs]:

level=debug msg=I0605 09:52:49.548166 	937 loadbalancer.go:1262] "adding attributes to load balancer" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" attrs=[{"Key":"load_balancing.cross_zone.enabled","Value":"true"}]
level=debug msg=I0605 09:52:49.909861 	937 awscluster_controller.go:291] "Looking up IP address for DNS" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" dns="yunjiang29781a-86-rvqd9-int-19a9485653bf29a1.elb.us-east-2.amazonaws.com"
level=debug msg=I0605 09:52:53.483058 	937 reflector.go:377] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: forcing resync
level=debug msg=Fetching Bootstrap SSH Key Pair...

Checking security groups:
<infraid>-lb allows 10.0.0.0/16:6443 and 10.0.0.0/16:22623
<infraid>-apiserver-lb allows 10.0.0.0/16:6443 and 10.134.0.0/16:22623 (and 0.0.0.0/0:6443)

are these settings correct?

    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-03-060250
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create subnets using attached CG template
    2. Create cluster into subnets which CIDR is 10.134.0.0/16
    3.
    

Actual results:

Bootstrap process fails.
    

Expected results:

Bootstrap succeeds.
    

Additional info:

No issues if creating cluster into subnets-pair-default (10.0.0.0/16)
No issues if only one CIDR in VPC, e.g. set VpcCidr to 10.134.0.0/16 in https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml

    

Description of problem:

The ingress operator is E2E tests are perma-failing with a prometheus service account issue:
=== CONT  TestAll/parallel/TestRouteMetricsControllerRouteAndNamespaceSelector
    route_metrics_test.go:86: prometheus service account not found
=== CONT  TestAll/parallel/TestRouteMetricsControllerOnlyNamespaceSelector
    route_metrics_test.go:86: prometheus service account not found
=== CONT  TestAll/parallel/TestRouteMetricsControllerOnlyRouteSelector
    route_metrics_test.go:86: prometheus service account not found

We need to bump openshift/library-go to get update https://github.com/openshift/library-go/pull/1697 for NewPrometheusClient function that switches from using a legacy service account API to TokenRequest API.

Version-Release number of selected component (if applicable):

    4.16, 4.17

How reproducible:

    100%

Steps to Reproduce:

    1. Run e2e-[aws|gcp|azure]-operator E2E tests on cluster-ingress-operator

Actual results:

     route_metrics_test.go:86: prometheus service account not found 

Expected results:

    No failure

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/152

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-41184. The following is the description of the original issue:

Description of problem:

    The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations.

The attached spreadsheet displays the combinations of valid disk and instance types.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After destroying cluster, there are still some files leftover in <install-dir>/.clusterapi_output
$ ls -ltra
total 1516
drwxr-xr-x. 1 fedora fedora     596 Jun 17 03:46 ..
drwxr-x---. 1 fedora fedora      88 Jun 17 06:09 .clusterapi_output
-rw-r--r--. 1 fedora fedora 1552382 Jun 17 06:09 .openshift_install.log
drwxr-xr-x. 1 fedora fedora      80 Jun 17 06:09 . 
$ ls -ltr .clusterapi_output/
total 40
-rw-r--r--. 1 fedora fedora  2335 Jun 17 05:58 envtest.kubeconfig
-rw-r--r--. 1 fedora fedora 20542 Jun 17 06:03 kube-apiserver.log
-rw-r--r--. 1 fedora fedora 10656 Jun 17 06:03 etcd.log

Then continue installing new cluster within same install dir, installer exited with error as below:
$ ./openshift-install create cluster --dir ipi-aws
INFO Credentials loaded from the "default" profile in file "/home/fedora/.aws/credentials" 
INFO Consuming Install Config from target directory 
FATAL failed to fetch Cluster: failed to load asset "Cluster": local infrastructure provisioning artifacts already exist. There may already be a running cluster 


After removing .clusterapi_output/envtest.kubeconfig, and creating cluster again, installation is continued.

Version-Release number of selected component (if applicable):

4.16 nightly build

How reproducible:

always

Steps to Reproduce:

1. Launch capi-based installation
2. Destroy cluster
3. Launch new cluster within same install dir

Actual results:

Fail to launch new cluster within the same install dir, because .clusterapi_output/envtest.kubeconfig is still there.

Expected results:

Succeed to create a new cluster within the same install dir

Additional info:

 

Component Readiness has found a potential regression in the following test:

[sig-network] pods should successfully create sandboxes by adding pod to network

Probability of significant regression: 99.93%

Sample (being evaluated) Release: 4.17
Start Time: 2024-07-12T00:00:00Z
End Time: 2024-07-18T23:59:59Z
Success Rate: 74.29%
Successes: 25
Failures: 9
Flakes: 1

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.18%
Successes: 54
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=minor&Upgrade=minor&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform&columnGroupBy=Architecture&columnGroupBy=Network&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20unknown%20ha%20minor&ignoreDisruption=1&ignoreMissing=0&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-07-18%2023%3A59%3A59&samplePRNumber=&samplePROrg=&samplePRRepo=&sampleRelease=4.17&sampleStartTime=2024-07-12%2000%3A00%3A00&testId=openshift-tests%3A65e48733eb0b6115134b2b8c6a365f16&testName=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network

This test appears to be failing roughly 50% of the time on periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-ovn-upgrade and the error looks workable:

 [sig-network] pods should successfully create sandboxes by adding pod to network expand_less 	0s
{  1 failures to create the sandbox

namespace/e2e-test-ns-global-srg5f node/worker-1 pod/test-ipv6-podtm8vn hmsg/da5d303f42 - never deleted - firstTimestamp/2024-07-18T11:26:41Z interesting/true lastTimestamp/2024-07-18T11:26:41Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-ipv6-podtm8vn_e2e-test-ns-global-srg5f_65c4722e-d832-4ec8-8209-39587a81d95d_0(d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f): error adding pod e2e-test-ns-global-srg5f_test-ipv6-podtm8vn to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f" Netns:"/var/run/netns/7bb7a08a-9352-49d6-a211-02046349dba6" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=e2e-test-ns-global-srg5f;K8S_POD_NAME=test-ipv6-podtm8vn;K8S_POD_INFRA_CONTAINER_ID=d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f;K8S_POD_UID=65c4722e-d832-4ec8-8209-39587a81d95d" Path:"" ERRORED: error configuring pod [e2e-test-ns-global-srg5f/test-ipv6-podtm8vn] networking: [e2e-test-ns-global-srg5f/test-ipv6-podtm8vn/65c4722e-d832-4ec8-8209-39587a81d95d:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-test-ns-global-srg5f/test-ipv6-podtm8vn d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f network default NAD default] [e2e-test-ns-global-srg5f/test-ipv6-podtm8vn d11ec24638e2d578486e57851a419e52ddd4367d48b33e46825f7c42687c9f7f network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:e3 [10.131.0.227/23]
'
': StdinData: {"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}}

Taken from: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-upgrade-from-stable-4.16-e2e-metal-ipi-ovn-upgrade/1813846390107803648

Description of problem:

    EC2 instances is failing to launch by MAPI as the instance profile set to the MachineSet config is invalid, not created by installer.

~~~
  errorMessage: "error launching instance: Value (ci-op-ikqrdc6x-cc206-bmcnx-edge-profile) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name"
  errorReason: InvalidConfiguration
~~~

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

Always

Steps to Reproduce:

1. set the edge compute pool on installer, without setting a custom instance profile
2. create a cluster
3. 
    

Actual results:

    

Expected results:

instance created in edge zone    

Additional info:

- IAM Profile feature: https://github.com/openshift/installer/pull/8689/files#diff-e46d61c55e5e276e3c264d18cba0346777fe3e662d0180a173001b8282af7c6eR51-R54
- CI failures: https://sippy.dptools.openshift.org/sippy-ng/jobs/Presubmits/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones%22%7D%5D%7D
- Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-edge-zones-manifest-validation%22%7D%5D%7D&sortField=timestamp&sort=desc
- Slack thread: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1722964067282219

 

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/205

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/sdn/pull/623

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


The "0000_90_olm_00-service-monitor.yaml" manifest containing RBAC for Prometheus to scrape OLM namespace is omitted https://github.com/openshift/hypershift/blob/e9594f47570b557877009b607d26b9cb4a34f233/control-plane-operator/controllers/hostedcontrolplane/cvo/reconcile.go#L66

But "0000_50_olm_06-psm-operator.servicemonitor.yaml" containing a new OLM ServiceMonitor that was added in https://github.com/openshift/operator-framework-olm/pull/551/files is still deployed, which make Prometheus logs failures: see https://issues.redhat.com/browse/OCPBUGS-36299

    

Version-Release number of selected component (if applicable):


    

How reproducible:

    Check Prometheus logs in any 4.17 hosted cluster
    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:

    Prometheus shouldn't be asked to discover/scrape targets without giving it the appropriate RBAC
   
   If that new ServiceMonitor is needed, the appropriate RBAC should be deployed, if not, the ServiceMonitor should be omitted.

    Maybe Hypershift should use an opt-in approach instead of opt-out for OLM resources, to avoid such issues in the future. 
    

Additional info:


    

Description of problem:

Httpd icon does not show up in git import flow

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

Steps to Reproduce:

1. Navigate to import from git
2. Fill in any github url
3. Edit import strategy, notice that httpd icon is missing

Actual results:

Icon isn't there

Expected results:

It is there

Additional info:

 

 

Description of problem:

   4.16 NodePool CEL validation breaking existing/older NodePools

Version-Release number of selected component (if applicable):

   4.16.0

How reproducible:

   100%

Steps to Reproduce:

    1. Deploy 4.16 NodePool CRDs
    2. Create NodePool resource without spec.replicas + spec.autoScaling
    3. 
    

Actual results:

    The NodePool "22276350-mynodepool" is invalid: spec: Invalid value: "object": One of replicas or autoScaling should be set but not both

Expected results:

    NodePool to apply successfully

Additional info:

    Breaking change: https://github.com/openshift/hypershift/pull/3786

This is a clone of issue OCPBUGS-42584. The following is the description of the original issue:

Description of problem:

Redhat CamelK installation should be via CLI
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1. Check for operator installation through CLI
    2. Check for any post-installation needed
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

This is a clone of issue OCPBUGS-39320. The following is the description of the original issue:

The original logic of the test is checking for a condition that can never happen. Moreover the test is comparing reserved cpus with all the content of the irqbalance config files which is not great. There were accidental matches between comments and cpu no.1.

Remove checking of reserved cpus in the /etc/sysconfig/irqbalance as in the current Performance profile deployment reserved cpus are never added to the irqbalance config file.

Description of problem:

When building https://github.com/kubevirt-ui/kubevirt-plugin from its release-4.16 branch, following warnings are issued during the webpack build:

WARNING in shared module react

No required version specified and unable to automatically determine one. Unable to find required version for "react" in description file (/home/vszocs/work/kubevirt-plugin/node_modules/react/package.json). It need to be in dependencies, devDependencies or peerDependencies.

These warnings should not appear during the plugin build.

Root cause seems to be webpack module federation code which attempts to auto-detect actual build version of shared modules, but this code seems to be unreliable and warnings such as the one above are anything but helpful.

How reproducible: always on kubevirt-plugin branch release-4.16

Steps to Reproduce:

1. git clone https://github.com/kubevirt-ui/kubevirt-plugin
2. cd kubevirt-plugin
3. yarn && yarn dev

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/92

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    After upgrading to OpenShift 4.14, the must-gather took much longer than before.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    Always

Steps to Reproduce:

    1. Run oc adm must-gather
    2. Wait for it to complete
    3.
    

Actual results:

    For a cluster with around 50 nodes, the must-gather took about 30 minutes.

Expected results:

   For a cluster with around 50 nodes, the must-gather can finish in about 10 minutes.

Additional info:

    It seems the gather_ppc collection script is related here.

https://github.com/openshift/must-gather/blob/release-4.14/collection-scripts/gather_ppc

This is a clone of issue OCPBUGS-38349. The following is the description of the original issue:

Description of problem:

When using configuring an OpenID idp that can only be accessed via the data plane, if the hostname of the provider can only be resolved by the data plane, reconciliation of the idp fails.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Configure an OpenID idp on a HostedCluster with a URL that points to a service in the dataplane (like https://keycloak.keycloak.svc)
    

Actual results:

    The oauth server fails to be reconciled

Expected results:

    The oauth server reconciles and functions properly

Additional info:

    Follow up to OCPBUGS-37753

Description of problem:

ARO cluster fails to install with disconnected networking.
We see master nodes bootup hang on the service machine-config-daemon-pull.service. Logs from the service indicate it cannot reach the public IP of the image registry. In ARO, image registries need to go via a proxy. Dnsmasq is used to inject proxy DNS answers, but machine-config-daemon-pull is starting before ARO's dnsmasq.service starts.

Version-Release number of selected component (if applicable):

4.14.16

How reproducible:

Always

Steps to Reproduce:

For Fresh Install:
1. Create the required ARO vnet and subnets
2. Attach a route table to the subnets with a blackhole route 0.0.0.0/0
3. Create 4.14 ARO cluster with --apiserver-visibility=Private --ingress-visibility=Private --outbound-type=UserDefinedRouting

[OR]

Post Upgrade to 4.14:
1. Create a ARO 4.13 UDR.
2. ClusterUpgrade the cluster 4.13-> 4.14 , upgrade was successful
3. Create a new node (scale up), we run into the same issue. 

Actual results:

For Fresh Install of 4.14:
ERROR: (InternalServerError) Deployment failed.

[OR]

Post Upgrade to 4.14:
Node doesn't come into a Ready State and Machine is stuck in Provisioned status.

Expected results:

Succeeded 

Additional info:
We see in the node logs that machine-config-daemon-pull.service is unable to reach the image registry. ARO's dnsmasq was not yet started.
Previously, systemd ordering was set for ovs-configuration.service to start after (ARO's) dnsmasq.service. Perhaps that should have gone on machine-config-daemon-pull.service.
See https://issues.redhat.com/browse/OCPBUGS-25406.

This is a clone of issue OCPBUGS-44163. The following is the description of the original issue:

Description of problem:

We identified a regression where we can no longer get oauth tokens for HyperShift v4.16 clusters via the OpenShift web console. v4.16.10 works fine, but once clusters are patched to v4.16.16 (or are created at that version) they fail to get the oauth token. 

This is due to this faulty PR: https://github.com/openshift/hypershift/pull/4496.

The oauth openshift deployment was changed and affected the IBM Cloud code path.  We need this endpoint to change back to using `socks5`.

Bug:
<           value: socks5://127.0.0.1:8090
---
>           value: http://127.0.0.1:8092
98c98
<           value: socks5://127.0.0.1:8090
---
>           value: http://127.0.0.1:80924:53
Fix:
Change http://127.0.0.1:8092 to socks5://127.0.0.1:8090

 

 

Version-Release number of selected component (if applicable):

4.16.16

How reproducible:

Every time.

Steps to Reproduce:

    1. Create ROKS v4.16.16 HyperShift-based cluster. 
    2. Navigate to the OpenShift web console.
    2. Click IAM#<username> menu in the top right.
    3. Click 'Copy login command'.
    4. Click 'Display token'.
    

Actual results:

Error getting token: Post "https://example.com:31335/oauth/token": http: server gave HTTP response to HTTPS client    

Expected results:

The oauth token should be successfully displayed.

Additional info:

    

Description of problem:
Multicast packets got 100% dropped

Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-02-202327
How reproducible:
Always

Steps to Reproduce:

1. Create a test namespace and enable multicast

oc  describe ns test   
Name:         test
Labels:       kubernetes.io/metadata.name=test
              pod-security.kubernetes.io/audit=restricted
              pod-security.kubernetes.io/audit-version=v1.24
              pod-security.kubernetes.io/enforce=restricted
              pod-security.kubernetes.io/enforce-version=v1.24
              pod-security.kubernetes.io/warn=restricted
              pod-security.kubernetes.io/warn-version=v1.24
Annotations:  k8s.ovn.org/multicast-enabled: true
              openshift.io/sa.scc.mcs: s0:c28,c27
              openshift.io/sa.scc.supplemental-groups: 1000810000/10000
              openshift.io/sa.scc.uid-range: 1000810000/10000
Status:       Active

No resource quota.

No LimitRange resource.

2. Created multicast pods

% oc get pods -n test -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
mcast-rc-67897   1/1     Running   0          10s   10.129.2.42   ip-10-0-86-58.us-east-2.compute.internal    <none>           <none>
mcast-rc-ftsq8   1/1     Running   0          10s   10.128.2.61   ip-10-0-33-247.us-east-2.compute.internal   <none>           <none>
mcast-rc-q48db   1/1     Running   0          10s   10.131.0.27   ip-10-0-1-176.us-east-2.compute.internal    <none>           <none>

3. Test mulicast traffic with omping from two pods

% oc rsh -n test mcast-rc-67897  
~ $ 
~ $ omping -c10 10.129.2.42 10.128.2.61 
10.128.2.61 : waiting for response msg
10.128.2.61 : joined (S,G) = (*, 232.43.211.234), pinging
10.128.2.61 :   unicast, seq=1, size=69 bytes, dist=2, time=0.506ms
10.128.2.61 :   unicast, seq=2, size=69 bytes, dist=2, time=0.595ms
10.128.2.61 :   unicast, seq=3, size=69 bytes, dist=2, time=0.555ms
10.128.2.61 :   unicast, seq=4, size=69 bytes, dist=2, time=0.572ms
10.128.2.61 :   unicast, seq=5, size=69 bytes, dist=2, time=0.614ms
10.128.2.61 :   unicast, seq=6, size=69 bytes, dist=2, time=0.653ms
10.128.2.61 :   unicast, seq=7, size=69 bytes, dist=2, time=0.611ms
10.128.2.61 :   unicast, seq=8, size=69 bytes, dist=2, time=0.594ms
10.128.2.61 :   unicast, seq=9, size=69 bytes, dist=2, time=0.603ms
10.128.2.61 :   unicast, seq=10, size=69 bytes, dist=2, time=0.687ms
10.128.2.61 : given amount of query messages was sent

10.128.2.61 :   unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 0.506/0.599/0.687/0.050
10.128.2.61 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

% oc rsh -n test mcast-rc-ftsq8
~ $ omping -c10 10.128.2.61  10.129.2.42
10.129.2.42 : waiting for response msg
10.129.2.42 : waiting for response msg
10.129.2.42 : waiting for response msg
10.129.2.42 : waiting for response msg
10.129.2.42 : joined (S,G) = (*, 232.43.211.234), pinging
10.129.2.42 :   unicast, seq=1, size=69 bytes, dist=2, time=0.463ms
10.129.2.42 :   unicast, seq=2, size=69 bytes, dist=2, time=0.578ms
10.129.2.42 :   unicast, seq=3, size=69 bytes, dist=2, time=0.632ms
10.129.2.42 :   unicast, seq=4, size=69 bytes, dist=2, time=0.652ms
10.129.2.42 :   unicast, seq=5, size=69 bytes, dist=2, time=0.635ms
10.129.2.42 :   unicast, seq=6, size=69 bytes, dist=2, time=0.626ms
10.129.2.42 :   unicast, seq=7, size=69 bytes, dist=2, time=0.597ms
10.129.2.42 :   unicast, seq=8, size=69 bytes, dist=2, time=0.618ms
10.129.2.42 :   unicast, seq=9, size=69 bytes, dist=2, time=0.964ms
10.129.2.42 :   unicast, seq=10, size=69 bytes, dist=2, time=0.619ms
10.129.2.42 : given amount of query messages was sent

10.129.2.42 :   unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 0.463/0.638/0.964/0.126
10.129.2.42 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

Actual results:
Mulicast packets loss is 100%
10.129.2.42 : multicast, xmt/rcv/%loss = 10/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

Expected results:
Should no 100% packet loss.

Additional info:
No such issue in 4.15, tested on same profile ipi-on-aws/versioned-installer-ci with 4.15.0-0.nightly-2024-05-31-131420, same operation with above steps.

The output for both mulicast pods:

10.131.0.27 :   unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.176/1.239/1.269/0.027
10.131.0.27 : multicast, xmt/rcv/%loss = 10/9/9% (seq>=2 0%), min/avg/max/std-dev = 1.227/1.304/1.755/0.170
and 
10.129.2.16 :   unicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.101/1.264/1.321/0.065
10.129.2.16 : multicast, xmt/rcv/%loss = 10/10/0%, min/avg/max/std-dev = 1.230/1.351/1.890/0.191

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

This is a clone of issue OCPBUGS-42563. The following is the description of the original issue:

Description of problem

During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.

When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:

  • master-0 in AZ *a
  • master-1 in AZ *b
  • master-2 in AZ *c

However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).

When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.

This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:

  • master-0 in AZ *a
  • master-0 in AZ *c
  • master-1 in AZ *b
  • master-2 in AZ *a
  • master-2 in AZ *c

This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.

4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.

Version-Release number of selected component

4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.

How reproducible

100%

Steps to Reproduce

I'm unsure how to replicate this in vanilla cluster install, but via OSD:

  1. Create a multi-az cluster in one of the reported zones, with a supplied GCP project (not the core OSD shared project, ie: CCS, or "Customer Cloud Subscription").

Example:

$ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp

Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.

Actual results

Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.

Expected results

A standard 3 control-plane-node cluster is created.

Additional info

We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.

The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:

{
  "controlPlane": [
    "us-west2-a",
    "us-west2-b",
    "us-west2-c"
  ],
  "compute": [
    "us-west2-c",     <--- inverted order.  Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow?
    "us-west2-b",
    "us-west2-a"
  ],
  "platform": {
    "defaultMachinePlatform": {  <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here
      "osDisk": {
        "DiskSizeGB": 0,
        "diskType": ""
      },
      "secureBoot": "Enabled",
      "type": ""
    },
    "projectID": "anishpatel",
    "region": "us-west2"
  }
}

Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/280

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The PowerVS CI uses the installer image to do some necessary setup.  The openssl binary was recently removed from that image.  So we need to switch to the upi-installer image.
    

Version-Release number of selected component (if applicable):

4.17
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Look at CI runs
    

This is a clone of issue OCPBUGS-38813. The following is the description of the original issue:

Description of problem:

  OLM 4.17 references 4.16 catalogs  

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. oc get pods -n openshift-marketplace -o yaml | grep "image: registry.redhat.io"
    

Actual results:

      image: registry.redhat.io/redhat/certified-operator-index:v4.16
      image: registry.redhat.io/redhat/certified-operator-index:v4.16
      image: registry.redhat.io/redhat/community-operator-index:v4.16
      image: registry.redhat.io/redhat/community-operator-index:v4.16
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16
      image: registry.redhat.io/redhat/redhat-operator-index:v4.16
      image: registry.redhat.io/redhat/redhat-operator-index:v4.16

Expected results:

      image: registry.redhat.io/redhat/certified-operator-index:v4.17
      image: registry.redhat.io/redhat/certified-operator-index:v4.17
      image: registry.redhat.io/redhat/community-operator-index:v4.17
      image: registry.redhat.io/redhat/community-operator-index:v4.17
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17
      image: registry.redhat.io/redhat/redhat-operator-index:v4.17
      image: registry.redhat.io/redhat/redhat-operator-index:v4.17

Additional info:

    

This is a clone of issue OCPBUGS-38802. The following is the description of the original issue:

Description of problem:

    Infrastructure object with platform None is ignored by node-joiner tool

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Run the node-joiner add-nodes command
    

Actual results:

    Currently the node-joiner tool retrieves the platform type from the kube-system/cluster-config-v1 config map

Expected results:

Retrieve the platform type from the infrastructure cluster object

Additional info:

    

Description of problem:

Version-Release number of selected component (if applicable):

There is an intermittent issue with the UploadImage() implementation in github.com/nutanix-cloud-native/prism-go-client@v0.3.4, on which the OCP installer depends. When testing the OCP installer with ClusterAPIInstall=true, I frequently hit the error with UploadImage() when calling to upload the bootstrap image to PC from the local image file. 

The error logs:
INFO creating the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8), taskUUID: c8eafd49-54e2-4fb9-a3df-c456863d71fd.
INFO created the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8).
INFO preparing to upload the bootstrap image demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso (uuid: 75694edf-f9c4-4d9a-9a44-731a4d103cc8) data from file /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso
ERROR failed to upload the bootstrap image data "demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso" from filepath /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso: status: 400 Bad Request, error-response: {
ERROR   "api_version": "3.1",
ERROR   "code": 400,
ERROR   "message_list": [
ERROR   { ERROR     "message": "Given input is invalid. Image 75694edf-f9c4-4d9a-9a44-731a4d103cc8 is already complete", ERROR     "reason": "INVALID_ARGUMENT" ERROR    }ERROR   ],
ERROR   "state": "ERROR"
ERROR }
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: failed to upload the bootstrap image data "demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso" from filepath /Users/yanhuali/Library/Caches/openshift-installer/image_cache/demo-ocp-cluster-g1-lrmwb-bootstrap-ign.iso: status: 400 Bad Request, error-response: {
ERROR   "api_version": "3.1",
ERROR   "code": 400,
ERROR   "message_list": [
ERROR   { ERROR     "message": "Given input is invalid. Image 75694edf-f9c4-4d9a-9a44-731a4d103cc8 is already complete", ERROR     "reason": "INVALID_ARGUMENT" ERROR    }ERROR   ],
ERROR   "state": "ERROR"
ERROR } The OCP installer code calling the prism-go-client function UploadImage() is here:https://github.com/openshift/installer/blob/master/pkg/infrastructure/nutanix/clusterapi/clusterapi.go#L172-L207    

How reproducible:

Use OCP IPI 4.16 to provision a Nutanix OCP cluster with the install-config ClusterAPIInstall=true. This is an intermittent issue, so you need to repeat the test several times to reproduce.

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

The installer intermittently failed at uploading the bootstrap image data to PC from the local image data file.

Expected results:

 The installer successfully to create the Nutanix OCP cluster with the install-config ClusterAPIInstall=true.

Additional info:

    

Description of problem:

As a user, when I manually type in a git repo it sends tens of unnecessary API calls to the git provider, which makes me hit the rate limit very quickly, and reduces my productivity

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

Always

Steps to Reproduce:

1. Developer perspective > +Add > Import from Git
2. Open devtools and switch to the networking tab
3. Start typing a GitHub link

Actual results:

There are many API calls to GitHub

Expected results:

There should not be that many

Additional info:

 

 

Description of problem:

Create any type of RoleBinding will trigger Admission Webhook Warning: xxxx unknown field: subjectRef

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-20-005211

How reproducible:

Always

Steps to Reproduce:

1. goes to RoleBinding creation page User Management -> RoleBindings -> Create binding or /k8s/cluster/rolebindings/~new
2. create any type of RoleBinding 

Actual results:

2. We can see an warning message on submit:Admission Webhook WarningRoleBinding test-ns-1 violates policy 299 - "unknown field \"subjectRef\""

Expected results:

2. no warning message

Additional info:

 

Description of problem:
The following parameter has been added to safe sysctls since k8s v1.29[1].

net.ipv4.tcp_keepalive_time
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
However, the list of safe sysctls returned by SafeSysctlAllowlist() in OpenShift is not updated [2].

[1] https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#safe-and-unsafe-sysctls
[2] https://github.com/openshift/apiserver-library-go/blob/e88385a79b1724850143487d507f606f8540f437/pkg/securitycontextconstraints/sysctl/mustmatchpatterns.go#L32

Due to this, the pod with these safe sysctls configuration is blocked by SCC for non-privileged users.
(Look at "Steps to Reproduce" for details.)

$ oc apply -f pod-sysctl.yaml
Error from server (Forbidden): error when creating "pod-sysctl.yaml": pods "pod-sysctl" is forbidden: unable to validate against any security context constraint: [provider "trident-controller": Forbidden: not usable by user or serviceaccount, provider "anyuid": Forbidden: not usable by user or serviceaccount, pod.spec.securityContext.sysctls[0]: Forbidden: unsafe sysctl "net.ipv4.tcp_fin_timeout" is not allowed, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "trident-node-linux": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
Version-Release number of selected component (if applicable):
OpenShift v4.16.4
How reproducible:
Always

Steps to Reproduce:
Step1. Login as a non-privileged user.

$ oc login -u user
Step2. Create the following yaml file and apply it.

$ cat pod-sysctl.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-sysctl
spec:
containers:

  • name: con-sysctl
    image: registry.nec.test:5000/ubi8/ubi
    command: ["/bin/bash", "-c", "tail -f /dev/null & wait"]
    securityContext:
    sysctls:
  • name: net.ipv4.tcp_fin_timeout
    value: "30"
    $ oc apply -f pod-sysctl.yaml
    Actual results:
    Applying the pod was blocked by SCC.

$ oc apply -f pod-sysctl.yaml
Error from server (Forbidden): error when creating "pod-sysctl.yaml": pods "pod-sysctl" is forbidden: unable to validate against any security context constraint: [provider "trident-controller": Forbidden: not usable by user or serviceaccount, provider "anyuid": Forbidden: not usable by user or serviceaccount, pod.spec.securityContext.sysctls[0]: Forbidden: unsafe sysctl "net.ipv4.tcp_fin_timeout" is not allowed, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "trident-node-linux": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
Expected results:
The yaml with safe sysctls can be applied by non-privileged user.
The specified sysctls are enabled in the pod.

Description of problem:

    imagesStreams on hosted-clusters pointing to image on private registries are failing due to tls verification although the registry is correctly trusted.

example:
$ oc create namespace e2e-test

$ oc --namespace=e2e-test tag virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ busybox:latest

$ oc --namespace=e2e-test  set image-lookup busybox

stirabos@t14s:~$ oc get imagestream -n e2e-test 
NAME      IMAGE REPOSITORY                                                    TAGS     UPDATED
busybox   image-registry.openshift-image-registry.svc:5000/e2e-test/busybox   latest   
stirabos@t14s:~$ oc get imagestream -n e2e-test busybox -o yaml
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  annotations:
    openshift.io/image.dockerRepositoryCheck: "2024-03-27T12:43:56Z"
  creationTimestamp: "2024-03-27T12:43:56Z"
  generation: 3
  name: busybox
  namespace: e2e-test
  resourceVersion: "49021"
  uid: 847281e7-e307-4057-ab57-ccb7bfc49327
spec:
  lookupPolicy:
    local: true
  tags:
  - annotations: null
    from:
      kind: DockerImage
      name: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ
    generation: 2
    importPolicy:
      importMode: Legacy
    name: latest
    referencePolicy:
      type: Source
status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2024-03-27T12:43:56Z"
      message: 'Internal error occurred: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ:
        Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to
        verify certificate: x509: certificate signed by unknown authority'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items: null
    tag: latest

While image virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ can be properly consumed if directly used for a container on a pod on the same cluster.

user-ca-bundle config map is properly propagated from hypershift:

$ oc get configmap -n openshift-config user-ca-bundle
NAME             DATA   AGE
user-ca-bundle   1      3h32m

$ openssl x509 -text -noout -in <(oc get cm -n openshift-config user-ca-bundle -o json | jq -r '.data["ca-bundle.crt"]')
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            11:3f:15:23:97:ac:c2:d5:f6:54:06:1a:9a:22:f2:b5:bf:0c:5a:00
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C = US, ST = NC, L = Raleigh, O = Test Company, OU = Testing, CN = test.metalkube.org
        Validity
            Not Before: Mar 27 08:28:07 2024 GMT
            Not After : Mar 27 08:28:07 2025 GMT
        Subject: C = US, ST = NC, L = Raleigh, O = Test Company, OU = Testing, CN = test.metalkube.org
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:c1:49:1f:18:d2:12:49:da:76:05:36:3e:6b:1a:
                    82:a7:22:0d:be:f5:66:dc:97:44:c7:ca:31:4d:f3:
                    7f:0a:d3:de:df:f2:b6:23:f9:09:b1:7a:3f:19:cc:
                    22:c9:70:90:30:a7:eb:49:28:b6:d1:e0:5a:14:42:
                    02:93:c4:ac:cc:da:b1:5a:8f:9c:af:60:19:1a:e3:
                    b1:34:c2:b6:2f:78:ec:9f:fe:38:75:91:0f:a6:09:
                    78:28:36:9e:ab:1c:0d:22:74:d5:52:fe:0a:fc:db:
                    5a:7c:30:9d:84:7d:f7:6a:46:fe:c5:6f:50:86:98:
                    cc:35:1f:6c:b0:e6:21:fc:a5:87:da:81:2c:7b:e4:
                    4e:20:bb:35:cc:6c:81:db:b3:95:51:cf:ff:9f:ed:
                    00:78:28:1d:cd:41:1d:03:45:26:45:d4:36:98:bd:
                    bf:5c:78:0f:c7:23:5c:44:5d:a6:ae:85:2b:99:25:
                    ae:c0:73:b1:d2:87:64:3e:15:31:8e:63:dc:be:5c:
                    ed:e3:fe:97:29:10:fb:5c:43:2f:3a:c2:e4:1a:af:
                    80:18:55:bc:40:0f:12:26:6b:f9:41:da:e2:a4:6b:
                    fd:66:ae:bc:9c:e8:2a:5a:3b:e7:2b:fc:a6:f6:e2:
                    73:9b:79:ee:0c:86:97:ab:2e:cc:47:e7:1b:e5:be:
                    0c:9f
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Basic Constraints: 
                CA:TRUE, pathlen:0
            X509v3 Subject Alternative Name: 
                DNS:virthost.ostest.test.metalkube.org
    Signature Algorithm: sha256WithRSAEncryption
    Signature Value:
        58:d2:da:f9:2a:c0:2d:7a:d9:9f:1f:97:e1:fd:36:a7:32:d3:
        ab:3f:15:cd:68:8e:be:7c:11:ec:5e:45:50:c4:ec:d8:d3:c5:
        22:3c:79:5a:01:63:9e:5a:bd:02:0c:87:69:c6:ff:a2:38:05:
        21:e4:96:78:40:db:52:c8:08:44:9a:96:6a:70:1e:1e:ae:74:
        e2:2d:fa:76:86:4d:06:b1:cf:d5:5c:94:40:17:5d:9f:84:2c:
        8b:65:ca:48:2b:2d:00:3b:42:b9:3c:08:1b:c5:5d:d2:9c:e9:
        bc:df:9a:7c:db:30:07:be:33:2a:bb:2d:69:72:b8:dc:f4:0e:
        62:08:49:93:d5:0f:db:35:98:18:df:e6:87:11:ce:65:5b:dc:
        6f:f7:f0:1c:b0:23:40:1e:e3:45:17:04:1a:bc:d1:57:d7:0d:
        c8:26:6d:99:fe:28:52:fe:ba:6a:a1:b8:d1:d1:50:a9:fa:03:
        bb:b7:ad:0e:82:d2:e8:34:91:fa:b4:f9:81:d1:9b:6d:0f:a3:
        8c:9d:c4:4a:1e:08:26:71:b9:1a:e8:49:96:0f:db:5c:76:db:
        ae:c7:6b:2e:ea:89:5d:7f:a3:ba:ea:7e:12:97:12:bc:1e:7f:
        49:09:d4:08:a6:4a:34:73:51:9e:a2:9a:ec:2a:f7:fc:b5:5c:
        f8:20:95:ad

This is probably a side effect of https://issues.redhat.com/browse/RFE-3093 - imagestream to trust CA added during the installation, that is also affecting imagestreams that requires a CA cert injected by hypershift during hosted-cluster creation in the disconnected use case.

Version-Release number of selected component (if applicable):

    v4.14, v4.15, v4.16

How reproducible:

    100%

Steps to Reproduce:

once connected to a disconnected hosted cluster, create an image stream pointing to an image on the internal mirror registry:
    1. $ oc --namespace=e2e-test tag virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ busybox:latest

    2. $ oc --namespace=e2e-test  set image-lookup busybox
    3. then check the image stream
    

Actual results:

    status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox
  tags:
  - conditions:
    - generation: 2
      lastTransitionTime: "2024-03-27T12:43:56Z"
      message: 'Internal error occurred: virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ:
        Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to
        verify certificate: x509: certificate signed by unknown authority'

although the same image can be directly consumed by a pod on the same cluster

Expected results:

    status:
  dockerImageRepository: image-registry.openshift-image-registry.svc:5000/e2e-test/busybox
  tags:
  - conditions:
    - generation: 8
      lastTransitionTime: "2024-03-27T13:30:46Z"
      message: dockerimage.image.openshift.io "virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-29-4-4zE9mRvED4RQoUxQ"
        not found
      reason: NotFound
      status: "False"
      type: ImportSuccess

Additional info:

    This is probably a side effect of https://issues.redhat.com/browse/RFE-3093

Marking the imagestream as:
    importPolicy:
      importMode: Legacy
      insecure: true
is enough to workaround this.

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/121

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The MAPI GCP Code hasn't changed since 4.14, so if it were a MAPI issue I'd expect to see other things breaking.T

he Installer doesn't seem to have changed either, however there's a discrepancy between what's created by 

terraform: https://github.com/openshift/installer/blame/916b3a305691dcbf1e47f01137e0ceee89ed0f59/data/data/gcp/post-bootstrap/main.tf#L14 

https://github.com/openshift/installer/blob/916b3a305691dcbf1e47f01137e0ceee89ed0f59/data/data/gcp/cluster/network/lb-private.tf#L10 

and the UPI instructions 

https://github.com/openshift/installer/blame/916b3a305691dcbf1e47f01137e0ceee89ed0f59/docs/user/gcp/install_upi.md#L560 

https://github.com/openshift/installer/blob/916b3a305691dcbf1e47f01137e0ceee89ed0f59/upi/gcp/02_lb_int.py#L19

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The builds installed in the hosted clusters are having issues to git-clone repositories from external URLs where their CA are configured in the ca-bundle.crt from trsutedCA section:

 spec:
    configuration:
      apiServer:
       [...]
      proxy:
        trustedCA:
          name: user-ca-bundle <---

In traditional OCP implementations, the *-global-ca configmap is installed in the same namespace from the build and the ca-bundle.crt is injected into this configmap. In hosted clusters the configmap is being created empty: 

$ oc get cm -n <app-namespace> <build-name>-global-ca  -oyaml
apiVersion: v1
data:
  ca-bundle.crt: ""


As mentioned, the user-ca-bundle has the certificates configured:

$ oc get cm -n openshift-config user-ca-bundle -oyaml
apiVersion: v1
data:
  ca-bundle.crt: |
    -----BEGIN CERTIFICATE----- <---


 

Version-Release number of selected component (if applicable):

 

How reproducible:

Easily

Steps to Reproduce:

1. Install hosted cluster with trustedCA configmap
2. Run a build in the hosted cluster
3. Check the global-ca configmap

Actual results:

global-ca is empty

Expected results:

global-ca injects the ca-bundle.crt properly

Additional info:

 

Description of problem:


An unexpected validation failure occurs when creating the agent ISO image if the RendezvousIP is a substring of the next-hop-address set for a worker node.

For example this configuration snippet in agent-config.yaml:

apiVersion: v1alpha1
kind: AgentConfig
metadata:
  name: agent-config
rendezvousIP: 7.162.6.1
hosts:
...
 - hostname: worker-0
    role: worker
    networkConfig:
     interfaces:
        - name: eth0
          type: Ethernet
          state: up
          ipv4:
            enabled: true
            address:
              - ip: 7.162.6.4
                prefix-length: 25
            dhcp: false
     routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: 7.162.6.126
            next-hop-interface: eth0
            table-id: 254

Will result in the validation failure when creating the image:

FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to fetch dependency of "Agent Installer Artifacts": failed to fetch dependency of "Agent Installer Ignition": failed to fetch dependency of "Agent Manifests": failed to fetch dependency of "NMState Config": failed to generate asset "Agent Hosts": invalid Hosts configuration: [Hosts[3].Host: Forbidden: Host worker-0 has role 'worker' and has the rendezvousIP assigned to it. The rendezvousIP must be assigned to a control plane host.

The problem is this check here https://github.com/openshift/installer/pull/6716/files#diff-fa305fe33630f77b65bd21cc9473b620f67cfd9ce35f7ddf24d03b26ec2ccfffR293
Its checking for the IP in the raw nmConfig. The problem is the routes stanza is also included in the nmConfig and the route is
next-hop-address: 7.162.6.126
So when rendezvousIP is 7.162.6.1 that strings.Contains() check returns true and the validation fails.

4.16.0-0.nightly-2024-05-16-165920 aws-sdn-upgrade failures in 1791152612112863232

Undiagnosed panic detected in pod

{  pods/openshift-controller-manager_controller-manager-8d46bf695-cvdc6_controller-manager.log.gz:E0516 17:36:26.515398       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3ca66c0), concrete:(*abi.Type)(0x3e9f720), asserted:(*abi.Type)(0x41dd660), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Secret)

Please review the following PR: https://github.com/openshift/origin/pull/28827

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

openshift-install is creating user-defined tags (platform.aws.userTags) in subnets on AWS of BYO VPC (unmanaged VPC) deployment when using CAPA.

The documentation[1] for userTags state:
> A map of keys and values that the installation program adds as tags to all resources that it creates.

So when the network (VPC and subnets) are managed by user (BYO VPC), the installer should not create additional tags when provided in install-config.yaml. 

Investigating in CAPA codebase, the feature gate TagUnmanagedNetworkResources is enabled, and the subnet is propagating the userTags in the reconciliation loop[2].

[1] https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installation-config-parameters-aws.html
[2] https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/network/subnets.go#L618

Version-Release number of selected component (if applicable):

4.16.0-ec.6-x86_64

How reproducible:

always

Steps to Reproduce:

- 1. create VPC and subnets using CloudFormation. Example template: https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml
- 2. create install-config with user-tags and subnet IDs to install the cluster:
- 3. create the cluster with feature gate for CAPI 

```
featureSet: CustomNoUpgrade
featureGates:
- ClusterAPIInstall=true
metadata:
  name: "${CLUSTER_NAME}"
platform:
  aws:
    region: us-east-1
    subnets:
    - subnet-0165c70573a45651c
    - subnet-08540527fffeae3e9
    userTags:
      x-red-hat-clustertype: installer
      x-red-hat-managed: "true"
```

 

Actual results:

installer/CAPA is setting the user-defined tags in unmanaged subnets

Expected results:

- installer/CAPA does not create userTags on unmanaged subnets 
- userTags is applied for regular/standard workflow (managed VPC) with CAPA

Additional info:

- Impacting on SD/ROSA: https://redhat-internal.slack.com/archives/CCPBZPX7U/p1717588837289489 

Ecosystem QE is preparing to create a release-4.16 branch within our test repos. Many pkgs are currently using v0.29 modules which will not be compatible with v0.28. It would be ideal if we can update k8s modules to v0.29 to prevent us from needing to re-implement the assisted APIs.

 

Description of problem:

When using the OpenShift Assisted Installer with a password containing the `:` colon character.

    

Version-Release number of selected component (if applicable):

    OpenShift 4.15

    

How reproducible:

    Everytime
    

Steps to Reproduce:

    1. Attempt to install using the Agentbased installer with a pull-secret which includes a colon character.
   
   The following snippet of. code appears to be hit when there is a colon within the user/password section of the pull-secret.
https://github.com/openshift/assisted-service/blob/d3dd2897d1f6fe108353c9241234a724b30262c2/internal/cluster/validations/validations.go#L132-L135

    

Actual results:

    Install fails

    

Expected results:

   Install succeeds

    

Additional info:


    

Backport to 4.17 of AUTH-482 specifically for the cluster-network-operator.

Namespaces with workloads that need pinning:

  • openshift-network-diagnostics         
  • openshift-network-node-identity

See 4.18 PR for more info on what needs pinning.

This is a clone of issue OCPBUGS-41637. The following is the description of the original issue:

Description of problem:

    Console and OLM engineering and BU have decided to remove the Extension Catalog navigation item until the feature has matured more.

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/289

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-operator/pull/243

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-42000. The following is the description of the original issue:

Description of problem:

1. We are making 2 API calls to get the logs for the PipelineRuns. instead, we can make use of `results.tekton.dev/record` annotation and replace the `records` in the value of the annotation with `logs` to get the logs of the PipelineRuns.

2. Tekton results will return back only v1 version of PipelineRun and TaskRun from Pipelines 1.16, so data type has to be v1 version for 1.16 version and for lower version it is v1beta1

Description of problem:

When running a conformance suite against a hypershift cluster (for example, CNI conformance) the MonitorTests step fails because of missing files from the disruption monitor.
    

Version-Release number of selected component (if applicable):

4.15.13
    

How reproducible:

Consistent
    

Steps to Reproduce:

    1. Create a hypershift cluster
    2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface
    3. Note errors in logs
    

Actual results:

found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-130-177.us-west-2.compute.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-152-10.us-west-2.compute.internal: the server could not find the requested resource]
Failed to write events from in-cluster monitors, err: open /tmp/artifacts/junit/AdditionalEvents__in_cluster_disruption.json: no such file or directory
    

Expected results:

No errors 
    

Additional info:

The first error can be avoided by creating the directory it's looking for on all nodes:
for node in $(oc get nodes -oname); do oc debug -n default $node -- chroot /host mkdir -p /var/log/disruption-data/monitor-events; done
However, I'm not sure if this directory not being created is due to the disruption monitor working properly on hypershift, or if this should be skipped on hypershift entirely.

The second error is related to the ARTIFACT_DIR env var not being set locally, and can be avoided by creating a directory, setting that directory as the ARTIFACT_DIR, and then creating an empty "junit" dir inside of it.
It looks like ARTIFACT_DIR defaults to a temporary directory if it's not set in the env, but the "junit" directory doesn't exist inside of it, so file creation in that non-existant directory fails.
    

Description of problem:

In OCP 4.17, kube-apiserver no longer gets a valid cloud config. Therefore the PersistentVolumeLabel admission plugin reject in-tree GCE PD PVs that do not have correct topology with `persistentvolumes \"gce-\" is forbidden: error querying GCE PD volume e2e-4d8656c6-d1d4-4245-9527-33e5ed18dd31: disk is not found`

 

In 4.16, kube-apiserver will not get a valid cloud config after it updates library-go with this PR.

 

How reproducible:

always    

Steps to Reproduce:

    1. Run e2e test "Multi-AZ Cluster Volumes should schedule pods in the same zones as statically provisioned PVs"
    

 

Due to upstream changes (https://github.com/kubernetes/kubernetes/pull/121485) KMSv1 is deprecated starting with k8s 1.29. HyperShift is actively using KMSv1. Migrating cluster from KMSv1 to KMSv2 is tricky so we need to at least make sure that new ROSA clusters can only enable KMSv2 whilst old one remains on KMSv1.

We need to verify that new installations of ROSA that enables KMS encryption are running the KMSv2 API and that old clusters upgrading to a version where KMSv2 is available remains on KMSv1.

 What

Set the version label of warn, audit and enforce to "latest" from "v1.24".

Why

  • We kept it at "v1.24" due to potential issues with HyperShift.
  • We are moving to "latest" to avoid the cost of maintaining it by-version.
  • It was discussed and decided in a meeting with David Eads (Wed Sep 12 2024).

We need to make some minor updates to our tekton files per https://github.com/konflux-ci/build-definitions/blob/main/task/buildah/0.2/MIGRATION.md. Specifically - 

  • Removes the BASE_IMAGES_DIGESTS result. Please remove all the references to this result from your pipeline.
    • Base images and their digests can be found in the SBOM for the output image.
  • No longer writes the base_images_from_dockerfile file into the source workspace.
  • Removes the BUILDER_IMAGE and DOCKER_AUTH params. Neither one did anything in the later releases of version 0.1. Please stop passing these params to the buildah task if you used to do so with version 0.1.

Description of problem:

  Snyk is failing on some deps  

Version-Release number of selected component (if applicable):

  At least master/4.17 and 4.16

How reproducible:

    100% 

Steps to Reproduce:

Open a PR against master or release-4.16 branch, Snyk will fail. And it seems like recent history shows that the test is just being overridden, we should stop overriding the test and fix the deps or justify excluding them from Snyk
    

Actual results:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cloud-credential-operator/679/pull-ci-openshift-cloud-credential-operator-master-security/1793098328855023616

 

This is a clone of issue OCPBUGS-43508. The following is the description of the original issue:

Description of problem:

    These two tests have been flaking more often lately. The TestLeaderElection flake is partially (but not solely) connected to OCPBUGS-41903.

   TestOperandProxyConfiguration seems to fail in the teardown while waiting for other cluster operators to become available.

   Although these flakes aren't customer facing, they considerably slow development cycles (due to retests) and also consume more resources than they should (every retest runs on a new cluster), so we want to backport the fixes.

Version-Release number of selected component (if applicable):

    4.18, 4.17, 4.16, 4.15, 4.14

How reproducible:

    Sometimes

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1251

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Test flake on 409 conflict during check

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn/1820792184916414464

{Failed  === RUN   TestCreateCluster/Main/EnsureHostedClusterImmutability
    util.go:911: 
        Expected
            <string>: Operation cannot be fulfilled on hostedclusters.hypershift.openshift.io "example-c88md": the object has been modified; please apply your changes to the latest version and try again
        to contain substring
            <string>: Services is immutable
        --- FAIL: TestCreateCluster/Main/EnsureHostedClusterImmutability (0.05s)
}

Description of problem:

    unit test jobs fail due to removed image https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc/1882/pull-ci-openshift-oc-master-unit/1856738411440771072

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    unit test job passes

Additional info:

    

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/304

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please, see the last 3 failures on this test in the link provided in the boilerplate text below:


Component Readiness has found a potential regression in [sig-cluster-lifecycle] Cluster completes upgrade.

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-06-29T00:00:00Z
End Time: 2024-07-05T23:59:59Z
Success Rate: 89.96%
Successes: 242
Failures: 27
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 858
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=ClusterUpgrade&component=Cluster%20Version%20Operator&confidence=95&environment=ovn%20upgrade-minor%20amd64%20aws%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=aws&platform=aws&sampleEndTime=2024-07-05%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-06-29%2000%3A00%3A00&testId=Cluster%20upgrade%3A2b2216bdf4b6c563428ce86dd8db4613&testName=%5Bsig-cluster-lifecycle%5D%20Cluster%20completes%20upgrade&upgrade=upgrade-minor&upgrade=upgrade-minor&variant=standard&variant=standard

Description of problem:

When use `shortestPath: true` the mirror images numbers is too many then the required. 

 

 

Version-Release number of selected component (if applicable):

./oc-mirror version --output=yaml
clientVersion:
  buildDate: "2024-05-08T04:26:09Z"
  compiler: gc
  gitCommit: 9e77c1944f70fed0a85e5051c8f3efdfb09add70
  gitTreeState: clean
  gitVersion: 4.16.0-202405080039.p0.g9e77c19.assembly.stream.el9-9e77c19
  goVersion: go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime
  major: ""
  minor: ""
  platform: linux/amd64

How reproducible:

always

Steps to Reproduce:

1) Use following isc to do mirror2mirror for v2:   
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
archiveSize: 8
mirror:
  platform:
    channels:
    - name: stable-4.15                                             
      type: ocp
      minVersion: '4.15.11'
      maxVersion: '4.15.11'
      shortestPath: true
    graph: true   `oc-mirror --config config.yaml --v2 docker://xxx.com:5000/m2m  --workspace file:///app1/0416/clid20/`

`oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http`

2) without the shortest path : 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
      - name: stable-4.15

 

Actual results: 

1) It counted 577 images to mirror 
oc-mirror --config config-11.yaml  file://outsizecheck --v2

2024/05/11 03:57:21  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/05/11 03:57:21  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/05/11 03:57:21  [INFO]   : ⚙️  setting up the environment for you...
2024/05/11 03:57:21  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/05/11 03:57:21  [INFO]   : 🕵️  going to discover the necessary images...
2024/05/11 03:57:21  [INFO]   : 🔍 collecting release images...
2024/05/11 03:57:28  [INFO]   : 🔍 collecting operator images...
2024/05/11 03:57:28  [INFO]   : 🔍 collecting additional images...
2024/05/11 03:57:28  [INFO]   : 🚀 Start copying the images...
2024/05/11 03:57:28  [INFO]   : === Overall Progress - copying image 1 / 577 ===
2024/05/11 03:57:28  [INFO]   : copying release image 1 / 577 




2)  without the shortest path , only counted 192 images to mirror. 
oc-mirror --config config-32547.yaml file://outsizecheck --v2

2024/05/11 03:55:12  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/05/11 03:55:12  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/05/11 03:55:12  [INFO]   : ⚙️  setting up the environment for you...
2024/05/11 03:55:12  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/05/11 03:55:12  [INFO]   : 🕵️  going to discover the necessary images...
2024/05/11 03:55:12  [INFO]   : 🔍 collecting release images...
2024/05/11 03:55:12  [INFO]   : detected minimum version as 4.15.11
2024/05/11 03:55:12  [INFO]   : detected minimum version as 4.15.11
2024/05/11 03:55:18  [INFO]   : 🔍 collecting operator images...
2024/05/11 03:56:09  [INFO]   : 🔍 collecting additional images...
2024/05/11 03:56:09  [INFO]   : 🚀 Start copying the images...
2024/05/11 03:56:09  [INFO]   : === Overall Progress - copying image 1 / 266 ===
2024/05/11 03:56:09  [INFO]   : copying release image 1 / 192

Expected results:

1) if only have one ocp payload, the image need mirrored should be same. 

Additional information:


[sig-arch] events should not repeat pathologically for ns/openshift-etcd-operator

{  1 events happened too frequently
event happened 25 times, something is wrong: namespace/openshift-etcd-operator deployment/etcd-operator hmsg/e2df46f507 - reason/RequiredInstallerResourcesMissing configmaps: etcd-all-bundles-8 (02:05:52Z) result=reject }

Sample failures:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/1795986440614580224

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1795986461422522368

It's hitting both of these jobs linked above, but intermittently, 20-40% of the time on this first payload with the regression.

Looks to be this PR: https://github.com/openshift/cluster-etcd-operator/pull/1268

This is a clone of issue OCPBUGS-39285. The following is the description of the original issue:

Description of problem: https://github.com/openshift/installer/pull/7727 changed the order of some playbooks and we're expected to run the network.yaml playbook before the metadata.json file is created. This isn't a problem with newer version of ansible, that will happily ignore missing var_files, however this is a problem with older ansible that fail with:

[cloud-user@installer-host ~]$ ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/network.yaml"

PLAY [localhost] *****************************************************************************************************************************************************************************************************************************
ERROR! vars file metadata.json was not found                                                                                       
Could not find file on the Ansible Controller.                                                                                      
If you are using a module and expect the file to exist on the remote, see the remote_src option

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/579

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

CCMs attempt direct connections when the mgmt cluster on which the HCP runs is proxied and does not allow direction outbound connections.

Example from the AWS CCM

 I0731 21:46:33.948466       1 event.go:389] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: error listing AWS instances: \"WebIdentityErr: failed to retrieve credentials\\ncaused by: RequestError: send request failed\\ncaused by: Post \\\"https://sts.us-east-1.amazonaws.com/\\\": dial tcp 72.21.206.96:443: i/o timeout\""

Description of problem:

    Sometimes deleting the bootstrap ssh rule during bootstrap destroy can timeout after 5min, failing the installation.

Version-Release number of selected component (if applicable):

    4.16+ with capi/aws

How reproducible:

    Intermittent

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

                level=info msg=Waiting up to 5m0s (until 2:31AM UTC) for bootstrap SSH rule to be destroyed...
                level=fatal msg=error destroying bootstrap resources failed during the destroy bootstrap hook: failed to remove bootstrap SSH rule: bootstrap ssh rule was not removed within 5m0s: timed out waiting for the condition

Expected results:

    The rule is deleted successfully and in a timely manner.

Additional info:

    This is probably happening because we are changing the AWSCluster object, thus causing capi/capa to trigger a big reconciliation of the resources. We should try to delete the rule via aws sdk.

Description of problem:

    IDMS is set on HostedCluster and reflected in their respective CR in-cluster.  Customers can create, update, and delete these today.  In-cluster IDMS has no impact.

Version-Release number of selected component (if applicable):

    4.14+

How reproducible:

    100%

Steps to Reproduce:

    1. Create HCP
    2. Create IDMS
    3. Observe it does nothing
    

Actual results:

    IDMS doesn't change anything if manipulated in data plane

Expected results:

    IDMS either allows updates OR IDMS updates are blocked.

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/75

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Given that we create a new pool, and we enable OCB in this pool, and we remove the pool and the MachineOSConfig resource, and we create another new pool to enable OCB again, then the controller pod panics.
    

Version-Release number of selected component (if applicable):

pre-merge https://github.com/openshift/machine-config-operator/pull/4327
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a new infra MCP

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


    2. Create a MachineOSConfig for infra pool

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: infra
spec:
  machineConfigPool:
    name: infra
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
EOF


    3. When the build is finished, remove the MachineOSConfig and the pool

oc delete machineosconfig infra
oc delete mcp infra

    4. Create a new infra1 pool
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra1
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra1]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra1: ""

    5. Create a new machineosconfig for infra1 pool

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: infra1
spec:
  machineConfigPool:
    name: infra1
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
    containerFile:
    - containerfileArch: noarch
      content: |-
        RUN echo 'test image' > /etc/test-image.file
EOF



    

Actual results:

The MCO controller pod panics (in updateMachineOSBuild):

E0430 11:21:03.779078       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 265 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00035e000?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x3547bc0?, 0x53ebb20?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?)
	<autogenerated>:1 +0x9
k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25
k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74
k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e
k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc0007097a0, 0x0, 0x0?)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateMachineOSBuild(0xc0007097a0, {0xc001c37800?, 0xc000029678?}, {0x3904000?, 0xc0028361a0})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:395 +0xd1
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e5738?, {0x3de6020, 0xc0008fe780}, 0x1, 0xc0000ac720)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x6974616761706f72?, 0x3b9aca00, 0x0, 0x69?, 0xc0005e5788?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000b97c20)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 248
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]



When the controller pod is restarted, it panics again, but in a different function (addMachineOSBuild):

E0430 11:26:54.753689       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 97 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x15555555aa?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x3547bc0?, 0x53ebb20?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?)
	<autogenerated>:1 +0x9
k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25
k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74
k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e
k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc000899560, 0x0, 0x0?)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).addMachineOSBuild(0xc000899560, {0x3904000?, 0xc0006a8b60})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:386 +0xc5
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:239
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x13e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00066bf38?, {0x3de6020, 0xc0008f8b40}, 0x1, 0xc000c2ea20)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00066bf88?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000ba6240)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 43
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]





    

Expected results:

No panic should happen. Errors should be controlled.

    

Additional info:

    In order to recover from this panic, we need to  manually delete the MachineOSBuild resources that are related to the pool that does not exist anymore.

Description of problem:

Found in QE CI case failure https://issues.redhat.com/browse/OCPQE-22045 that: 4.16 HCP oauth-openshift panics when anonymously curl'ed (this is not seen in OCP 4.16 and HCP 4.15).

Version-Release number of selected component (if applicable):

HCP 4.16 4.16.0-0.nightly-2024-05-14-165654    

How reproducible:

Always

Steps to Reproduce:

1.
$ export KUBECONFIG=HCP.kubeconfig
$ oc get --raw=/.well-known/oauth-authorization-server | jq -r .issuer
https://oauth-clusters-hypershift-ci-283235.apps.xxxx.com:443

2. Panics when anonymously curl'ed:
$ curl -k "https://oauth-clusters-hypershift-ci-283235.apps.xxxx.com:443/oauth/authorize?response_type=token&client_id=openshift-challenging-client"
This request caused apiserver to panic. Look in the logs for details.

3. Check logs.
$ oc --kubeconfig=/home/xxia/my/env/hypershift-management/mjoseph-hyp-283235-416/kubeconfig -n clusters-hypershift-ci-283235 get pod | grep oauth-openshift
oauth-openshift-55c6967667-9bxz9                     2/2     Running   0          6h23m
oauth-openshift-55c6967667-l55fh                     2/2     Running   0          6h22m
oauth-openshift-55c6967667-ntc6l                     2/2     Running   0          6h23m

$ for i in oauth-openshift-55c6967667-9bxz9 oauth-openshift-55c6967667-l55fh oauth-openshift-55c6967667-ntc6l; do oc logs --timestamps --kubeconfig=/home/xxia/my/env/hypershift-management/mjoseph-hyp-283235-416/kubeconfig -n clusters-hypershift-ci-283235 $i > logs/hypershift-management/mjoseph-hyp-283235-416/$i.log; done

$ grep -il panic *.log
oauth-openshift-55c6967667-ntc6l.log

$ cat oauth-openshift-55c6967667-ntc6l.log
2024-05-15T03:43:59.769424528Z I0515 03:43:59.769303       1 secure_serving.go:57] Forcing use of http/1.1 only
2024-05-15T03:43:59.772754182Z I0515 03:43:59.772725       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
2024-05-15T03:43:59.772803132Z I0515 03:43:59.772782       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
2024-05-15T03:43:59.772841518Z I0515 03:43:59.772834       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
2024-05-15T03:43:59.772870498Z I0515 03:43:59.772787       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
2024-05-15T03:43:59.772982605Z I0515 03:43:59.772736       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
2024-05-15T03:43:59.773009678Z I0515 03:43:59.773002       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
2024-05-15T03:43:59.773214896Z I0515 03:43:59.773194       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/etc/kubernetes/certs/serving-cert/tls.crt::/etc/kubernetes/certs/serving-cert/tls.key"
2024-05-15T03:43:59.773939655Z I0515 03:43:59.773923       1 secure_serving.go:213] Serving securely on [::]:6443
2024-05-15T03:43:59.773965659Z I0515 03:43:59.773952       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
2024-05-15T03:43:59.873008524Z I0515 03:43:59.872970       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
2024-05-15T03:43:59.873078108Z I0515 03:43:59.873021       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
2024-05-15T03:43:59.873120163Z I0515 03:43:59.873032       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
2024-05-15T09:25:25.782066400Z E0515 09:25:25.782026       1 runtime.go:77] Observed a panic: runtime error: invalid memory address or nil pointer dereference
2024-05-15T09:25:25.782066400Z goroutine 8662 [running]:
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:110 +0x9c
2024-05-15T09:25:25.782066400Z panic({0x2115f60?, 0x3c45ec0?})
2024-05-15T09:25:25.782066400Z     runtime/panic.go:914 +0x21f
2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers.(*unionAuthenticationHandler).AuthenticationNeeded(0xc0008a90e0, {0x7f2a74268bd8?, 0xc000607760?}, {0x293c340?, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     github.com/openshift/oauth-server/pkg/oauth/handlers/default_auth_handler.go:122 +0xce1
2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauth/handlers.(*authorizeAuthenticator).HandleAuthorize(0xc0008a9110, 0xc0007b06c0, 0x7?, {0x293c340, 0xc0007d1ef0})
2024-05-15T09:25:25.782066400Z     github.com/openshift/oauth-server/pkg/oauth/handlers/authenticator.go:54 +0x21d
2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver.AuthorizeHandlers.HandleAuthorize({0xc0008a91a0?, 0x3, 0x772d66?}, 0x22ef8e0?, 0xc0007b2420?, {0x293c340, 0xc0007d1ef0})
2024-05-15T09:25:25.782066400Z     github.com/openshift/oauth-server/pkg/osinserver/interfaces.go:29 +0x95
2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/osinserver.(*osinServer).handleAuthorize(0xc0004a54c0, {0x293c340, 0xc0007d1ef0}, 0xd?)
2024-05-15T09:25:25.782066400Z     github.com/openshift/oauth-server/pkg/osinserver/osinserver.go:77 +0x25e
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x0?, {0x293c340?, 0xc0007d1ef0?}, 0x410acc?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z net/http.(*ServeMux).ServeHTTP(0x2390e60?, {0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2514 +0x142
2024-05-15T09:25:25.782066400Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithRestoreOAuthHeaders.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:57 +0x1ca
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func21({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x4?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.withAuthorization.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/authorization.go:78 +0x639
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc1893dc16e2d2585?, {0x293c340?, 0xc0007d1ef0?}, 0xc0007fabb8?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x3c5b920?, {0x293c340?, 0xc0007d1ef0?}, 0x3?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/maxinflight.go:196 +0x262
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func23({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x7f2a74226390?, {0x293c340?, 0xc0007d1ef0?}, 0xc0007953c8?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithImpersonation.func4({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/impersonation.go:50 +0x1c3
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xcd1160?, {0x293c340?, 0xc0007d1ef0?}, 0x0?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func24({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xcd1160?, {0x293c340?, 0xc0007d1ef0?}, 0x0?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x235fda0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func26({0x293c340?, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x291a100?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3700)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/authentication.go:120 +0x7e5
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x293c340?, 0xc0007d1ef0?}, 0x291ef40?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x293c340, 0xc0007d1ef0}, 0xc0007a3500)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filterlatency/filterlatency.go:94 +0x37a
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0003e0900?, {0x293c340?, 0xc0007d1ef0?}, 0xc00061af20?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:115 +0x62
2024-05-15T09:25:25.782066400Z created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP in goroutine 8660
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:101 +0x1b2
2024-05-15T09:25:25.782066400Z 
2024-05-15T09:25:25.782066400Z goroutine 8660 [running]:
2024-05-15T09:25:25.782066400Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1fb1a00?, 0xc000810260})
2024-05-15T09:25:25.782066400Z     k8s.io/apimachinery@v0.29.2/pkg/util/runtime/runtime.go:75 +0x85
2024-05-15T09:25:25.782066400Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0005aa840, 0x1, 0x1c08865?})
2024-05-15T09:25:25.782066400Z     k8s.io/apimachinery@v0.29.2/pkg/util/runtime/runtime.go:49 +0x6b
2024-05-15T09:25:25.782066400Z panic({0x1fb1a00?, 0xc000810260?})
2024-05-15T09:25:25.782066400Z     runtime/panic.go:914 +0x21f
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc000528cc0, {0x2944dd0, 0xc000476460}, 0xdf8475800?)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/timeout.go:121 +0x35c
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestDeadline.withRequestDeadline.func27({0x2944dd0, 0xc000476460}, 0xc0007a3300)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/request_deadline.go:100 +0x237
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x2944dd0?, 0xc000476460?}, 0x2459ac0?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithWaitGroup.withWaitGroup.func28({0x2944dd0, 0xc000476460}, 0xc0004764b0?)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/waitgroup.go:86 +0x18c
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0007a3200?, {0x2944dd0?, 0xc000476460?}, 0xc0004764b0?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithWarningRecorder.func13({0x2944dd0?, 0xc000476460}, 0xc000476410?)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/warning.go:35 +0xc6
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0x2390e60?, {0x2944dd0?, 0xc000476460?}, 0xd?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithCacheControl.func14({0x2944dd0, 0xc000476460}, 0x0?)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/cachecontrol.go:31 +0xa7
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0002a0fa0?, {0x2944dd0?, 0xc000476460?}, 0xc0005aad90?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithHTTPLogging.WithLogging.withLogging.func34({0x2944dd0, 0xc000476460}, 0x1?)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/server/httplog/httplog.go:111 +0x95
2024-05-15T09:25:25.782066400Z net/http.HandlerFunc.ServeHTTP(0xc0007b0360?, {0x2944dd0?, 0xc000476460?}, 0x0?)
2024-05-15T09:25:25.782066400Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782066400Z k8s.io/apiserver/pkg/endpoints/filters.WithTracing.func1({0x2944dd0?, 0xc000476460?}, 0xc0007a3200?)
2024-05-15T09:25:25.782066400Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/traces.go:42 +0x222
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x2944dd0?, 0xc000476460?}, 0x291ef40?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP(0xc000289b80, {0x293c340?, 0xc0007d1bf0}, 0xc0007a3100, {0x2923a40, 0xc000528d68})
2024-05-15T09:25:25.782129547Z     go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.44.0/handler.go:217 +0x1202
2024-05-15T09:25:25.782129547Z go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1({0x293c340?, 0xc0007d1bf0?}, 0xc0001fec40?)
2024-05-15T09:25:25.782129547Z     go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.44.0/handler.go:81 +0x35
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x2948fb0?, {0x293c340?, 0xc0007d1bf0?}, 0x100?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithLatencyTrackers.func16({0x29377e0?, 0xc0001fec40}, 0xc000289e40?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/webhook_duration.go:57 +0x14a
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0007a2f00?, {0x29377e0?, 0xc0001fec40?}, 0x7f2abb853108?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestInfo.func17({0x29377e0, 0xc0001fec40}, 0x3d02360?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/requestinfo.go:39 +0x118
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0007a2e00?, {0x29377e0?, 0xc0001fec40?}, 0x12a1dc02246f?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithRequestReceivedTimestamp.withRequestReceivedTimestampWithClock.func31({0x29377e0, 0xc0001fec40}, 0xc000508b58?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/request_received_time.go:38 +0xaf
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x3?, {0x29377e0?, 0xc0001fec40?}, 0xc0005ab818?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithMuxAndDiscoveryComplete.func18({0x29377e0?, 0xc0001fec40?}, 0xc0007a2e00?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/mux_discovery_complete.go:52 +0xd5
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc000081800?, {0x29377e0?, 0xc0001fec40?}, 0xc0005ab888?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithPanicRecovery.withPanicRecovery.func32({0x29377e0?, 0xc0001fec40?}, 0xc0007d18f0?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/server/filters/wrap.go:74 +0xa6
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x29490d0?, {0x29377e0?, 0xc0001fec40?}, 0xc00065eea0?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithAuditInit.withAuditInit.func33({0x29377e0, 0xc0001fec40}, 0xc00040c580?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/endpoints/filters/audit_init.go:63 +0x12c
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0x2390e60?, {0x29377e0?, 0xc0001fec40?}, 0xd?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithPreserveOAuthHeaders.func2({0x29377e0, 0xc0001fec40}, 0xc0007a2d00)
2024-05-15T09:25:25.782129547Z     github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:42 +0x16e
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0005aba80?, {0x29377e0?, 0xc0001fec40?}, 0x24c95d5?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth.WithStandardHeaders.func3({0x29377e0, 0xc0001fec40}, 0xc0005abb18?)
2024-05-15T09:25:25.782129547Z     github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0xde
2024-05-15T09:25:25.782129547Z net/http.HandlerFunc.ServeHTTP(0xc0005abb68?, {0x29377e0?, 0xc0001fec40?}, 0xc00040c580?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2136 +0x29
2024-05-15T09:25:25.782129547Z k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0x3d33480?, {0x29377e0?, 0xc0001fec40?}, 0xc0005abb50?)
2024-05-15T09:25:25.782129547Z     k8s.io/apiserver@v0.29.2/pkg/server/handler.go:189 +0x25
2024-05-15T09:25:25.782129547Z net/http.serverHandler.ServeHTTP({0xc0007d1830?}, {0x29377e0?, 0xc0001fec40?}, 0x6?)
2024-05-15T09:25:25.782129547Z     net/http/server.go:2938 +0x8e
2024-05-15T09:25:25.782129547Z net/http.(*conn).serve(0xc0007b02d0, {0x29490d0, 0xc000585e90})
2024-05-15T09:25:25.782129547Z     net/http/server.go:2009 +0x5f4
2024-05-15T09:25:25.782129547Z created by net/http.(*Server).Serve in goroutine 249
2024-05-15T09:25:25.782129547Z     net/http/server.go:3086 +0x5cb
2024-05-15T09:25:25.782129547Z http: superfluous response.WriteHeader call from k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithPanicRecovery.func19 (wrap.go:57)
2024-05-15T09:25:25.782129547Z E0515 09:25:25.782066       1 wrap.go:58] "apiserver panic'd" method="GET" URI="/oauth/authorize?response_type=token&client_id=openshift-challenging-client" auditID="ac4795ff-5935-4ff5-bc9e-d84018f29469"      

Actual results:

Panics when anonymously curl'ed

Expected results:

No panic

Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/74

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    Trying to execute https://github.com/openshift-metal3/dev-scripts to deploy an OCP 4.16 or 4.17 cluster (with the same configuration OCP 4.14 and 4.15 are instead working) with:
 MIRROR_IMAGES=true
 INSTALLER_PROXY=true

the bootstrap process fails with:

 level=debug msg=    baremetalhost resource not yet available, will retry
level=debug msg=    baremetalhost resource not yet available, will retry
level=info msg=  baremetalhost: ostest-master-0: uninitialized
level=info msg=  baremetalhost: ostest-master-0: registering
level=info msg=  baremetalhost: ostest-master-1: uninitialized
level=info msg=  baremetalhost: ostest-master-1: registering
level=info msg=  baremetalhost: ostest-master-2: uninitialized
level=info msg=  baremetalhost: ostest-master-2: registering
level=info msg=  baremetalhost: ostest-master-1: inspecting
level=info msg=  baremetalhost: ostest-master-2: inspecting
level=info msg=  baremetalhost: ostest-master-0: inspecting
E0514 12:16:51.985417   89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=5466&timeoutSeconds=547&watch=true": Service Unavailable
W0514 12:16:52.979254   89709 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=5466": Service Unavailable
E0514 12:16:52.979293   89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=5466": Service Unavailable
E0514 12:37:01.927140   89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=7800&timeoutSeconds=383&watch=true": Service Unavailable
W0514 12:37:03.173425   89709 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=7800": Service Unavailable
E0514 12:37:03.173473   89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?resourceVersion=7800": Service Unavailable
level=debug msg=Fetching Bootstrap SSH Key Pair...
level=debug msg=Loading Bootstrap SSH Key Pair...

it looks like up to a certain point https://api.ostest.test.metalkube.org:6443 was reachable but then for some reason it started failing because its not using the proxy or is and it shouldn't be (???)

The 3 master nodes are reported as:
[root@ipi-ci-op-0qigcrln-b54ee-1790684582253694976 home]# oc get baremetalhosts -A
NAMESPACE               NAME              STATE        CONSUMER                ONLINE   ERROR              AGE
openshift-machine-api   ostest-master-0   inspecting   ostest-bbhxb-master-0   true     inspection error   24m
openshift-machine-api   ostest-master-1   inspecting   ostest-bbhxb-master-1   true     inspection error   24m
openshift-machine-api   ostest-master-2   inspecting   ostest-bbhxb-master-2   true     inspection error   24m

With something like:

 status:
  errorCount: 5
  errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection: Validation
    of image href http://0.0.0.0:8084/34427934-f1a6-48d6-9666-66872eec9ba2 failed,
    reason: Got HTTP code 503 instead of 200 in response to HEAD request.'
  errorType: inspection error

on their status

Version-Release number of selected component (if applicable):

    4.16, 4.17

How reproducible:

    100%

Steps to Reproduce:

    1. Try to create an OCP 4.16 cluster with dev-scrips with IP_STACK=v4, MIRROR_IMAGES=true and INSTALLER_PROXY=true
    2.
    3.
    

Actual results:

    level=info msg=  baremetalhost: ostest-master-0: inspecting
E0514 12:16:51.985417   89709 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *unstructured.Unstructured: Get "https://api.ostest.test.metalkube.org:6443/apis/metal3.io/v1alpha1/namespaces/openshift-machine-api/baremetalhosts?allowWatchBookmarks=true&resourceVersion=5466&timeoutSeconds=547&watch=true": Service Unavailable

Expected results:

    Successful deployment

Additional info:

I'm using IP_STACK=v4, MIRROR_IMAGES=true and INSTALLER_PROXY=true
with the same configuration (MIRROR_IMAGES=true and INSTALLER_PROXY=true) OCP 4.14 and OCP 4.15 are working.

When removing INSTALLER_PROXY=true, OCP 4.16 is also working.

I'm going to attach bootstrap gather logs

Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/231

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/installer/pull/8455

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

for ipi on vSphere. enable capi in installer and install the cluster.after destroy the cluster, according to the destroy log: it display "all folder deleted". but actually the cluster folder still exists in vSphere 
Client.

example:

05-08 20:24:38.765 level=debug msg=Delete Folder*05-08 20:24:40.649* level=debug msg=All folders deleted*05-08 20:24:40.649* level=debug msg=Delete StoragePolicy=openshift-storage-policy-wwei-0429g-fdwqc*05-08 20:24:41.576* level=info msg=Destroyed StoragePolicy=openshift-storage-policy-wwei-0429g-fdwqc*05-08 20:24:41.576* level=debug msg=Delete Tag=wwei-0429g-fdwqc*05-08 20:24:43.463* level=info msg=Deleted Tag=wwei-0429g-fdwqc*05-08 20:24:43.463* level=debug msg=Delete TagCategory=openshift-wwei-0429g-fdwqc*05-08 20:24:44.825* level=info msg=Deleted TagCategory=openshift-wwei-0429g-fdwqc
govc ls /DEVQEdatacenter/vm |grep wwei-0429g-fdwqc/DEVQEdatacenter/vm/wwei-0429g-fdwqc

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-07-025557

How reproducible:

destroy a the cluster with capi

Steps to Reproduce:

    1.install cluster with capi
    2.destroy cluster and check cluster folder in vSphere client
    

Actual results:

    cluster folder still exists.

Expected results:

    cluster folder should not exists in vSphere client after successful destroy.

Additional info:

    

This is a clone of issue OCPBUGS-38551. The following is the description of the original issue:

Description of problem:

    If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail.

Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    When an image is referenced by tag and digest, oc-mirror skips the image

Version-Release number of selected component (if applicable):

    

How reproducible:

    Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator

Steps to Reproduce:

    1 mirror to disk
    2 disk to mirror

Actual results:

    docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format

Expected results:

The image should be mirrored    

Additional info:

    

Description of problem:

The update of the samples is required for the release of Samples Operator for OCP 4.17    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

Not a bug, but using OCPBUGS so that CI automation can be used in Github. The SO JIRA project is no longer updated with the required versions.

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-41787. The following is the description of the original issue:

Description of problem:

    The test tries to schedule pods on all workers but fails to schedule on infra nodes

 Warning  FailedScheduling  86s                default-scheduler  0/9 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 6 node(s) didn
't match pod anti-affinity rules. preemption: 0/9 nodes are available: 3 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.         

$ oc get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
ostest-b6fns-infra-0-m4v7t    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-infra-0-pllsf    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-infra-0-vnbp8    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-master-0         Ready    control-plane,master   19h   v1.30.4
ostest-b6fns-master-2         Ready    control-plane,master   19h   v1.30.4
ostest-b6fns-master-lmlxf-1   Ready    control-plane,master   17h   v1.30.4
ostest-b6fns-worker-0-h527q   Ready    worker                 19h   v1.30.4
ostest-b6fns-worker-0-kpvdx   Ready    worker                 19h   v1.30.4
ostest-b6fns-worker-0-xfcjf   Ready    worker                 19h   v1.30.4

Infra nodes should be removed from the worker nodes in the test

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-09-09-173813

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

In integration with latest Hypershift Operator (0.0.39) and a Hosted Cluster 4.15.x

1. apply a new hostedCluster.Spec.Configuration.Image (insecureRegistries)
2. config is rolledout to all the node pools
3. nodes with the previous config are stuck because machines can't be deleted. So rollout never progress

CAPI shows this log

I0624 14:38:22.520708       1 logger.go:67] "Handling deleted AWSMachine"
E0624 14:38:22.520786       1 logger.go:83] "unable to delete machine" err="failed to get raw userdata: failed to retrieve bootstrap data secret for AWSMachine ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1/ad-int1-workers-16fe3af3-mdvv6: Secret \"user-data-ad-int1-workers-b14ee318\" not found"
E0624 14:38:22.521364       1 controller.go:324] "Reconciler error" err="failed to get raw userdata: failed to retrieve bootstrap data secret for AWSMachine ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1/ad-int1-workers-16fe3af3-mdvv6: Secret \"user-data-ad-int1-workers-b14ee318\" not found" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1/ad-int1-workers-16fe3af3-mdvv6" namespace="ocm-int-2c3is2isdhgqcu5qat4a7qbo8j6vqm62-ad-int1" name="ad-int1-workers-16fe3af3-mdvv6" reconcileID="8ca6fbef-1031-45df-b0cc-78d2f25607da"

The secret seems to be deleted by HO too early.
Found https://github.com/openshift/hypershift/pull/3969 which may be related

Version-Release number of selected component (if applicable):{code:none}

How reproducible:

Always in ROSA int environment

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:

Patch example
          image:
            additionalTrustedCA:
              name: ""
            registrySources:
              blockedRegistries:
              - badregistry.io

Slack thread https://redhat-external.slack.com/archives/C01C8502FMM/p1719221463858639

Description of problem:

In an attempt to fix https://issues.redhat.com/browse/OCPBUGS-35300, we introduced an Azure-specific dependency on dnsmasq, which introduced a dependency loop. This bug aims to revert that chain.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Installation failed on 4.16 nightly build when waiting for install-complete. API is unavailable.

level=info msg=Waiting up to 20m0s (until 5:00AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443...
level=info msg=API v1.29.2+a0beecc up
level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete...
api available
waiting for bootstrap to complete
level=info msg=Waiting up to 20m0s (until 5:01AM UTC) for the Kubernetes API at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443...
level=info msg=API v1.29.2+a0beecc up
level=info msg=Waiting up to 30m0s (until 5:11AM UTC) for bootstrapping to complete...
level=info msg=It is now safe to remove the bootstrap resources
level=info msg=Time elapsed: 15m54s
Copying kubeconfig to shared dir as kubeconfig-minimal
level=info msg=Destroying the bootstrap resources... 
level=info msg=Waiting up to 40m0s (until 5:39AM UTC) for the cluster at https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443 to initialize...
W0313 04:59:34.272442     229 reflector.go:539] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout
I0313 04:59:34.272658     229 trace.go:236] Trace[533197684]: "Reflector ListAndWatch" name:k8s.io/client-go/tools/watch/informerwatcher.go:146 (13-Mar-2024 04:59:04.271) (total time: 30000ms):
Trace[533197684]: ---"Objects listed" error:Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout 30000ms (04:59:34.272)
...
E0313 05:38:18.669780     229 reflector.go:147] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: failed to list *v1.ClusterVersion: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 172.212.184.131:6443: i/o timeout
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ci-op-4sgxj8jx-8482f.qe.azure.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 172.212.184.131:6443: i/o timeout
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=error msg=failed to initialize the cluster: timed out waiting for the condition 

On master node, seems that kube-apiserver is not running, 
[root@ci-op-4sgxj8jx-8482f-hppxj-master-0 ~]# crictl ps | grep apiserver
e4b6cc9622b01       ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90                                                         7 minutes ago        Running             kube-apiserver-cert-syncer                    22                  3ff4af6614409       kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0
1249824fe5788       ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90                                                         4 hours ago          Running             kube-apiserver-insecure-readyz                0                   3ff4af6614409       kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0
ca774b07284f0       ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90                                                         4 hours ago          Running             kube-apiserver-cert-regeneration-controller   0                   3ff4af6614409       kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0
2931b9a2bbabd       ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90                                                         4 hours ago          Running             openshift-apiserver-check-endpoints           0                   4136bf2183de1       apiserver-7df5bb879-xx74p
0c9534aec3b6b       8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de                                                         4 hours ago          Running             openshift-apiserver                           0                   4136bf2183de1       apiserver-7df5bb879-xx74p
db21a2dd1df33       ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90                                                         4 hours ago          Running             guard                                         0                   199e1f4e665b9       kube-apiserver-guard-ci-op-4sgxj8jx-8482f-hppxj-master-0
429110f9ea5a3       6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5                                                         4 hours ago          Running             apiserver-watcher                             0                   7664f480df29d       apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-0

[root@ci-op-4sgxj8jx-8482f-hppxj-master-1 ~]# crictl ps | grep apiserver
c64187e7adcc6       ec5ccd782eb003136d9cc1df51a2b20f8a2a489d72ffb894b92f50e363c7cb90                                                         4 hours ago         Running             openshift-apiserver-check-endpoints           0                   1a4a5b247c28a       apiserver-7df5bb879-f6v5x
ff98c52402288       8c9042f97c89d8c8519d6e6235bef5a5346f08e6d7d9864ef0f228b318b4c3de                                                         4 hours ago         Running             openshift-apiserver                           0                   1a4a5b247c28a       apiserver-7df5bb879-f6v5x
2f8a97f959409       faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927                                                         4 hours ago         Running             oauth-apiserver                               0                   ffa2c316a0cca       apiserver-97fbc599c-2ftl7
72897e30e0df0       6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5                                                         4 hours ago         Running             apiserver-watcher                             0                   3b6c3849ce91f       apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-1

[root@ci-op-4sgxj8jx-8482f-hppxj-master-2 ~]# crictl ps | grep apiserver
04c426f07573d       faa1b95089d101cdc907d7affe310bbff5a9aa8f92c725dc6466afc37e731927                                                         4 hours ago         Running             oauth-apiserver                      0                   2172a64fb1a38       apiserver-654dcb4cc6-tq8fj
4dcca5c0e9b99       6a03f3f082f3719e79087d569b3cd1e718fb670d1261fbec9504662f1005b1a5                                                         4 hours ago         Running             apiserver-watcher                    0                   1cd99ec327199       apiserver-watcher-ci-op-4sgxj8jx-8482f-hppxj-master-2


And found below error in kubelet log,
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: E0313 06:10:15.004656   23961 kuberuntime_manager.go:1262] container &Container{Name:kube-apiserver,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:789f242b8bc721b697e265c6f9d025f45e56e990bfd32e331c633fe0b9f076bc,Command:[/bin/bash -ec],Args:[LOCK=/var/log/kube-apiserver/.lock
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: # We should be able to acquire the lock immediatelly. If not, it means the init container has not released it yet and kubelet or CRI-O started container prematurely.
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec {LOCK_FD}>${LOCK} && flock --verbose -w 30 "${LOCK_FD}" || {
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]:   echo "Failed to acquire lock for kube-apiserver. Please check setup container for details. This is likely kubelet or CRI-O bug."
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]:   exit 1
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: }
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]:   echo "Copying system trust bundle ..."
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]:   cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: fi
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: exec watch-termination --termination-touch-file=/var/log/kube-apiserver/.terminating --termination-log-file=/var/log/kube-apiserver/termination.log --graceful-termination-duration=135s --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig -- hyperkube kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=${HOST_IP}  -v=2 --permit-address-sharing
Mar 13 06:10:15 ci-op-4sgxj8jx-8482f-hppxj-master-0 kubenswrapper[23961]: ],WorkingDir:,Ports:[]ContainerPort{ContainerPort{Name:,HostPort:6443,ContainerPort:6443,Protocol:TCP,HostIP:,},},Env:[]EnvVar{EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:STATIC_POD_VERSION,Value:4,ValueFrom:nil,},EnvVar{Name:HOST_IP,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:GOGC,Value:100,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{265 -3} {<nil>} 265m DecimalSI},memory: {{1073741824 0} {<nil>} 1Gi BinarySI},},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:resource-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-resources,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:cert-dir,ReadOnly:false,MountPath:/etc/kubernetes/static-pod-certs,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:audit-dir,ReadOnly:false,MountPath:/var/log/kube-apiserver,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:livez,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:readyz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:1,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:FallbackToLogsOnError,VolumeDevices:[]VolumeDevice{},StartupProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:healthz,Port:{0 6443 },Host:,Scheme:HTTPS,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:10,PeriodSeconds:5,SuccessThreshold:1,FailureThreshold:30,TerminationGracePeriodSeconds:nil,},ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod kube-apiserver-ci-op-4sgxj8jx-8482f-hppxj-master-0_openshift-kube-apiserver(196e0956694ff43707b03f4585f3b6cd): CreateContainerConfigError: host IP unknown; known addresses: []

Version-Release number of selected component (if applicable):

    4.16 latest nightly build

How reproducible:

    frequently

Steps to Reproduce:

    1. Install cluster on 4.16 nightly build
    2.
    3.
    

Actual results:

    Installation failed.

Expected results:

    Installation is successful.

Additional info:

Searched CI jobs, found many jobs failed with same error, most are on azure platform.
https://search.dptools.openshift.org/?search=failed+to+initialize+the+cluster%3A+timed+out+waiting+for+the+condition&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Description of problem:

During the creation of a 4.16 cluster using the nightly build (--channel-group nightly --version 4.16.0-0.nightly-2024-05-19-235324) with the following command: 

osa create cluster --cluster-name $CLUSTER_NAME --sts --mode auto --machine-cidr 10.0.0.0/16 --compute-machine-type m6a.xlarge --region $REGION --oidc-config-id $OIDC_ID --channel-group nightly --version 4.16.0-0.nightly-2024-05-19-235324 --ec2-metadata-http-tokens optional --replicas 2 --service-cidr 172.30.0.0/16 --pod-cidr 10.128.0.0/14 --host-prefix 23 -y    

How reproducible:

1. Run the command provided above to create a cluster.Observe the error during the IAM role creation step.
2. Observe the error during the IAM role creation step.

Actual results:

time="2024-05-20T03:21:03Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create inline policy for role master: AccessDenied: User: arn:aws:sts::890193308254:assumed-role/ManagedOpenShift-Installer-Role/1716175231092827911 is not authorized to perform: iam:PutRolePolicy on resource: role ManagedOpenShift-ControlPlane-Role because no identity-based policy allows the iam:PutRolePolicy action\n\tstatus code: 403, request id: 27f0f631-abdd-47e9-ba02-a2e71a7487dc"
time="2024-05-20T03:21:04Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=wx9l766h
time="2024-05-20T03:21:04Z" level=error msg="error provisioning cluster" error="exit status 4" installID=wx9l766h
time="2024-05-20T03:21:04Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=wx9l766h
time="2024-05-20T03:21:04Z" level=debug msg="OpenShift Installer v4.16.0

Expected results:

The cluster should be created successfully without IAM permission errors.

Additional info:

- The IAM role ManagedOpenShift-Installer-Role does not have the necessary permissions to perform iam:PutRolePolicy on the ManagedOpenShift-ControlPlane-Role.

- This issue was observed with the nightly build 4.16.0-0.nightly-2024-05-19-235324.

 

More context: https://redhat-internal.slack.com/archives/C070BJ1NS1E/p1716182046041269

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Check webhook-authentication-integrated-oauth secret annotations in openshift-config namespace
2.
3.

Actual results:

No component annotation set

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-39298. The following is the description of the original issue:

Description of problem:

cluster-capi-operator's manifests-gen tool would generate CAPI providers transport configmaps with missing metadata details

Version-Release number of selected component (if applicable):

4.17, 4.18

How reproducible:

Not impacting payload, only a tooling bug

Description of problem:

This is related  to  BUG [OCPBUGS-29459](https://issues.redhat.com/browse/OCPBUGS-29459), In addition to fixing this bug, we should fix the logging for machine-config-controller to throw the detailed warnings/error about the faulty malfomed certificate. For example which particular certificate is malformed and what is the actual issue with the malformed certificate for example: `x509: malformed algorithm identifier`,  `x509: invalid certificate policies` etc.
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

machine-config-controller just shows `Malformed Cert, not syncing` as Info msgs and is failling to log the details for the malformed certificate and it makes the triage/troubleshooting diffcult, its hard to guess what certificate has issue.
    

Expected results:

machine-config-controller to throw the detailed warnings with detailing which certificate has issue which makes the trouleshooting lot easier. And this should be either error or warnings not the info.
    

Additional info:

This is related to BUG [OCPBUGS-29459](https://issues.redhat.com/browse/OCPBUGS-29459)
    

Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/102

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/65

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

OCP/RHCOS system daemon(s) like ovs-vswitchd (revalidator process) use the same vCPU (from isolated vCPU pool) that is already reserved by CPU Manager for CNF workloads, causing intermittent issues for CNF workloads performance (and also causing vCPU level overload). Note: NCP 23.11 uses CPU Manager with static policy and Topology Manager set to "single-numa-node". Also, specific isolated and reserved vCPU pools have been defined.

Version-Release number of selected component (if applicable):

4.14.22

How reproducible:

Intermittent at customer environment.

Steps to Reproduce:

1.
2.
3.

Actual results:

ovs-vswitchd is using isolated CPUs

Expected results:

ovs-vswitchd to use only  reserved CPUs

Additional info:

We want to understand if customer is hitting the bug:

  https://issues.redhat.com/browse/OCPBUGS-32407

This bug was fixed at 4.14.25. Customer cluster is 4.14.22. Customer is also asking if it is possible to get a private fix since they cannot update at the moment.

All case files have been yanked at both US and EU instances of Supportshell. In case case updates or attachments are not accessible please let me know.

Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2381

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.

Description of problem:

TestAllowedSourceRangesStatus test is flaking with the error:

allowed_source_ranges_test.go:197: expected the annotation to be reflected in status.allowedSourceRanges: timed out waiting for the condition

I also notice it sometimes coincides with a TestScopeChange error. It may be related updating LoadBalancer type operations, for example, https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/978/pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator/1800249453098045440

Version-Release number of selected component (if applicable):

4.17

How reproducible:

~25-50%

Steps to Reproduce:

1. Run cluster-ingress-operator TestAllowedSourceRangesStatus E2E tests
2.
3.

Actual results:

Test is flaking

Expected results:

Test shouldn't flake

Additional info:

Example Flake

Search.CI Link

Description of problem:

    Geneve port has not been created for a set of nodes.

~~~
[arghosh@supportshell-1 03826869]$ omg get nodes |grep -v NAME|wc -l
83
~~~
~~~
# crictl exec -ti `crictl ps --name nbdb -q` ovn-nbctl show transit_switch | grep tstor-prd-fc-shop09a | wc -l
73
# crictl exec -ti `crictl ps --name nbdb -q` ovn-sbctl list chassis | grep -c ^hostname
41
# ovs-appctl ofproto/list-tunnels | wc -l
40
~~~

Version-Release number of selected component (if applicable):

    4.14.17

How reproducible:

    Not Sure

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    POD to POD connectivity issue when PODs are hosted on different nodes

Expected results:

    POD to POD connectivity should work fine

Additional info:

    As per customer https://github.com/openshift/ovn-kubernetes/pull/2179 resolves the issue.

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Please review the following PR: https://github.com/openshift/csi-operator/pull/227

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2372

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-34692.

Refactor name to Dockerfile.ocp as a better alternative to Dockerfile.rhel7 since contents are actually rhel9.

Description of problem:

Possibly reviving OCPBUGS-10771, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:

: [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less	1h34m30s
{  3 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 03 22:03:29.822 - 106s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
Oct 03 22:08:34.162 - 98s   E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
Oct 03 22:13:01.645 - 118s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)

But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

4.15. Possibly all supported versions of the CPMS operator have this exposure.

How reproducible:

Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

Steps to Reproduce:

w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort

Actual results:

periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact

Expected results:

CPMS goes Available=False if and only if immediate admin intervention is appropriate.

Description of problem:

    After https://github.com/openshift/cluster-kube-controller-manager-operator/pull/804 was merged the controller no longer updates secret type and this no longer adds owner label. This PR would ensure the secret is created with this label

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38951. The following is the description of the original issue:

  1. Regular expression for matching audience string is incorrect
  2. STS functionality functions incorrectly due to convoluted logic (detected by QE)

This is a clone of issue OCPBUGS-43567. The following is the description of the original issue:

Description of problem:

With the newer azure-sdk-for-go replacing go-autorest, there was a change to use ClientCertificateCredential that did not include the `SendCertificateChain` option by default that used to be there.  The ARO team requires this be set otherwise the 1p integration for SNI will not work.  

Old version: https://github.com/Azure/go-autorest/blob/f7ea664c9cff3a5257b6dbc4402acadfd8be79f1/autorest/adal/token.go#L262-L264

New version: https://github.com/openshift/installer-aro/pull/37/files#diff-da950a4ddabbede621d9d3b1058bb34f8931c89179306ee88a0e4d76a4cf0b13R294

    

Version-Release number of selected component (if applicable):

This was introduced in the OpenShift installer PR: https://github.com/openshift/installer/pull/6003    

How reproducible:

Every time we authenticate using SNI in Azure.  

Steps to Reproduce:

    1.  Configure a service principal in the Microsoft tenant using SNI
    2.  Attempt to run the installer using client-certificate credentials to install a cluster with credentials mode in manual
    

Actual results:

Installation fails as we're unable to authenticate using SNI.  
    

Expected results:

We're able to authenticate using SNI.  
    

Additional info:

This should not have any affect on existing non-SNI based authentication methods using client certificate credentials.  It was previously set in autorest for golang, but is not defaulted to in the newer azure-sdk-for-go.  


Note that only first party Microsoft services will be able to leverage SNI in Microsoft tenants.  The test case for this on the installer side would be to ensure it doesn't break manual credential mode installs using a certificate pinned to a service principal.  

 

 

All we would need changed is to this  pass the ` SendCertificateChain: true,` option only on client certificate credentials.  Ideally we could back-port this as well to all openshift versions which received the migration from AAD to Microsoft Graph changes. 

Currently we are downloading and installing the rpm each build of the upi-installer image. This has caused random timeouts. Determine if it is possible to:

https://learn.microsoft.com/en-us/powershell/scripting/install/install-other-linux?view=powershell-7.4#binary-archives

and either copy the tar or follow the steps in the initial container.

 

Acceptance Criteria:

  • Image copy method
  • New story to perform the work

Description of problem:

Mirroring fails sometimes due to various number of reasons and since mirror fails, current code does not generate idms & itms files. Even if user tries to mirror the operators  twice or thrice the operators does not get mirrored and no resources are created to utilize the operators that have already been mirrored. This bug is to create idms and itms even if mirroring fails
    

Version-Release number of selected component (if applicable):

     4.16
    

How reproducible:

     Always
    

Steps to Reproduce:

    1. Install latest oc-mirror
    2. Use the ImageSetConfig.yaml below
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
archiveSize: 4
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15
    full: false # only mirror the latest versions
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    full: false # only mirror the latest versions
    3. Mirror using the command `oc-mirror -c config.yaml docker://localhost:5000/m2m --dest-skip-verify=false --workspace=file://test`
    

Actual results:

     Mirroring fails and does not generate any idms or itms files
    

Expected results:

     IDMS and ITMS files should be generated for the mirrored operators, even if mirroring fails
    

Additional info:


    

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/123

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  • While upgrading from 4.12.55 to 4.13.42 the network operator seems to be in a degraded state due to the ovnkube-master pods ending up in a crashloopbackoff.
     
    The ovnkube-master container appears to hit a context deadline timeout and is not starting. This happens for all 3 ovnkube-master pods.
     
    ovnkube-master-b5dwz   5/6     CrashLoopBackOff   15 (4m49s ago)   75m
    ovnkube-master-dm6g5   5/6     CrashLoopBackOff   15 (3m50s ago)   72m
    ovnkube-master-lzltc         5/6     CrashLoopBackOff   16 (31s ago)     76m    

          
    Relevant logs :

    1 ovnkube.go:369] failed to start network controller manager: failed to start default network controller: failed to sync address sets on controller init: failed to transact address set sync ops: error in transact with ops [{Op:insert Table:Address_Set Row:map[addresses:{GoSet:[172.21.4.58 172.30.113.119 172.30.113.93 172.30.140.204 172.30.184.23 172.30.20.1 172.30.244.26 172.30.250.254 172.30.29.56 172.30.39.131 172.30.54.87 172.30.54.93 172.30.70.9]} external_ids:{GoMap:map[direction:ingress gress-index:0 ip-family:v4 ...]} log:false match:ip4.src == {$a10011776377603330168, $a10015887742824209439, $a10026019104056290237, $a10029515256826812638, $a5952808452902781817, $a10084011578527782670, $a10086197949337628055, $a10093706521660045086, $a10096260576467608457, $a13012332091214445736, $a10111277808835218114, $a10114713358929465663, $a101155018460287381, $a16191032114896727480, $a14025182946114952022, $a10127722282178953052, $a4829957937622968220, $a10131833063630260035, $a3533891684095375041, $a7785003721317615588, $a10594480726457361847, $a10147006001458235329, $a12372228123457253136, $a10016996505620670018, $a10155660392008449200, $a10155926828030234078, $a15442683337083171453, $a9765064908646909484, $a7550609288882429832, $a11548830526886645428, $a10204075722023637394, $a10211228835433076965, $a5867828639604451547, $a10222049254704513272, $a13856077787103972722, $a11903549070727627659,.... (this is a very long list of ACL)

This is a clone of issue OCPBUGS-39339. The following is the description of the original issue:

Description of problem:

    The issue comes from https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25386451&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25386451.
Error message is shown when gather bootstrap log bundle although log bundle gzip file is generated.

ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected.

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

    Always

Steps to Reproduce:

    1. Run `openshift-install gather bootstrap --dir <install-dir>`
    2.
    3.
    

Actual results:

    Error message shown in output of command `openshift-install gather bootstrap --dir <install-dir>`

Expected results:

    No error message shown there.

Additional info:

Analysis from Rafael, https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25387767&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25387767     

This looks very much like a 'downstream a thing' process, but only making a modification to an existing one.

Currently, the operator-framework-olm monorepo generates a self-hosting catalog from operator-registry.Dockerfile.  This image also contains cross-compiled opm binaries for windows and mac, and joins the payload as ose-operator-registry.

To separate concerns, this introduces a new operator-framework-cli image which will be based on scratch, not self-hosting in any way, and just a container to convey repeatably produced o-f CLIs.  Right now, this will focus on opm for olm v0 only, but others can be added in future.

 

Description of problem:

    NodePool machine instances are failing to join a HostedCluster. The nodepool status reports InvalidConfig with the following condition:

  - lastTransitionTime: "2024-05-02T15:08:58Z"
    message: 'Failed to generate payload: error getting ignition payload: machine-config-server
      configmap is out of date, waiting for update 5c59871d != 48a6b276'
    observedGeneration: 1
    reason: InvalidConfig
    status: ""
    type: ValidGeneratedPayload

Version-Release number of selected component (if applicable):

    4.14.21 (HostedCluster), with HyperShift operator db9d81eb56e35b145bbbd878bbbcf742c9e75be2

How reproducible:

    100%

Steps to Reproduce:

* Create a ROSA HCP cluster
* Waiting for the nodepools to come up
* Add an IdP
* Delete ignition-server pods in the hostedcontrolplane namespace on the management cluster
* Confirm nodepools complain about machine-config-server/token-secret hash mismatch
* Scale down/up by deleting machine.cluster.x-k8s.io resources or otherwise     

Actual results:

    Nodes are not created

Expected results:

    Node is created

Additional info:

    The AWSMachine resources were created along with the corresponding ec2 instances. However, they were never ignited.

Deleting AWSMachine resources, resulted in successful ignition of new nodes.

Note: Please find logs from both HCP namespaces in the comment https://issues.redhat.com/browse/OCPBUGS-33377?focusedId=24690046&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24690046

Description of problem:

Subnets created by the installer are tagged with kubernetes.io/cluster/<infra_id> set to 'shared' instead of 'owned'.

Version-Release number of selected component (if applicable):

4.16.z

How reproducible:

Any time a 4.16 cluster is installed

Steps to Reproduce:

    1. Install a fresh 4.16 cluster without providing an existing VPC.

Actual results:

Subnets are tagged with kubernetes.io/cluster/<infra_id>: shared    

Expected results:

Subnets created by the installer are tagged with kubernetes.io/cluster/<infra_id>: owned

Additional info:

Slack discussion here - https://redhat-internal.slack.com/archives/C68TNFWA2/p1720728359424529

Description of problem:

Some events have time related infomration set to null (firstTimestamp, lastTimestamp, eventTime)

Version-Release number of selected component (if applicable):

cluster-logging.v5.8.0

How reproducible:

100% 

Steps to Reproduce:

    1.Stop one of the masters
    2.Start the master
    3.Wait untill the ENV stabilizes
    4. oc get events -A | grep unknown     

Actual results:

oc get events -A | grep unknow
default                                      <unknown>   Normal    TerminationStart                             namespace/kube-system                                                            Received signal to terminate, becoming unready, but keeping serving
default                                      <unknown>   Normal    TerminationPreShutdownHooksFinished          namespace/kube-system                                                            All pre-shutdown hooks have been finished
default                                      <unknown>   Normal    TerminationMinimalShutdownDurationFinished   namespace/kube-system                                                            The minimal shutdown duration of 0s finished
....

Expected results:

    All time related information is set correctly

Additional info:

   This causes issues with external monitoring systems. Events with no timestamp will never show or push other events from the view depending on the sorting order of the timestamp. The operator of the environment has then trouble to see what is happening there. 

Description of problem:

    During HyperShift operator updates/rollout, previous ignition-server token and user-data secrets are not properly cleaned up and causing them to be abandoned on the control plane.

Version-Release number of selected component (if applicable):

    4.15.6+

How reproducible:

    100%

Steps to Reproduce:

    1. Deploy hypershift-operator <4.15.6
    2. Create HostedCluster and NodePool
    3. Update hypershift-operator to 4.15.8+
    

Actual results:

    Previous token and user-data secrets are now unmanaged and abandoned

Expected results:

    HyperShift operator to properly clean them up

Additional info:

    Introduced by https://github.com/openshift/hypershift/pull/3730

`openshift-tests` doesn't have an easy way to figure out what version it's running; not every subcommand prints it out.

 

This is a clone of issue OCPBUGS-42546. The following is the description of the original issue:

Description of problem:

When machineconfig fails to generate, we set upgradeable=false and degrade pools. The expectation is that the CO would also degrade after some time (normally 30 minutes) since master pool is degraded, but that doesn't seem to be happening. Based on our initial investigation, the event/degrade is happening but it seems to be being cleared.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Should be always

Steps to Reproduce:

    1. Apply a wrong config, such as a bad image.config object:
spec:
  registrySources:
    allowedRegistries:
    - test.reg
    blockedRegistries:
    - blocked.reg
    
    2. upgrade the cluster or roll out a new MCO pod
    3. observe that pools are degraded but the CO isn't
    

Actual results:

    

Expected results:

    

Additional info:

    

Currently, several of our projects are using registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.22-openshift-4.17 (or other versions of images from that family) as part of their build.

This image has more tooling in it and is more closely aligned with what is used for building shipping images:

registry.ci.openshift.org/openshift/release:rhel-9-release-golang-1.22-openshift-4.17

As an OpenShift developer, I would like to use the same builder images across our team's builds where possible to reduce confusion. Please change all non-UBI builds to use the openshift/release image instead of the ocp/builder image in these repos:
https://github.com/openshift/vertical-pod-autoscaler-operator

https://github.com/openshift/kubernetes-autoscaler (VPA images only)

https://github.com/openshift/cluster-resource-override-admission-operator

https://github.com/openshift/cluster-resource-override-admission

Also update the main branch to match images of any CI builds that are changed:

https://github.com/openshift/release

 

 

When switching from ipForwarding: Global to Restricted, sysctl settings are not adjusted

Switch from:

# oc edit network.operator/cluster
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  annotations:
    networkoperator.openshift.io/ovn-cluster-initiator: 10.19.1.66
  creationTimestamp: "2023-11-22T12:14:46Z"
  generation: 207
  name: cluster
  resourceVersion: "235152"
  uid: 225d404d-4e26-41bf-8e77-4fc44948f239
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      egressIPConfig: {}
      gatewayConfig:
        ipForwarding: Global
(...)

To:

# oc edit network.operator/cluster
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  annotations:
    networkoperator.openshift.io/ovn-cluster-initiator: 10.19.1.66
  creationTimestamp: "2023-11-22T12:14:46Z"
  generation: 207
  name: cluster
  resourceVersion: "235152"
  uid: 225d404d-4e26-41bf-8e77-4fc44948f239
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      egressIPConfig: {}
      gatewayConfig:
        ipForwarding: Restricted

You'll see that the pods are updated:

# oc get pods -o yaml -n openshift-ovn-kubernetes ovnkube-node-fnl9z | grep sysctl -C10
      fi

      admin_network_policy_enabled_flag=
      if [[ "false" == "true" ]]; then
        admin_network_policy_enabled_flag="--enable-admin-network-policy"
      fi

      # If IP Forwarding mode is global set it in the host here.
      ip_forwarding_flag=
      if [ "Restricted" == "Global" ]; then
        sysctl -w net.ipv4.ip_forward=1
        sysctl -w net.ipv6.conf.all.forwarding=1
      else
        ip_forwarding_flag="--disable-forwarding"
      fi

      NETWORK_NODE_IDENTITY_ENABLE=
      if [[ "true" == "true" ]]; then
        NETWORK_NODE_IDENTITY_ENABLE="
          --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig
          --cert-dir=/etc/ovn/ovnkube-node-certs
          --cert-duration=24h

And that ovnkube correctly takes the settings:

# ps aux | grep disable-for
root       74963  0.3  0.0 8085828 153464 ?      Ssl  Nov22   3:38 /usr/bin/ovnkube --init-ovnkube-controller master1.site1.r450.org --init-node master1.site1.r450.org --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --metrics-enable-config-duration --export-ovs-metrics --disable-snat-multiple-gws --enable-multi-network --enable-multicast --zone master1.site1.r450.org --enable-interconnect --acl-logging-rate-limit 20 --enable-multi-external-gateway=true --disable-forwarding --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
root     2096007  0.0  0.0   3880  2144 pts/0    S+   10:07   0:00 grep --color=auto disable-for

But sysctls are never restricted:

[root@master1 ~]# sysctl -a | grep forward
net.ipv4.conf.0eca9d9e7fd3231.bc_forwarding = 0
net.ipv4.conf.0eca9d9e7fd3231.forwarding = 1
net.ipv4.conf.0eca9d9e7fd3231.mc_forwarding = 0
net.ipv4.conf.21a32cf76c3bcdf.bc_forwarding = 0
net.ipv4.conf.21a32cf76c3bcdf.forwarding = 1
net.ipv4.conf.21a32cf76c3bcdf.mc_forwarding = 0
net.ipv4.conf.22f9bca61beeaba.bc_forwarding = 0
net.ipv4.conf.22f9bca61beeaba.forwarding = 1
net.ipv4.conf.22f9bca61beeaba.mc_forwarding = 0
net.ipv4.conf.2ee438a7201c1f7.bc_forwarding = 0
net.ipv4.conf.2ee438a7201c1f7.forwarding = 1
net.ipv4.conf.2ee438a7201c1f7.mc_forwarding = 0
net.ipv4.conf.3560ce219f7b591.bc_forwarding = 0
net.ipv4.conf.3560ce219f7b591.forwarding = 1
net.ipv4.conf.3560ce219f7b591.mc_forwarding = 0
net.ipv4.conf.507c81eb9944c2e.bc_forwarding = 0
net.ipv4.conf.507c81eb9944c2e.forwarding = 1
net.ipv4.conf.507c81eb9944c2e.mc_forwarding = 0
net.ipv4.conf.6278633ca74482f.bc_forwarding = 0
net.ipv4.conf.6278633ca74482f.forwarding = 1
net.ipv4.conf.6278633ca74482f.mc_forwarding = 0
net.ipv4.conf.68b572ce18f3b82.bc_forwarding = 0
net.ipv4.conf.68b572ce18f3b82.forwarding = 1
net.ipv4.conf.68b572ce18f3b82.mc_forwarding = 0
net.ipv4.conf.7291c80dd47a6f3.bc_forwarding = 0
net.ipv4.conf.7291c80dd47a6f3.forwarding = 1
net.ipv4.conf.7291c80dd47a6f3.mc_forwarding = 0
net.ipv4.conf.76abdac44c6aee7.bc_forwarding = 0
net.ipv4.conf.76abdac44c6aee7.forwarding = 1
net.ipv4.conf.76abdac44c6aee7.mc_forwarding = 0
net.ipv4.conf.7f9abb486611f68.bc_forwarding = 0
net.ipv4.conf.7f9abb486611f68.forwarding = 1
net.ipv4.conf.7f9abb486611f68.mc_forwarding = 0
net.ipv4.conf.8cd86bfb8ea635f.bc_forwarding = 0
net.ipv4.conf.8cd86bfb8ea635f.forwarding = 1
net.ipv4.conf.8cd86bfb8ea635f.mc_forwarding = 0
net.ipv4.conf.8e87bd3f6ddc9f8.bc_forwarding = 0
net.ipv4.conf.8e87bd3f6ddc9f8.forwarding = 1
net.ipv4.conf.8e87bd3f6ddc9f8.mc_forwarding = 0
net.ipv4.conf.91079c8f5c1630f.bc_forwarding = 0
net.ipv4.conf.91079c8f5c1630f.forwarding = 1
net.ipv4.conf.91079c8f5c1630f.mc_forwarding = 0
net.ipv4.conf.92e754a12836f63.bc_forwarding = 0
net.ipv4.conf.92e754a12836f63.forwarding = 1
net.ipv4.conf.92e754a12836f63.mc_forwarding = 0
net.ipv4.conf.a5c01549a6070ab.bc_forwarding = 0
net.ipv4.conf.a5c01549a6070ab.forwarding = 1
net.ipv4.conf.a5c01549a6070ab.mc_forwarding = 0
net.ipv4.conf.a621d1234f0f25a.bc_forwarding = 0
net.ipv4.conf.a621d1234f0f25a.forwarding = 1
net.ipv4.conf.a621d1234f0f25a.mc_forwarding = 0
net.ipv4.conf.all.bc_forwarding = 0
net.ipv4.conf.all.forwarding = 1
net.ipv4.conf.all.mc_forwarding = 0
net.ipv4.conf.br-ex.bc_forwarding = 0
net.ipv4.conf.br-ex.forwarding = 1
net.ipv4.conf.br-ex.mc_forwarding = 0
net.ipv4.conf.br-int.bc_forwarding = 0
net.ipv4.conf.br-int.forwarding = 1
net.ipv4.conf.br-int.mc_forwarding = 0
net.ipv4.conf.c3f3da187245cf6.bc_forwarding = 0
net.ipv4.conf.c3f3da187245cf6.forwarding = 1
net.ipv4.conf.c3f3da187245cf6.mc_forwarding = 0
net.ipv4.conf.c7e518fff8ff973.bc_forwarding = 0
net.ipv4.conf.c7e518fff8ff973.forwarding = 1
net.ipv4.conf.c7e518fff8ff973.mc_forwarding = 0
net.ipv4.conf.d17c6fb6d3dd021.bc_forwarding = 0
net.ipv4.conf.d17c6fb6d3dd021.forwarding = 1
net.ipv4.conf.d17c6fb6d3dd021.mc_forwarding = 0
net.ipv4.conf.default.bc_forwarding = 0
net.ipv4.conf.default.forwarding = 1
net.ipv4.conf.default.mc_forwarding = 0
net.ipv4.conf.eno8303.bc_forwarding = 0
net.ipv4.conf.eno8303.forwarding = 1
net.ipv4.conf.eno8303.mc_forwarding = 0
net.ipv4.conf.eno8403.bc_forwarding = 0
net.ipv4.conf.eno8403.forwarding = 1
net.ipv4.conf.eno8403.mc_forwarding = 0
net.ipv4.conf.ens1f0.bc_forwarding = 0
net.ipv4.conf.ens1f0.forwarding = 1
net.ipv4.conf.ens1f0.mc_forwarding = 0
net.ipv4.conf.ens1f0/3516.bc_forwarding = 0
net.ipv4.conf.ens1f0/3516.forwarding = 1
net.ipv4.conf.ens1f0/3516.mc_forwarding = 0
net.ipv4.conf.ens1f0/3517.bc_forwarding = 0
net.ipv4.conf.ens1f0/3517.forwarding = 1
net.ipv4.conf.ens1f0/3517.mc_forwarding = 0
net.ipv4.conf.ens1f0/3518.bc_forwarding = 0
net.ipv4.conf.ens1f0/3518.forwarding = 1
net.ipv4.conf.ens1f0/3518.mc_forwarding = 0
net.ipv4.conf.ens1f1.bc_forwarding = 0
net.ipv4.conf.ens1f1.forwarding = 1
net.ipv4.conf.ens1f1.mc_forwarding = 0
net.ipv4.conf.ens3f0.bc_forwarding = 0
net.ipv4.conf.ens3f0.forwarding = 1
net.ipv4.conf.ens3f0.mc_forwarding = 0
net.ipv4.conf.ens3f1.bc_forwarding = 0
net.ipv4.conf.ens3f1.forwarding = 1
net.ipv4.conf.ens3f1.mc_forwarding = 0
net.ipv4.conf.fcb6e9468a65d70.bc_forwarding = 0
net.ipv4.conf.fcb6e9468a65d70.forwarding = 1
net.ipv4.conf.fcb6e9468a65d70.mc_forwarding = 0
net.ipv4.conf.fcd96084b7f5a9a.bc_forwarding = 0
net.ipv4.conf.fcd96084b7f5a9a.forwarding = 1
net.ipv4.conf.fcd96084b7f5a9a.mc_forwarding = 0
net.ipv4.conf.genev_sys_6081.bc_forwarding = 0
net.ipv4.conf.genev_sys_6081.forwarding = 1
net.ipv4.conf.genev_sys_6081.mc_forwarding = 0
net.ipv4.conf.lo.bc_forwarding = 0
net.ipv4.conf.lo.forwarding = 1
net.ipv4.conf.lo.mc_forwarding = 0
net.ipv4.conf.ovn-k8s-mp0.bc_forwarding = 0
net.ipv4.conf.ovn-k8s-mp0.forwarding = 1
net.ipv4.conf.ovn-k8s-mp0.mc_forwarding = 0
net.ipv4.conf.ovs-system.bc_forwarding = 0
net.ipv4.conf.ovs-system.forwarding = 1
net.ipv4.conf.ovs-system.mc_forwarding = 0
net.ipv4.ip_forward = 1
net.ipv4.ip_forward_update_priority = 1
net.ipv4.ip_forward_use_pmtu = 0
net.ipv6.conf.0eca9d9e7fd3231.forwarding = 1
net.ipv6.conf.0eca9d9e7fd3231.mc_forwarding = 0
net.ipv6.conf.21a32cf76c3bcdf.forwarding = 1
net.ipv6.conf.21a32cf76c3bcdf.mc_forwarding = 0
net.ipv6.conf.22f9bca61beeaba.forwarding = 1
net.ipv6.conf.22f9bca61beeaba.mc_forwarding = 0
net.ipv6.conf.2ee438a7201c1f7.forwarding = 1
net.ipv6.conf.2ee438a7201c1f7.mc_forwarding = 0
net.ipv6.conf.3560ce219f7b591.forwarding = 1
net.ipv6.conf.3560ce219f7b591.mc_forwarding = 0
net.ipv6.conf.507c81eb9944c2e.forwarding = 1
net.ipv6.conf.507c81eb9944c2e.mc_forwarding = 0
net.ipv6.conf.6278633ca74482f.forwarding = 1
net.ipv6.conf.6278633ca74482f.mc_forwarding = 0
net.ipv6.conf.68b572ce18f3b82.forwarding = 1
net.ipv6.conf.68b572ce18f3b82.mc_forwarding = 0
net.ipv6.conf.7291c80dd47a6f3.forwarding = 1
net.ipv6.conf.7291c80dd47a6f3.mc_forwarding = 0
net.ipv6.conf.76abdac44c6aee7.forwarding = 1
net.ipv6.conf.76abdac44c6aee7.mc_forwarding = 0
net.ipv6.conf.7f9abb486611f68.forwarding = 1
net.ipv6.conf.7f9abb486611f68.mc_forwarding = 0
net.ipv6.conf.8cd86bfb8ea635f.forwarding = 1
net.ipv6.conf.8cd86bfb8ea635f.mc_forwarding = 0
net.ipv6.conf.8e87bd3f6ddc9f8.forwarding = 1
net.ipv6.conf.8e87bd3f6ddc9f8.mc_forwarding = 0
net.ipv6.conf.91079c8f5c1630f.forwarding = 1
net.ipv6.conf.91079c8f5c1630f.mc_forwarding = 0
net.ipv6.conf.92e754a12836f63.forwarding = 1
net.ipv6.conf.92e754a12836f63.mc_forwarding = 0
net.ipv6.conf.a5c01549a6070ab.forwarding = 1
net.ipv6.conf.a5c01549a6070ab.mc_forwarding = 0
net.ipv6.conf.a621d1234f0f25a.forwarding = 1
net.ipv6.conf.a621d1234f0f25a.mc_forwarding = 0
net.ipv6.conf.all.forwarding = 1
net.ipv6.conf.all.mc_forwarding = 0
net.ipv6.conf.br-ex.forwarding = 1
net.ipv6.conf.br-ex.mc_forwarding = 0
net.ipv6.conf.br-int.forwarding = 1
net.ipv6.conf.br-int.mc_forwarding = 0
net.ipv6.conf.c3f3da187245cf6.forwarding = 1
net.ipv6.conf.c3f3da187245cf6.mc_forwarding = 0
net.ipv6.conf.c7e518fff8ff973.forwarding = 1
net.ipv6.conf.c7e518fff8ff973.mc_forwarding = 0
net.ipv6.conf.d17c6fb6d3dd021.forwarding = 1
net.ipv6.conf.d17c6fb6d3dd021.mc_forwarding = 0
net.ipv6.conf.default.forwarding = 1
net.ipv6.conf.default.mc_forwarding = 0
net.ipv6.conf.eno8303.forwarding = 1
net.ipv6.conf.eno8303.mc_forwarding = 0
net.ipv6.conf.eno8403.forwarding = 1
net.ipv6.conf.eno8403.mc_forwarding = 0
net.ipv6.conf.ens1f0.forwarding = 1
net.ipv6.conf.ens1f0.mc_forwarding = 0
net.ipv6.conf.ens1f0/3516.forwarding = 0
net.ipv6.conf.ens1f0/3516.mc_forwarding = 0
net.ipv6.conf.ens1f0/3517.forwarding = 0
net.ipv6.conf.ens1f0/3517.mc_forwarding = 0
net.ipv6.conf.ens1f0/3518.forwarding = 0
net.ipv6.conf.ens1f0/3518.mc_forwarding = 0
net.ipv6.conf.ens1f1.forwarding = 1
net.ipv6.conf.ens1f1.mc_forwarding = 0
net.ipv6.conf.ens3f0.forwarding = 1
net.ipv6.conf.ens3f0.mc_forwarding = 0
net.ipv6.conf.ens3f1.forwarding = 1
net.ipv6.conf.ens3f1.mc_forwarding = 0
net.ipv6.conf.fcb6e9468a65d70.forwarding = 1
net.ipv6.conf.fcb6e9468a65d70.mc_forwarding = 0
net.ipv6.conf.fcd96084b7f5a9a.forwarding = 1
net.ipv6.conf.fcd96084b7f5a9a.mc_forwarding = 0
net.ipv6.conf.genev_sys_6081.forwarding = 1
net.ipv6.conf.genev_sys_6081.mc_forwarding = 0
net.ipv6.conf.lo.forwarding = 1
net.ipv6.conf.lo.mc_forwarding = 0
net.ipv6.conf.ovn-k8s-mp0.forwarding = 1
net.ipv6.conf.ovn-k8s-mp0.mc_forwarding = 0
net.ipv6.conf.ovs-system.forwarding = 1
net.ipv6.conf.ovs-system.mc_forwarding = 0

It's logical that this is happening, because nowhere in the code is there a mechanism to tune the global sysctl back to 0 when the mode is switched from `Global` to `Restricted`. There's also no mechanism to sequentially reboot the nodes so that they'd reboot back to their defaults (= sysctl ip forward off).

Description of problem:

    The kubevirt passt network binding need a global namespace to work, using the default namespace does not look the best option.

    We should be able to deploy at openshift-cnv and users allow to read the nads there to be able to use passt.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Create a nad at openshift-cnv namespace
    2. Try to use that nad from non openshift-cnv pods
    3.
    

Actual results:

Fail for pods to start    

Expected results:

Pod can start and use the nad

Additional info:

    

Description of problem:

    DeploymentConfigs deprecation info alert is shows on the Edit deployment form. It should be shows on only deploymentConfigs pages.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Create a deployment
    2. Open Edit deployment form from the actions menu
    3.
    

Actual results:

    DeploymentConfigs deprecation info alert present on the edit deployment form

Expected results:

    DeploymentConfigs deprecation info alert should not be shown for the Deployment 

Additional info:

    

This is a clone of issue OCPBUGS-29497. The following is the description of the original issue:

While updating an HC with controllerAvailabilityPolicy of SingleReplica, the HCP doesn't fully rollout with 3 pod stuck in Pending

multus-admission-controller-5b5c95684b-v5qgd          0/2     Pending   0               4m36s
network-node-identity-7b54d84df4-dxx27                0/3     Pending   0               4m12s
ovnkube-control-plane-647ffb5f4d-hk6fg                0/3     Pending   0               4m21s

This is because these deployment all have requiredDuringSchedulingIgnoredDuringExecution zone anti-affinity and maxUnavailable: 25% (i.e. 1)

Thus the old pod blocks scheduling of the new pod.

This is a clone of issue OCPBUGS-43518. The following is the description of the original issue:

Description of problem:

Necessary security group rules are not created when using installer created VPC.

Version-Release number of selected component (if applicable):

    4.17.2

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy a power vs cluster and have the installer create the VPC, or remove required rules from a VPC you're bringing.
    2. Control plane nodes fail to bootstrap.
    3. Fail
    

Actual results:

    Install fails

Expected results:

    Install succeeds

Additional info:

    Fix identified

Description of problem:

The labels added by PAC have been deprecated and added to PLR annotations. So, use annotations to get the value in the repository list page, repository PLRs list page, and on the PLR details page. 

Description of problem:

    Documentation for User Workload Monitoring implies that default retention time is 15d, when it is actually 24h in practice

Version-Release number of selected component (if applicable):

    4.12/4.13/4.14/4.15

How reproducible:

    100%

Steps to Reproduce:

    1. Install a cluster
    2. enable user workload monitoring
    3. check pod manifest and check for retention time     

Actual results:

    Retention time is 24h

Expected results:

    Retention time is 15d instead of 24h

Additional info:

    

In the agent installer, assisted-service must always use the openshift-baremetal-installer binary (which is dynamically linked) to ensure that if the target cluster is in FIPS mode the installer will be able to run. (This was implemented in MGMT-15150.)

A recent change for OCPBUGS-33227 has switched to using the statically-linked openshift-installer for 4.16 and later. This breaks FIPS on the agent-based installer.

It appears that CI tests for the agent installer (the compact-ipv4 job runs with FIPS enabled) did not detect this, because we are unable to correctly determine the "version" of OpenShift being installed when it is in fact a CI payload.

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/809

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Failed to deploy baremetal cluster as cluster nodes are not introspected
    

Version-Release number of selected component (if applicable):

4.15.15
    

How reproducible:

periodically
    

Steps to Reproduce:

    1. Deploy baremetal dualstack cluster with disabled provisioning network
    2.
    3.
    

Actual results:

Cluster fails to deploy as ironic.service fails to start on the bootstrap node:

[root@api ~]# systemctl status ironic.service
○ ironic.service - Ironic baremetal deployment service
     Loaded: loaded (/etc/containers/systemd/ironic.container; generated)
     Active: inactive (dead)

May 27 08:01:05 api.kni-qe-4.lab.eng.rdu2.redhat.com systemd[1]: Dependency failed for Ironic baremetal deployment service.
May 27 08:01:05 api.kni-qe-4.lab.eng.rdu2.redhat.com systemd[1]: ironic.service: Job ironic.service/start failed with result 'dependency'.

    

Expected results:

ironic.service is started, nodes are introspected and cluster is deployed
    

Additional info:


    

Description of problem:

`preserveBootstrapIgnition` was named after the implementation details in terraform for how to make deleting S3 objects optional. The motivation behind the change was that some customers run installs in subscriptions where policies do not allow deleting s3 objects. They didn't want the install to fail because of that.

With the move from terraform to capi/capa, this is now implemented differently: capa always tries to delete the s3 objects but will ignore any permission errors if `preserveBootstrapIgnition` is set.

We should rename this option so it's clear that the objects will be deleted if there are enough permissions. My suggestion is to name something similar to what's named in CAPA: `allowBestEffortDeleteIgnition`.

Ideally we should deprecate `preserveBootstrapIgnition` in 4.16 and remove it in 4.17.

Version-Release number of selected component (if applicable):

    4.14+ but I don't think we want to change this for terraform-based installs

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    https://github.com/openshift/installer/pull/7288

Description of problem:

    If I use custom CVO capabilities via the install config, I can create a capability set that disables the Ingress capability.
However, once the cluster boots up, the Ingress capability will always be enabled.
This creates a dissonance between the desired install config and what happens.
It would be better to fail the install at install-config validation to prevent that dissonance.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38450. The following is the description of the original issue:

Description of problem:

Day2 add node with oc binary is not working for ARM64 on baremetal CI running

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Running compact agent installation on arm64 platform
    2. After the cluster is ready, run day2 install 
    3. Day2 install fail with error, worker-a-00 is not reachable  

Actual results:

    Day2 install exit with error.

Expected results:

    Day2 install should works

Additional info:

Job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/54181/rehearse-54181-periodic-ci-openshift-openshift-tests-private-release-4.17-arm64-nightly-baremetal-compact-abi-ipv4-static-day2-f7/1823641309190033408

Error message from console when running day2 install:
rsync: [sender] link_stat "/assets/node.x86_64.iso" failed: No such file or directory (2) command terminated with exit code 23 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1823) [Receiver=3.2.3] rsync: [Receiver] write error: Broken pipe (32) error: exit status 23 {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-08-13T14:32:20Z"} error: failed to execute wrapped command: exit status 1    

Description of problem:

Compared to other COs, the MCO seems to be doing a lot more direct API calls to CO objects: https://gist.github.com/deads2k/227479c81e9a57af6c018711548e4600

Most of these are GETs but we are also doing a lot of UPDATE calls, neither of which should be all that necessary for us. The MCO pod seems to be doing a lot of direct GETs and no-op UPDATEs in older code, which we should clean up and bring down the count.

Some more context in the slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1712953161264199

Version-Release number of selected component (if applicable):

    All

How reproducible:

    Very

Steps to Reproduce:

    1. look at e2e test bundles under /artifacts/junit/just-users-audit-log-summary__xxxxxx.json
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Now that PowerVS uses the upi-installer image, it is encountering the following error:

mkdir: cannot create directory '/output/.ssh': Permission denied
cp: cannot create regular file '/output/.ssh/': Not a directory
    

Version-Release number of selected component (if applicable):

4.17.0
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Look at CI run
    

This is a clone of issue OCPBUGS-44022. The following is the description of the original issue:

Description of problem:

We should decrease the verbosity level for the IBM CAPI module.  This will affect the output of the file .openshift_install.log
    

User Story:

As a dev, I want to be able to:

  • managed lifecycle of my API deps easily

so that I can achieve

  • best automation and workflow trust

Acceptance Criteria:

Description of criteria:
We initially mirror some APIs in the third party folder.
MCO API was moved to openshift/api so we can just consume it via vendoring with no need to adhoc hacks.

HardwareDetails is a pointer and we fail to check if it's null. The installer panics when attempting to collect gather logs from masters.

We need to update the crio test for workload partitioning to give out more useful information, currently it's hard to tell which container or pod has a cpu affinity mismatch.

More info on change: https://github.com/openshift/origin/pull/28852

Description of problem:

The "Auth Token GCP" filter in OperatorHub is displayed all the time, but in stead it should be rendered only for GPC cluster that have Manual creadential mode. When an GCP WIF capable operator is installed and the cluster is in GCP WIF mode, the Console should require the user to enter the necessary information about the GCP project, account, service account etc, which is in turn to be injected the operator's deployment via subscription.config (exactly how Azure and AWS STS got implemented in Console)

Version-Release number of selected component (if applicable):

4.15

How reproducible:

    

Steps to Reproduce:

    1. On a non-GCP cluster, navigate to OperatorHub
    2. check available filters
    3.
    

Actual results:

    "Auth Token GCP" filter is available in OperatorHub

Expected results:

    "Auth Token GCP" filter should not be available in OperatorHub for a non-GCP cluster. 
    When selecting an operator that supports "Auth token GCP" as indicated by the annotation features.operators.openshift.io/token-auth-gcp: "true" the console needs to, aligned with how it works AWS/Azure auth capable operators, force the user to input the required information to auth against GCP via WIF in the form of env vars that are set up using subscription.config on the operator. The exact names need to come out of https://issues.redhat.com/browse/CCO-574

Additional info:

Azure PR - https://github.com/openshift/console/pull/13082
AWS PR - https://github.com/openshift/console/pull/12778

UI Screen Design can be taken from the existing implementation of the Console support short-lived token setup flow for AWS and Azure described here: https://docs.google.com/document/d/1iFNpyycby_rOY1wUew-yl3uPWlE00krTgr9XHDZOTNo/edit

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/71

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

With the changes in 4.17 to add authentication to the assisted-service API, it requires an additional step for users to retrieve data via this API. This will make it more difficult to request the data in customer cases. It would be useful to set the assisted-service log level to debug in order to capture additional logging in agent-gathers, and remove the need for requesting data from the API.

Description of problem:

    HostedCluster fails to update from 4.14.9 to 4.14.24. This was attempted using the HCP KubeVirt platform, but could impact other platforms as well. 

Version-Release number of selected component (if applicable):

4.14.9    

How reproducible:

    100%

Steps to Reproduce:

    1.Create an HCP KubeVirt cluster with 4.14.9 and wait for it to reach Completed
    2.Update the HostedCluster's release image to 4.14.24
 
    

Actual results:

    HOstedCluster is stuck in a partial update state indefinitely with this condition

    - lastTransitionTime: "2024-05-14T17:37:16Z"
      message: 'Working towards 4.14.24: 478 of 599 done (79% complete), waiting on
        csi-snapshot-controller, image-registry, storage'
      observedGeneration: 4
      reason: ClusterOperatorsUpdating
      status: "True"
      type: ClusterVersionProgressing

Expected results:

    HostedCluster updates successfully. 

Additional info:

    Updating from 4.14.24 to 4.14.25 worked in this environment. We noted that 4.14.9 -> 4.14.24 did not work. this was reproduced in multiple environments.

This was also observed using both MCE 2.4 and MCE 2.5 across 4.14 and 4.15 infra clusters

 

User Story:

As a HyperShift user, I want to be able to:

  • have the multi-arch mgmt/nodepool CPU check run regardless of the multi-arch flag
  • enhance the warning message to let the user know what they need to do to fix the issue

so that I can achieve

  • consistent validation
  • better UX

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/189

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-43625. The following is the description of the original issue:

Component Readiness has found a potential regression in the following test:

install should succeed: infrastructure

installer fails with:

time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded" 

Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&Network=ovn&NetworkAccess=default&Platform=azure&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Installer%20%2F%20openshift-installer&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-10-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-14%2000%3A00%3A00&testId=cluster%20install%3A3e14279ba2c202608dd9a041e5023c4c&testName=install%20should%20succeed%3A%20infrastructure

Description of problem:

    The configured HTTP proxy in a HostedCluster is not used when generating the user data for worker instances.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    always

Steps to Reproduce:

    1. Create a public hosted cluster that has access to the outside only via proxy
    2. Wait for machines to ignite
    

Actual results:

    1. Machines do not ignite/join as nodes

Expected results:

    Machines join as nodes

Additional info:

    The proxy resource that is used to generate the user data snippet is empty.

Description of problem:

The option "Auto deploy when new image is available" becomes unchecked when editing a deployment from web console

Version-Release number of selected component (if applicable):

4.15.17

How reproducible:

100%

Steps to Reproduce:

1. Goto Workloads --> Deployments --> Edit Deployment --> Under Images section --> Tick the option "Auto deploy when new Image is available" and now save deployment.
2. Now again edit the deployment and observe that the option "Auto deploy when new Image is available" is unchecked.
3. Same test work fine in 4.14 cluster.
    

Actual results:

Option "Auto deploy when new Image is available" is in unchecked state.

Expected results:

Option "Auto deploy when new Image is available" remains in checked state.

Additional info:

    

Since in CI we use the --no-index option to emulate a pip disconnected environment to try and reproduce the downstream build conditions, it's not possible to test normal dependencies from source as pip won't be able to retrieve them from any remote source.
To work around that we download all the packages first removing the --no-index option but using the --no-deps option, preventing downloading dependencies.
This forces to download only the packages specified in the requirements file, ensuring total control on main libraries and dependencies and allowing us to be as granular as needed, easily switching between RPMs and source packages for testing or even for downstream builds.
When we install them afterwards if any dependency is missing the installation process will fail in CI, allowing us to correct the dependencies list directly the change PR.

Downloading the libraries first and then installing them with the same option allows more flexibility and almost a 1-to-1 copy of the downstream build environment that cachito uses.

Description of the problem:

Assisted-Service logs addresses instead of actual values during cluster's registration.

How reproducible:

100%

Steps to reproduce:

1. Register a cluster and look at the logs

Actual results:

 Apr 17 10:48:09 master service[2732]: time="2024-04-17T10:48:09Z" level=info msg="Register cluster: agent-sno with id 026efda3-fd2c-40d3-a65f-8a22acd6267a and params &{AdditionalNtpSource:<nil> APIVips:[] BaseDNSDomain:abi-ci.com ClusterNetworkCidr:<nil> ClusterNetworkHostPrefix:0 ClusterNetworks:[0xc00111cc00] CPUArchitecture:s390x DiskEncryption:<nil> HighAvailabilityMode:0xc0011ce5a0 HTTPProxy:<nil> HTTPSProxy:<nil> Hyperthreading:<nil> IgnitionEndpoint:<nil> IngressVips:[] MachineNetworks:[0xc001380340] Name:0xc0011ce5b0 NetworkType:0xc0011ce5c0 NoProxy:<nil> OcpReleaseImage: OlmOperators:[] OpenshiftVersion:0xc0011ce5d0 Platform:0xc0009659e0 PullSecret:0xc0011ce5f0 SchedulableMasters:0xc0010cc710 ServiceNetworkCidr:<nil> ServiceNetworks:[0xc001380380]
...

Expected results:

All values shown (without the secrets)

Description of problem:

Setting capabilities as below in install-config:
--------
capabilities:
  baselineCapabilitySet: v4.14
  additionalEnabledCapabilities:
    - CloudCredential 

Continue to create manifests, installer should exit with error message that "the marketplace capability requires the OperatorLifecycleManager capability", as what is done in https://github.com/openshift/installer/pull/7495/.
In that PR, seems that only check when baselineCapabilitySet is set to None.

When baselineCapabilitySet is set to v4.x, it also includes capability "marketplace", and needs to do such pre-check.

Version-Release number of selected component (if applicable):

    4.15/4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Prepare install-config and set baselineCapabilitySet to v4.x (x<15)
    2. Create manifests
    3.
    

Actual results:

    Manifests are created successfully.

Expected results:

    Installer exited with error message that something like "the marketplace capability requires the OperatorLifecycleManager capability"

Additional info:

    

The goal is to collect metrics about AdminNetworkPolicy and BaselineAdminNetworkPolicy CRDs because its essentially to understand how the users are using this feature and in fact if they are using it OR not. This is required for 4.16 Feature https://issues.redhat.com/browse/SDN-4157 and hoping to get approval and PRs merged before the code freeze time frame for 4.16 (April 26th 2024)

admin_network_policy_total

admin_network_policy_total represents the total number of admin network policies in the cluster

Labels: None

See https://github.com/ovn-org/ovn-kubernetes/pull/4239 for more information

Cardinality of the metric is at most 1.

baseline_admin_network_policy_total

baseline_admin_network_policy_total represents the total number of baseline admin network policies in the cluster (0 or 1)

Labels: None

See https://github.com/ovn-org/ovn-kubernetes/pull/4239 for more information

Cardinality of the metric is at most 1.

We don't need the above two anymore because we have https://redhat-internal.slack.com/archives/C0VMT03S5/p1712567951869459?thread_ts=1712346681.157809&cid=C0VMT03S5 

Instead of that we are adding two other metrics for rule count: (https://issues.redhat.com/browse/MON-3828 )

admin_network_policy_db_objects_total

admin_network_policy_db_objects_total represents the total number of OVN NBDB objects (table_name) owned by AdminNetworkPolicy controller in the cluster

{{Labels: }}

  • table_name, possible values are "ACL" and "AddressSet" (In future "Port_Group")

See https://github.com/ovn-org/ovn-kubernetes/pull/4254  for more information

Cardinality of the metric is at most 3.

baseline_admin_network_policy_db_objects_total

baseline_admin_network_policy_db_objects_total represents the total number of OVN NBDB objects (table_name) owned by BaselineAdminNetworkPolicy controller in the cluster

{{Labels: }}

  • table_name, possible values are "ACL" and "AddressSet" (In future "Port_Group")

See https://github.com/ovn-org/ovn-kubernetes/pull/4254 for more information

Cardinality of the metric is at most 3.

In a cluster with external OIDC environment we need to replace global refresh sync lock in OIDC provider with per-refresh-token one. The work should replace the sync lock that would apply to all HTTP-serving spawned goroutines with a sync-lock that is specific to each of the refresh tokens

Description of problem:

 

Version-Release number of selected component (if applicable):

 

Steps to Reproduce:

 

Actual results:

    

Expected results:

    That reduces token refresh request handling time by about 30%.

Additional info:

 

This is a clone of issue OCPBUGS-41631. The following is the description of the original issue:

Description of problem:

Panic seen in below CI job when run the below command

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-insights-operator-release-4.17-insights-operator-e2e-tests-periodic (all) - 2 runs, 100% failed, 50% of failures match = 50% impact

Panic observed:

E0910 09:00:04.283647       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 268 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x36c8b40, 0x5660c90})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ce8540?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x36c8b40?, 0x5660c90?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000d6e360, {0x3abd580?, 0xc00224a608}, {0x3abd580?, 0xc001bd2308})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:585 +0x1f3
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001933f70, {0x3faaba0, 0xc000759710}, 0x1, 0xc00097bda0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000750f70, 0x3b9aca00, 0x0, 0x1, 0xc00097bda0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000dc2630)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 261
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x33204b3] 

 

Version-Release number of selected component (if applicable):

    

How reproducible:

Seen in this CI run -https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic

Steps to Reproduce:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'

Actual results:

    

Expected results:

 No panic to observe

Additional info:

    

Description of the problem:

Soft timeout  , 24 installation . slow down the network (external link to 40Mbps) 
Installation takes long hour .

After 10 hours event log messages not sent .
In the example here we see :

5/29/2024, 11:05:05 AM    
warning Cluster 97709caf-5081-43e7-b5cc-80873ab1442d: finalizing stage Waiting for cluster operators has been active more than the expected completion time (600 minutes)
5/29/2024, 11:03:02 AM    The following operators are experiencing issues: insights, kube-controller-manager
5/29/2024, 1:10:02 AM    Operator console status: available message: All is well
5/29/2024, 1:09:03 AM    Operator console status: progressing message: SyncLoopRefreshProgressing: Working toward version 4.15.14, 1 replicas available
5/29/2024, 1:04:03 AM    Operator console status: failed message: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.test-infra-cluster-356d2e39.redhat.com returns '503 Service Unavailable'

 
 
Base on logs events. the message below triggered after 10 hours  and two minute later we got timeout.

5/29/2024, 11:03:02 AM The following operators are experiencing issues: insights, kube-controller-manager

Looks like it will be better to allow messaging earlier . maybe we can tune and see the message "the following operators...."   after 1 or 2 hours .

Keeping 10 hours with info wont help to customer  and he may stop installation or hide real bugs.

**

test-infra-cluster-356d2e39_97709caf-5081-43e7-b5cc-80873ab1442d.tar

Description of problem:

This is a port of https://issues.redhat.com/browse/OCPBUGS-38470 to 4.17.
    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/187

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem

Seen in a 4.16.1 CI run:

: [bz-Etcd] clusteroperator/etcd should not change condition/Available expand_less	1h28m39s
{  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Jun 27 14:17:18.966 E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
Jun 27 14:17:18.966 - 75s   E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy

But further digging turned up no sign that quorum had had any difficulties. It seems like the difficulty was the GetMemberHealth structure, which currently allows timelines like:

  • T0, start probing all known members in GetMemberHealth
  • Tsmall, MemberA Healthy:true Took:41.614949ms Error:<nil>
  • Talmost-30s, MemberB Healthy:false Took:29.869420582s Error:health check failed: context deadline exceeded
  • T30s, DefaultClientTimeout runs out.
  • T30s, MemberC Healthy:false Took:27.199µs Error:health check failed: context deadline exceeded
  • TB, next probe round rolls around, start probing all known members in GetMemberHealth.
  • TBsmall, MemberA Healthy:true Took:...ms Error:<nil>
  • TB+30s, MemberB Healthy:false Took:29....s Error:health check failed: context deadline exceeded
  • TB+30s, DefaultClientTimeout runs out.
  • TB+30s, MemberC Healthy:false Took:...µs Error:health check failed: context deadline exceeded

That can leave 30+s gaps of nominal Healthy:false for MemberC when in fact MemberC was completely fine.
I suspect that the "was really short" Took:27.199µs got a "took too long" context deadline exceeded because GetMemberHealth has a 30s timeout per member, while many (all?) of its callers have a 30s DefaultClientTimeout. Which means by the time we get to MemberC, we've already spend our Context and we're starved of time to actually check MemberC. It may be more reliable to refactor and probe all known members in parallel, and to keep probing in the event of failures while you wait for the slowest member-probe to get back to you, because I suspect a re-probe of MemberC (or even a single probe that was granted reasonable time to complete) while we waited on MemberB would have succeeded and told us MemberC was actually fine.

Exposure is manageable, because this is self-healing, and quorum is actually ok. But still worth fixing because it spooks admins (and the origin CI test suite) if you tell them you're Available=False, and we want to save that for situations where the component is actually having trouble like quorum loss, and not burn signal-to-noise by claiming EtcdMembers_NoQuorum when it's really BriefIssuesScrapingMemberAHealthAndWeWillllTryAgainSoon.

Version-Release number of selected component

Seen in 4.16.1, but the code is old, so likely a longstanding issue.

How reproducible

Luckily for customers, but unluckily for QE, network or whatever hiccups when connecting to members seem rare, so we don't trip the condition that exposes this issue often.

Steps to Reproduce

1. Figure out which order etcd is probing members in.
2. Stop the first or second member, in a way that makes its health probes time out ~30s.
3. Monitor the etcd ClusterOperator Available condition.

Actual results

Available goes False claiming EtcdMembers_NoQuorum, as the operator starves itself of the time it needs to actually probe the third member.

Expected results

Available stays True, as the etcd operator take the full 30s to check on all members, and see that two of them are completely happy.

Description of problem:

Checking the vsphere-problem-detector-operator log in 4.17.0-0.nightly-2024-07-28-191830, it threw the error message as below:

W0729 01:36:04.693729       1 reflector.go:547] k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232: failed to list *v1.ClusterCSIDriver: clustercsidrivers.operator.openshift.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:vsphere-problem-detector-operator" cannot list resource "clustercsidrivers" in API group "operator.openshift.io" at the cluster scope
E0729 01:36:04.693816       1 reflector.go:150] k8s.io/client-go@v0.30.2/tools/cache/reflector.go:232: Failed to watch *v1.ClusterCSIDriver: failed to list *v1.ClusterCSIDriver: clustercsidrivers.operator.openshift.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:vsphere-problem-detector-operator" cannot list resource "clustercsidrivers" in API group "operator.openshift.io" at the cluster scope
And vsphere-problem-detector-operator continue restarting:
vsphere-problem-detector-operator-76d6885898-vsww4   1/1     Running   34 (11m ago)    7h18m
It might be caused by https://github.com/openshift/vsphere-problem-detector/pull/166 and we have not added the clusterrole in openshift/cluster-storage-operator repo yet. 

Version-Release number of selected component (if applicable):

  4.17.0-0.nightly-2024-07-28-191830    

How reproducible:

  Always    

Steps to Reproduce:

   See Description

Actual results:

   vsphere-problem-detector-operator report permission lack and restart 

Expected results:

   vsphere-problem-detector-operator should not report permission lack and restart 

Additional info:

    

This is a clone of issue OCPBUGS-38573. The following is the description of the original issue:

Description of problem:

While working on the readiness probes we have discovered that the single member health check always allocates a new client. 

Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check.

This should reduce CEO's and etcd CPU consumption.

Version-Release number of selected component (if applicable):

any supported version    

How reproducible:

always, but technical detail

Steps to Reproduce:

 na    

Actual results:

CEO creates a new etcd client when it is checking a single member health

Expected results:

CEO should use the existing pooled client to check for single member health    

Additional info:

    

This is a clone of issue OCPBUGS-43925. The following is the description of the original issue:

Description of problem:

BuildConfig form breaks on manually enter the Git URL after selecting the source type as Git    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Navigate to Create BuildConfig form page
    2. Select source type as Git
    3. Enter the git url by typing manually do not paste or select from the suggestion
    

Actual results:

Console breaks    

Expected results:

   Console should not break and user should be able tocreate BuildConfig 

Additional info:

    

This is a clone of issue OCPBUGS-42880. The following is the description of the original issue:

Description of problem

When the cluster version operator has already accepted an update to 4.(y+1).z, it should accept retargets to 4.(y+1).z' even if ClusterVersion has Upgradeable=False (unless there are overrides, those are explicitly supposed to block patch updates). It currently blocks these retargets, which can make it hard for a cluster admin to say "hey, this update is stuck on a bug in 4.(y+1).z, and I want to retarget to 4.(y+1).z' to pick up the fix for that bug so the update can complete".

Spun out from Evgeni Vakhonin 's testing of OTA-861.

Version-Release number of selected component

Reproduced in a 4.15.35 CVO.

How reproducible

Reproduced in my first try, but I have not made additional attempts.

Steps to Reproduce

1. Install 4.y, e.g. with Cluster Bot launch 4.14.38 aws.
2. Request an update to a 4.(y+1).z:
a. oc adm upgrade channel candidate-4.15
b. oc adm upgrade --to 4.15.35
3. Wait until the update has been accepted...

$ oc adm upgrade | head -n1
info: An upgrade is in progress. Working towards 4.15.35: 10 of 873 done (1% complete

4. Inject an Upgradeable=False situation for testing:

$ oc -n openshift-config-managed patch configmap admin-gates --type json -p '[ {"op": "add", "path": "/data/ack-4.14-kube-1.29-api-removals-in-4.16", value: "testing"}]'

And after a minute or two, the CVO has noticed and set Upgradeable=False:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.15.35: 109 of 873 done (12% complete), waiting on etcd, kube-apiserver

Upgradeable=False

  Reason: AdminAckRequired
  Message: testing

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16, fast-4.15, fast-4.16)

Recommended updates:

  VERSION     IMAGE
  4.15.36     quay.io/openshift-release-dev/ocp-release@sha256:a8579cdecf1d45d33b5e88d6e1922df3037d05b09bcff7f08556b75898ab2f46

5. Request a patch-bumping retarget to 4.(y+1).z':

$ oc adm upgrade --allow-upgrade-with-warnings --to 4.15.36
warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading:

  Reason: ClusterOperatorsUpdating
  Message: Working towards 4.15.35: 109 of 873 done (12% complete), waiting on etcd, kube-apiserver
Requested update to 4.15.36

6. Check the status of the retarget request: oc adm upgrade

Actual results

The retarget was rejected:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.15.35: 109 of 873 done (12% complete), waiting on etcd, kube-apiserver

Upgradeable=False

  Reason: AdminAckRequired
  Message: testing

ReleaseAccepted=False

  Reason: PreconditionChecks
  Message: Preconditions failed for payload loaded version="4.15.36" image="quay.io/openshift-release-dev/ocp-release@sha256:a8579cdecf1d45d33b5e88d6e1922df3037d05b09bcff7f08556b75898ab2f46": Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": testing

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16, fast-4.15, fast-4.16)

Recommended updates:

  VERSION     IMAGE
  4.15.36     quay.io/openshift-release-dev/ocp-release@sha256:a8579cdecf1d45d33b5e88d6e1922df3037d05b09bcff7f08556b75898ab2f46

Expected results

The retarget should have been accepted, because 4.15.35 was already accepted, and 4.15.35 to 4.15.36 is a patch bump where Upgradeable=False is

Additional info

This GetCurrentVersion is looking in history for the most recent Completed entry. But for the Upgradeable precondition, we want to be looking in status.desired for the currently accepted entry, regardless of whether we've completed reconciling it or not.

This solves 2 problems which were introduced when we moved our API to a separate sub module:

1. Duplicate API files: currently all our API is vendored in the main module.
2. API imports in our code are using the vendored version which led to poor UX because local changes to the API are not being reflected immediately in the code until the vendor dir is updated. Also using the IDE "go to" feature will send you to the vendor folder where you can't do any updates/changes.

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/370

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The documentation files in the Github repository of Installer do not mention apiVIPs and ingressVIPs, mentioning instead the deprecated fields apiVIP and ingressVIP.

Note that the information contained in the customer-facing documentation is correct: https://docs.openshift.com/container-platform/4.16/installing/installing_openstack/installing-openstack-installer-custom.html

Hello Team,

 

After the hard reboot of all nodes due to a power outage,  failure of image pull of NTO preventing "ocp-tuned-one-shot.service" startup result in dependency failure for kubelet and crio services,

------------

journalctl_--no-pager

Aug 26 17:07:46 ocp05 systemd[1]: Reached target The firstboot OS update has completed.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3577]: NM resolv-prepender: Starting download of baremetal runtime cfg image
Aug 26 17:07:46 ocp05 systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Aug 26 17:07:46 ocp05 systemd[1]: Starting TuneD service from NTO image...
Aug 26 17:07:46 ocp05 nm-dispatcher[3687]: NM resolv-prepender triggered by lo up.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3644]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ lo == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + exit 0
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + exit 0
Aug 26 17:07:46 ocp05 bash[3655]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 podman[3661]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26...
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Main process exited, code=exited, status=125/n/a
Aug 26 17:07:46 ocp05 nm-dispatcher[3793]: NM resolv-prepender triggered by brtrunk up.
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Failed with result 'exit-code'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ brtrunk == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + exit 0
Aug 26 17:07:46 ocp05 systemd[1]: Failed to start TuneD service from NTO image.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Dependencies necessary to run kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Kubernetes Kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet.service: Job kubelet.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Container Runtime Interface for OCI (CRI-O).
Aug 26 17:07:46 ocp05 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet-dependencies.target: Job kubelet-dependencies.target/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + exit 0

-----------

-----------

$ oc get proxy config cluster  -oyaml
  status:
    httpProxy: http://proxy_ip:8080
    httpsProxy: http://proxy_ip:8080

$ cat /etc/mco/proxy.env
HTTP_PROXY=http://proxy_ip:8080
HTTPS_PROXY=http://proxy_ip:8080

-----------

-----------
× ocp-tuned-one-shot.service - TuneD service from NTO image
     Loaded: loaded (/etc/systemd/system/ocp-tuned-one-shot.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-08-26 17:07:46 UTC; 2h 30min ago
   Main PID: 3661 (code=exited, status=125)

Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
-----------

  • Customer has proxy configured in their environment. However,  nodes can not start after hard reboot of all nodes as it looks that NTO ignoring cluster wide proxy settings. To resolve NTO image pull issue, customer has to include proxy variable in  /etc/systemd/system.conf manually.

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/43

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This comes from this bug https://issues.redhat.com/browse/OCPBUGS-29940

After applying the workaround suggested [1][2] with "oc adm must-gather --node-name" we found another issue where must-gather creates the debug pod on all master nodes and gets stuck for a while because the script gather_network_logs_basics loop. Filtering out the NotReady nodes would allow us to apply the workaround.

The script gather_network_logs_basics gets the master nodes by label (node-role.kubernetes.io/master) and saves them in the CLUSTER_NODES variable. It then passes this as a parameter to the function gather_multus_logs $CLUSTER_NODES, where it loops through the list of master nodes and performs debugging for each node.

collection-scripts/gather_network_logs_basics
...
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
...
collection-scripts/gather_multus_logs
...
function gather_multus_logs {
  for NODE in "$@"; do
    nodefilename=$(echo "$NODE" | sed -e 's|node/||')
    out=$(oc debug "${NODE}" -- \
    /bin/bash -c "cat $INPUT_LOG_PATH" 2>/dev/null) && echo "$out" 1> "${OUTPUT_LOG_PATH}/multus-log-$nodefilename.log"
  done
}

This could be resolved with something similar to this:

CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True")).metadata.name')}"
/usr/bin/gather_multus_logs $CLUSTER_NODES

[1] - https://access.redhat.com/solutions/6962230
[2] - https://issues.redhat.com/browse/OCPBUGS-29940

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/283

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Reviewing https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=operator-conditions&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&environment=ovn%20no-upgrade%20amd64%20azure%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=azure&platform=azure&sampleEndTime=2024-06-05%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2024-05-30%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set&upgrade=no-upgrade&upgrade=no-upgrade&variant=standard&variant=standard, it appears that the Azure tests are failing frequently with "Told to stop trying". Check failed before until passed.

Reviewing this, it appears that the rollout happened as expected, but the until function got a non-retryable error and exited, while the check saw that the Deletion timestamp was set and the Machine went into Running, which caused it to fail.

We should investigate why the until failed in this case as it should have seen the same machines and therefore should have seen a Running machine and passed.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

    Pod stuck in creating state when running performance benchmark

The exact error when describing the pod -
Events:
  Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  45s (x114 over 3h47m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_client-1-5c978b7665-n4tds_cluster-density-v2-35_f57d8281-5a79-4c91-9b83-bb3e4b553597_0(5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564): error adding pod cluster-density-v2-35_client-1-5c978b7665-n4tds to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&\{ContainerID:5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 Netns:/var/run/netns/e06c9af7-c13d-426f-9a00-73c54441a20b IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-density-v2-35;K8S_POD_NAME=client-1-5c978b7665-n4tds;K8S_POD_INFRA_CONTAINER_ID=5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564;K8S_POD_UID=f57d8281-5a79-4c91-9b83-bb3e4b553597 Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564" Netns:"/var/run/netns/e06c9af7-c13d-426f-9a00-73c54441a20b" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=cluster-density-v2-35;K8S_POD_NAME=client-1-5c978b7665-n4tds;K8S_POD_INFRA_CONTAINER_ID=5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564;K8S_POD_UID=f57d8281-5a79-4c91-9b83-bb3e4b553597" Path:"" ERRORED: error configuring pod [cluster-density-v2-35/client-1-5c978b7665-n4tds] networking: [cluster-density-v2-35/client-1-5c978b7665-n4tds/f57d8281-5a79-4c91-9b83-bb3e4b553597:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-v2-35/client-1-5c978b7665-n4tds 5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 network default NAD default] [cluster-density-v2-35/client-1-5c978b7665-n4tds 5a8d6897ca792d91f1c52054f5f8c596530fbf72d3abb07b19a20fd9c95cc564 network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23]
'
': StdinData: \{"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}

Version-Release number of selected component (if applicable):

    4.16.0-ec.5\{code}
How reproducible:
{code:none}
    50-60%

It seems to be related to the number of times I have ran our test on a single cluster. Many of our performance tests are on ephemeral clusters - so we build the cluster, run the test, tear down. Currently I have a long lived cluster (1 week old), and I have been running many performance tests against this cluster -- serially. After each test, the previous resources are cleaned up. \{code}
Steps to Reproduce:
{code:none}
    1. Use the following cmdline as an example.
    2.  ./bin/amd64/kube-burner-ocp cluster-density-v2 --iterations 90     3. Repeat until issue arises ( usually after 3-4 attempts)./
    \{code}
Actual results:
{code:none}
    client-1-5c978b7665-n4tds    0/1     ContainerCreating   0          4h14m

Expected results:

    For the benchmark not to get stuck waiting for this pod. \{code}
Additional info:
{code:none}
    Looking at the ovnkube-controller pod logs, grepping for the pod which was stuck

oc logs -n openshift-ovn-kubernetes ovnkube-node-qpkws -c ovnkube-controller | grep client-1-5c978b7665-n4tds

W0425 13:12:09.302395    6996 base_network_controller_policy.go:545] Failed to get get LSP for pod cluster-density-v2-35/client-1-5c978b7665-n4tds NAD default for networkPolicy allow-from-openshift-ingress, err: logical port cluster-density-v2-35/client-1-5c978b7665-n4tds for pod cluster-density-v2-35_client-1-5c978b7665-n4tds not found in cache
I0425 13:12:09.302412    6996 obj_retry.go:370] Retry add failed for *factory.localPodSelector cluster-density-v2-35/client-1-5c978b7665-n4tds, will try again later: unable to get port info for pod cluster-density-v2-35/client-1-5c978b7665-n4tds NAD default
W0425 13:12:09.908446    6996 helper_linux.go:481] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4] pod uid f57d8281-5a79-4c91-9b83-bb3e4b553597: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23]
I0425 13:12:09.963651    6996 cni.go:279] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] ADD finished CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default], result "", err failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:03:f6 [10.131.3.246/23]
I0425 13:12:09.988397    6996 cni.go:258] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL starting CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default]
W0425 13:12:09.996899    6996 helper_linux.go:697] Failed to delete pod "cluster-density-v2-35/client-1-5c978b7665-n4tds" interface 7f80514901cbc57: failed to lookup link 7f80514901cbc57: Link not found
I0425 13:12:10.009234    6996 cni.go:279] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL finished CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default], result "\{\"dns\":{}}", err <nil>
I0425 13:12:10.059917    6996 cni.go:258] [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default] DEL starting CNI request [cluster-density-v2-35/client-1-5c978b7665-n4tds 7f80514901cbc57517d263f1a5aa143d2c82f470132c01f8ba813c18f3160ee4 network default NAD default]



Description of problem:

Dynamic plugins using PatternFly 4 could be referring to PF4 variables that do not exist in OpenShift 4.15+. Currently this is causing contrast issues for ACM in dark mode for donut charts.    

Version-Release number of selected component (if applicable):

4.15    

How reproducible:

Always    

Steps to Reproduce:

    1. Install ACM on OpenShift 4.15
    2. Switch to dark mode
    3. Observe Home > Overview page
    

Actual results:

 Some categories in the donut charts cannot be seen due to low contrast   

Expected results:

 Colors should match those seen in OpenShift 4.14 and earlier   

Additional info:

Also posted about this on Slack: https://redhat-internal.slack.com/archives/C011BL0FEKZ/p1720467671332249

Variables like --pf-chart-color-gold-300 are no longer provided, although the PF5 equivalent, --pf-v5-chart-color-gold-300, is available. The stylesheet @patternfly/patternfly/patternfly-charts.scss is present, but not the V4 version. Hopefully it is possible to also include these styles since the names now include a version. 

Description of problem:

The 'Getting started resources' card on the Cluster overview includes a link to 'View all steps in documentation', but this link is not valid for ROSA and OSD so it should be hidden.

Description of problem:

If a cluster is running with user-workload-monitoring enabled, running an ose-tests suite against the cluster will fail the data collection step.

This is because there is a query in the test framework that assumes that the number of prometheus instances that the thanos pods will connect to will match exactly the number of platform prometheus instances. However, it doesn't account for thanos also connecting to the user-workload-monitoring instances. As such, the test suite will always fail against a cluster that is healthy and running user-workload-monitoring in addition to the normal openshift-monitoring stack.

    

Version-Release number of selected component (if applicable):

4.15.13
    

How reproducible:

Consistent
    

Steps to Reproduce:

    1. Create an OpenShift cluster
    2. Enable workload monitoring
    3. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface
    

Actual results:

The error message `#### at least one Prometheus sidecar isn't ready` will be displayed, and the metrics collection will fail
    

Expected results:

Metrics collection succeeds with no errors
    

Additional info:


    

This is a clone of issue OCPBUGS-35262. The following is the description of the original issue:

Description of problem:

installing into Shared VPC stuck in waiting for network infrastructure ready

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-10-225505

How reproducible:

Always

Steps to Reproduce:

1. "create install-config" and then insert Shared VPC settings (see [1])
2. activate the service account which has the minimum permissions in the host project (see [2])
3. "create cluster"

FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project. 

Actual results:

1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed.
2. 2 firewall-rules are created in the service project unexpectedly (see [3]).

Expected results:

The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.

Additional info:

 

Description of problem:

ROSA Cluster creation goes into error status sometimes with version 4.16.0-0.nightly-2024-06-14-072943

Version-Release number of selected component (if applicable):

 

How reproducible:

60%

Steps to Reproduce:

1. Prepare VPC
2. Create a rosa sts cluster cluster with subnets
3. Wait for cluster ready

Actual results:

Cluster goes into error status

Expected results:

Cluster get ready

Additional info:

The failure happens by CI job triggering Here are the Jobs:

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-aws-rosa-sts-localzone-f7/180153139425024819 

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-aws-rosa-sts-private-proxy-f7/1801531362717470720

 

Please review the following PR: https://github.com/openshift/router/pull/604

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When create hostedcluster with -role-arn, --sts-credsfails failed

Version-Release number of selected component (if applicable):

4.16 
4.17

How reproducible:

100%    

Steps to Reproduce:

    1.  hypershift-no-cgo create iam cli-role    
    2.  aws sts get-session-token --output json
    3.  hcp create cluster aws --role-arn xxx --sts-creds xxx
    

Actual results:

2024-06-06T04:34:39Z	ERROR	Failed to create cluster	{"error": "failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action\n\tstatus code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd"}
github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1
	/remote-source/app/product-cli/cmd/cluster/aws/create.go:60
github.com/spf13/cobra.(*Command).execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1032
main.main
	/remote-source/app/product-cli/main.go:60
runtime.main
	/usr/lib/golang/src/runtime/proc.go:271
Error: failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action
	status code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd
failed to create iam: AccessDenied: User: arn:aws:sts::301721915996:assumed-role/6cd90f28a6449141869b/cli-create-iam is not authorized to perform: iam:TagOpenIDConnectProvider on resource: arn:aws:iam::301721915996:oidc-provider/hypershift-ci-oidc.s3.us-east-1.amazonaws.com/6cd90f28a6449141869b because no identity-based policy allows the iam:TagOpenIDConnectProvider action
	status code: 403, request id: 20e16ec4-b9a1-4fa4-aa34-1344145d41fd
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-06-06T04:34:39Z"}
error: failed to execute wrapped command: exit status 1

Expected results:

    create hostedcluster successful

Additional info: 
Full Logs: https://docs.google.com/document/d/1AnvAHXPfPYtP6KRcAKOebAx1wXjhWMOn3TW604XK09o/edit 
The same command can be successfully created the second time

Description of problem:

Customer reports that in the OpenShift Container Platform for a single namespace they are seeing a "TypeError: Cannot read properties of null (reading 'metadata')" error when navigating to the Topology view (Developer Console):

TypeError: Cannot read properties of null (reading 'metadata')
    at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1220454)
    at s (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:424007)
    at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:330465)
    at na (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:58879)
    at Hs (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:111315)
    at xl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98327)
    at Cl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98255)
    at _l (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:98118)
    at pl (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:95105)
    at https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:263:44774

Screenshot is available in the linked Support Case. The following Stack Trace is shown:

at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:330387)
    at g
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at g
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at g
    at a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:245070)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at g
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at g
    at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:426770)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at g
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:242507)
    at svg
    at div
    at https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:603940
    at u (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:602181)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at e.a (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:398426)
    at div
    at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:353461
    at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:354168
    at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1405970)
    at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864)
    at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052)
    at withFallback(Connect(withUserSettingsCompatibility(undefined)))
    at div
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:62178)
    at div
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565)
    at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077)
    at div
    at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280)
    at div
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:719437)
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:9899)
    at div
    at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:512628
    at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864)
    at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:123:75018)
    at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:511867
    at https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:150:220157
    at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:375316
    at div
    at R (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:183146)
    at N (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:183594)
    at f (https://console.apps.example.com/static/vendors~app/code-refs/actions~delete-revision~dev-console/code-refs/actions~dev-console/code-refs/ad~01887c45-chunk-0fc9a9eb8a528a7c580c.min.js:26:22249)
    at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:509351
    at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:548866
    at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864)
    at div
    at div
    at t.b (https://console.apps.example.com/static/dev-console/code-refs/common-chunk-5e4f38c02bde64a97ae5.min.js:1:113711)
    at t.a (https://console.apps.example.com/static/dev-console/code-refs/common-chunk-5e4f38c02bde64a97ae5.min.js:1:116541)
    at u (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:305613)
    at https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:509656
    at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052)
    at withFallback()
    at t.a (https://console.apps.example.com/static/dev-console/code-refs/topology-chunk-e4ae65442e61628a832f.min.js:1:553554)
    at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:67625)
    at I (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1533554)
    at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:69670)
    at Suspense
    at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:452052)
    at section
    at m (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:720427)
    at div
    at div
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1533801)
    at div
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565)
    at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077)
    at div
    at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280)
    at l (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1175827)
    at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:458912
    at S (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:98:86864)
    at main
    at div
    at v (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:264220)
    at div
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:62178)
    at div
    at div
    at c (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:545565)
    at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:775077)
    at div
    at d (https://console.apps.example.com/static/vendor-patternfly-core-chunk-cdcfdc55890623d5fc26.min.js:1:458280)
    at Un (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:183620)
    at t.default (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:880042)
    at e.default (https://console.apps.example.com/static/quick-start-chunk-794085a235e14913bdf3.min.js:1:3540)
    at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:239711)
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1610459)
    at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636)
    at _t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:142374)
    at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636)
    at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636)
    at ee (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628636)
    at i (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:830807)
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1604651)
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1604840)
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1602256)
    at te (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1628767)
    at https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1631899
    at r (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:36:121910)
    at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:67625)
    at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:69670)
    at t (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:21:64230)
    at re (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1632210)
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:804787)
    at t.a (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:1079398)
    at s (https://console.apps.example.com/static/main-chunk-876b3080b765b87baa51.min.js:1:654118)
    at t.a (https://console.apps.example.com/static/vendors~main-chunk-4b6445a3b3fc17bf0831.min.js:150:195887)
    at Suspense 

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13.38
Developer Console

How reproducible:

Only on customer side, in a single namespace on a single cluster

Steps to Reproduce:

1. On a particular cluster, enter the Developer Console
2. Navigate to "Topology"

Actual results:

Loading the page fails with the error "TypeError: Cannot read properties of null (reading 'metadata')"

Expected results:

No error is shown. The Topology view is shown

Additional info:

- Screenshot available in linked Support Case
- HAR file is available in linked Support Case

This is a clone of issue OCPBUGS-37588. The following is the description of the original issue:

Description of problem:

Creating and destroying transit gateways (TG) during CI testing is costing an abnormal amount of money.  Since the monetary cost for creating a TG is high, provide support for a user created TG when creating an OpenShift cluster.
    

Version-Release number of selected component (if applicable):

all
    

How reproducible:

always
    

Description of problem:
After installing the Pipelines Operator on a local cluster (OpenShift local), the Pipelines features was shown the Console.

But when selecting the Build option "Pipelines" a warning was shown:

The pipeline template for Dockerfiles is not available at this time.

Anyway it was possible to push the Create button and create a Deployment. But because there is no build process created, it couldn't successful start.

After ~20 min after the Pipeline operator says that it was successfully installed, the Pipeline templates in the openshift-pipelines namespaces appear, and I could create valid Deployment.

Version-Release number of selected component (if applicable):

  1. OpenShift cluster 4.14.7
  2. Pipelines operator 1.14.3

How reproducible:
Sometimes, maybe depending on the internet connection speed.

Steps to Reproduce:

  1. Install OpenShift Local
  2. Install Pipelines Opeartor
  3. Import from Git and select Pipeline as option

Actual results:

  1. Error message was shown: The pipeline template for Dockerfiles is not available at this time.
  2. The user can create the Deployment anyway.

Expected results:

  1. The error message is fine.
  2. But as long as the error message is shown I would expect that the user can not click on Create.

Additional info:

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

The multus-admission-controller does not retain its container resource requests/limits if manually set. The cluster-network-operator overwrites any modifications on the next reconciliation. This resource preservation support has already been added to all other components in https://github.com/openshift/hypershift/pull/1082 and https://github.com/openshift/hypershift/pull/3120. Similar changes should be made for the multus-admission-controller so all hosted control plane components demonstrate the same resource preservation behavior.

Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/43

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Inspection is failing on hosts which special characters found in serial number of block devices:

Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: 2024-07-03 09:16:11.325 1 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory'....'error': "The following errors were encountered:\n* collector logs failed: 'utf-8' codec can't decode byte 0xff in position 12: invalid start byte"} call_inspector /usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py:128

Serial found:
"serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"

Interesting stacktrace error:
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed

Full stack trace:
~~~
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: 2024-07-03 09:16:11.628 1 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -bia --json -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID,SERIAL" returned: 0 in 0.006s e
xecute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: --- Logging error ---
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: --- Logging error ---
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Traceback (most recent call last):
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Traceback (most recent call last):
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     stream.write(msg + self.terminator)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Call stack:
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]:     stream.write(msg + self.terminator)
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/bin/ironic-python-agent", line 10, in <module>
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     sys.exit(run())
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     agent.IronicPythonAgent(CONF.api_url,
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Call stack:
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 485, in run
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     self.process_lookup_data(content)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 400, in process_lookup_data
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     hardware.cache_node(self.node)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3179, in cache_node
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     dispatch_to_managers('wait_for_disks')
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     return getattr(manager, method)(*args, **kwargs)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 997, in wait_for_disks
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     self.get_os_install_device()
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1518, in get_os_install_device
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = self.list_block_devices_check_skip_list(
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1495, in list_block_devices_check_skip_list
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = self.list_block_devices(
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1460, in list_block_devices
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = list_all_block_devices()
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 526, in list_all_block_devices
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     report = il_utils.execute('lsblk', '-bia', '--json',
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 111, in execute
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     _log(result[0], result[1])
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 99, in _log
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     LOG.debug('Command stdout is: "%s"', stdout)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Message: 'Command stdout is: "%s"'
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Arguments: ('{\n   "blockdevices": [\n      {\n         "kname": "loop0",\n         "model": null,\n         "size": 67467313152,\n         "rota": false,\n         "type": "loop",\n         "uuid": "28f5ff52-7f5b-4e5a-bcf2-59813e5aef5a",\n         "partuuid": null,\n         "serial": null\n      },{\n         "kname": "loop1",\n         "model": null,\n         "size": 1027846144,\n         "rota": false,\n         "type": "loop",\n         "uuid": null,\n         "partuuid": null,\n         "serial": null\n      },{\n         "kname": "sda",\n         "model": "LITEON IT ECE-12",\n         "size": 120034123776,\n         "rota": false,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "XXXXXXXXXXXXXXXXXX"\n      },{\n         "kname": "sdb",\n         "model": "LITEON IT ECE-12",\n         "size": 120034123776,\n         "rota": false,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "XXXXXXXXXXXXXXXXXXXX"\n      },{\n         "kname": "sdc",\n         "model": "External",\n         "size": 0,\n         "rota": true,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"\n      }\n   ]\n}\n',)
~~~

Version-Release number of selected component (if applicable):

OCP 4.14.28

How reproducible:

Always

Steps to Reproduce:

    1. Add a BMH with a bad utf-8 characters in serial
    2.
    3.
    

Actual results:

Inspection fail

Expected results:

Inspection works

Additional info:

    

 

This is a clone of issue OCPBUGS-37780. The following is the description of the original issue:

As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.

Example:

compute:
- name: worker
  architecture: arm64
...
- name: edge
  architecture: amd64
  platform:
    aws:
      zones: ${edge_zones_str}

See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631

Description of problem:

The pod of catalogsource without registryPoll wasn't recreated during the node failure

    jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS        RESTARTS       AGE
certified-operators-rcs64               1/1     Running       0              123m
community-operators-8mxh6               1/1     Running       0              123m
marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
qe-app-registry-5jxlx                   1/1     Running       0              106m
redhat-marketplace-4bgv9                1/1     Running       0              123m
redhat-operators-ww5tb                  1/1     Running       0              123m
test-2xvt8                              1/1     Terminating   0              12m

jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
NAME         READY   STATUS    RESTARTS   AGE    IP            NODE                                          NOMINATED NODE   READINESS GATES
test-2xvt8   1/1     Running   0          7m6s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>

jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS     ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   116m   v1.30.2+421e90e

Version-Release number of selected component (if applicable):

     Cluster version is 4.17.0-0.nightly-2024-07-07-131215

How reproducible:

    always

Steps to Reproduce:

    1. create a catalogsource without the registryPoll configure.

jiazha-mac:~ jiazha$ cat cs-32183.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test
  namespace: openshift-marketplace
spec:
  displayName: Test Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.16
  publisher: OpenShift QE
  sourceType: grpc

jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml 
catalogsource.operators.coreos.com/test created

jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE                                          NOMINATED NODE   READINESS GATES
test-2xvt8   1/1     Running   0          3m18s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>


     2. Stop the node 
jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc 
Temporary namespace openshift-debug-q4d5k is created for debugging node...
Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.5
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet


Removing debug pod ...
Temporary namespace openshift-debug-q4d5k was removed.

jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS     ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   115m   v1.30.2+421e90e


    3. check it this catalogsource's pod recreated.

    

Actual results:

No new pod was generated. 

    jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS        RESTARTS       AGE
certified-operators-rcs64               1/1     Running       0              123m
community-operators-8mxh6               1/1     Running       0              123m
marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
qe-app-registry-5jxlx                   1/1     Running       0              106m
redhat-marketplace-4bgv9                1/1     Running       0              123m
redhat-operators-ww5tb                  1/1     Running       0              123m
test-2xvt8                              1/1     Terminating   0              12m

once node recovery, a new pod was generated.


jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS   ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   Ready    worker   127m   v1.30.2+421e90e

jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS    RESTARTS       AGE
certified-operators-rcs64               1/1     Running   0              127m
community-operators-8mxh6               1/1     Running   0              127m
marketplace-operator-769fbb9898-czsfn   1/1     Running   4 (121m ago)   140m
qe-app-registry-5jxlx                   1/1     Running   0              109m
redhat-marketplace-4bgv9                1/1     Running   0              127m
redhat-operators-ww5tb                  1/1     Running   0              127m
test-wqxvg                              1/1     Running   0              27s 

Expected results:

During the node failure, a new catalog source pod should be generated.

    

Additional info:

Hi Team,

After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.

  • The commit [1] try to fix this issue by adding "force deleting dead pod" process into ensurePod() function.
  • The ensurePod() is called by EnsureRegistryServer() [2].
  • However, the syncRegistryServer() will return immediately without calling EnsureRegistryServer() if there is no registryPoll in catalog [3].
  • There is no registryPoll defined in catalogsource that were generated when we build catalog image following Doc [4].
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: redhat-operator-index
      namespace: openshift-marketplace
    spec:
      image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
      sourceType: grpc
    
  • So the catalog pod created by the catalogsource cannot recovered.

And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
  sourceType: grpc
  updateStrategy:   <==
    registryPoll:   <==
      interval: 10m <==

The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.

[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html

Description of problem:

Once you registers a IDMS/ICSP which only contains the root url for the source registry, the registry-overrides is not properly filled and they are ignored.

Sample:

apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  name: image-policy
spec:
  imageDigestMirrors:
  - mirrors:
    - registry.vshiray.net/redhat.io
    source: registry.redhat.io
  - mirrors:
    - registry.vshiray.net/connect.redhat.com
    source: registry.connect.redhat.com
  - mirrors:
    - registry.vshiray.net/gcr.io
    source: gcr.io
  - mirrors:
    - registry.vshiray.net/docker.io
    source: docker.io

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Deploy a IDMS with root registries
    2. Try to deploy a Disconnected HostedCluster using the internal registry
    

Actual results:

The registry-overrides is empty so the disconnected deployment is stuck

Expected results:

The registry-overrides flag is properly filled and the deployment could continue

Additional info: 

To workaround this, you just must create a new IDMS which points to the right OCP release version:

apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
  name: ocp-release
spec:
  imageDigestMirrors:
  - mirrors:
    - registry.vshiray.net/quay.io/openshift-release-dev/ocp-v4.0-art-dev
    source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
  - mirrors:
    - registry.vshiray.net/quay.io/openshift-release-dev/ocp-release
    source: quay.io/openshift-release-dev/ocp-release

Description of problem:

 azure-disk-csi-driver doesnt use registryOverrides 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.set registry override on CPO
    2.watch that azure-disk-csi-driver continues to use default registry
    3.
    

Actual results:

    azure-disk-csi-driver uses default registry

Expected results:

    azure-disk-csi-driver mirrored registry

Additional info:

    

This is a clone of issue OCPBUGS-38936. The following is the description of the original issue:

Description of problem:

    NodePool Controller doesn't respect LatestSupportedVersion https://github.com/openshift/hypershift/blob/main/support/supportedversion/version.go#L19

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Create HostedCluster / NodePool
    2. Upgrade both HostedCluster and NodePool at the same time to a version higher than the LatestSupportedVersion
    

Actual results:

    NodePool tries to upgrade to the new version while the HostedCluster ValidReleaseImage condition fails with: 'the latest version supported is: "x.y.z". Attempting to use: "x.y.z"'

Expected results:

    NodePool ValidReleaseImage condition also fails

Additional info:

    

Description of problem:

release-4.17 of openshift/cluster-api-provider-openstack is missing some commits that were backported in upstream project into the release-0.10 branch.
We should import them in our downstream fork.

Please review the following PR: https://github.com/openshift/images/pull/181

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

All of OCP release process relies on version we expose as a label (io.openshift.build.versions) in hyperkube image, see https://github.com/openshift/kubernetes/blob/master/openshift-hack/images/hyperkube/Dockerfile.rhel.

Unfortunately, our CI is not picking that version, but rather is trying to guess a version based on the available tags, we should ensure that all the build processes read that label, rather than requiring a manual tag push when doing k8s bump.

Description of problem:

It is currently possible to watch a singular namespaced resource without providing a namespace. This is inconsistent with one-off requests for these resources and could also return unexpected results, since namespaced resource names do not need to be unique at the cluster scope.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

1. Visit the details page of a namespaced resource
2. Replace the 'ns/<namespace>' segment of the URL with 'cluster'

Actual results:

Details for the resource are rendered momentarily, then a 404 after a few seconds.

Expected results:

We should show a 404 error when the page loads.

Additional info:

There is also probably a case where we could visit a resource details page of a namespaced resource that has an identically named resource in another namespace, then change the URL to a cluster-scoped path, and we'll see the details for the other resource.

See watchK8sObject for the root cause. We should probably only start the websocket if we have a successful initial poll request. We also should probably terminate the websocket if the poll request fails at any point.

Description of problem:

When mirroring content with oc-mirror v2, some required images for OpenShift installation are missing from the registry    

Version-Release number of selected component (if applicable):

OpenShift installer version: v4.15.17 

[admin@registry ~]$ oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202406131906.p0.g7c0889f.assembly.stream.el9-7c0889f", GitCommit:"7c0889f4bd343ccaaba5f33b7b861db29b1e5e49", GitTreeState:"clean", BuildDate:"2024-06-13T22:07:44Z", GoVersion:"go1.21.9 (Red Hat 1.21.9-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Use oc-mirror v2 to mirror content.

$ cat imageset-config-ocmirrorv2-v4.15.yaml
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.15
      minVersion: 4.15.17
      type: ocp
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    full: false
    packages:
      - name: ansible-automation-platform-operator
      - name: cluster-logging
      - name: datagrid
      - name: devworkspace-operator
      - name: multicluster-engine
      - name: multicluster-global-hub-operator-rh
      - name: odf-operator
      - name: quay-operator
      - name: rhbk-operator
      - name: skupper-operator
      - name: servicemeshoperator
      - name: submariner
      - name: lvms-operator
      - name: odf-lvm-operator
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15
    full: false
    packages:
      - name: crunchy-postgres-operator
      - name: nginx-ingress-operator
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.15
    full: false
    packages:
      - name: argocd-operator
      - name: cockroachdb
      - name: infinispan
      - name: keycloak-operator
      - name: mariadb-operator
      - name: nfs-provisioner-operator
      - name: postgresql
      - name: skupper-operator
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: registry.access.redhat.com/ubi8/nodejs-18
  - name: registry.redhat.io/openshift4/ose-prometheus:v4.14.0
  - name: registry.redhat.io/service-interconnect/skupper-router-rhel9:2.4.3
  - name: registry.redhat.io/service-interconnect/skupper-config-sync-rhel9:1.4.4
  - name: registry.redhat.io/service-interconnect/skupper-service-controller-rhel9:1.4.4
  - name: registry.redhat.io/service-interconnect/skupper-flow-collector-rhel9:1.4.4
  helm: {}


Run oc-mirror using the command:

oc-mirror --v2 \
-c imageset-config-ocmirrorv2-v4.15.yaml  \
--workspace file:////data/oc-mirror/workdir/ \
docker://registry.local.momolab.io:8443/mirror 

Steps to Reproduce:

    1. Install Red Hat Quay mirror registry
    2. Mirror using oc-mirror v2 command and steps above
    3. Install OpenShift
    

Actual results:

    Installation fails

Expected results:

    Installation succeeds

Additional info:

 ## Check logs on coreos:
[core@sno1 ~]$ journalctl -b -f -u release-image.service -u bootkube.service
Jul 02 03:46:22 sno1.local.momolab.io bootkube.sh[13486]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: (Mirrors also failed: [registry.local.momolab.io:8443/mirror/openshift/release@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in registry.local.momolab.io:8443/mirror/openshift/release: name unknown: repository not found]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized

## Check if that image was pulled:

[admin@registry ~]$ cat /data/oc-mirror/workdir/working-dir/dry-run/mapping.txt | grep -i f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06
docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06=docker://registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06

## Problem is, it doesn't exist on the registry (also via UI):

[admin@registry ~]$ podman pull registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06
Trying to pull registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06...
Error: initializing source docker://registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev@sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06: reading manifest sha256:f36e139f75b179ffe40f5a234a0cef3f0a051cc38cbde4b262fb2d96606acc06 in registry.local.momolab.io:8443/mirror/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown

Description of problem:

aws capi installs, particularly when running under heavy load in ci, can sometimes fail with:

    level=info msg=Creating private Hosted Zone
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create private hosted zone: error creating private hosted zone: HostedZoneAlreadyExists: A hosted zone has already been created with the specified caller reference.
level=error msg=	status code: 409, request id: f173760d-ab43-41b8-a8a0-568cf387bf5e

example job: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-installer-8448-ci-4.16-e2e-aws-ovn/1793287246942572544/artifacts/e2e-aws-ovn-8/ipi-install-install/build-log.txt

Version-Release number of selected component (if applicable):

    

How reproducible:

not reproducible - needs to be discovered in ci

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    install fails due to existing hosted zone

Expected results:

    HostedZoneAlreadyExists error should not cause install to fail

Additional info:

    

 

Description of problem:

This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing.

LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue. 

Version-Release number of selected component (if applicable):

4.15.11     

How reproducible:

    

Steps to Reproduce:

 (From the customer)   
    1. Configure LDAP IDP
    2. Configure Proxy
    3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
    

Actual results:

    LDAP IDP communication from the control plane oauth pod goes through proxy 

Expected results:

    LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings

Additional info:

For more information, see linked tickets.    

Description of problem:

Hypershift Operator pods are running with higher PriorityClass but external-dns is set to default class with lower preemption priority, this has made the pod to preempt during migration. 
Observed while performance testing dynamic serving spec migration on MC.

# oc get pods -n hypershift 
NAME                           READY   STATUS    RESTARTS   AGE
external-dns-7f95b5cdc-9hnjs   0/1     Pending   0          23m
operator-956bdb486-djjvb       1/1     Running   0          116m
operator-956bdb486-ppgzt       1/1     Running   0          115m

external-dns pod.spec
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  priorityClassName: default

operator pods.spec
  preemptionPolicy: PreemptLowerPriority
  priority: 100003000
  priorityClassName: hypershift-operator


    

Version-Release number of selected component (if applicable):

On Management Cluster 4.14.7
    

How reproducible:

Always

Steps to Reproduce:

    1. Setup a MC with request serving and autoscaling machinesets
    2. Load up the MC to its max capacity
    3. Watch external-dns pod gets preempted when resources needed by other pods
    

Actual results:

External-dns pod goes to pending state until new node comes up
    

Expected results:

Since this is also a critical pod like hypershift operator, as it would affect HC dns configuration, this one needs to be a higher priority pod as well.
    

Additional info:

stage: perf3 sector
    

Description of the problem:
Using ACM, when adding a node to a spoke, it's showing as stuck in installing but appear to have installed successfully. Confirmed workloads can be scheduled on the new nodes.

  • Assisted service pod has no logs of interest.
  • The hive status for the cluster suggests the install was succuessful but ACM still says installing step 9/9
  • The status finally flipped to failed with:
    '''
    Error
    This host failed its installation.
    Host failed to install because its installation stage Done took longer than expected 1h0m0s.
     

How reproducible:
Unsure atm.
 

Steps to reproduce:

1. Install a spoke cluster

2. Add a new node day 2, with the operation timing out. (In this case there was a x509 cert issue).

Actual results:
Node eventually gets added to cluster and accepts workloads, but the GUI does not reflect this.
 

Expected results:
If node actually succeeds in joining cluster, update the GUI to say so.

Description of problem:

nmstate-configuration.service failed due to wrong variable name $hostname_file
https://github.com/openshift/machine-config-operator/blob/5a6e8b81f13de2dbf606a497140ac6e9c2a00e6f/templates/common/baremetal/files/nmstate-configuration.yaml#L26

Version-Release number of selected component (if applicable):

4.16.0    

How reproducible:

always

Steps to Reproduce:

    1. install cluster via dev-script, with node-specific network configuration
    

Actual results:

nmstate-configuration failed:

sh-5.1# journalctl -u nmstate-configuration
May 07 02:19:54 worker-0 systemd[1]: Starting Applies per-node NMState network configuration...
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + systemctl -q is-enabled mtu-migration
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + echo 'Cleaning up left over mtu migration configuration'
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: Cleaning up left over mtu migration configuration
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + rm -rf /etc/cno/mtu-migration
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + '[' -e /etc/nmstate/openshift/applied ']'
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + src_path=/etc/nmstate/openshift
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + dst_path=/etc/nmstate
May 07 02:19:54 worker-0 systemd[1]: nmstate-configuration.service: Main process exited, code=exited, status=1/FAILURE
May 07 02:19:54 worker-0 nmstate-configuration.sh[1565]: ++ hostname -s
May 07 02:19:54 worker-0 systemd[1]: nmstate-configuration.service: Failed with result 'exit-code'.
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + hostname=worker-0
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + host_file=worker-0.yml
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + cluster_file=cluster.yml
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + config_file=
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: + '[' -s /etc/nmstate/openshift/worker-0.yml ']'
May 07 02:19:54 worker-0 nmstate-configuration.sh[1553]: /usr/local/bin/nmstate-configuration.sh: line 22: hostname_file: unbound variable
May 07 02:19:54 worker-0 systemd[1]: Failed to start Applies per-node NMState network configuration.

Expected results:

cluster can be setup successfully with node-specific network configuration via new mechanism

Additional info:

    

Starting with payload 4.17.0-0.nightly-2024-06-27-123139 we are seeing hypershift-release-4.17-periodics-e2e-aws-ovn-conformance failures due to

: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "PrometheusKubernetesListWatchFailures",
          "alertstate": "firing",
          "container": "kube-rbac-proxy",
          "endpoint": "metrics",
          "instance": "10.132.0.19:9092",
          "job": "prometheus-k8s",
          "namespace": "openshift-monitoring",
          "pod": "prometheus-k8s-0",
          "prometheus": "openshift-monitoring/k8s",
          "service": "prometheus-k8s",
          "severity": "warning"
        },

It looks like this was introduced with cluster-monitoring-operator/pull/2392

This is a clone of issue OCPBUGS-38037. The following is the description of the original issue:

Description of problem:

When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment
    

Version-Release number of selected component (if applicable):

WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
    

How reproducible:

Always
    

Steps to Reproduce:

    1.  Setup OSUS in a reacheable  network 
    2. Cut all internet connection except for the mirror registry and OSUS service
    3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig
    

Actual results:


    

Expected results:

Should not fail
    

Additional info:


    

Description of problem:

    ci/prow/security is failing: k8s.io/client-go/transport

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

    always

Steps to Reproduce:

    1. trigger ci/prow/security on a pull request
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/209

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38174. The following is the description of the original issue:

Description of problem:

The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.    

Version-Release number of selected component (if applicable):

4.15.z and later    

How reproducible:

    Always when AlertmanagerConfig is enabled

Steps to Reproduce:

    1. Enable UWM with AlertmanagerConfig
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
    2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file)
    3. Wait for a couple of minutes.
    

Actual results:

Monitoring ClusterOperator goes Degraded=True.
    

Expected results:

No error
    

Additional info:

The Prometheus operator logs show that it doesn't understand the proxy_from_environment field.
    

This is a clone of issue OCPBUGS-42237. The following is the description of the original issue:

Description of problem:

The samples operator sync for OCP 4.18 includes an update to the ruby imagestream. This removes EOLed versions of Ruby and upgrades the images to be ubi9 based
    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Run build suite tests
    2.
    3.
    

Actual results:

Tests fail trying to pull image. Example: Error pulling image "image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8": initializing source docker://image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8: reading manifest 3.0-ubi8 in image-registry.openshift-image-registry.svc:5000/openshift/ruby: manifest unknown
    

Expected results:

Builds can pull image, and the tests succeed.
    

Additional info:

As part of the continued deprecation of the Samples Operator, these tests should create their own Ruby imagestream that is kept current.
    

Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/112

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When investigating https://issues.redhat.com/browse/OCPBUGS-34819 we encountered an issue with the LB creation but also noticed that masters are using an S3 stub ignition even though they don't have to. Although that can be harmless, we are adding an extra, useless hop that we don't need.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    Change the AWSMachineTemplate ignition.storageType to UnencryptedUserData

Include English Translations text for Supported Languages in User Preference Dropdown

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38326. The following is the description of the original issue:

Description of problem:

Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

1. apply CRD yaml file
2. check the NetworkAttachmentDefinition status

Actual results:

status with error 

Expected results:

NetworkAttachmentDefinition has been created 

 

 

We removed this in 4.18, but we also should remove this in 4.17 since the saas template was not used then either.

Not removing this in 4.17 also causes issues backporting ARO HCP API changes; we need to backport changes related to that work to 4.17.

Example - https://github.com/openshift/hypershift/pull/4640#issuecomment-2320233415

This payload run detects a panic in CVO code. The following payloads did not see the same panic. Bug should be prioritized by CVO team accordingly.

Relevant Job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade/1782008003688402944

Panic trace as showed in this log:

I0421 13:06:29.113325       1 availableupdates.go:61] First attempt to retrieve available updates
I0421 13:06:29.119731       1 cvo.go:721] Finished syncing available updates "openshift-cluster-version/version" (6.46969ms)
I0421 13:06:29.120687       1 sync_worker.go:229] Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True"
I0421 13:06:29.120697       1 sync_worker.go:579] Cluster operator etcd changed Degraded from "False" to "True"
E0421 13:06:29.121014       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 185 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bbc580?, 0x30cdc90})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1e3efe0?})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1bbc580?, 0x30cdc90?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc002944000, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:725 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:584 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000101800?, {0x2194c80, 0xc0026245d0}, 0x1, 0xc000118120)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x0?, 0x0?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc002398c80, {0x21b41b8, 0xc0004be230}, 0x10)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:564 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:431 +0x5d
created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run in goroutine 118
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:429 +0x49d
E0421 13:06:29.121188       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 185 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bbc580?, 0x30cdc90})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000000002?})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1bbc580?, 0x30cdc90?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1e3efe0?})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x1bbc580?, 0x30cdc90?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc002944000, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:725 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:584 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000101800?, {0x2194c80, 0xc0026245d0}, 0x1, 0xc000118120)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x0?, 0x0?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc002398c80, {0x21b41b8, 0xc0004be230}, 0x10)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:564 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:431 +0x5d
created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run in goroutine 118
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:429 +0x49d
I0421 13:06:29.120720       1 cvo.go:738] Started syncing upgradeable "openshift-cluster-version/version"
I0421 13:06:29.123165       1 upgradeable.go:69] Upgradeability last checked 5.274200045s ago, will not re-check until 2024-04-21T13:08:23Z
I0421 13:06:29.123195       1 cvo.go:740] Finished syncing upgradeable "openshift-cluster-version/version" (2.469943ms)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x195c018]

goroutine 185 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000000002?})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x1bbc580?, 0x30cdc90?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1e3efe0?})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x1bbc580?, 0x30cdc90?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc002944000, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:725 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:584 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000101800?, {0x2194c80, 0xc0026245d0}, 0x1, 0xc000118120)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x0?, 0x0?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc002398c80, {0x21b41b8, 0xc0004be230}, 0x10)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:564 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:431 +0x5d
created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run in goroutine 118
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:429 +0x49d

Description of problem

CI is occasionally bumping into failures like:

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less	53m22s
{  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:186]: during upgrade to registry.build05.ci.openshift.org/ci-op-kj8vc4dt/release@sha256:74bc38fc3a1d5b5ac8e84566d54d827c8aa88019dbdbf3b02bef77715b93c210: the "master" pool should be updated before the CVO reports available at the new version
Ginkgo exit error 1: exit with code 1}

where the machine-config operator is rolling the control-plane MachineConfigPool after the ClusterVersion update completes:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status | [.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message] | sort[]'
2024-05-17T12:57:04Z RenderDegraded=False : 
2024-05-17T12:58:35Z Degraded=False : 
2024-05-17T12:58:35Z NodeDegraded=False : 
2024-05-17T15:13:22Z Updated=True : All nodes are updated with MachineConfig rendered-master-4fcadad80c9941813b00ca7e3eef8e69
2024-05-17T15:13:22Z Updating=False : 
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[0].completionTime'
2024-05-17T14:15:22Z

Because of changes to registry pull secrets:

$ dump() {
> curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1791442450049404928/artifacts/e2e-azure-ovn-upgrade
/gather-extra/artifacts/machineconfigs.json | jq -r ".items[] | select(.metadata.name == \"$1\").spec.config.storage.files[] | select(.path == \"/etc/mco/internal-registry-pull-secret.json\").contents.source" | 
python3 -c 'import urllib.parse, sys; print(urllib.parse.unquote(sys.stdin.read()).split(",", 1)[-1])' | jq -c '.auths | to_entries[]'
> }
$ diff -u0 <(dump rendered-master-d6a8cd53ae132250832cc8267e070af6) <(dump rendered-master-4fcadad80c9941813b00ca7e3eef8e69) | sed 's/"value":.*/.../'
--- /dev/fd/63  2024-05-17 12:28:37.882351026 -0700
+++ /dev/fd/62  2024-05-17 12:28:37.883351026 -0700
@@ -1 +1 @@
-{"key":"172.30.124.169:5000",...
+{"key":"172.30.124.169:5000",...
@@ -3,3 +3,3 @@
-{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
-{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
-{"key":"image-registry.openshift-image-registry.svc:5000",...
+{"key":"default-route-openshift-image-registry.apps.ci-op-kj8vc4dt-6c39f.ci2.azure.devcluster.openshift.com",...
+{"key":"image-registry.openshift-image-registry.svc.cluster.local:5000",...
+{"key":"image-registry.openshift-image-registry.svc:5000",...

Version-Release number of selected component (if applicable)

Seen in 4.16-to-4.16 Azure update CI. Unclear what the wider scope is.

How reproducible

Sippy reports Success Rate: 94.27% post regression, so a rare race.

But using CI search to pick jobs with 10 or more runs over the past 2 days:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&type=junit&search=master.*pool+should+be+updated+before+the+CVO+reports+available' | grep '[0-9][0-9] runs.*failures ma
tch' | sort
periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade (all) - 52 runs, 50% failed, 12% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 80 runs, 20% failed, 25% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade (all) - 82 runs, 21% failed, 59% of failures match = 12% impact
periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 80 runs, 53% failed, 14% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-sdn-upgrade (all) - 50 runs, 12% failed, 50% of failures match = 6% impact
pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-upgrade (all) - 14 runs, 21% failed, 33% of failures match = 7% impact
pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade (all) - 11 runs, 36% failed, 75% of failures match = 27% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 11 runs, 18% failed, 100% of failures match = 18% impact
pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade (all) - 19 runs, 21% failed, 25% of failures match = 5% impact
pull-ci-openshift-machine-config-operator-master-e2e-azure-ovn-upgrade-out-of-change (all) - 21 runs, 48% failed, 50% of failures match = 24% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade (all) - 16 runs, 81% failed, 15% of failures match = 13% impact
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 16 runs, 25% failed, 75% of failures match = 19% impact
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade (all) - 26 runs, 35% failed, 67% of failures match = 23% impact

shows some flavors like pull-ci-openshift-cluster-network-operator-master-e2e-azure-ovn-upgrade up at a 27% hit rates.

Steps to Reproduce

Unclear.

Actual results

Pull secret changes after the ClusterVersion update cause an unexpected master MachineConfigPool roll.

Expected results

No MachineConfigPool roll after the ClusterVersion update completes.

Additional info

Description of problem:

 

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-13-084629

How reproducible:

100%

Steps to Reproduce:

1.apply configmap
*****
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      remoteWrite:
        - url: "http://invalid-remote-storage.example.com:9090/api/v1/write"
          queue_config:
            max_retries: 1
*****

2. check logs
% oc logs -c prometheus prometheus-k8s-0 -n openshift-monitoring
...
ts=2024-06-14T01:28:01.804Z caller=dedupe.go:112 component=remote level=warn remote_name=5ca657 url=http://invalid-remote-storage.example.com:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://invalid-remote-storage.example.com:9090/api/v1/write\": dial tcp: lookup invalid-remote-storage.example.com on 172.30.0.10:53: no such host"

3.query after 15mins
% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PrometheusRemoteStorageFailures"}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   145  100    78  100    67    928    797 --:--:-- --:--:-- --:--:--  1726
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [],
    "analysis": {}
  }
}

% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=prometheus_remote_storage_failures_total' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   124  100    78  100    46   1040    613 --:--:-- --:--:-- --:--:--  1653
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [],
    "analysis": {}
  }
}

Actual results:

alert did not triggeted

Expected results:

alert triggered, able to see the alert and metrics

Additional info:

below metrics show as `No datapoints found.`
prometheus_remote_storage_failures_total
prometheus_remote_storage_samples_dropped_total
prometheus_remote_storage_retries_total
`prometheus_remote_storage_samples_failed_total` value is 0

Description of the problem:

BE version ~2.32 (master) - block authenticated proxy unencoded url -  'http://ocp-edge:red@hat@10.6.48.65:3132' - with 2 @ 
Currently BE accepts such a url, although it is not supported.

How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

This is a clone of issue OCPBUGS-38990. The following is the description of the original issue:

Description of problem:

node-joiner pod does not honour cluster wide testing   

Version-Release number of selected component (if applicable):

OCP 4.16.6

How reproducible:

Always

Steps to Reproduce:

    1. Configure an OpenShift cluster wide proxy according to https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html and add Red Hat urls (quay.io and alii) to the proxy allow list.
    2. Add a node to a cluster using a node joiner pod, following https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/add-nodes.md
    

Actual results:

Error retrieving the images on quay.io
time=2024-08-22T08:39:02Z level=error msg=Release Image arch could not be found: command '[oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd -o=go-template={{if and .metadata.metadata (index . "metadata" "metadata" "release.openshift.io/architecture")}}{{index . "metadata" "metadata" "release.openshift.io/architecture"}}{{else}}{{.config.architecture}}{{end}} --insecure=true --registry-config=/tmp/registry-config1164077466]' exited with non-zero exit code 1:time=2024-08-22T08:39:02Z level=error msg=error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd: Get "http://quay.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)    

Expected results:

  node-joiner is able to downoad the images using the proxy

Additional info:
By allowing full direct internet access, without a proxy, the node joiner pod is able to download image from quay.io.

So there is a strong suspicion that the http timeout error above comes from the pod not being to use the proxy.

Restricted environementes when external internet access is only allowed through a proxy allow lists is quite common in corporate environements.

Please consider honouring the openshift proxy configuration .

Description of problem:

- One node [ rendezvous]   is failed to add the cluster and there are some pending CSR's.

- omc get csr 
NAME                                                            AGE   SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-44qjs                                                       21m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-9n9hc                                                       5m    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-9xw24                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-brm6f                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-dz75g                                                       36m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-l8c7v                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-mv7w5                                                       52m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-v6pgd                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
In order to complete the installation, cu needs to approve the those CSR's manually.    

Steps to Reproduce:

   agent-based installation. 
    

Actual results:

    CSR's are in pending state. 

Expected results:

    CSR's should approved automatically 

Additional info:

Logs : https://drive.google.com/drive/folders/1UCgC6oMx28k-_WXy8w1iN_t9h9rtmnfo?usp=sharing

Description of problem:

failed job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1023/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1796261717831847936                 

seeing below error:
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: error unpacking terraform: could not unpack the directory for the aws provider: open mirror/openshift/local/aws: file does not exist                                                    

Version-Release number of selected component (if applicable):

4.16/4.17    

How reproducible:

100%

Steps to Reproduce:

    1. create AWS cluster with "CustomNoUpgrade" featureSet is configured

install-config.yaml
----------------------
featureSet: CustomNoUpgrade
featureGates: [GatewayAPIEnabled=true]

    2.

    

Actual results:

level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: error unpacking terraform: could not unpack the directory for the aws provider: open mirror/openshift/local/aws: file does not exist

Expected results:

install should be successful    

Additional info:

workaround is to add ClusterAPIInstallAWS=true to feature_gates as well, .e.g
featureSet: CustomNoUpgrade
featureGates: [GatewayAPIEnabled=true,ClusterAPIInstallAWS=true]    

discussion thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1716887301410459 

Description of problem:

Examples in docs/user/gcp/customization can't directly be used to install a cluster.
    

Description of problem:

The installation of compact and HA clusters is failing in the vSphere environment. During the cluster setup, two master nodes were observed to be in a "Not Ready" state, and the rendezvous host failed to join the cluster. 

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-25-131159    

How reproducible:

100%    

Actual results:

level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
level=info msg=Use the following commands to gather logs from the cluster
level=info msg=openshift-install gather bootstrap --help
level=error msg=Bootstrap failed to complete: : bootstrap process timed out: context deadline exceeded
ERROR: Bootstrap failed. Aborting execution.

Expected results:

Installation should be successful.    

Additional info:

Agent Gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/54459/rehearse-54459-periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-vsphere-agent-compact-fips-f14/1839389511629410304/artifacts/vsphere-agent-compact-fips-f14/cucushift-agent-gather/artifacts/agent-gather.tar.xz

Description of problem:

Build tests in OCP 4.14 reference Ruby images that are now EOL. The related code in our sample ruby build was deleted.
    

Version-Release number of selected component (if applicable):

4.14
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Run the build suite for OCP 4.14 against a 4.14 cluster
    

Actual results:

Test [sig-builds][Feature:Builds][Slow] builds with a context directory s2i context directory build should s2i build an application using a context directory [apigroup:build.openshift.io] fails

2024-05-08T11:11:57.558298778Z I0508 11:11:57.558273       1 builder.go:400] Powered by buildah v1.31.0
  2024-05-08T11:11:57.581578795Z I0508 11:11:57.581509       1 builder.go:473] effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true]
  2024-05-08T11:11:57.583755245Z I0508 11:11:57.583715       1 builder.go:401] redacted build: {"kind":"Build","apiVersion":"build.openshift.io/v1","metadata":{"name":"s2icontext-1","namespace":"e2e-test-contextdir-wpphk","uid":"c2db2893-06e5-4274-96ae-d8cd635a1f8d","resourceVersion":"51882","generation":1,"creationTimestamp":"2024-05-08T11:11:55Z","labels":{"buildconfig":"s2icontext","openshift.io/build-config.name":"s2icontext","openshift.io/build.start-policy":"Serial"},"annotations":{"openshift.io/build-config.name":"s2icontext","openshift.io/build.number":"1"},"ownerReferences":[{"apiVersion":"build.openshift.io/v1","kind":"BuildConfig","name":"s2icontext","uid":"b7dbb52b-ae66-4465-babc-728ae3ceed9a","controller":true}],"managedFields":[{"manager":"openshift-apiserver","operation":"Update","apiVersion":"build.openshift.io/v1","time":"2024-05-08T11:11:55Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:openshift.io/build-config.name":{},"f:openshift.io/build.number":{}},"f:labels":{".":{},"f:buildconfig":{},"f:openshift.io/build-config.name":{},"f:openshift.io/build.start-policy":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"b7dbb52b-ae66-4465-babc-728ae3ceed9a\"}":{}}},"f:spec":{"f:output":{"f:to":{}},"f:serviceAccount":{},"f:source":{"f:contextDir":{},"f:git":{".":{},"f:uri":{}},"f:type":{}},"f:strategy":{"f:sourceStrategy":{".":{},"f:env":{},"f:from":{},"f:pullSecret":{}},"f:type":{}},"f:triggeredBy":{}},"f:status":{"f:conditions":{".":{},"k:{\"type\":\"New\"}":{".":{},"f:lastTransitionTime":{},"f:lastUpdateTime":{},"f:status":{},"f:type":{}}},"f:config":{},"f:phase":{}}}}]},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"https://github.com/sclorg/s2i-ruby-container"},"contextDir":"2.7/test/puma-test-app"},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"image-registry.openshift-image-registry.svc:5000/openshift/ruby:2.7-ubi8"},"pullSecret":{"name":"builder-dockercfg-v9xk2"},"env":[{"name":"BUILD_LOGLEVEL","value":"5"}]}},"output":{"to":{"kind":"DockerImage","name":"image-registry.openshift-image-registry.svc:5000/e2e-test-contextdir-wpphk/test:latest"},"pushSecret":{"name":"builder-dockercfg-v9xk2"}},"resources":{},"postCommit":{},"nodeSelector":null,"triggeredBy":[{"message":"Manually triggered"}]},"status":{"phase":"New","outputDockerImageReference":"image-registry.openshift-image-registry.svc:5000/e2e-test-contextdir-wpphk/test:latest","config":{"kind":"BuildConfig","namespace":"e2e-test-contextdir-wpphk","name":"s2icontext"},"output":{},"conditions":[{"type":"New","status":"True","lastUpdateTime":"2024-05-08T11:11:55Z","lastTransitionTime":"2024-05-08T11:11:55Z"}]}}
  2024-05-08T11:11:57.584949442Z Cloning "https://github.com/sclorg/s2i-ruby-container" ...
  2024-05-08T11:11:57.585044449Z I0508 11:11:57.585030       1 source.go:237] git ls-remote --heads https://github.com/sclorg/s2i-ruby-container
  2024-05-08T11:11:57.585081852Z I0508 11:11:57.585072       1 repository.go:450] Executing git ls-remote --heads https://github.com/sclorg/s2i-ruby-container
  2024-05-08T11:11:57.840621917Z I0508 11:11:57.840572       1 source.go:237] 663daf43b2abb5662504638d017c7175a6cff59d	refs/heads/3.2-experimental
  2024-05-08T11:11:57.840621917Z 88b4e684576b3fe0e06c82bd43265e41a8129c5d	refs/heads/add_test_latest_imagestreams
  2024-05-08T11:11:57.840621917Z 12a863ab4b050a1365d6d59970dddc6743e8bc8c	refs/heads/master
  2024-05-08T11:11:57.840730405Z I0508 11:11:57.840714       1 source.go:69] Cloning source from https://github.com/sclorg/s2i-ruby-container
  2024-05-08T11:11:57.840793509Z I0508 11:11:57.840781       1 repository.go:450] Executing git clone --recursive --depth=1 https://github.com/sclorg/s2i-ruby-container /tmp/build/inputs
  2024-05-08T11:11:59.073229755Z I0508 11:11:59.073183       1 repository.go:450] Executing git rev-parse --abbrev-ref HEAD
  2024-05-08T11:11:59.080132731Z I0508 11:11:59.080079       1 repository.go:450] Executing git rev-parse --verify HEAD
  2024-05-08T11:11:59.083626287Z I0508 11:11:59.083586       1 repository.go:450] Executing git --no-pager show -s --format=%an HEAD
  2024-05-08T11:11:59.115407368Z I0508 11:11:59.115361       1 repository.go:450] Executing git --no-pager show -s --format=%ae HEAD
  2024-05-08T11:11:59.195276873Z I0508 11:11:59.195231       1 repository.go:450] Executing git --no-pager show -s --format=%cn HEAD
  2024-05-08T11:11:59.198916080Z I0508 11:11:59.198879       1 repository.go:450] Executing git --no-pager show -s --format=%ce HEAD
  2024-05-08T11:11:59.204712375Z I0508 11:11:59.204663       1 repository.go:450] Executing git --no-pager show -s --format=%ad HEAD
  2024-05-08T11:11:59.211098793Z I0508 11:11:59.211051       1 repository.go:450] Executing git --no-pager show -s --format=%<(80,trunc)%s HEAD
  2024-05-08T11:11:59.216192627Z I0508 11:11:59.216149       1 repository.go:450] Executing git config --get remote.origin.url
  2024-05-08T11:11:59.218615714Z 	Commit:	12a863ab4b050a1365d6d59970dddc6743e8bc8c (Bump common from `1f774c8` to `a957816` (#537))
  2024-05-08T11:11:59.218661988Z 	Author:	dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  2024-05-08T11:11:59.218683019Z 	Date:	Tue Apr 9 15:24:11 2024 +0200
  2024-05-08T11:11:59.218722882Z I0508 11:11:59.218711       1 repository.go:450] Executing git rev-parse --abbrev-ref HEAD
  2024-05-08T11:11:59.234411732Z I0508 11:11:59.234366       1 repository.go:450] Executing git rev-parse --verify HEAD
  2024-05-08T11:11:59.237729596Z I0508 11:11:59.237698       1 repository.go:450] Executing git --no-pager show -s --format=%an HEAD
  2024-05-08T11:11:59.255304604Z I0508 11:11:59.255269       1 repository.go:450] Executing git --no-pager show -s --format=%ae HEAD
  2024-05-08T11:11:59.261113560Z I0508 11:11:59.261074       1 repository.go:450] Executing git --no-pager show -s --format=%cn HEAD
  2024-05-08T11:11:59.270006232Z I0508 11:11:59.269961       1 repository.go:450] Executing git --no-pager show -s --format=%ce HEAD
  2024-05-08T11:11:59.278485984Z I0508 11:11:59.278443       1 repository.go:450] Executing git --no-pager show -s --format=%ad HEAD
  2024-05-08T11:11:59.281940527Z I0508 11:11:59.281906       1 repository.go:450] Executing git --no-pager show -s --format=%<(80,trunc)%s HEAD
  2024-05-08T11:11:59.299465312Z I0508 11:11:59.299423       1 repository.go:450] Executing git config --get remote.origin.url
  2024-05-08T11:11:59.374652834Z error: provided context directory does not exist: 2.7/test/puma-test-app
    

Expected results:

Tests succeed
    

Additional info:

Ruby 2.7 is EOL and not searchable in the Red Hat container catalog.

Failing test: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-openshift-controller-manager-operator/344/pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.14-openshift-e2e-aws-builds-techpreview/1788152058105303040
    

Description of problem:

    Compute nodes table, does not display correct filesystem data

Version-Release number of selected component (if applicable):

    4.16.0-0.ci-2024-04-29-054754

How reproducible:

    Always

Steps to Reproduce:

    1. In an Openshift cluster 4.16.0-0.ci-2024-04-29-054754
    2. Go to the Compute / Nodes menu
    3. Check the Filesystem column
    

Actual results:

    There is no storage data displayed

Expected results:

    The query is executed correctly and the storage data is displayed correctly

Additional info:

    The query has an error as is not concatenating things correctly: https://github.com/openshift/console/blob/master/frontend/packages/console-app/src/components/nodes/NodesPage.tsx#L413

https://github.com/openshift/console/blob/master/frontend/packages/console-app/src/components/nodes/NodesPage.tsx#L413

 

Description of problem:

Should not panic when specify wrong loglevel for oc-mirror

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1. Run command: `oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2  --loglevel -h`

Actual results:

The command panic with error: 
oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2  --loglevel -h
2024/07/31 05:22:41  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/07/31 05:22:41  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/07/31 05:22:41  [INFO]   : ⚙️  setting up the environment for you...
2024/07/31 05:22:41  [INFO]   : 🔀 workflow mode: diskToMirror 
2024/07/31 05:22:41  [ERROR]  : parsing config error parsing local storage configuration : invalid loglevel -h Must be one of [error, warn, info, debug]
panic: StorageDriver not registered: 
goroutine 1 [running]:github.com/distribution/distribution/v3/registry/handlers.NewApp({0x5634e98, 0x76ea4a0}, 0xc000a7c388)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:126 +0x2374github.com/distribution/distribution/v3/registry.NewRegistry({0x5634e98?, 0x76ea4a0?}, 0xc000a7c388)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/registry.go:141 +0x56github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).setupLocalStorage(0xc000a78488)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:571 +0x3c6github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc00090f208, {0xc0007ae300, 0x1, 0x8})	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:201 +0x27fgithub.com/spf13/cobra.(*Command).execute(0xc00090f208, {0xc0000520a0, 0x8, 0x8})	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1github.com/spf13/cobra.(*Command).ExecuteC(0xc00090f208)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ffgithub.com/spf13/cobra.(*Command).Execute(0x74bc8d8?)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13main.main()	/go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18

Expected results:

Exit with error , should not panic
 

 

Download and merge French and Spanish languages translations in the OCP Console. 

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Version-Release number of selected component (if applicable):

Cypress test cannot be run locally.  This appears to be the result of `window.SERVER_FLAGS.authDisabled` always having a value of `false` when auth is in fact disabled.

How reproducible:

Always

Steps to Reproduce:

    1.  Run `yarn test-cypress-console` with [auth disabled|https://github.com/openshift/console?tab=readme-ov-file#openshift-no-authentication]
    2.  Run any of the tests (e.g., masthead.cy.ts)
    3.  Note the test fails because the test tries to login even though auth is disabled.  This appears to be because https://github.com/openshift/console/blob/d26868305edc663e8b251e5d73a7c62f7a01cd8c/frontend/packages/integration-tests-cypress/support/login.ts#L28 fails since `window.SERVER_FLAGS.authDisabled` incorrectly has a value of `false`.

Description of problem:

    After successfully creating a NAD of type: "OVN Kubernetes secondary localnet network", when viewing the object in the GUI, it will say that it is of type "OVN Kubernetes L2 overlay network".
When examining the objects YAML, it is still correctly configured as a NAD type of localnet.
Version-Release number of selected component:
OCP Virtualization 4.15.1
How reproducible:100%
Steps to Reproduce:
1. Create appropriate NNCP and apply
for example:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: nncp-br-ex-vlan-101
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: '' 
  desiredState:
    ovn:
      bridge-mappings:
      - localnet: vlan-101 
        bridge: br-ex
        state: present 

2. Create localnet type NAD (from GUI or YAML)
For example:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: vlan-101
  namespace: default
spec:
  config: |2 
    {
           "name":"br-ex",
           "type":"ovn-k8s-cni-overlay",
           "cniVersion":"0.4.0",
           "topology":"localnet",
           "vlanID":101,
           "netAttachDefName":"default/vlan-101"
     } 
3. View through the GUI by clicking on Networking->NetworkAttachementDefinitions->NAD you just created
4.  When you look under type it will incorrectly display as Type: OVN Kubernetes L2 overlay Network
 
Actual results:
Type is displayed as OVN Kubernetes L2 overlay Network
If you examine the YAML for the NAD you will see that it is indeed still of type localnet
Please see attached screenshots for display of NAD type and the actual YAML of NAD.
At this point in time it looks as though this is just a display error.
Expected results:
Type should be displayed as OVN Kubernetes secondary localnet network

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-33308. The following is the description of the original issue:

Description of problem:

When creating an OCP cluster on AWS and selecting "publish: Internal," 
the ingress operator may create external LB mappings to external 
subnets.

This can occur if public subnets were specified during installation at install-config.

https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-private.html#private-clusters-about-aws_installing-aws-private 

A configuration validation should be added to the installer.    

Version-Release number of selected component (if applicable):

    4.14+ probably older versions as well.

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    Slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1714986876688959

Description of problem:

The TestControllerConfigStuff e2e test was mistakenly merged into the main branch of the MCO repository. This test was supposed to be ephemeral and not actually merged into the repo. It was discovered during the cherrypick process for 4.16 and was removed there. However, it is still part of the main branch and should be removed.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

Run the test-e2e-techpreview CI job

Actual results:

This test should not be present nor should it execute.    

Expected results:

This test is actually present.    

Additional info:

    
openshift-install image-based create config-template --dir configuration-dir
INFO Config-Template created in: configuration-dir 


openshift-install image-based create config-template --dir configuration-dir
FATAL failed to fetch Image-based Config ISO configuration: failed to load asset "Image-based Config ISO configuration": invalid Image-based Config configuration: networkConfig: Invalid value: interfaces: 
FATAL - ipv4:                                      
FATAL     address:                                 
FATAL     - ip: 192.168.122.2                      
FATAL       prefix-length: 23                      
FATAL     dhcp: false                              
FATAL     enabled: true                            
FATAL   mac-address: "00:00:00:00:00:00"           
FATAL   name: eth0                                 
FATAL   state: up                                  
FATAL   type: ethernet                             
FATAL : install nmstate package, exec: "nmstatectl": executable file not found in $PATH 

We shouldn't see the above error.

Please review the following PR: https://github.com/openshift/thanos/pull/146

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem

Debug into one of the worker nodes on the hosted cluster:

oc debug node/ip-10-1-0-97.ca-central-1.compute.internal

nslookup kubernetes.default.svc.cluster.local
Server:         10.1.0.2
Address:        10.1.0.2#53

** server can't find kubernetes.default.svc.cluster.local: NXDOMAIN

curl -k https://172.30.0.1:443/readyz
curl: (7) Failed to connect to 172.30.0.1 port 443: Connection refused

sh-5.1# curl -k https://172.20.0.1:443/readyz
ok

Version-Release number of selected component (if applicable):

4.15.20

Steps to Reproduce:

Unknown

Actual results:

Pods on a hosted cluster's workers unable to connect to their internal kube apiserver via the service IP.

Expected results:

Pods on a hosted cluster's workers have connectivity to their kube apiserver via the service IP.

Additional info:

Checked the "Konnectivity server" logs on Dynatrace and found the error below occurs repeatedly

E0724 01:02:00.223151       1 server.go:895] "DIAL_RSP contains failure" err="dial tcp 172.30.176.80:8443: i/o timeout" dialID=8375732890105363305 agentID="1eab211f-6ea1-46ea-bc78-14d75d6ba325"

E0724 01:02:00.223482       1 tunnel.go:150] "Received failure on connection" err="read tcp 10.128.17.15:8090->10.128.82.107:52462: use of closed network connection" 
  • Looks the konnectivity server is trying to establish a connection to 172.30.176.80:8443 but is timing out
  • also the 2nd error indicates that an existing network connection was closed unexpectedly

Relevant OHSS Ticket: https://issues.redhat.com/browse/OHSS-36053

Slack thread discussion

Description of problem:

When running agent-based installation with arm64 and multi payload, after booting the iso file, assisted-service raise the error, and the installation fail to start:

Openshift version 4.16.0-0.nightly-arm64-2024-04-02-182838 for CPU architecture arm64 is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-arm64-2024-04-02-182838' and CPU architecture 'arm64'" go-id=419 pkg=Inventory request_id=5817b856-ca79-43c0-84f1-b38f733c192f 

The same error when running the installation with multi-arch build in assisted-service.log:

Openshift version 4.16.0-0.nightly-multi-2024-04-01-135550 for CPU architecture multi is not supported: no release image found for openshiftVersion: '4.16.0-0.nightly-multi-2024-04-01-135550' and CPU architecture 'multi'" go-id=306 pkg=Inventory request_id=21a47a40-1de9-4ee3-9906-a2dd90b14ec8 

Amd64 build works fine for now.

Version-Release number of selected component (if applicable):

    

How reproducible:

always

Steps to Reproduce:

1. Create agent iso file with openshift-install binary: openshift-install agent create image with arm64/multi payload
2. Booting the iso file 
3. Track the "openshift-install agent wait-for bootstrap-complete" output and assisted-service log
    

Actual results:

 The installation can't start with error

Expected results:

 The installation is working fine

Additional info:

assisted-service log: https://docs.google.com/spreadsheets/d/1Jm-eZDrVz5so4BxsWpUOlr3l_90VmJ8FVEvqUwG8ltg/edit#gid=0

Job fail url: 
multi payload: 
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-baremetal-compact-agent-ipv4-dhcp-day2-amd-mixarch-f14/1774134780246364160

arm64 payload:
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-arm64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1773354788239446016

Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/119

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Bootstrap destroy failed in CI with:

level=fatal msg=error destroying bootstrap resources failed during the destroy bootstrap hook: failed to remove bootstrap SSH rule: failed to update AWSCluster during bootstrap destroy: Operation cannot be fulfilled on awsclusters.infrastructure.cluster.x-k8s.io "ci-op-nk1s6685-77004-4gb4d": the object has been modified; please apply your changes to the latest version and try again

Version-Release number of selected component (if applicable):

 

How reproducible:

Unclear. CI search returns no results. Observed it as a single failure (aws-ovn job, linked below) in the testing of https://amd64.ocp.releases.ci.openshift.org/releasestream/4.17.0-0.nightly/release/4.17.0-0.nightly-2024-06-15-004118 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn/1801780204167761920

 

Two possible solutions:

  1. Add retries in the case of a failure like this or to updating in general
  2. Switch to sdk-based destroy rather than capi-based

Description of problem:

    The current api version used by the registry operator does not include the recently added "ChunkSizeMiB" feature gate. We need to bump the openshift/api to latest so that this feature gate becomes available for use.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Private HC provision failed on AWS. 

How reproducible:

Always. 

Steps to Reproduce:

Create a private HC on AWS following the steps in https://hypershift-docs.netlify.app/how-to/aws/deploy-aws-private-clusters/:

RELEASE_IMAGE=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-20-005211
HO_IMAGE=quay.io/hypershift/hypershift-operator:latest
BUCKET_NAME=fxie-hcp-bucket
REGION=us-east-2
AWS_CREDS="$HOME/.aws/credentials"
CLUSTER_NAME=fxie-hcp-1
BASE_DOMAIN=qe.devcluster.openshift.com
EXT_DNS_DOMAIN=hypershift-ext.qe.devcluster.openshift.com
PULL_SECRET="/Users/fxie/Projects/hypershift/.dockerconfigjson"

hypershift install --oidc-storage-provider-s3-bucket-name $BUCKET_NAME --oidc-storage-provider-s3-credentials $AWS_CREDS --oidc-storage-provider-s3-region $REGION --private-platform AWS --aws-private-creds $AWS_CREDS --aws-private-region=$REGION --wait-until-available --hypershift-image $HO_IMAGE

hypershift create cluster aws --pull-secret=$PULL_SECRET --aws-creds=$AWS_CREDS --name=$CLUSTER_NAME --base-domain=$BASE_DOMAIN --node-pool-replicas=2 --region=$REGION --endpoint-access=Private --release-image=$RELEASE_IMAGE --generate-ssh

Additional info:

From the MC:
$ for k in $(oc get secret -n clusters-fxie-hcp-1 | grep -i kubeconfig | awk '{print $1}'); do echo $k; oc extract secret/$k -n clusters-fxie-hcp-1 --to - 2>/dev/null | grep -i 'server:'; done
admin-kubeconfig
    server: https://a621f63c3c65f4e459f2044b9521b5e9-082a734ef867f25a.elb.us-east-2.amazonaws.com:6443
aws-pod-identity-webhook-kubeconfig
    server: https://kube-apiserver:6443
bootstrap-kubeconfig
    server: https://api.fxie-hcp-1.hypershift.local:443
cloud-credential-operator-kubeconfig
    server: https://kube-apiserver:6443
dns-operator-kubeconfig
    server: https://kube-apiserver:6443
fxie-hcp-1-2bsct-kubeconfig
    server: https://kube-apiserver:6443
ingress-operator-kubeconfig
    server: https://kube-apiserver:6443
kube-controller-manager-kubeconfig
    server: https://kube-apiserver:6443
kube-scheduler-kubeconfig
    server: https://kube-apiserver:6443
localhost-kubeconfig
    server: https://localhost:6443
service-network-admin-kubeconfig
    server: https://kube-apiserver:6443

 

The bootstrap-kubeconfig uses an incorrect KAS port (should be 6443 since the KAS is exposed through LB), causing kubelet on each HC node to use the same incorrect port. As a result AWS VMs are provisioned but cannot join the HC as nodes.

From a bastion:
[ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 443
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection timed out.
[ec2-user@ip-10-0-5-182 ~]$ nc -zv api.fxie-hcp-1.hypershift.local 6443
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.143.91:6443.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

 

Besides, the CNO also passes the wrong KAS port to Network components on the HC.

 

Same for HA proxy configuration on the VMs:

frontend local_apiserver
  bind 172.20.0.1:6443
  log global
  mode tcp
  option tcplog
  default_backend remote_apiserver

backend remote_apiserver
  mode tcp
  log global
  option httpchk GET /version
  option log-health-checks
  default-server inter 10s fall 3 rise 3
  server controlplane api.fxie-hcp-1.hypershift.local:443 

Description of problem:

    We should not require the s3:DeleteObject permission for installs when the `preserveBootstrapIgnition` option is set in the install-config.

Version-Release number of selected component (if applicable):

    4.14+

How reproducible:

    always

Steps to Reproduce:

    1. Use an account without the permission
    2. Set `preserveBootstrapIgnition: true` in the install-config.yaml
    3. Try to deploy a cluster
    

Actual results:

INFO Credentials loaded from the "denys3" profile in file "/home/cloud-user/.aws/credentials"
INFO Consuming Install Config from target directory
WARNING Action not allowed with tested creds          action=s3:DeleteBucket
WARNING Action not allowed with tested creds          action=s3:DeleteObject
WARNING Action not allowed with tested creds          action=s3:DeleteObject
WARNING Tested creds not able to perform all requested actions
FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: current credentials insufficient for performing cluster installation

Expected results:

    No permission errors.

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1047

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

 When one of our partner was trying to deploy a  4.16 Spoke cluster with ZTP/Gitops Approach, they get the following error message in their assisted-service pod:

error msg="failed to get corresponding infraEnv" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:409" error="record not found" go-id=497 preprovisioning_image=storage-1.fi-911.tre.nsn-rdnet.net preprovisioning_image_namespace=fi-911 request_id=cc62d8f6-d31f-4f74-af50-3237df186dc2

 

After some discussion in Assisted-Installer forum(https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1723196754444999), Nick Carboni and Alona Paz suggested that "identifier: mac-address" is not supported. Partner has currently ACM 2.11.0 and MCE 2.6.0 versions. However, their older cluster had ACM 2.10 and MCE 2.4.5 and this parameter was working. Nick and Alona suggested to remove "identifier: mac-address" from siteconfig and then installation started to progress. Based on suggestion from Nick, I opened this bug ticket to understand why it started not work now. Partner asked for an official documentation on why this parameter is no more working anymore or if this parameter is not supported any more.

Changing apiserverConfig.Spec.TLSSecurityProfile now makes MCO rollout nodes (see https://github.com/openshift/machine-config-operator/pull/4435) which is disruptive for other tests.

"the simplest solution would be to skip that test for now, and think about how to rewrite/mock it later (even though we try to make tests take that into account, nodes rollout is still a disruptive operation)."

For more context, check https://redhat-internal.slack.com/archives/C026QJY8WJJ/p1722348618775279

Description of problem:

We have runbook for OVNKubernetesNorthdInactive: https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/OVNKubernetesNorthdInactive.md

But the runbook url is not added for alert OVNKubernetesNorthdInactive:
4.12: https://github.com/openshift/cluster-network-operator/blob/c1a891129c310d01b8d6940f1eefd26058c0f5b6/bindata/network/ovn-kubernetes/managed/alert-rules-control-plane.yaml#L350
4.13: https://github.com/openshift/cluster-network-operator/blob/257435702312e418be694f4b98b8fe89557030c6/bindata/network/ovn-kubernetes/managed/alert-rules-control-plane.yaml#L350

Version-Release number of selected component (if applicable):

4.12.z, 4.13.z

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/aws-encryption-provider/pull/19

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Previously, in OCPBUGS-32105, we fixed a bug where a race between the assisted-installer and the assisted-installer-controller to mark a Node as Joined would result in 30+ minutes of (unlogged) retries by the former if the latter won. This was indistinguishable from the installation process hanging and it would eventually timed out.

This bug has been fixed, but we were unable to reproduce the circumstances that caused it.

However, a reproduction by the customer reveals another problem: we now correctly retry checking the control plane nodes for readiness if we encounter a conflict with another write from assisted-installer-controller. However, we never reload fresh data from assisted-service - data that would show the host has already been updated and thus prevent us from trying to update it again. Therefore, we continue to get a conflict on every retry. (This is at least now logged, so we can see what is happening.)

This also suggests a potential way to reproduce the problem: whenever one control plane node has booted to the point that the assisted-installer-controller is running before the second control plane node has booted to the point that the Node is marked as ready in the k8s API, there is a possibility of a race. There is in fact no need for the write from assisted-installer-controller to come in the narrow window between when assisted-installer reads vs. writes to the assisted-service API, because assisted-installer is always using a stale read.

The new test: [sig-node] kubelet metrics endpoints should always be reachable

Is picking up some upgrade job runs where we see the metrics endpoint go down for about 30 seconds, during the generic node update phase, and recover before we reboot the node. This is treated as a reason to flake the test because there was no overlap with reboot as initially written.

Example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/1806142925785010176
Interval chart showing the problem: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1806142925785010176/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/intervals?filterText=master-1&intervalFile=e2e-timelines_spyglass_20240627-024633.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=MetricsEndpointDown&selectedSources=NodeState

The master outage at 3:30:59 is causing a flake when I'd rather it didn't, because it doesn't extend into the reboot.

I'd like to tighten this up to include any overlap with update.

Will be backported to 4.16 to tighten the signal there as well.

Description of problem:

    The developer perspective dashboards page will deduplicate data before showing it on the table.  

How reproducible:

    

Steps to Reproduce:

    1. Apply this dashboard yaml (https://drive.google.com/file/d/1PcErgAKqu95yFi5YDAM5LxaTEutVtbrs/view?usp=sharing)
    2. open the dashboard on the Admin console and should be list all the rows
    3. open the dashboard on the Developer console selecting openshift-kube-scheduler projectand  
    4. See that when varying Plugin are available under the Execution Time table they are combined in the developer perspective per Pod 
    

 

Actual results:

    The Developer Perspective Dashboards Table doesn't display all rows returned from a query.

Expected results:

    The Developer Perspective Dashboards Table displays all rows returned from a query.

Additional info:

Admin Console: https://drive.google.com/file/d/1EIMYHBql0ql1zYiKlqOJh7hyqG-JFjla/view?usp=sharing

Developer Console: https://drive.google.com/file/d/1jk-Fxq9I6LDYzBGLFTUDDsGqERzwWJrl/view?usp=sharing

 

It works as expected on OCP <= 4.14

Description of problem:
Trying to delete the application depleyed using Serveless, with a user with limited permission, caused the "Delete application" form to complain:

pipelines.tekton.dev is forbidden: User "uitesting" cannot list resource "pipelines" in API group "tekton.dev" in the namespace "test-cluster-local"

This prevents the deletion. Worth adding that the cluster doesn't have Pipelines installed.
See the sceenshot: https://drive.google.com/file/d/1bsQ_NFO_grj_fE-UInUJXum39bPsHJh1

Version-Release number of selected component (if applicable):

4.15.0
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a limited user
    2. Deploy some application, not nececcerly a Serverless one
    3. Try to delete the "application" using the Dev Console
    

Actual results:

And unrevelant error is shown, preventing the deletetion: pipelines.tekton.dev is forbidden: User "uitesting" cannot list resource "pipelines" in API group "tekton.dev" in the namespace "test-cluster-local"
    

Expected results:

The app should be removed, with everything that's labelled with it.
    

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/126

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When setting overlaps CIDR for v4InternalSubnet, Live migration did not be blocked.

#oc patch Network.operator.openshift.io cluster --type='merge' --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig": {"v4InternalSubnet": "10.128.0.0/16"}}}}'

#oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'

Version-Release number of selected component (if applicable):

https://github.com/openshift/cluster-network-operator/pull/2392/commits/50201625861ba30570313d8f28c14e59e83f112a

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

I can see: 
network                                    4.16.0-0.ci.test-2024-06-05-023541-ci-ln-hhzztr2-latest   True        False         True       163m    The cluster configuration is invalid (network clusterNetwork(10.128.0.0/14) overlaps with network v4InternalSubnet(10.128.0.0/16)). Use 'oc edit network.config.openshift.io cluster' to fix.
   
 However migration still going on later

Expected results:

    Migration should be blocked.

Additional info:

    

Description of problem:

    Bursting hosted cluster creation on a management cluster with size tagging enabled results in some hosted clusters taking a very long time to be scheduled.

Version-Release number of selected component (if applicable):

    4.16.0

How reproducible:

    always

Steps to Reproduce:

    1. Setup a management cluster with request serving architecture and size tagging enabled.
    2. Configure clustersizing configuration for concurrency of 5 clusters at a time over a 10 minute window.
    3. Create many clusters at the same time
    

Actual results:

    Some of the created clusters take a very long time to be scheduled.

Expected results:

    New clusters take at most the time required to bring up request serving nodes to be scheduled. 

Additional info:

    The concurrency settings of the clustersizing configuration are getting applied to both existing clusters and new clusters coming in. They should not be applied to net new hostedclusters.

Description of problem:

When using IPI for IBM Cloud to create a Private BYON cluster, the installer attempts to fetch the VPC resource to verify if it is already a PermittedNetwork for the DNS Services Zone.
However, currently there is a new VPC Region that is listed in IBM Cloud, eu-es, which is not yet GA'd. This means while it is listed in available VPC Regions, to search for resources, requests to eu-es fail. Any attempts to use VPC Regions alphabetically after eu-es (appears they are returned in this order), fail due to requests made to eu-es. This includes, eu-gb, us-east, and us-south, causing a golang panic.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Create IBM Cloud BYON resources in us-east or us-south
2. Attempt to create a Private BYON based cluster in us-east or us-south

Actual results:

DEBUG   Fetching Common Manifests...               
DEBUG   Reusing previously-fetched Common Manifests 
DEBUG Generating Terraform Variables...            
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x2bdb706]

goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/installconfig/ibmcloud.(*Metadata).IsVPCPermittedNetwork(0xc000e89b80, {0x1a8b9918, 0xc00007c088}, {0xc0009d8678, 0x8})
	/go/src/github.com/openshift/installer/pkg/asset/installconfig/ibmcloud/metadata.go:175 +0x186
github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1dc55040, 0x5?)
	/go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:606 +0x3a5a
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000ca0d80, {0x1a8ab280, 0x1dc55040}, {0x0, 0x0})
	/go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x5fa
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffd948754cc?, {0x1a8ab280, 0x1dc55040}, {0x1dc32840, 0x8, 0x8})
	/go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x48
main.runTargetCmd.func1({0x7ffd948754cc, 0xb})
	/go/src/github.com/openshift/installer/cmd/openshift-install/create.go:261 +0x125
main.runTargetCmd.func2(0x1dc38800?, {0xc000ca0a80?, 0x3?, 0x3?})
	/go/src/github.com/openshift/installer/cmd/openshift-install/create.go:291 +0xe7
github.com/spf13/cobra.(*Command).execute(0x1dc38800, {0xc000ca0a20, 0x3, 0x3})
	/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc000bc8000)
	/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918
main.installerMain()
	/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0
main.main()
	/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff

Expected results:

Successful Private cluster creation using BYON on IBM Cloud

Additional info:

IBM Cloud development has identified the issue and is working on a fix to all affected supported releases (4.12, 4.13, 4.14+)

This is a clone of issue OCPBUGS-38085. The following is the description of the original issue:

Description of problem:

Multipart upload issues with Cloudflare R2 using S3 api. Some S3 compatible object storage systems like R2 require that all multipart chunks are the same size. This was mostly true before, except the final chunk was larger than the requested chunk size which causes uploads to fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Problem shows itself on OpenShift CI clusters intermittently.

Steps to Reproduce:

This behavior has been causing 504 Gateway Timeout issues in the image registry instances in OpenShift CI clusters.
It is connected to uploading big images (i.e 35GB), but we do not currently have the exact steps that reproduce it.

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    https://github.com/distribution/distribution/issues/3873 
    https://github.com/distribution/distribution/issues/3873#issuecomment-2258926705
    https://developers.cloudflare.com/r2/api/workers/workers-api-reference/#r2multipartupload-definition (look for "uniform in size")

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/292

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    For a cluster having one worker machine of A3 instance type, during "destroy cluster" it keeps telling below failure until I stopped the instance via "gcloud".

WARNING failed to stop instance jiwei-0530b-q9t8w-worker-c-ck6s8 in zone us-central1-c: googleapi: Error 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command., badRequest

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-multi-2024-05-29-143245

How reproducible:

    Always

Steps to Reproduce:

    1. "create install-config" and then "create manifests"
    2. edit a worker machineset YAML, to specify "machineType: a3-highgpu-8g" along with "onHostMaintenance: Terminate"
    3. "create cluster", and make sure it succeeds
    4. "destroy cluster"     

Actual results:

    Uninstalling the cluster keeps telling stopping instance error.

Expected results:

    "destroy cluster" should proceed without any warning/error, and delete everything finally.

Additional info:

FYI the .openshift-install.log is available at https://drive.google.com/file/d/15xIwzi0swDk84wqg32tC_4KfUahCalrL/view?usp=drive_link

FYI to stop the A3 instance via "gcloud" by specifying "--discard-local-ssd=false" does succeed.

$ gcloud  compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null
CREATION_TIMESTAMP   ZONE           STATUS      NAME                              MACHINE_TYPE   ITEMS
2024-05-29 20:55:52  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-master-0        n2-standard-4  ['jiwei-0530b-q9t8w-master']
2024-05-29 20:55:52  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-master-1        n2-standard-4  ['jiwei-0530b-q9t8w-master']
2024-05-29 20:55:52  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-master-2        n2-standard-4  ['jiwei-0530b-q9t8w-master']
2024-05-29 21:10:08  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-worker-a-rkxkk  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
2024-05-29 21:10:19  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-worker-b-qg6jv  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
2024-05-29 21:10:31  us-central1-c  RUNNING     jiwei-0530b-q9t8w-worker-c-ck6s8  a3-highgpu-8g  ['jiwei-0530b-q9t8w-worker']
$ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c
ERROR: (gcloud.compute.instances.stop) HTTPError 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command.
$ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c --discard-local-ssd=false
Stopping instance(s) jiwei-0530b-q9t8w-worker-c-ck6s8...done.                                                                                    
Updated [https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8].
$ gcloud  compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null
CREATION_TIMESTAMP   ZONE           STATUS      NAME                              MACHINE_TYPE   ITEMS
2024-05-29 20:55:52  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-master-0        n2-standard-4  ['jiwei-0530b-q9t8w-master']
2024-05-29 20:55:52  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-master-1        n2-standard-4  ['jiwei-0530b-q9t8w-master']
2024-05-29 20:55:52  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-master-2        n2-standard-4  ['jiwei-0530b-q9t8w-master']
2024-05-29 21:10:08  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-worker-a-rkxkk  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
2024-05-29 21:10:19  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-worker-b-qg6jv  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
2024-05-29 21:10:31  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-worker-c-ck6s8  a3-highgpu-8g  ['jiwei-0530b-q9t8w-worker']
$ gcloud compute instances delete -q jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8].
$ 

Description of problem:

When install a 4.16 cluster with the same API public DNS already existed, Installer is prompting Terraform Variables initialization errors, which is not expected since the Terraform support should be removed from the installer.    


05-19 17:36:32.935  level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "qe.devcluster.openshift.com": the zone already has record sets for the domain of the cluster: [api.gpei-0519a.qe.devcluster.openshift.com. (A)]

Version-Release number of selected component (if applicable):

 4.16.0-0.nightly-2024-05-18-212906, which has the CAPI install as default   

How reproducible:

    

Steps to Reproduce:

    1. Create a 4.16 cluster with the cluster name: gpei-0519a
    2. After the cluster installation finished, try to create the 2nd one with the same cluster name
    

Actual results:

    05-19 17:36:26.390  level=debug msg=OpenShift Installer 4.16.0-0.nightly-2024-05-18-212906
05-19 17:36:26.390  level=debug msg=Built from commit 3eed76e1400cac88af6638bb097ada1607137f3f
05-19 17:36:26.390  level=debug msg=Fetching Metadata...
05-19 17:36:26.390  level=debug msg=Loading Metadata...
05-19 17:36:26.390  level=debug msg=  Loading Cluster ID...
05-19 17:36:26.390  level=debug msg=    Loading Install Config...
05-19 17:36:26.390  level=debug msg=      Loading SSH Key...
05-19 17:36:26.390  level=debug msg=      Loading Base Domain...
05-19 17:36:26.390  level=debug msg=        Loading Platform...
05-19 17:36:26.390  level=debug msg=      Loading Cluster Name...
05-19 17:36:26.390  level=debug msg=        Loading Base Domain...
05-19 17:36:26.390  level=debug msg=        Loading Platform...
05-19 17:36:26.390  level=debug msg=      Loading Pull Secret...
05-19 17:36:26.390  level=debug msg=      Loading Platform...
05-19 17:36:26.390  level=debug msg=    Using Install Config loaded from state file
05-19 17:36:26.391  level=debug msg=  Using Cluster ID loaded from state file
05-19 17:36:26.391  level=debug msg=  Loading Install Config...
05-19 17:36:26.391  level=debug msg=  Loading Bootstrap Ignition Config...
05-19 17:36:26.391  level=debug msg=    Loading Ironic bootstrap credentials...
05-19 17:36:26.391  level=debug msg=    Using Ironic bootstrap credentials loaded from state file
05-19 17:36:26.391  level=debug msg=    Loading CVO Ignore...
05-19 17:36:26.391  level=debug msg=      Loading Common Manifests...
05-19 17:36:26.391  level=debug msg=        Loading Cluster ID...
05-19 17:36:26.391  level=debug msg=        Loading Install Config...
05-19 17:36:26.391  level=debug msg=        Loading Ingress Config...
05-19 17:36:26.391  level=debug msg=          Loading Install Config...
05-19 17:36:26.391  level=debug msg=        Using Ingress Config loaded from state file
05-19 17:36:26.391  level=debug msg=        Loading DNS Config...
05-19 17:36:26.391  level=debug msg=          Loading Install Config...
05-19 17:36:26.392  level=debug msg=          Loading Cluster ID...
05-19 17:36:26.392  level=debug msg=          Loading Platform Credentials Check...
05-19 17:36:26.392  level=debug msg=            Loading Install Config...
05-19 17:36:26.392  level=debug msg=          Using Platform Credentials Check loaded from state file
05-19 17:36:26.392  level=debug msg=        Using DNS Config loaded from state file
05-19 17:36:26.392  level=debug msg=        Loading Infrastructure Config...
05-19 17:36:26.392  level=debug msg=          Loading Cluster ID...
05-19 17:36:26.392  level=debug msg=          Loading Install Config...
05-19 17:36:26.392  level=debug msg=          Loading Cloud Provider Config...
05-19 17:36:26.392  level=debug msg=            Loading Install Config...
05-19 17:36:26.392  level=debug msg=            Loading Cluster ID...
05-19 17:36:26.392  level=debug msg=            Loading Platform Credentials Check...
05-19 17:36:26.392  level=debug msg=          Using Cloud Provider Config loaded from state file
05-19 17:36:26.393  level=debug msg=          Loading Additional Trust Bundle Config...
05-19 17:36:26.393  level=debug msg=            Loading Install Config...
05-19 17:36:26.393  level=debug msg=          Using Additional Trust Bundle Config loaded from state file
05-19 17:36:26.393  level=debug msg=        Using Infrastructure Config loaded from state file
05-19 17:36:26.393  level=debug msg=        Loading Network Config...
05-19 17:36:26.393  level=debug msg=          Loading Install Config...
05-19 17:36:26.393  level=debug msg=        Using Network Config loaded from state file
05-19 17:36:26.393  level=debug msg=        Loading Proxy Config...
05-19 17:36:26.393  level=debug msg=          Loading Install Config...
05-19 17:36:26.393  level=debug msg=          Loading Network Config...
05-19 17:36:26.393  level=debug msg=        Using Proxy Config loaded from state file
05-19 17:36:26.393  level=debug msg=        Loading Scheduler Config...
05-19 17:36:26.394  level=debug msg=          Loading Install Config...
05-19 17:36:26.394  level=debug msg=        Using Scheduler Config loaded from state file
05-19 17:36:26.394  level=debug msg=        Loading Image Content Source Policy...
05-19 17:36:26.394  level=debug msg=          Loading Install Config...
05-19 17:36:26.394  level=debug msg=        Using Image Content Source Policy loaded from state file
05-19 17:36:26.394  level=debug msg=        Loading Cluster CSI Driver Config...
05-19 17:36:26.394  level=debug msg=          Loading Install Config...
05-19 17:36:26.394  level=debug msg=          Loading Cluster ID...
05-19 17:36:26.394  level=debug msg=        Using Cluster CSI Driver Config loaded from state file
05-19 17:36:26.394  level=debug msg=        Loading Image Digest Mirror Set...
05-19 17:36:26.394  level=debug msg=          Loading Install Config...
05-19 17:36:26.394  level=debug msg=        Using Image Digest Mirror Set loaded from state file
05-19 17:36:26.394  level=debug msg=        Loading Machine Config Server Root CA...
05-19 17:36:26.395  level=debug msg=        Using Machine Config Server Root CA loaded from state file
05-19 17:36:26.395  level=debug msg=        Loading Certificate (mcs)...
05-19 17:36:26.395  level=debug msg=          Loading Machine Config Server Root CA...
05-19 17:36:26.395  level=debug msg=          Loading Install Config...
05-19 17:36:26.395  level=debug msg=        Using Certificate (mcs) loaded from state file
05-19 17:36:26.395  level=debug msg=        Loading CVOOverrides...
05-19 17:36:26.395  level=debug msg=        Using CVOOverrides loaded from state file
05-19 17:36:26.395  level=debug msg=        Loading KubeCloudConfig...
05-19 17:36:26.395  level=debug msg=        Using KubeCloudConfig loaded from state file
05-19 17:36:26.395  level=debug msg=        Loading KubeSystemConfigmapRootCA...
05-19 17:36:26.395  level=debug msg=        Using KubeSystemConfigmapRootCA loaded from state file
05-19 17:36:26.395  level=debug msg=        Loading MachineConfigServerTLSSecret...
05-19 17:36:26.396  level=debug msg=        Using MachineConfigServerTLSSecret loaded from state file
05-19 17:36:26.396  level=debug msg=        Loading OpenshiftConfigSecretPullSecret...
05-19 17:36:26.396  level=debug msg=        Using OpenshiftConfigSecretPullSecret loaded from state file
05-19 17:36:26.396  level=debug msg=      Using Common Manifests loaded from state file
05-19 17:36:26.396  level=debug msg=      Loading Openshift Manifests...
05-19 17:36:26.396  level=debug msg=        Loading Install Config...
05-19 17:36:26.396  level=debug msg=        Loading Cluster ID...
05-19 17:36:26.396  level=debug msg=        Loading Kubeadmin Password...
05-19 17:36:26.396  level=debug msg=        Using Kubeadmin Password loaded from state file
05-19 17:36:26.396  level=debug msg=        Loading OpenShift Install (Manifests)...
05-19 17:36:26.396  level=debug msg=        Using OpenShift Install (Manifests) loaded from state file
05-19 17:36:26.397  level=debug msg=        Loading Feature Gate Config...
05-19 17:36:26.397  level=debug msg=          Loading Install Config...
05-19 17:36:26.397  level=debug msg=        Using Feature Gate Config loaded from state file
05-19 17:36:26.397  level=debug msg=        Loading CloudCredsSecret...
05-19 17:36:26.397  level=debug msg=        Using CloudCredsSecret loaded from state file
05-19 17:36:26.397  level=debug msg=        Loading KubeadminPasswordSecret...
05-19 17:36:26.397  level=debug msg=        Using KubeadminPasswordSecret loaded from state file
05-19 17:36:26.397  level=debug msg=        Loading RoleCloudCredsSecretReader...
05-19 17:36:26.397  level=debug msg=        Using RoleCloudCredsSecretReader loaded from state file
05-19 17:36:26.397  level=debug msg=        Loading Baremetal Config CR...
05-19 17:36:26.397  level=debug msg=        Using Baremetal Config CR loaded from state file
05-19 17:36:26.397  level=debug msg=        Loading Image...
05-19 17:36:26.397  level=debug msg=          Loading Install Config...
05-19 17:36:26.398  level=debug msg=        Using Image loaded from state file
05-19 17:36:26.398  level=debug msg=        Loading AzureCloudProviderSecret...
05-19 17:36:26.398  level=debug msg=        Using AzureCloudProviderSecret loaded from state file
05-19 17:36:26.398  level=debug msg=      Using Openshift Manifests loaded from state file
05-19 17:36:26.398  level=debug msg=    Using CVO Ignore loaded from state file
05-19 17:36:26.398  level=debug msg=    Loading Install Config...
05-19 17:36:26.398  level=debug msg=    Loading Kubeconfig Admin Internal Client...
05-19 17:36:26.398  level=debug msg=      Loading Certificate (admin-kubeconfig-client)...
05-19 17:36:26.398  level=debug msg=        Loading Certificate (admin-kubeconfig-signer)...
05-19 17:36:26.398  level=debug msg=        Using Certificate (admin-kubeconfig-signer) loaded from state file
05-19 17:36:26.398  level=debug msg=      Using Certificate (admin-kubeconfig-client) loaded from state file
05-19 17:36:26.399  level=debug msg=      Loading Certificate (kube-apiserver-complete-server-ca-bundle)...
05-19 17:36:26.399  level=debug msg=        Loading Certificate (kube-apiserver-localhost-ca-bundle)...
05-19 17:36:26.399  level=debug msg=          Loading Certificate (kube-apiserver-localhost-signer)...
05-19 17:36:26.399  level=debug msg=          Using Certificate (kube-apiserver-localhost-signer) loaded from state file
05-19 17:36:26.399  level=debug msg=        Using Certificate (kube-apiserver-localhost-ca-bundle) loaded from state file
05-19 17:36:26.399  level=debug msg=        Loading Certificate (kube-apiserver-service-network-ca-bundle)...
05-19 17:36:26.399  level=debug msg=          Loading Certificate (kube-apiserver-service-network-signer)...
05-19 17:36:26.399  level=debug msg=          Using Certificate (kube-apiserver-service-network-signer) loaded from state file
05-19 17:36:26.399  level=debug msg=        Using Certificate (kube-apiserver-service-network-ca-bundle) loaded from state file
05-19 17:36:26.400  level=debug msg=        Loading Certificate (kube-apiserver-lb-ca-bundle)...
05-19 17:36:26.400  level=debug msg=          Loading Certificate (kube-apiserver-lb-signer)...
05-19 17:36:26.400  level=debug msg=          Using Certificate (kube-apiserver-lb-signer) loaded from state file
05-19 17:36:26.400  level=debug msg=        Using Certificate (kube-apiserver-lb-ca-bundle) loaded from state file
05-19 17:36:26.400  level=debug msg=      Using Certificate (kube-apiserver-complete-server-ca-bundle) loaded from state file
05-19 17:36:26.400  level=debug msg=      Loading Install Config...
05-19 17:36:26.400  level=debug msg=    Using Kubeconfig Admin Internal Client loaded from state file
05-19 17:36:26.400  level=debug msg=    Loading Kubeconfig Kubelet...
05-19 17:36:26.400  level=debug msg=      Loading Certificate (kube-apiserver-complete-server-ca-bundle)...
05-19 17:36:26.400  level=debug msg=      Loading Certificate (kubelet-client)...
05-19 17:36:26.401  level=debug msg=        Loading Certificate (kubelet-bootstrap-kubeconfig-signer)...
05-19 17:36:26.401  level=debug msg=        Using Certificate (kubelet-bootstrap-kubeconfig-signer) loaded from state file
05-19 17:36:26.401  level=debug msg=      Using Certificate (kubelet-client) loaded from state file
05-19 17:36:26.401  level=debug msg=      Loading Install Config...
05-19 17:36:26.401  level=debug msg=    Using Kubeconfig Kubelet loaded from state file
05-19 17:36:26.401  level=debug msg=    Loading Kubeconfig Admin Client (Loopback)...
05-19 17:36:26.401  level=debug msg=      Loading Certificate (admin-kubeconfig-client)...
05-19 17:36:26.401  level=debug msg=      Loading Certificate (kube-apiserver-localhost-ca-bundle)...
05-19 17:36:26.401  level=debug msg=      Loading Install Config...
05-19 17:36:26.401  level=debug msg=    Using Kubeconfig Admin Client (Loopback) loaded from state file
05-19 17:36:26.401  level=debug msg=    Loading Master Ignition Customization Check...
05-19 17:36:26.402  level=debug msg=      Loading Install Config...
05-19 17:36:26.402  level=debug msg=      Loading Machine Config Server Root CA...
05-19 17:36:26.402  level=debug msg=      Loading Master Ignition Config...
05-19 17:36:26.402  level=debug msg=        Loading Install Config...
05-19 17:36:26.402  level=debug msg=        Loading Machine Config Server Root CA...
05-19 17:36:26.402  level=debug msg=      Loading Master Ignition Config from both state file and target directory
05-19 17:36:26.402  level=debug msg=      On-disk Master Ignition Config matches asset in state file
05-19 17:36:26.402  level=debug msg=      Using Master Ignition Config loaded from state file
05-19 17:36:26.402  level=debug msg=    Using Master Ignition Customization Check loaded from state file
05-19 17:36:26.402  level=debug msg=    Loading Worker Ignition Customization Check...
05-19 17:36:26.402  level=debug msg=      Loading Install Config...
05-19 17:36:26.402  level=debug msg=      Loading Machine Config Server Root CA...
05-19 17:36:26.403  level=debug msg=      Loading Worker Ignition Config...
05-19 17:36:26.403  level=debug msg=        Loading Install Config...
05-19 17:36:26.403  level=debug msg=        Loading Machine Config Server Root CA...
05-19 17:36:26.403  level=debug msg=      Loading Worker Ignition Config from both state file and target directory
05-19 17:36:26.403  level=debug msg=      On-disk Worker Ignition Config matches asset in state file
05-19 17:36:26.403  level=debug msg=      Using Worker Ignition Config loaded from state file
05-19 17:36:26.403  level=debug msg=    Using Worker Ignition Customization Check loaded from state file
05-19 17:36:26.403  level=debug msg=    Loading Master Machines...
05-19 17:36:26.403  level=debug msg=      Loading Cluster ID...
05-19 17:36:26.403  level=debug msg=      Loading Platform Credentials Check...
05-19 17:36:26.403  level=debug msg=      Loading Install Config...
05-19 17:36:26.403  level=debug msg=      Loading Image...
05-19 17:36:26.404  level=debug msg=      Loading Master Ignition Config...
05-19 17:36:26.404  level=debug msg=    Using Master Machines loaded from state file
05-19 17:36:26.404  level=debug msg=    Loading Worker Machines...
05-19 17:36:26.404  level=debug msg=      Loading Cluster ID...
05-19 17:36:26.404  level=debug msg=      Loading Platform Credentials Check...
05-19 17:36:26.404  level=debug msg=      Loading Install Config...
05-19 17:36:26.404  level=debug msg=      Loading Image...
05-19 17:36:26.404  level=debug msg=      Loading Release...
05-19 17:36:26.404  level=debug msg=        Loading Install Config...
05-19 17:36:26.404  level=debug msg=      Using Release loaded from state file
05-19 17:36:26.404  level=debug msg=      Loading Worker Ignition Config...
05-19 17:36:26.404  level=debug msg=    Using Worker Machines loaded from state file
05-19 17:36:26.404  level=debug msg=    Loading Common Manifests...
05-19 17:36:26.404  level=debug msg=    Loading Openshift Manifests...
05-19 17:36:26.404  level=debug msg=    Loading Proxy Config...
05-19 17:36:26.405  level=debug msg=    Loading Certificate (admin-kubeconfig-ca-bundle)...
05-19 17:36:26.405  level=debug msg=      Loading Certificate (admin-kubeconfig-signer)...
05-19 17:36:26.405  level=debug msg=    Using Certificate (admin-kubeconfig-ca-bundle) loaded from state file
05-19 17:36:26.405  level=debug msg=    Loading Certificate (aggregator)...
05-19 17:36:26.405  level=debug msg=    Using Certificate (aggregator) loaded from state file
05-19 17:36:26.405  level=debug msg=    Loading Certificate (aggregator-ca-bundle)...
05-19 17:36:26.405  level=debug msg=      Loading Certificate (aggregator-signer)...
05-19 17:36:26.405  level=debug msg=      Using Certificate (aggregator-signer) loaded from state file
05-19 17:36:26.405  level=debug msg=    Using Certificate (aggregator-ca-bundle) loaded from state file
05-19 17:36:26.405  level=debug msg=    Loading Certificate (system:kube-apiserver-proxy)...
05-19 17:36:26.405  level=debug msg=      Loading Certificate (aggregator-signer)...
05-19 17:36:26.406  level=debug msg=    Using Certificate (system:kube-apiserver-proxy) loaded from state file
05-19 17:36:26.406  level=debug msg=    Loading Certificate (aggregator-signer)...
05-19 17:36:26.406  level=debug msg=    Loading Certificate (system:kube-apiserver-proxy)...
05-19 17:36:26.406  level=debug msg=      Loading Certificate (aggregator)...
05-19 17:36:26.406  level=debug msg=    Using Certificate (system:kube-apiserver-proxy) loaded from state file
05-19 17:36:26.406  level=debug msg=    Loading Bootstrap SSH Key Pair...
05-19 17:36:26.406  level=debug msg=    Using Bootstrap SSH Key Pair loaded from state file
05-19 17:36:26.406  level=debug msg=    Loading User-provided Service Account Signing key...
05-19 17:36:26.406  level=debug msg=    Using User-provided Service Account Signing key loaded from state file
05-19 17:36:26.406  level=debug msg=    Loading Cloud Provider CA Bundle...
05-19 17:36:26.406  level=debug msg=      Loading Install Config...
05-19 17:36:26.407  level=debug msg=    Using Cloud Provider CA Bundle loaded from state file
05-19 17:36:26.407  level=debug msg=    Loading Certificate (journal-gatewayd)...
05-19 17:36:26.407  level=debug msg=      Loading Machine Config Server Root CA...
05-19 17:36:26.407  level=debug msg=    Using Certificate (journal-gatewayd) loaded from state file
05-19 17:36:26.407  level=debug msg=    Loading Certificate (kube-apiserver-lb-ca-bundle)...
05-19 17:36:26.407  level=debug msg=    Loading Certificate (kube-apiserver-external-lb-server)...
05-19 17:36:26.407  level=debug msg=      Loading Certificate (kube-apiserver-lb-signer)...
05-19 17:36:26.407  level=debug msg=      Loading Install Config...
05-19 17:36:26.407  level=debug msg=    Using Certificate (kube-apiserver-external-lb-server) loaded from state file
05-19 17:36:26.407  level=debug msg=    Loading Certificate (kube-apiserver-internal-lb-server)...
05-19 17:36:26.407  level=debug msg=      Loading Certificate (kube-apiserver-lb-signer)...
05-19 17:36:26.408  level=debug msg=      Loading Install Config...
05-19 17:36:26.408  level=debug msg=    Using Certificate (kube-apiserver-internal-lb-server) loaded from state file
05-19 17:36:26.408  level=debug msg=    Loading Certificate (kube-apiserver-lb-signer)...
05-19 17:36:26.408  level=debug msg=    Loading Certificate (kube-apiserver-localhost-ca-bundle)...
05-19 17:36:26.408  level=debug msg=    Loading Certificate (kube-apiserver-localhost-server)...
05-19 17:36:26.408  level=debug msg=      Loading Certificate (kube-apiserver-localhost-signer)...
05-19 17:36:26.408  level=debug msg=    Using Certificate (kube-apiserver-localhost-server) loaded from state file
05-19 17:36:26.408  level=debug msg=    Loading Certificate (kube-apiserver-localhost-signer)...
05-19 17:36:26.408  level=debug msg=    Loading Certificate (kube-apiserver-service-network-ca-bundle)...
05-19 17:36:26.408  level=debug msg=    Loading Certificate (kube-apiserver-service-network-server)...
05-19 17:36:26.409  level=debug msg=      Loading Certificate (kube-apiserver-service-network-signer)...
05-19 17:36:26.409  level=debug msg=      Loading Install Config...
05-19 17:36:26.409  level=debug msg=    Using Certificate (kube-apiserver-service-network-server) loaded from state file
05-19 17:36:26.409  level=debug msg=    Loading Certificate (kube-apiserver-service-network-signer)...
05-19 17:36:26.409  level=debug msg=    Loading Certificate (kube-apiserver-complete-server-ca-bundle)...
05-19 17:36:26.409  level=debug msg=    Loading Certificate (kube-apiserver-complete-client-ca-bundle)...
05-19 17:36:26.409  level=debug msg=      Loading Certificate (admin-kubeconfig-ca-bundle)...
05-19 17:36:26.409  level=debug msg=      Loading Certificate (kubelet-client-ca-bundle)...
05-19 17:36:26.409  level=debug msg=        Loading Certificate (kubelet-signer)...
05-19 17:36:26.409  level=debug msg=        Using Certificate (kubelet-signer) loaded from state file
05-19 17:36:26.410  level=debug msg=      Using Certificate (kubelet-client-ca-bundle) loaded from state file
05-19 17:36:26.410  level=debug msg=      Loading Certificate (kube-control-plane-ca-bundle)...
05-19 17:36:26.410  level=debug msg=        Loading Certificate (kube-control-plane-signer)...
05-19 17:36:26.410  level=debug msg=        Using Certificate (kube-control-plane-signer) loaded from state file
05-19 17:36:26.410  level=debug msg=        Loading Certificate (kube-apiserver-lb-signer)...
05-19 17:36:26.410  level=debug msg=        Loading Certificate (kube-apiserver-localhost-signer)...
05-19 17:36:26.410  level=debug msg=        Loading Certificate (kube-apiserver-service-network-signer)...
05-19 17:36:26.410  level=debug msg=      Using Certificate (kube-control-plane-ca-bundle) loaded from state file
05-19 17:36:26.410  level=debug msg=      Loading Certificate (kube-apiserver-to-kubelet-ca-bundle)...
05-19 17:36:26.411  level=debug msg=        Loading Certificate (kube-apiserver-to-kubelet-signer)...
05-19 17:36:26.411  level=debug msg=        Using Certificate (kube-apiserver-to-kubelet-signer) loaded from state file
05-19 17:36:26.411  level=debug msg=      Using Certificate (kube-apiserver-to-kubelet-ca-bundle) loaded from state file
05-19 17:36:26.411  level=debug msg=      Loading Certificate (kubelet-bootstrap-kubeconfig-ca-bundle)...
05-19 17:36:26.411  level=debug msg=        Loading Certificate (kubelet-bootstrap-kubeconfig-signer)...
05-19 17:36:26.411  level=debug msg=      Using Certificate (kubelet-bootstrap-kubeconfig-ca-bundle) loaded from state file
05-19 17:36:26.411  level=debug msg=    Using Certificate (kube-apiserver-complete-client-ca-bundle) loaded from state file
05-19 17:36:26.411  level=debug msg=    Loading Certificate (kube-apiserver-to-kubelet-ca-bundle)...
05-19 17:36:26.411  level=debug msg=    Loading Certificate (kube-apiserver-to-kubelet-client)...
05-19 17:36:26.412  level=debug msg=      Loading Certificate (kube-apiserver-to-kubelet-signer)...
05-19 17:36:26.412  level=debug msg=    Using Certificate (kube-apiserver-to-kubelet-client) loaded from state file
05-19 17:36:26.412  level=debug msg=    Loading Certificate (kube-apiserver-to-kubelet-signer)...
05-19 17:36:26.412  level=debug msg=    Loading Certificate (kube-control-plane-ca-bundle)...
05-19 17:36:26.412  level=debug msg=    Loading Certificate (kube-control-plane-kube-controller-manager-client)...
05-19 17:36:26.412  level=debug msg=      Loading Certificate (kube-control-plane-signer)...
05-19 17:36:26.412  level=debug msg=    Using Certificate (kube-control-plane-kube-controller-manager-client) loaded from state file
05-19 17:36:26.412  level=debug msg=    Loading Certificate (kube-control-plane-kube-scheduler-client)...
05-19 17:36:26.412  level=debug msg=      Loading Certificate (kube-control-plane-signer)...
05-19 17:36:26.412  level=debug msg=    Using Certificate (kube-control-plane-kube-scheduler-client) loaded from state file
05-19 17:36:26.413  level=debug msg=    Loading Certificate (kube-control-plane-signer)...
05-19 17:36:26.413  level=debug msg=    Loading Certificate (kubelet-bootstrap-kubeconfig-ca-bundle)...
05-19 17:36:26.413  level=debug msg=    Loading Certificate (kubelet-client-ca-bundle)...
05-19 17:36:26.413  level=debug msg=    Loading Certificate (kubelet-client)...
05-19 17:36:26.413  level=debug msg=    Loading Certificate (kubelet-signer)...
05-19 17:36:26.413  level=debug msg=    Loading Certificate (kubelet-serving-ca-bundle)...
05-19 17:36:26.413  level=debug msg=      Loading Certificate (kubelet-signer)...
05-19 17:36:26.413  level=debug msg=    Using Certificate (kubelet-serving-ca-bundle) loaded from state file
05-19 17:36:26.413  level=debug msg=    Loading Certificate (mcs)...
05-19 17:36:26.413  level=debug msg=    Loading Machine Config Server Root CA...
05-19 17:36:26.413  level=debug msg=    Loading Key Pair (service-account.pub)...
05-19 17:36:26.414  level=debug msg=    Using Key Pair (service-account.pub) loaded from state file
05-19 17:36:26.414  level=debug msg=    Loading Release Image Pull Spec...
05-19 17:36:26.414  level=debug msg=    Using Release Image Pull Spec loaded from state file
05-19 17:36:26.414  level=debug msg=    Loading Image...
05-19 17:36:26.414  level=debug msg=  Loading Bootstrap Ignition Config from both state file and target directory
05-19 17:36:26.414  level=debug msg=  On-disk Bootstrap Ignition Config matches asset in state file
05-19 17:36:26.414  level=debug msg=  Using Bootstrap Ignition Config loaded from state file
05-19 17:36:26.414  level=debug msg=Using Metadata loaded from state file
05-19 17:36:26.414  level=debug msg=Reusing previously-fetched Metadata
05-19 17:36:26.415  level=info msg=Consuming Worker Ignition Config from target directory
05-19 17:36:26.415  level=debug msg=Purging asset "Worker Ignition Config" from disk
05-19 17:36:26.415  level=info msg=Consuming Master Ignition Config from target directory
05-19 17:36:26.415  level=debug msg=Purging asset "Master Ignition Config" from disk
05-19 17:36:26.415  level=info msg=Consuming Bootstrap Ignition Config from target directory
05-19 17:36:26.415  level=debug msg=Purging asset "Bootstrap Ignition Config" from disk
05-19 17:36:26.415  level=debug msg=Fetching Master Ignition Customization Check...
05-19 17:36:26.415  level=debug msg=Reusing previously-fetched Master Ignition Customization Check
05-19 17:36:26.415  level=debug msg=Fetching Worker Ignition Customization Check...
05-19 17:36:26.415  level=debug msg=Reusing previously-fetched Worker Ignition Customization Check
05-19 17:36:26.415  level=debug msg=Fetching Terraform Variables...
05-19 17:36:26.415  level=debug msg=Loading Terraform Variables...
05-19 17:36:26.416  level=debug msg=  Loading Cluster ID...
05-19 17:36:26.416  level=debug msg=  Loading Install Config...
05-19 17:36:26.416  level=debug msg=  Loading Image...
05-19 17:36:26.416  level=debug msg=  Loading Release...
05-19 17:36:26.416  level=debug msg=  Loading BootstrapImage...
05-19 17:36:26.416  level=debug msg=    Loading Install Config...
05-19 17:36:26.416  level=debug msg=    Loading Image...
05-19 17:36:26.416  level=debug msg=  Loading Bootstrap Ignition Config...
05-19 17:36:26.416  level=debug msg=  Loading Master Ignition Config...
05-19 17:36:26.416  level=debug msg=  Loading Master Machines...
05-19 17:36:26.416  level=debug msg=  Loading Worker Machines...
05-19 17:36:26.416  level=debug msg=  Loading Ironic bootstrap credentials...
05-19 17:36:26.416  level=debug msg=  Loading Platform Provisioning Check...
05-19 17:36:26.416  level=debug msg=    Loading Install Config...
05-19 17:36:26.416  level=debug msg=  Loading Common Manifests...
05-19 17:36:26.417  level=debug msg=  Fetching Cluster ID...
05-19 17:36:26.417  level=debug msg=  Reusing previously-fetched Cluster ID
05-19 17:36:26.417  level=debug msg=  Fetching Install Config...
05-19 17:36:26.417  level=debug msg=  Reusing previously-fetched Install Config
05-19 17:36:26.417  level=debug msg=  Fetching Image...
05-19 17:36:26.417  level=debug msg=  Reusing previously-fetched Image
05-19 17:36:26.417  level=debug msg=  Fetching Release...
05-19 17:36:26.417  level=debug msg=  Reusing previously-fetched Release
05-19 17:36:26.417  level=debug msg=  Fetching BootstrapImage...
05-19 17:36:26.417  level=debug msg=    Fetching Install Config...
05-19 17:36:26.417  level=debug msg=    Reusing previously-fetched Install Config
05-19 17:36:26.417  level=debug msg=    Fetching Image...
05-19 17:36:26.417  level=debug msg=    Reusing previously-fetched Image
05-19 17:36:26.417  level=debug msg=  Generating BootstrapImage...
05-19 17:36:26.417  level=debug msg=  Fetching Bootstrap Ignition Config...
05-19 17:36:26.418  level=debug msg=  Reusing previously-fetched Bootstrap Ignition Config
05-19 17:36:26.418  level=debug msg=  Fetching Master Ignition Config...
05-19 17:36:26.418  level=debug msg=  Reusing previously-fetched Master Ignition Config
05-19 17:36:26.418  level=debug msg=  Fetching Master Machines...
05-19 17:36:26.418  level=debug msg=  Reusing previously-fetched Master Machines
05-19 17:36:26.418  level=debug msg=  Fetching Worker Machines...
05-19 17:36:26.418  level=debug msg=  Reusing previously-fetched Worker Machines
05-19 17:36:26.418  level=debug msg=  Fetching Ironic bootstrap credentials...
05-19 17:36:26.418  level=debug msg=  Reusing previously-fetched Ironic bootstrap credentials
05-19 17:36:26.418  level=debug msg=  Fetching Platform Provisioning Check...
05-19 17:36:26.418  level=debug msg=    Fetching Install Config...
05-19 17:36:26.418  level=debug msg=    Reusing previously-fetched Install Config
05-19 17:36:26.418  level=debug msg=  Generating Platform Provisioning Check...
05-19 17:36:26.419  level=info msg=Credentials loaded from the "flexy-installer" profile in file "/home/installer1/workspace/ocp-common/Flexy-install@2/flexy/workdir/awscreds20240519-580673-bzyw8l"
05-19 17:36:32.935  level=fatal msg=failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": baseDomain: Invalid value: "qe.devcluster.openshift.com": the zone already has record sets for the domain of the cluster: [api.gpei-0519a.qe.devcluster.openshift.com. (A)]

Expected results:

 Remove all TF checks on AWS/vSphere/Nutanix platforms

Additional info:

    

This is a clone of issue OCPBUGS-30811. The following is the description of the original issue:

Description of problem:

On CI all the software for openstack and ansible related pieces are taken from pip and ansible-glalaxy instead of OS repository.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/231

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-37584. The following is the description of the original issue:

Description of problem:

Topology screen crashes and reports "Oh no! something went wrong" when a pod in completed state is selected.

Version-Release number of selected component (if applicable):

RHOCP 4.15.18    

How reproducible:

100%

Steps to Reproduce:

1. Switch to developer mode
2. Select Topology
3. Select a project that has completed cron jobs like openshift-image-registry
4. Click the green CronJob Object
5. Observe Crash

Actual results:

The Topology screen crashes with error "Oh no! Something went wrong."

Expected results:

After clicking the completed pod / workload, the screen should display the information related to it.

Additional info:

    

After some investigation, the issues we have seen with util-linux missing from some images are due to the CentOS Stream base image not installing subscription-manager

 

[root@92063ff10998 /]# yum install subscription-manager
CentOS Stream 9 - BaseOS                                                                                                                                                                                      3.3 MB/s | 8.9 MB     00:02    
CentOS Stream 9 - AppStream                                                                                                                                                                                   2.1 MB/s |  17 MB     00:08    
CentOS Stream 9 - Extras packages                                                                                                                                                                              14 kB/s |  17 kB     00:01    
Dependencies resolved.
==============================================================================================================================================================================================================================================
 Package                                                                        Architecture                                    Version                                                  Repository                                      Size
==============================================================================================================================================================================================================================================
Installing:
 subscription-manager                                                           aarch64                                         1.29.40-1.el9                                            baseos                                         911 k
Installing dependencies:
 acl                                                                            aarch64                                         2.3.1-4.el9                                              baseos                                          71 k
 checkpolicy                                                                    aarch64                                         3.6-1.el9                                                baseos                                         348 k
 cracklib                                                                       aarch64                                         2.9.6-27.el9                                             baseos                                          95 k
 cracklib-dicts                                                                 aarch64                                         2.9.6-27.el9                                             baseos                                         3.6 M
 dbus                                                                           aarch64                                         1:1.12.20-8.el9                                          baseos                                         3.7 k
 dbus-broker                                                                    aarch64                                         28-7.el9                                                 baseos                                         166 k
 dbus-common                                                                    noarch                                          1:1.12.20-8.el9                                          baseos                                          15 k
 dbus-libs                                                                      aarch64                                         1:1.12.20-8.el9                                          baseos                                         150 k
 diffutils                                                                      aarch64                                         3.7-12.el9                                               baseos                                         392 k
 dmidecode                                                                      aarch64                                         1:3.3-7.el9                                              baseos                                          70 k
 gobject-introspection                                                          aarch64                                         1.68.0-11.el9                                            baseos                                         248 k
 iproute                                                                        aarch64                                         6.2.0-5.el9                                              baseos                                         818 k
 kmod-libs                                                                      aarch64                                         28-9.el9                                                 baseos                                          62 k
 libbpf                                                                         aarch64                                         2:1.3.0-2.el9                                            baseos                                         172 k
 libdb                                                                          aarch64                                         5.3.28-53.el9                                            baseos                                         712 k
 libdnf-plugin-subscription-manager                                             aarch64                                         1.29.40-1.el9                                            baseos                                          63 k
 libeconf                                                                       aarch64                                         0.4.1-4.el9                                              baseos                                          26 k
 libfdisk                                                                       aarch64                                         2.37.4-18.el9                                            baseos                                         150 k
 libmnl                                                                         aarch64                                         1.0.4-16.el9                                             baseos                                          28 k
 libpwquality                                                                   aarch64                                         1.4.4-8.el9                                              baseos                                         119 k
 libseccomp                                                                     aarch64                                         2.5.2-2.el9                                              baseos                                          72 k
 libselinux-utils                                                               aarch64                                         3.6-1.el9                                                baseos                                         190 k
 libuser                                                                        aarch64                                         0.63-13.el9                                              baseos                                         405 k
 libutempter                                                                    aarch64                                         1.2.1-6.el9                                              baseos                                          27 k
 openssl                                                                        aarch64                                         1:3.2.1-1.el9                                            baseos                                         1.3 M
 pam                                                                            aarch64                                         1.5.1-19.el9                                             baseos                                         627 k
 passwd                                                                         aarch64                                         0.80-12.el9                                              baseos                                         121 k
 policycoreutils                                                                aarch64                                         3.6-2.1.el9                                              baseos                                         242 k
 policycoreutils-python-utils                                                   noarch                                          3.6-2.1.el9                                              baseos                                          77 k
 psmisc                                                                         aarch64                                         23.4-3.el9                                               baseos                                         243 k
 python3-audit                                                                  aarch64                                         3.1.2-2.el9                                              baseos                                          83 k
 python3-chardet                                                                noarch                                          4.0.0-5.el9                                              baseos                                         239 k
 python3-cloud-what                                                             aarch64                                         1.29.40-1.el9                                            baseos                                          77 k
 python3-dateutil                                                               noarch                                          1:2.8.1-7.el9                                            baseos                                         288 k
 python3-dbus                                                                   aarch64                                         1.2.18-2.el9                                             baseos                                         144 k
 python3-decorator                                                              noarch                                          4.4.2-6.el9                                              baseos                                          28 k
 python3-distro                                                                 noarch                                          1.5.0-7.el9                                              baseos                                          37 k
 python3-dnf-plugins-core                                                       noarch                                          4.3.0-15.el9                                             baseos                                         264 k
 python3-gobject-base                                                           aarch64                                         3.40.1-6.el9                                             baseos                                         184 k
 python3-gobject-base-noarch                                                    noarch                                          3.40.1-6.el9                                             baseos                                         161 k
 python3-idna                                                                   noarch                                          2.10-7.el9.1                                             baseos                                         102 k
 python3-iniparse                                                               noarch                                          0.4-45.el9                                               baseos                                          47 k
 python3-inotify                                                                noarch                                          0.9.6-25.el9                                             baseos                                          53 k
 python3-librepo                                                                aarch64                                         1.14.5-2.el9                                             baseos                                          48 k
 python3-libselinux                                                             aarch64                                         3.6-1.el9                                                baseos                                         183 k
 python3-libsemanage                                                            aarch64                                         3.6-1.el9                                                baseos                                          79 k
 python3-policycoreutils                                                        noarch                                          3.6-2.1.el9                                              baseos                                         2.1 M
 python3-pysocks                                                                noarch                                          1.7.1-12.el9                                             baseos                                          35 k
 python3-requests                                                               noarch                                          2.25.1-8.el9                                             baseos                                         125 k
 python3-setools                                                                aarch64                                         4.4.4-1.el9                                              baseos                                         595 k
 python3-setuptools                                                             noarch                                          53.0.0-12.el9                                            baseos                                         944 k
 python3-six                                                                    noarch                                          1.15.0-9.el9                                             baseos                                          37 k
 python3-subscription-manager-rhsm                                              aarch64                                         1.29.40-1.el9                                            baseos                                         162 k
 python3-systemd                                                                aarch64                                         234-18.el9                                               baseos                                          89 k
 python3-urllib3                                                                noarch                                          1.26.5-5.el9                                             baseos                                         215 k
 subscription-manager-rhsm-certificates                                         noarch                                          20220623-1.el9                                           baseos                                          21 k
 systemd                                                                        aarch64                                         252-33.el9                                               baseos                                         4.0 M
 systemd-libs                                                                   aarch64                                         252-33.el9                                               baseos                                         641 k
 systemd-pam                                                                    aarch64                                         252-33.el9                                               baseos                                         271 k
 systemd-rpm-macros                                                             noarch                                          252-33.el9                                               baseos                                          69 k
 usermode                                                                       aarch64                                         1.114-4.el9                                              baseos                                         189 k
 util-linux                                                                     aarch64                                         2.37.4-18.el9                                            baseos                                         2.3 M
 util-linux-core                                                                aarch64                                         2.37.4-18.el9                                            baseos                                         463 k
 virt-what                                                                      aarch64                                         1.25-5.el9                                               baseos                                          33 k
 which                                                                          aarch64                                         2.21-29.el9                                              baseos                                          41 kTransaction Summary
==============================================================================================================================================================================================================================================
Install  66 PackagesTotal download size: 26 M
Installed size: 92 M
Is this ok [y/N]: 
 

subscription-manager does bring in quite a few things. we can probably get away with installing 

systemd util-linux iproute dbus

we may hit some edge cases still where something works in OCP but doesn't in OKD due to a missing package.  we have hit at least 6 or 7 containers using tools from util-linux so far.

Description of the problem:

While installing many SNOs via ACM/ZTP using the infrastructure operator, all SNOs have a left over pod in the `assisted-installer` namespace that is stuck in ImagePullBackOff state.  It seems that there is no sha tag referenced in the pod spec and thus the container image will not be able to be pulled down in a disconnected environment.  In large scale tests this results in 3500+ pods across 3500 SNO clusters stuck in ImagePullBackOff.  Despite this, the cluster succeed in installing so this doesn't block tests however is something that should be addressed for the additional resources that a pod that can not run places on an SNO

Versions

Hub and Deployed SNO - 4.15.2

ACM - 2.10.0-DOWNSTREAM-2024-03-14-14-53-38 

 

How reproducible: 

Always in disconnected

Steps to reproduce:

1.

2.

3.

Actual results:

# oc --kubeconfig /root/hv-vm/kc/vm00006/kubeconfig get po -n assisted-installer 
NAME                                  READY   STATUS             RESTARTS   AGE
assisted-installer-controller-z569s   0/1     Completed          0          25h
vm00006-debug-n477b                   0/1     ImagePullBackOff   0          25h

Yaml of pod in question:

# oc --kubeconfig /root/hv-vm/kc/vm00006/kubeconfig get po -n assisted-installer vm00006-debug-n477b -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    debug.openshift.io/source-container: container-00
    debug.openshift.io/source-resource: /v1, Resource=nodes/vm00006
    openshift.io/scc: privileged
  creationTimestamp: "2024-03-25T16:31:48Z"
  name: vm00006-debug-n477b
  namespace: assisted-installer
  resourceVersion: "501965"
  uid: 4c46ed25-5d81-4e27-8f2a-5a5eb89cc474
spec:
  containers:
  - command:
    - chroot
    - /host
    - last
    - reboot
    env:
    - name: TMOUT
      value: "900"
    image: registry.redhat.io/rhel8/support-tools
    imagePullPolicy: Always
    name: container-00
    resources: {}
    securityContext:
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host
      name: host
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-zfn5z
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostIPC: true
  hostNetwork: true
  hostPID: true
  nodeName: vm00006
  preemptionPolicy: PreemptLowerPriority
  priority: 1000000000
  priorityClassName: openshift-user-critical
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /
      type: Directory
    name: host
  - name: kube-api-access-zfn5z
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-25T16:31:48Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-25T16:31:48Z"
    message: 'containers with unready status: [container-00]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-25T16:31:48Z"
    message: 'containers with unready status: [container-00]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-25T16:31:48Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: registry.redhat.io/rhel8/support-tools
    imageID: ""
    lastState: {}
    name: container-00
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "registry.redhat.io/rhel8/support-tools"
        reason: ImagePullBackOff
  hostIP: fc00:1005::3ed
  phase: Pending
  podIP: fc00:1005::3ed
  podIPs:
  - ip: fc00:1005::3ed
  qosClass: BestEffort
  startTime: "2024-03-25T16:31:48Z"

ACI Object for this cluster:

# oc get aci -n vm00005 vm00005 -o yaml
apiVersion: extensions.hive.openshift.io/v1beta1
kind: AgentClusterInstall
metadata:
  annotations:
    agent-install.openshift.io/install-config-overrides: '{"networking":{"networkType":"OVNKubernetes"},"capabilities":{"baselineCapabilitySet":
      "None", "additionalEnabledCapabilities": [ "OperatorLifecycleManager", "NodeTuning"
      ] }}'
    argocd.argoproj.io/sync-wave: "1"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions.hive.openshift.io/v1beta1","kind":"AgentClusterInstall","metadata":{"annotations":{"agent-install.openshift.io/install-config-overrides":"{\"networking\":{\"networkType\":\"OVNKubernetes\"},\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"OperatorLifecycleManager\", \"NodeTuning\" ] }}","argocd.argoproj.io/sync-wave":"1","ran.openshift.io/ztp-gitops-generated":"{}"},"labels":{"app.kubernetes.io/instance":"ztp-clusters-01"},"name":"vm00005","namespace":"vm00005"},"spec":{"clusterDeploymentRef":{"name":"vm00005"},"imageSetRef":{"name":"openshift-4.15.2"},"manifestsConfigMapRef":{"name":"vm00005"},"networking":{"clusterNetwork":[{"cidr":"fd01::/48","hostPrefix":64}],"machineNetwork":[{"cidr":"fc00:1005::/64"}],"serviceNetwork":["fd02::/112"]},"provisionRequirements":{"controlPlaneAgents":1},"sshPublicKey":"ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC6351YHGIvE6DZcAt0RXqodzKbglUeCNqxRyd7OOfcN+p88RaKZahg9/dL81IbdlGPbhPPG9ga7BdkLN8VbyYNCjE7kBIKM47JS1pbjeoJhI8bBrFjjp6LIJW/tWh1Pyl7Mk3DbPiZOkJ9HXvOgE/HRh44jJPtMLnzrZU5VNgNRgaEBWOG+j06pxdK9giMji1mFkJXSr43YUZYYgM3egNfNzxeTG0SshbZarRDEKeAlnDJkZ70rbP2krL2MgZJDv8vIK1PcMMFhjsJ/4Pp7F0Tl2Rm/qlZhTn4ptWagZmM0Z3N2WkNdX6Z9i2lZ5K+5jNHEFfjw/CPOFqpaFMMckpfFMsAJchbqnh+F5NvKJSFNB6L77iRCp5hbhGBbZncwc3UDO3FZ9ZuYZ8Ws+2ZyS5uVxd5ZUsvZFO+mWwySytFbsc0nUUcgkXlBiGKF/eFm9SQTURkyNzJkJfPm7awRwYoidaf8MTSp/kUCCyloAjpFIOJAa0SoVerhLp8uhQzfeU= root@e38-h01-000-r650.rdu2.scalelab.redhat.com"}}
    ran.openshift.io/ztp-gitops-generated: '{}'
  creationTimestamp: "2024-03-25T15:39:40Z"
  finalizers:
  - agentclusterinstall.agent-install.openshift.io/ai-deprovision
  generation: 3
  labels:
    app.kubernetes.io/instance: ztp-clusters-01
  name: vm00005
  namespace: vm00005
  ownerReferences:
  - apiVersion: hive.openshift.io/v1
    kind: ClusterDeployment
    name: vm00005
    uid: 4d647db3-88fb-4b64-8a47-d50c9e2dfe7b
  resourceVersion: "267225"
  uid: f6472a4f-d483-4563-8a36-388e7d3874c9
spec:
  clusterDeploymentRef:
    name: vm00005
  clusterMetadata:
    adminKubeconfigSecretRef:
      name: vm00005-admin-kubeconfig
    adminPasswordSecretRef:
      name: vm00005-admin-password
    clusterID: b758099e-3556-4fec-9190-ba709d4fbcaf
    infraID: 1de248f6-6767-47dc-8617-c02e8d0d457e
  imageSetRef:
    name: openshift-4.15.2
  manifestsConfigMapRef:
    name: vm00005
  networking:
    clusterNetwork:
    - cidr: fd01::/48
      hostPrefix: 64
    machineNetwork:
    - cidr: fc00:1005::/64
    serviceNetwork:
    - fd02::/112
    userManagedNetworking: true
  provisionRequirements:
    controlPlaneAgents: 1
  sshPublicKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC6351YHGIvE6DZcAt0RXqodzKbglUeCNqxRyd7OOfcN+p88RaKZahg9/dL81IbdlGPbhPPG9ga7BdkLN8VbyYNCjE7kBIKM47JS1pbjeoJhI8bBrFjjp6LIJW/tWh1Pyl7Mk3DbPiZOkJ9HXvOgE/HRh44jJPtMLnzrZU5VNgNRgaEBWOG+j06pxdK9giMji1mFkJXSr43YUZYYgM3egNfNzxeTG0SshbZarRDEKeAlnDJkZ70rbP2krL2MgZJDv8vIK1PcMMFhjsJ/4Pp7F0Tl2Rm/qlZhTn4ptWagZmM0Z3N2WkNdX6Z9i2lZ5K+5jNHEFfjw/CPOFqpaFMMckpfFMsAJchbqnh+F5NvKJSFNB6L77iRCp5hbhGBbZncwc3UDO3FZ9ZuYZ8Ws+2ZyS5uVxd5ZUsvZFO+mWwySytFbsc0nUUcgkXlBiGKF/eFm9SQTURkyNzJkJfPm7awRwYoidaf8MTSp/kUCCyloAjpFIOJAa0SoVerhLp8uhQzfeU=
    root@e38-h01-000-r650.rdu2.scalelab.redhat.com
status:
  apiVIP: fc00:1005::3ec
  apiVIPs:
  - fc00:1005::3ec
  conditions:
  - lastProbeTime: "2024-03-25T15:39:40Z"
    lastTransitionTime: "2024-03-25T15:39:40Z"
    message: SyncOK
    reason: SyncOK
    status: "True"
    type: SpecSynced
  - lastProbeTime: "2024-03-25T15:55:46Z"
    lastTransitionTime: "2024-03-25T15:55:46Z"
    message: The cluster's validations are passing
    reason: ValidationsPassing
    status: "True"
    type: Validated
  - lastProbeTime: "2024-03-25T16:48:06Z"
    lastTransitionTime: "2024-03-25T16:48:06Z"
    message: The cluster requirements are met
    reason: ClusterAlreadyInstalling
    status: "True"
    type: RequirementsMet
  - lastProbeTime: "2024-03-25T16:48:06Z"
    lastTransitionTime: "2024-03-25T16:48:06Z"
    message: 'The installation has completed: Cluster is installed'
    reason: InstallationCompleted
    status: "True"
    type: Completed
  - lastProbeTime: "2024-03-25T15:39:40Z"
    lastTransitionTime: "2024-03-25T15:39:40Z"
    message: The installation has not failed
    reason: InstallationNotFailed
    status: "False"
    type: Failed
  - lastProbeTime: "2024-03-25T16:48:06Z"
    lastTransitionTime: "2024-03-25T16:48:06Z"
    message: The installation has stopped because it completed successfully
    reason: InstallationCompleted
    status: "True"
    type: Stopped
  - lastProbeTime: "2024-03-25T15:57:07Z"
    lastTransitionTime: "2024-03-25T15:57:07Z"
    reason: There is no failing prior preparation attempt
    status: "False"
    type: LastInstallationPreparationFailed
  connectivityMajorityGroups: '{"IPv4":[],"IPv6":[],"fc00:1005::/64":[]}'
  debugInfo:
    eventsURL: https://assisted-service-multicluster-engine.apps.acm-lta.rdu2.scalelab.redhat.com/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMWRlMjQ4ZjYtNjc2Ny00N2RjLTg2MTctYzAyZThkMGQ0NTdlIn0.rC8eJsuw3mVOtBNWFUu6Nq5pRjmiBLuC6b_xV_FetO5D_8Tc4qz7_hit29C92Xrl_pjysD3tTey2c9NJoI9kTA&cluster_id=1de248f6-6767-47dc-8617-c02e8d0d457e
    logsURL: https://assisted-service-multicluster-engine.apps.acm-lta.rdu2.scalelab.redhat.com/api/assisted-install/v2/clusters/1de248f6-6767-47dc-8617-c02e8d0d457e/logs?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMWRlMjQ4ZjYtNjc2Ny00N2RjLTg2MTctYzAyZThkMGQ0NTdlIn0._CjP1uE4SpWQ93CIAVjHqN1uyse4d0AU5FThBSAyfKoRF8IvvY9Zpsr2kx3jNd_CyzHp0wPT5EWWpIMvQx79Kw
    state: adding-hosts
    stateInfo: Cluster is installed
  ingressVIP: fc00:1005::3ec
  ingressVIPs:
  - fc00:1005::3ec
  machineNetwork:
  - cidr: fc00:1005::/64
  platformType: None
  progress:
    totalPercentage: 100
  userManagedNetworking: true
  validationsInfo:
    configuration:
    - id: platform-requirements-satisfied
      message: Platform requirements satisfied
      status: success
    - id: pull-secret-set
      message: The pull secret is set.
      status: success
    hosts-data:
    - id: all-hosts-are-ready-to-install
      message: All hosts in the cluster are ready to install.
      status: success
    - id: sufficient-masters-count
      message: The cluster has the exact amount of dedicated control plane nodes.
      status: success
    network:
    - id: api-vips-defined
      message: 'API virtual IPs are not required: User Managed Networking'
      status: success
    - id: api-vips-valid
      message: 'API virtual IPs are not required: User Managed Networking'
      status: success
    - id: cluster-cidr-defined
      message: The Cluster Network CIDR is defined.
      status: success
    - id: dns-domain-defined
      message: The base domain is defined.
      status: success
    - id: ingress-vips-defined
      message: 'Ingress virtual IPs are not required: User Managed Networking'
      status: success
    - id: ingress-vips-valid
      message: 'Ingress virtual IPs are not required: User Managed Networking'
      status: success
    - id: machine-cidr-defined
      message: The Machine Network CIDR is defined.
      status: success
    - id: machine-cidr-equals-to-calculated-cidr
      message: 'The Cluster Machine CIDR is not required: User Managed Networking'
      status: success
    - id: network-prefix-valid
      message: The Cluster Network prefix is valid.
      status: success
    - id: network-type-valid
      message: The cluster has a valid network type
      status: success
    - id: networks-same-address-families
      message: Same address families for all networks.
      status: success
    - id: no-cidrs-overlapping
      message: No CIDRS are overlapping.
      status: success
    - id: ntp-server-configured
      message: No ntp problems found
      status: success
    - id: service-cidr-defined
      message: The Service Network CIDR is defined.
      status: success
    operators:
    - id: cnv-requirements-satisfied
      message: cnv is disabled
      status: success
    - id: lso-requirements-satisfied
      message: lso is disabled
      status: success
    - id: lvm-requirements-satisfied
      message: lvm is disabled
      status: success
    - id: mce-requirements-satisfied
      message: mce is disabled
      status: success
    - id: odf-requirements-satisfied
      message: odf is disabled
      status: success

Expected results:

 

 

Description of problem:

The creation of an Azure HC with secret encryption failed with
# azure-kms-provider-active container log (within the KAS pod)
I0516 09:38:22.860917       1 exporter.go:17] "metrics backend" exporter="prometheus"
I0516 09:38:22.861178       1 prometheus_exporter.go:56] "Prometheus metrics server running" address="8095"
I0516 09:38:22.861199       1 main.go:90] "Starting KeyManagementServiceServer service" version="" buildDate=""
E0516 09:38:22.861439       1 main.go:59] "unrecoverable error encountered" err="failed to create key vault client: key vault name, key name and key version are required"

How reproducible:

Always

Steps to Reproduce:

1. export RESOURCEGROUP="fxie-1234-rg" LOCATION="eastus" KEYVAULT_NAME="fxie-1234-keyvault" KEYVAULT_KEY_NAME="fxie-1234-key" KEYVAULT_KEY2_NAME="fxie-1234-key-2"
2. az group create --name $RESOURCEGROUP --location $LOCATION
3. az keyvault create -n $KEYVAULT_NAME -g $RESOURCEGROUP -l $LOCATION --enable-purge-protection true
4. az keyvault set-policy -n $KEYVAULT_NAME --key-permissions decrypt encrypt --spn fa5abf8d-ed43-4637-93a7-688e2a0efd82
5. az keyvault key create --vault-name $KEYVAULT_NAME -n $KEYVAULT_KEY_NAME --protection software
6. KEYVAULT_KEY_URL="$(az keyvault key show --vault-name $KEYVAULT_NAME --name $KEYVAULT_KEY_NAME --query 'key.kid' -o tsv)"
7. hypershift create cluster azure            --pull-secret $PULL_SECRET            --name $CLUSTER_NAME            --azure-creds $HOME/.azure/osServicePrincipal.json            --node-pool-replicas=1            --location eastus            --base-domain $BASE_DOMAIN    --release-image registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-05-15-001800 --encryption-key-id $KEYVAULT_KEY_URL     

Root cause:

The entrypoint statement within azure-kubernetes-kms's Dockerfile is in shell form which prevents any command line arguments from being used. 

Description of problem:

 When trying to onboard a xFusion baremetal node using redfish-virtual media (no provisioning network), it fails after the node registration with this error:

Normal InspectionError 60s metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection: The attribute Links/ManagedBy is missing from the resource /redfish/v1/Systems/1

Version-Release number of selected component (if applicable):

    4.14.18

How reproducible:

    Just add a xFusion baremetal node, specifing in the manifest

Spec: 
  Automated Cleaning Mode: metadata 
  Bmc: 
    Address: redfish-virtualmedia://w.z.x.y/redfish/v1/Systems/1 
    Credentials Name: hu28-tovb-bmc-secret 
    Disable Certificate Verification: true 
  Boot MAC Address:MAC 
  Boot Mode: UEFI Online: false 
  Preprovisioning Network Data Name: openstack-hu28-tovb-network-config-secret

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Inspection fails with afore mentioned error, no preprovisioning image is mounted on the hoste virtualmedia

Expected results:

    VirtualMedia get mounted and inspection starts.

Additional info:

    

Description of problem: When the bootstrap times out, the installer tries to download the logs from the bootstrap VM and gives an analysis of what happened. On OpenStack platform, we're currently failing to download the bootstrap logs (tracked in OCPBUGS-34950), which causes the analysis to always return an erroneous message:

time="2024-06-05T08:34:45-04:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2024-06-05T08:34:45-04:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
time="2024-06-05T08:34:45-04:00" level=error msg="The bootstrap machine did not execute the release-image.service systemd unit"

The affirmation that the bootstrap machine did not execute the release-image.service systemd unit is wrong, as I can confirm by SSH'ing to the bootstrap node:

systemctl status release-image.service
● release-image.service - Download the OpenShift Release Image
     Loaded: loaded (/etc/systemd/system/release-image.service; static)
     Active: active (exited) since Wed 2024-06-05 11:57:33 UTC; 1h 16min ago
    Process: 2159 ExecStart=/usr/local/bin/release-image-download.sh (code=exited, status=0/SUCCESS)
   Main PID: 2159 (code=exited, status=0/SUCCESS)
        CPU: 47.364s

Jun 05 11:57:05 mandre-tnvc8bootstrap systemd[1]: Starting Download the OpenShift Release Image...
Jun 05 11:57:06 mandre-tnvc8bootstrap podman[2184]: 2024-06-05 11:57:06.895418265 +0000 UTC m=+0.811028632 system refresh
Jun 05 11:57:06 mandre-tnvc8bootstrap release-image-download.sh[2159]: Pulling quay.io/openshift-release-dev/ocp-release@sha256:31cdf34b1957996d5c79c48466abab2fcfb9d9843>
Jun 05 11:57:32 mandre-tnvc8bootstrap release-image-download.sh[2269]: 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d10e64dd0ec125d
Jun 05 11:57:32 mandre-tnvc8bootstrap podman[2269]: 2024-06-05 11:57:32.82473216 +0000 UTC m=+25.848290388 image pull 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d1>
Jun 05 11:57:33 mandre-tnvc8bootstrap systemd[1]: Finished Download the OpenShift Release Image.

The installer was just unable to retrieve the bootstrap logs. Earlier, buried in the installer logs, we can see:

time="2024-06-05T08:34:42-04:00" level=info msg="Failed to gather bootstrap logs: failed to connect to the bootstrap machine: dial tcp 10.196.2.10:22: connect: connection
 timed out"

This is what should be reported by the analyzer.

Description of problem

Currently the manifests directory has:

0000_30_cluster-api_00_credentials-request.yaml
0000_30_cluster-api_00_namespace.yaml
...

CredentialsRequests go into the openshift-cloud-credential-operator namespace, so they can come before or after the openshift-cluster-api namespace. But because they ask for Secrets in the openshift-cluster-api namespace, there would be less race and drama if the CredentialsRequest manifests were given a name that sorted them after the namespace. Like 0000_30_cluster-api_01_credentials-request.yaml.

Version-Release number of selected component

I haven't gone digging in history, it may have been like this since forever.

How reproducible

Every time.

Steps to Reproduce

With a release image pullspec like registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-27-184535:

$ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-27-184535
$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request'

Actual results

$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request'
manifests/0000_30_cluster-api_00_credentials-request.yaml
manifests/0000_30_cluster-api_00_namespace.yaml

Expected results

$ ls manifests/0000_30_cluster-api_* | grep 'namespace\|credentials-request'
manifests/0000_30_cluster-api_00_namespace.yaml
manifests/0000_30_cluster-api_01_credentials-request.yaml

Description of problem:

Given the sessions can now modify their expiration during their lifetime (by going through the token refresh process), the current pruning mechanism might randomly remove active sessions.
We need to fix that behavior by ordering the byAge index of the session storage by expiration and only removing the sessions that are either really expired or close to expiration.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

In 4,16 OCP starts to place an annotation on service accounts when it creates a dockercfg secret. Some operators/reconciliation loops (incorrectly) will then try to set the annotation on the SA back to exactly what they wanted. OCP will annotate again and create a new secret. Operators sets it back without annotation. Rinse Repeat.

Eventually etcd will get completely overloaded with secrets, will start to OOM, and the entire cluster will come down.

 

There is belief that at least otel, tempo, acm, odf/ocs, strymzi, elasticsearch and possibly other operators reconciled the annoations on the SA by setting them back exactly how they wanted them set.

 

These seem to be related (but no complete)

https://issues.redhat.com/browse/LOG-5776

https://issues.redhat.com/browse/ENTMQST-6129

https://issues.redhat.com/browse/TRACING-4435

https://issues.redhat.com/browse/ACM-10987

This is a clone of issue OCPBUGS-29240. The following is the description of the original issue:

Manila drivers and node-registrar should be configured to use healtchecks.

Description of problem:

without specifying "kmsKeyServiceAccount" for controlPlane leads to creating bootstrap and control-plane machines failure

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-multi-2024-06-12-211551

How reproducible:

Always

Steps to Reproduce:

1. "create install-config" and then insert disk encryption settings, but not set "kmsKeyServiceAccount" for controlPlane (see [2])
2. "create cluster" (see [3])

Actual results:

"create cluster" failed with below error: 

ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: GCPMachine.infrastructure.cluster.x-k8s.io "jiwei-0613d-capi-84z69-bootstrap" is invalid: spec.rootDiskEncryptionKey.kmsKeyServiceAccount: Invalid value: "": spec.rootDiskEncryptionKey.kmsKeyServiceAccount in body should match '[-_[A-Za-z0-9]+@[-_[A-Za-z0-9]+.iam.gserviceaccount.com

Expected results:

Installation should succeed.

Additional info:

FYI the QE test case: 

OCP-61160 - [IPI-on-GCP] install cluster with different custom managed keys for control-plane and compute nodes https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-61160

Description of problem:

    when installing a disconnected and/or private cluster, existing VPC and subnets are used and inserted into install-config, unfortunately CAPI installation failed with error 'failed to generate asset "Cluster API Manifests": failed to generate GCP manifests: failed to get control plane subnet'

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-07-07-131215

How reproducible:

    Always

Steps to Reproduce:

1. create VPC/subnets/etc.
2. "create install-config", and insert the VPC/subnets settings (see [1])
3. "create cluster" (or "create manifests")

Actual results:

Failed with below error, although the subnet does exist (see [2]):

07-08 17:46:44.755  level=fatal msg=failed to fetch Cluster API Manifests: failed to generate asset "Cluster API Manifests": failed to generate GCP manifests: failed to get control plane subnet: failed to find subnet jiwei-0708b-master-subnet: googleapi: Error 400: Invalid resource field value in the request.
07-08 17:46:44.755  level=fatal msg=Details:
07-08 17:46:44.755  level=fatal msg=[
07-08 17:46:44.755  level=fatal msg=  {
07-08 17:46:44.755  level=fatal msg=    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
07-08 17:46:44.755  level=fatal msg=    "domain": "googleapis.com",
07-08 17:46:44.755  level=fatal msg=    "metadatas": {
07-08 17:46:44.755  level=fatal msg=      "method": "compute.v1.SubnetworksService.Get",
07-08 17:46:44.755  level=fatal msg=      "service": "compute.googleapis.com"
07-08 17:46:44.755  level=fatal msg=    },
07-08 17:46:44.755  level=fatal msg=    "reason": "RESOURCE_PROJECT_INVALID"
07-08 17:46:44.755  level=fatal msg=  }
07-08 17:46:44.756  level=fatal msg=]
07-08 17:46:44.756  level=fatal msg=, invalidParameter

Expected results:

    Installation should succeed.

Additional info:

FYI one of the problem PROW CI tests: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-disc-priv-capi-amd-mixarch-f28-destructive/1810155486213836800

QE's Flexy-install/295772/
VARIABLES_LOCATION private-templates/functionality-testing/aos-4_17/ipi-on-gcp/versioned-installer-private_cluster
LAUNCHER_VARS
feature_set: "TechPreviewNoUpgrade"    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

manually approve client CSR will cause Admission Webhook Warning  

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-23-013426

How reproducible:

    Always

Steps to Reproduce:

    1. setup UPI cluster and manually add more workers, don't approve node CSR
    2. Navigate to Compute -> Nodes page, click on 'Discovered' status then click on 'Approve' button in the modal
    

Actual results:

2. Console displays `Admission Webhook Warning: CertificateSigningRequest xxx violates policy 299 - unknown field metadata.Originalname

Expected results:

2. no warning message    

Additional info:

    

Description of problem:

When setting up cluster on vsphere, sometimes machine is powered off and in "Provisioning" phase, it will trigger a new machine creation, and report error "failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists"

Version-Release number of selected component (if applicable):

 4.12.0-0.ci.test-2022-09-26-235306-ci-ln-vh4qjyk-latest

How reproducible:

Sometimes, met two times

Steps to Reproduce:

1. Setup a vsphere cluster
2.
3.

Actual results:

Cluster installation failed, machine stuck in Provisioning status. 
$ oc get machine                      
NAME                             PHASE          TYPE   REGION   ZONE   AGE
jima-ipi-27-d97wp-master-0       Running                               4h
jima-ipi-27-d97wp-master-1       Running                               4h
jima-ipi-27-d97wp-master-2       Running                               4h
jima-ipi-27-d97wp-worker-7qn9b   Provisioning                          3h56m
jima-ipi-27-d97wp-worker-dsqd2   Running                               3h56m

$ oc edit machine jima-ipi-27-d97wp-worker-7qn9b
status:
  conditions:
  - lastTransitionTime: "2022-09-27T01:27:29Z"
    status: "True"
    type: Drainable
  - lastTransitionTime: "2022-09-27T01:27:29Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  - lastTransitionTime: "2022-09-27T01:27:29Z"
    status: "True"
    type: Terminable
  lastUpdated: "2022-09-27T01:27:29Z"
  phase: Provisioning
  providerStatus:
    conditions:
    - lastTransitionTime: "2022-09-27T01:36:09Z"
      message: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists.
      reason: MachineCreationSucceeded
      status: "False"
      type: MachineCreation
    taskRef: task-11363480

$ govc vm.info /SDDC-Datacenter/vm/jima-ipi-27-d97wp/jima-ipi-27-d97wp-worker-7qn9b
Name:           jima-ipi-27-d97wp-worker-7qn9b
  Path:         /SDDC-Datacenter/vm/jima-ipi-27-d97wp/jima-ipi-27-d97wp-worker-7qn9b
  UUID:         422cb686-6585-f05a-af13-b2acac3da294
  Guest name:   Red Hat Enterprise Linux 8 (64-bit)
  Memory:       16384MB
  CPU:          8 vCPU(s)
  Power state:  poweredOff
  Boot time:    <nil>
  IP address:   
  Host:         10.3.32.8

I0927 01:44:42.568599       1 session.go:91] No existing vCenter session found, creating new session
I0927 01:44:42.633672       1 session.go:141] Find template by instance uuid: 9535891b-902e-410c-b9bb-e6a57aa6b25a
I0927 01:44:42.641691       1 reconciler.go:270] jima-ipi-27-d97wp-worker-7qn9b: already exists, but was not powered on after clone, requeue
I0927 01:44:42.641726       1 controller.go:380] jima-ipi-27-d97wp-worker-7qn9b: reconciling machine triggers idempotent create
I0927 01:44:42.641732       1 actuator.go:66] jima-ipi-27-d97wp-worker-7qn9b: actuator creating machine
I0927 01:44:42.659651       1 reconciler.go:935] task: task-11363480, state: error, description-id: VirtualMachine.clone
I0927 01:44:42.659684       1 reconciler.go:951] jima-ipi-27-d97wp-worker-7qn9b: Updating provider status
E0927 01:44:42.659696       1 actuator.go:57] jima-ipi-27-d97wp-worker-7qn9b error: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists.
I0927 01:44:42.659762       1 machine_scope.go:101] jima-ipi-27-d97wp-worker-7qn9b: patching machine
I0927 01:44:42.660100       1 recorder.go:103] events "msg"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists." "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima-ipi-27-d97wp-worker-7qn9b","uid":"9535891b-902e-410c-b9bb-e6a57aa6b25a","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"17614"} "reason"="FailedCreate" "type"="Warning"
W0927 01:44:42.688562       1 controller.go:382] jima-ipi-27-d97wp-worker-7qn9b: failed to create machine: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists.
E0927 01:44:42.688651       1 controller.go:326]  "msg"="Reconciler error" "error"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: The name 'jima-ipi-27-d97wp-worker-7qn9b' already exists." "controller"="machine-controller" "name"="jima-ipi-27-d97wp-worker-7qn9b" "namespace"="openshift-machine-api" "object"={"name":"jima-ipi-27-d97wp-worker-7qn9b","namespace":"openshift-machine-api"} "reconcileID"="d765f02c-bd54-4e6c-88a4-c578f16c7149"
...
I0927 03:18:45.118110       1 actuator.go:66] jima-ipi-27-d97wp-worker-7qn9b: actuator creating machine
E0927 03:18:45.131676       1 actuator.go:57] jima-ipi-27-d97wp-worker-7qn9b error: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created
I0927 03:18:45.131725       1 machine_scope.go:101] jima-ipi-27-d97wp-worker-7qn9b: patching machine
I0927 03:18:45.131873       1 recorder.go:103] events "msg"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"jima-ipi-27-d97wp-worker-7qn9b","uid":"9535891b-902e-410c-b9bb-e6a57aa6b25a","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"17614"} "reason"="FailedCreate" "type"="Warning"
W0927 03:18:45.150393       1 controller.go:382] jima-ipi-27-d97wp-worker-7qn9b: failed to create machine: jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created
E0927 03:18:45.150492       1 controller.go:326]  "msg"="Reconciler error" "error"="jima-ipi-27-d97wp-worker-7qn9b: reconciler failed to Create machine: ServerFaultCode: The object 'vim.Task:task-11363480' has already been deleted or has not been completely created" "controller"="machine-controller" "name"="jima-ipi-27-d97wp-worker-7qn9b" "namespace"="openshift-machine-api" "object"={"name":"jima-ipi-27-d97wp-worker-7qn9b","namespace":"openshift-machine-api"} "reconcileID"="5d92bc1d-2f0d-4a0b-bb20-7f2c7a2cb5af"
I0927 03:18:45.150543       1 controller.go:187] jima-ipi-27-d97wp-worker-dsqd2: reconciling Machine


Expected results:

Machine is created successfully.

Additional info:

machine-controller log: http://file.rdu.redhat.com/~zhsun/machine-controller.log

Description of problem:

My CSV recently added a v1beta2 API version in addition to the existing v1beta1 version. When I create a v1beta2 CR and view it in the console, I see v1beta1 API fields and not the expected v1beta2 fields.

 
Version-Release number of selected component (if applicable):

4.15.14 (could affect other versions)

How reproducible:

Install 3.0.0 development version of Cryostat Operator

Steps to Reproduce:

    1. operator-sdk run bundle quay.io/ebaron/cryostat-operator-bundle:ocpbugs-34901
    2. cat << 'EOF' | oc create -f -
    apiVersion: operator.cryostat.io/v1beta2
    kind: Cryostat
    metadata:
      name: cryostat-sample
    spec:
      enableCertManager: false
    EOF
    3. Navigate to https://<openshift console>/k8s/ns/openshift-operators/clusterserviceversions/cryostat-operator.v3.0.0-dev/operator.cryostat.io~v1beta2~Cryostat/cryostat-sample
    4. Observe v1beta1 properties are rendered including "Minimal Deployment"
    5. Attempt to toggle "Minimal Deployment", observe that this fails.

Actual results:

v1beta1 properties are rendered in the details page instead of v1beta2 properties

Expected results:

v1beta2 properties are rendered in the details page

Additional info:

    

This is a clone of issue OCPBUGS-38842. The following is the description of the original issue:

Component Readiness has found a potential regression in the following test:

[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry

Probability of significant regression: 98.02%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Image%20Registry&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-22%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-15%2000%3A00%3A00&testId=openshift-tests-upgrade%3A10a9e2be27aa9ae799fde61bf8c992f6&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers%20for%20ns%2Fopenshift-image-registry

Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.

The problem appears to be a permissions error preventing the pods from starting:

2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied

Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489

Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:

container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch

With slightly different versions in each stream, but both were on 3-2.231.

Hits other tests too:

operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

This is a clone of issue OCPBUGS-38474. The following is the description of the original issue:

Description of problem:

    AdditionalTrustedCA is not wired correctly so the configmap is not found my its operator. This feature is meant to be exposed by XCMSTRAT-590, but at the moment it seems to be broken

Version-Release number of selected component (if applicable):

    4.16.5

How reproducible:

    Always

Steps to Reproduce:

1. Create a configmap containing a registry and PEM cert, like https://github.com/openshift/openshift-docs/blob/ef75d891786604e78dcc3bcb98ac6f1b3a75dad1/modules/images-configuration-cas.adoc#L17  
2. Refer to it in .spec.configuration.image.additionalTrustedCA.name     
3. image-registry-config-operator is not able to find the cm and the CO is degraded
    

Actual results:

   CO is degraded

Expected results:

    certs are used.

Additional info:

I think we may miss a copy of the configmap from the cluster NS to the target ns. It should be also deleted if it is deleted.

 

 % oc get hc -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd -o jsonpath="{.items[0].spec.configuration.image.additionalTrustedCA}" | jq
{
  "name": "registry-additional-ca-q9f6x5i4"
}

 

 

% oc get cm -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd registry-additional-ca-q9f6x5i4
NAME                              DATA   AGE
registry-additional-ca-q9f6x5i4   1      16m

 

 

logs of cluster-image-registry operator

 

E0814 13:22:32.586416       1 imageregistrycertificates.go:141] ImageRegistryCertificatesController: unable to sync: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found, requeuing

 

 

CO is degraded

 

% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.16.5    True        False         False      3h58m
csi-snapshot-controller                    4.16.5    True        False         False      4h11m
dns                                        4.16.5    True        False         False      3h58m
image-registry                             4.16.5    True        False         True       3h58m   ImageRegistryCertificatesControllerDegraded: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found
ingress                                    4.16.5    True        False         False      3h59m
insights                                   4.16.5    True        False         False      4h
kube-apiserver                             4.16.5    True        False         False      4h11m
kube-controller-manager                    4.16.5    True        False         False      4h11m
kube-scheduler                             4.16.5    True        False         False      4h11m
kube-storage-version-migrator              4.16.5    True        False         False      166m
monitoring                                 4.16.5    True        False         False      3h55m

 

 

This is a clone of issue OCPBUGS-42745. The following is the description of the original issue:

flowschemas.v1beta3.flowcontrol.apiserver.k8s.io used  in manifests/09_flowschema.yaml

Description of problem:

Failed to deploy the cluster with the following error:
time="2024-06-13T14:01:11Z" level=debug msg="Creating the security group rules"time="2024-06-13T14:01:19Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create security groups: failed to create the security group rule on group \"cb9a607c-9799-4186-bc22-26f141ce91aa\" for IPv4 tcp on ports 1936-1936: Bad request with: [POST https://10.46.44.159:13696/v2.0/security-group-rules], error message: {\"NeutronError\": {\"type\": \"SecurityGroupRuleParameterConflict\", \"message\": \"Conflicting value ethertype IPv4 for CIDR fd2e:6f44:5dd8:c956::/64\", \"detail\": \"\"}}"time="2024-06-13T14:01:20Z" level=debug msg="OpenShift Installer 4.17.0-0.nightly-2024-06-13-083330"time="2024-06-13T14:01:20Z" level=debug msg="Built from commit 6bc75dfebaca79ecf302263af7d32d50c31f371a"time="2024-06-13T14:01:20Z" level=debug msg="Loading Install Config..."time="2024-06-13T14:01:20Z" level=debug msg="  Loading SSH Key..."time="2024-06-13T14:01:20Z" level=debug msg="  Loading Base Domain..."time="2024-06-13T14:01:20Z" level=debug msg="    Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg="  Loading Cluster Name..."time="2024-06-13T14:01:20Z" level=debug msg="    Loading Base Domain..."time="2024-06-13T14:01:20Z" level=debug msg="    Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg="  Loading Pull Secret..."time="2024-06-13T14:01:20Z" level=debug msg="  Loading Platform..."time="2024-06-13T14:01:20Z" level=debug msg="Using Install Config loaded from state file"time="2024-06-13T14:01:20Z" level=debug msg="Loading Agent Config..."time="2024-06-13T14:01:20Z" level=info msg="Waiting up to 40m0s (until 2:41PM UTC) for the cluster at https://api.ostest.shiftstack.com:6443 to initialize..."

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-13-083330

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-41111. The following is the description of the original issue:

Description of problem:

v4.17 baselineCapabilitySet is not recognized.
  
# ./oc adm release extract --install-config v4.17-basecap.yaml --included --credentials-requests --from quay.io/openshift-release-dev/ocp-release:4.17.0-rc.1-x86_64 --to /tmp/test

error: unrecognized baselineCapabilitySet "v4.17"

# cat v4.17-basecap.yaml
---
apiVersion: v1
platform:
  gcp:
    foo: bar
capabilities:
  baselineCapabilitySet: v4.17 

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-04-132247

How reproducible:

    always

Steps to Reproduce:

    1. Run `oc adm release extract --install-config --included` against an install-config file including baselineCapabilitySet: v4.17. 
    2.
    3.
    

Actual results:

    `oc adm release extract` throw unrecognized error

Expected results:

    `oc adm release extract` should extract correct manifests

Additional info:

    If specifying baselineCapabilitySet: v4.16, it works well. 

Description of problem:

revert "force cert rotation every couple days for development" in 4.16

Below is the steps to verify this bug:

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator
  cluster-kube-apiserver-operator                https://github.com/openshift/cluster-kube-apiserver-operator                7764681777edfa3126981a0a1d390a6060a840a3

# git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307"
08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation

# oc get clusterversions.config.openshift.io 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         64m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133

$ cat scripts/check_secret_expiry.sh
FILE="$1"
if [ ! -f "$1" ]; then
  echo "must provide \$1" && exit 0
fi
export IFS=$'\n'
for i in `cat "$FILE"`
do
  if `echo "$i" | grep "^#" > /dev/null`; then
    continue
  fi
  NS=`echo $i | cut -d ' ' -f 1`
  SECRET=`echo $i | cut -d ' ' -f 2`
  rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null
  echo "Check cert dates of $SECRET in project $NS:"
  openssl x509 -noout --dates -in tls.crt; echo
done

$ cat certs.txt
openshift-kube-controller-manager-operator csr-signer-signer
openshift-kube-controller-manager-operator csr-signer
openshift-kube-controller-manager kube-controller-manager-client-cert-key
openshift-kube-apiserver-operator aggregator-client-signer
openshift-kube-apiserver aggregator-client
openshift-kube-apiserver external-loadbalancer-serving-certkey
openshift-kube-apiserver internal-loadbalancer-serving-certkey
openshift-kube-apiserver service-network-serving-certkey
openshift-config-managed kube-controller-manager-client-cert-key
openshift-config-managed kube-scheduler-client-cert-key
openshift-kube-scheduler kube-scheduler-client-cert-key

Checking the Certs,  they are with one day expiry times, this is as expected.
# ./check_secret_expiry.sh certs.txt
Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:41:38 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of csr-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:52:21 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator:
notBefore=Jun 27 04:41:37 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of aggregator-client in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:49 2022 GMT
notAfter=Jul 27 04:52:50 2022 GMT

Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:28 2022 GMT
notAfter=Jul 27 04:52:29 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT
# 

# cat check_secret_expiry_within.sh
#!/usr/bin/env bash
# usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year
WITHIN=${1:-24hours}
echo "Checking validity within $WITHIN ..."
oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before")  \(.metadata.annotations."auth.openshift.io/certificate-not-after")  \(.metadata.namespace)\t\(.metadata.name)"'

# ./check_secret_expiry_within.sh 1day
Checking validity within 1day ...
2022-06-27T04:41:37Z  2022-06-28T04:41:37Z  openshift-kube-apiserver-operator	aggregator-client-signer
2022-06-27T04:52:26Z  2022-06-28T04:41:37Z  openshift-kube-apiserver	aggregator-client
2022-06-27T04:52:21Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer
2022-06-27T04:41:38Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer-signer

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of problem:

If an infra-id (which is uniquely generated by the installer) is reused the installer will fail with:

level=info msg=Creating private Hosted Zone
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create private hosted zone: error creating private hosted zone: HostedZoneAlreadyExists: A hosted zone has already been created with the specified caller reference.


Users should not be reusing installer state in this manner, but we do it purposefully in our ipi-install-install step to mitigate infrastructure provisioning flakes:

https://steps.ci.openshift.org/reference/ipi-install-install#line720

We can fix this by ensuring the caller ref is unique on each invocation.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-36871. The following is the description of the original issue:

Description of problem:

Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15.
During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". 
During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. 
The CSR remain pending and do not get auto-approved

This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

   CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually. 

Expected results:

    CSR should get approved automatically and domain name scheme should not change.

Additional info:

    

Description of problem:

user.openshift.io and oauth.openshift.io APIs are not unavailable in external oidc cluster, that conducts all the common pull/push blob from/to image registry failed.

Version-Release number of selected component (if applicable):

4.15.15

How reproducible:

always

Steps to Reproduce:

1.Create a ROSA HCP cluster which configured external oidc users
2.Push data to image registry under a project
oc new-project wxj1
oc new-build httpd~https://github.com/openshift/httpd-ex.git 
3.

Actual results:

$ oc logs -f build/httpd-ex-1
Cloning "https://github.com/openshift/httpd-ex.git" ...	Commit:	1edee8f58c0889616304cf34659f074fda33678c (Update httpd.json)	Author:	Petr Hracek <phracek@redhat.com>	Date:	Wed Jun 5 13:00:09 2024 +0200time="2024-06-12T09:55:13Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"I0612 09:55:13.306937       1 defaults.go:112] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].Caching blobs under "/var/cache/blobs".Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...Trying to pull image-registry.openshift-image-registry.svc:5000/openshift/httpd@sha256:765aa645587f34e310e49db7cdc97e82d34122adb0b604eea891e0f98050aa77...Warning: Pull failed, retrying in 5s ...error: build error: After retrying 2 times, Pull image still failed due to error: unauthorized: unable to validate token: NotFound


oc logs -f deploy/image-registry -n openshift-image-registry

time="2024-06-12T09:55:13.36003996Z" level=error msg="invalid token: the server could not find the requested resource (get users.user.openshift.io ~)" go.version="go1.20.12 X:strictfipsruntime" http.request.host="image-registry.openshift-image-registry.svc:5000" http.request.id=0c380b81-99d4-4118-8de3-407706e8767c http.request.method=GET http.request.remoteaddr="10.130.0.35:50550" http.request.uri="/openshift/token?account=serviceaccount&scope=repository%3Aopenshift%2Fhttpd%3Apull" http.request.useragent="containers/5.28.0 (github.com/containers/image)"

Expected results:

Should pull/push blob from/to image registry on external oidc cluster

Additional info:

 

This is a clone of issue OCPBUGS-43428. The following is the description of the original issue:

As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.

From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.

Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.

inter 1s fall 2 rise 3

and

     readinessProbe:
      httpGet:
        scheme: HTTPS
        port: 6443
        path: readyz
      initialDelaySeconds: 0
      periodSeconds: 5
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3

We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following

2024-10-08T12:37:32.779247039Z [WARNING]  (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.

much faster than k8s would consider something as wrong.

In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.

This is a clone of issue OCPBUGS-35297. The following is the description of the original issue:

Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.

 

This is a clear regression and it is only present on 4.17, not 4.16.  It is present across all platforms, though I've selected AWS for links and screenshots.

 

4.17 graph - shows the change

4.16 graph - shows no change

slack thread if there are questions

courtesy screen shot

Description of problem

Similar to OCPBUGS-20061, but for a different situation:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=48h&name=pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep 'failures match' | sort
pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling (all) - 15 runs, 60% failed, 33% of failures match = 20% impact

In that test, since ETCD-329, the test suite deletes a control-plane Machine and waits for the ControlPlaneMachineSet controller to scale in a replacement. But in runs like this, the outgoing Node goes Ready=Unknown for not-yet-diagnosed reasons, and that somehow misses cpmso#294's inertia (maybe the running guard should be dropped?), and the ClusterOperator goes Available=False complaining about Missing 1 available replica(s).

It's not clear from the message which replica it's worried about (that would be helpful information to include in the message), but I suspect it's the Machine/Node that's in the deletion process. But regardless of the message, this does not seem like a situation worth a cluster-admin-midnight-page Available=False alarm.

Version-Release number of selected component

Seen in dev-branch CI. I haven't gone back to check older 4.y.

How reproducible

CI Search shows 20% impact, see my earlier query in this message.

Steps to Reproduce

Run a bunch of pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling and check CI Search results.

Actual results

20% impact

Expected results

No hits.

As a dev I'd like to run the HO locally smoothly.
Usually when you run it locally you scale down the deployment so they do not interfere.
However the HO code expects a pod to be always running.
We should imporve the dev ux and remove that hard dep.

Description of problem:
Affects only developers with a local build.

Version-Release number of selected component (if applicable):
4.15

How reproducible:
Always

Steps to Reproduce:
Build and run the console locally.

Actual results:
The user toggle menu isn't shown, so developers cannot access the user preference, such as the language or theme.

Expected results:
The user toggle should be there.

Please review the following PR: https://github.com/openshift/images/pull/183

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-37988. The following is the description of the original issue:

Description of problem:

    In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Open /settings/cluster using Firefox with Dark mode selected
    2.
    3.
    

Actual results:

    The version numbers under Update status are black

Expected results:

    The version numbers under Update status are white

Additional info:

    

Description of problem:

Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented.

On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power.

Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful.

[1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371

Version-Release number of selected component (if applicable):

Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions    

How reproducible:

Always

Steps to Reproduce:

    1. Deploy SNO node using ACM and fakefish as redfish interface
    2. Check metal3-ironic pod logs    

Actual results:

We can see a soft power_off command sent to the ironic agent running on the ramdisk:

2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197
2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234

Expected results:

There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.

Additional info:

    

This is a clone of issue OCPBUGS-38571. The following is the description of the original issue:

Description of problem:

Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-multi-2024-08-15-212448    

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", then optionally insert interested settings (see [1])
2. "create cluster", and make sure the cluster turns healthy finally (see [2])
3. check the cluster's addresses on GCP (see [3])
4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])

Actual results:

The global address "<infra id>-apiserver" is not deleted during "destroy cluster".

Expected results:

Everything of the cluster shoudl get deleted during "destroy cluster".    

Additional info:

FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306    

Description of the problem:

After ACM installation, namespace `local-cluster` contains AgentClusterInstall and ClusterDeployment but not InfraEnv:

[kni@provisionhost-0-0 ~]$ oc project local-cluster 
Now using project "local-cluster" on server "https://api.ocp-edge-cluster-0.qe.lab.redhat.com:6443".
[kni@provisionhost-0-0 ~]$ oc get aci
NAME            CLUSTER         STATE
local-cluster   local-cluster   adding-hosts
[kni@provisionhost-0-0 ~]$ oc get clusterdeployment
NAME            INFRAID   PLATFORM          REGION   VERSION                              CLUSTERTYPE   PROVISIONSTATUS   POWERSTATE   AGE
local-cluster             agent-baremetal            4.16.0-0.nightly-2024-06-03-060250                 Provisioned       Running      61m
[kni@provisionhost-0-0 ~]$ oc get infraenv
No resources found in local-cluster namespace.
[kni@provisionhost-0-0 ~]$ 

 

 

How reproducible:

100%

Steps to reproduce:

1. Deploy OCP 4.16

2. Deploy ACM build 2.11.0-DOWNSTREAM-2024-06-03-20-28-43

3. Execute `oc get infraenv -n local-cluster`

Actual results:

No infraenvs displayed.

Expected results:

Local-cluster infraenv displayed.

Description of problem:

The instructions for running a local console with authentication documented in https://github.com/openshift/console?tab=readme-ov-file#openshift-with-authentication appear to no longer work.

Steps to Reproduce:

    1.  Follow the steps in https://github.com/openshift/console?tab=readme-ov-file#openshift-with-authentication     

Actual results:

console on  master [$?] via 🐳 desktop-linux via 🐹 v1.19.5 via  v18.20.2 via 💎 v3.1.3 on ☁️  openshift-dev (us-east-1) 
❯ ./examples/run-bridge.sh
++ oc whoami --show-token
++ oc whoami --show-server
++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.alertmanagerPublicURL}'
++ oc -n openshift-config-managed get configmap monitoring-shared-config -o 'jsonpath={.data.thanosPublicURL}'
+ ./bin/bridge --base-address=http://localhost:9000 --ca-file=examples/ca.crt --k8s-auth=bearer-token --k8s-auth-bearer-token=sha256~EEIDh9LGIzlQ83udktABnEEIse3bintNzKNBJwQfvNI --k8s-mode=off-cluster --k8s-mode-off-cluster-endpoint=https://api.rhamilto.devcluster.openshift.com:6443 --k8s-mode-off-cluster-skip-verify-tls=true --listen=http://127.0.0.1:9000 --public-dir=./frontend/public/dist --user-auth=openshift --user-auth-oidc-client-id=console-oauth-client --user-auth-oidc-client-secret-file=examples/console-client-secret --user-auth-oidc-ca-file=examples/ca.crt --k8s-mode-off-cluster-alertmanager=https://alertmanager-main-openshift-monitoring.apps.rhamilto.devcluster.openshift.com --k8s-mode-off-cluster-thanos=https://thanos-querier-openshift-monitoring.apps.rhamilto.devcluster.openshift.com
W0515 11:18:55.835781   57122 authoptions.go:103] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
F0515 11:18:55.836030   57122 authoptions.go:299] Error initializing authenticator: file examples/ca.crt contained no CA data

Expected results:

Local console with authentication should work.

Description of problem

Seen in a 4.17 nightly-to-nightly CI update:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade/1809154554084724736/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator") | .reason' | sort | uniq -c | sort -n | tail -n3
     82 Pulled
     82 Started
   2116 ValidatingAdmissionPolicyUpdated
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade/1809154554084724736/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .reason == "ValidatingAdmissionPolicyUpdated").message' | sort | uniq -c
    705 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/machine-configuration-guards because it changed
    705 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/managed-bootimages-platform-check because it changed
    706 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/mcn-guards because it changed

I'm not sure what those are about (which may be a bug on it's own? Would be nice to know what changed), but it smells like a hot loop to me.

Version-Release number of selected component

Seen in 4.17. Not clear yet how to audit for exposure frequency or versions, short of teaching the origin test suite to fail if it sees too many of these kinds of events? Maybe a for openshift-... namespaces version of the current events should not repeat pathologically in e2e namespaces test-case? Which we may have, but it's not tripping?

How reproducible

Besides the initial update, also seen in this 4.17.0-0.nightly-2024-07-05-091056 serial run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1809154615350923264/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .reason == "ValidatingAdmissionPolicyUpdated").message' | sort | uniq -c
   1006 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/machine-configuration-guards because it changed
   1006 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/managed-bootimages-platform-check because it changed
   1007 Updated ValidatingAdmissionPolicy.admissionregistration.k8s.io/mcn-guards because it changed

So possibly every time, in all 4.17 clusters?

Steps to Reproduce

1. Unclear. Possibly just install 4.17.
2. Run oc -n openshift-machine-config-operator get -o json events | jq -r '.items[] | select(.reason == "ValidatingAdmissionPolicyUpdated")'.

Actual results

Thousands of hits.

Expected results

Zero to few hits.

This is a clone of issue OCPBUGS-42115. The following is the description of the original issue:

This is a clone of issue OCPBUGS-41184. The following is the description of the original issue:

Description of problem:

    The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations.

The attached spreadsheet displays the combinations of valid disk and instance types.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

 In assisted-service, we check in a hostname contains 64 or less characters:
https://github.com/openshift/assisted-service/blob/master/internal/host/hostutil/host_utils.go#L28

But the kubelet will fails to register a node if it has 64 characters:
https://access.redhat.com/solutions/7068042

This issue was faced by a customer in case https://access.redhat.com/support/cases/#/case/03876918

How reproducible:
Always
 

Steps to reproduce:

1. Have a node with a hostname made of 64 characters

2. Install the cluster and notices that the node will be marked as "Pending user action"

3. Look at the kubelet log on the node, and notice the following error:

 Jul 22 12:53:39 synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com kubenswrapper[4193]: E0722 12:53:39.167645    4193 kubelet_node_status.go:94] "Unable to register node with API server" err="Node \"synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com\" is invalid: metadata.labels: Invalid value: \"synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com\": must be no more than 63 characters" node="synergywatson-control-plane-1.private.openshiftvcn.oraclevcn.com"

Actual results:
Cluster installation fails
 

Expected results:
The assisted installer does not allow the installation of a cluster with a hostname longer than 64 characters included

/cc Alona Kaplan

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/79

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    If a deployment is created using BuildConfig and on edit of that, build option is selected as Shipwright and on Save error is coming

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Install Shipwright
    2. Create a deployment using BuildConfig as build option
    3. Edit it and see build option dropdown
    4. Click on Save
    

Actual results:

    Build option is set as Shipwright

Expected results:

    Build option should be BuildConfig

Additional info:

    

Description of problem:

When using the registry-overrides flag to override registries for control plane components, it seems like the current implementation prpagates the override to some data plane components. 

It seems that certain components like multus, dns, and ingress get values for their containers' images from env vars set in operators on the control plane (cno/dns operator/konnectivity), and hence also get the overridden registry propagated to them. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1.Input a registry override through the HyperShift Operator
    2.Check registry fields for components on data plane
    3.
    

Actual results:

Data plane components that get registry values from env vars set in dns-operator, ingress-operator, cluster-network-operator, and cluster-node-tuning-operator get overridden registries. 

Expected results:

overriden registries should not get propagated to data plane

Additional info:

    

Description of problem:

    The capitalization of "Import from Git" is inconsistent between the serverless and git repository buttons in the Add page

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    Always

Steps to Reproduce:

    1. Install serverless operator
    2. In the developer perspective click +Add
    3. Observe "Add from git" is inconsistently capitalized
    

Actual results:

    It is consistent

Expected results:

    It is not

Additional info:

    

This is a clone of issue OCPBUGS-38281. The following is the description of the original issue:

Description of problem:

We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 

Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
at:
github.com/openshift/cluster-openshift-controller-manager-operator/pkg/operator/internalimageregistry/cleanup_controller.go:146 +0xd65

This is a clone of issue OCPBUGS-39232. The following is the description of the original issue:

Description of problem:

    The smoke test for OLM run by the OpenShift e2e suite is specifying an unavailable operator for installation, causing it to fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always (when using 4.17+ catalog versions)

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-32812. The following is the description of the original issue:

Description of problem:

    When the image from a build is rolling out on the nodes, the update progress on the node is not displaying correctly. 

Version-Release number of selected component (if applicable):

    

How reproducible:

Always     

Steps to Reproduce:

    1. Enable OCL functionality 
    2. Opt the pool in by MachineOSConfig 
    3. Wait for the image to build and roll out
    4. Track mcp update status by oc get mcp 
    

Actual results:

The MCP start with O ready pool. While there are 1-2 pools got updated already, the count still remains 0. The count jump to 3 when all the pools are ready.     

Expected results:

The update progress should be reflected in the mcp status correctly. 

Additional info:

    

Description of problem:

In https://issues.redhat.com//browse/STOR-1453: TLSSecurityProfile feature, storage clustercsidriver.spec.observedConfig will get the value from APIServer.spec.tlsSecurityProfile to set cipherSuites and minTLSVersion in all corresponding csi driver, but it doesn't work well in hypershift cluster when only setting different value in the hostedclusters.spec.configuration.apiServer.tlsSecurityProfile in management cluster, the APIServer.spec in hosted cluster is not synced and CSI driver doesn't get the updated value as well. 

Version-Release number of selected component (if applicable):

Pre-merge test with openshift/csi-operator#69,openshift/csi-operator#71

How reproducible:

Always

Steps to Reproduce:

1. Have a hypershift cluster, the clustercsidriver get the default value like "minTLSVersion": "VersionTLS12"
$ oc get clustercsidriver ebs.csi.aws.com -ojson | jq .spec.observedConfig.targetcsiconfig.servingInfo
{
  "cipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
  ],
  "minTLSVersion": "VersionTLS12"
}
 
2. set the tlsSecurityProfile in hostedclusters.spec.configuration.apiServer in mgmtcluster, like the "minTLSVersion": "VersionTLS11":
 $ oc -n clusters get hostedclusters hypershift-ci-14206 -o json | jq .spec.configuration
{
  "apiServer": {
    "audit": {
      "profile": "Default"
    },
    "tlsSecurityProfile": {
      "custom": {
        "ciphers": [
          "ECDHE-ECDSA-CHACHA20-POLY1305",
          "ECDHE-RSA-CHACHA20-POLY1305",
          "ECDHE-RSA-AES128-GCM-SHA256",
          "ECDHE-ECDSA-AES128-GCM-SHA256"
        ],
        "minTLSVersion": "VersionTLS11"
      },
      "type": "Custom"
    }
  }
}     

3. This doesn't pass to apiserver in hosted cluster
oc get apiserver cluster -ojson | jq .spec
{
  "audit": {
    "profile": "Default"
  }
}     

4. CSI Driver still use the default value which is different from mgmtcluster.hostedclusters.spec.configuration.apiServer
$ oc get clustercsidriver ebs.csi.aws.com -ojson | jq .spec.observedConfig.targetcsiconfig.servingInfo
{
  "cipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
  ],
  "minTLSVersion": "VersionTLS12"
}

Actual results:

The tlsSecurityProfile doesn't get synced 

Expected results:

The tlsSecurityProfile should get synced 

Additional info:

    

Please review the following PR: https://github.com/openshift/router/pull/602

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/210

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38657. The following is the description of the original issue:

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535

Description of problem:

INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision...
E0819 14:17:33.676051    2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
E0819 14:17:33.708233    2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
I0819 14:17:33.708279    2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"

Openshift Dedicated is in the process of developing an offering of GCP clusters that uses only short-lived credentials from the end user. For these clusters to be deployed, the pod running the Openshift Installer needs to function with GCP credentials that fit the short-lived credential formats. This worked in prior Installer versions, such as 4.14, but was not an explicit requirement.
 

This is a clone of issue OCPBUGS-29528. The following is the description of the original issue:

Description of problem:

Camel K provides a list of Kamelets that are able to act as an event source or sink for a Knative eventing message broker.

Usually the list of Kamelets installed with the Camel K operator are displayed in the Developer Catalog list of available event sources with the provider "Apache Software Foundation" or "Red Hat Integration".

When a user adds a custom Kamelet custom resource to the user namespace the list of default Kamelets coming from the Camel K operator is gone. The Developer Catalog event source list then only displays the custom Kamelet but not the default ones.

Version-Release number of selected component (if applicable):

    

How reproducible:

Apply a custom Kamelet custom resource to the user namespace and open the list of available event sources in Dev Console Developer Catalog.

Steps to Reproduce:

    1. install global Camel K operator in operator namespace (e.g. openshift-operators)
    2. list all available event sources in "default" user namespace and see all Kamelets listed as event sources/sinks
    3. add a custom Kamelet custom resource to the default namespace
    4. see the list of available event sources only listing the custom Kamelet and the default Kamelets are gone from that list
    

Actual results:

Default Kamelets that act as event source/sink are only displayed in the Developer Catalog when there is no custom Kamelet added to a namespace.    

Expected results:

Default Kamelets coming with the Camel K operator (installed in the operator namespace) should always be part of the Developer Catalog list of available event sources/sinks. When the user adds more custom Kamelets these should be listed, too.   

Additional info:

Reproduced with Camel K operator 2.2 and OCP 4.14.8

screenshots: https://drive.google.com/drive/folders/1mTpr1IrASMT76mWjnOGuexFr9-mP0y3i?usp=drive_link

 

Description of the problem:

Recently, image registries listed in the assisted mirror config map are triggering a validation error if their registry is not listed in the spoke pull secret (And spoke fails to deploy). The error is:

message: 'The Spec could not be synced due to an input error: pull secret for
      new cluster is invalid: pull secret must contain auth for "registry.ci.openshift.org"' 

Mirror registries should be automatically excluded per this doc , and have been up until this point . This issue just start happening, so it appears some change has caused this.

 

This will impact:

  • customers using disconnected registries 
  • RH internal using icsp for pre-ga images (brew, etc..)

 

Versions:

So far occurs with 2.11.0-DOWNSTREAM-2024-05-17-03-42-35  but still checking for other y/z stream 

 

How reproducible:

100%

 

Steps to reproduce:

1. Create a mirror config map such as in this doc and ensure registry.ci.openshift.org is an entry

2. Create an imagecontentsource policy containing 
registry.ci.openshift.org/ocp/release:4.15.0-0.ci-2024-05-21-131652
3. Deploy a spoke cluster using CRDs with the above 4.15 ci image - ensure the spoke pull secret does NOT contain an entry for registry.ci.openshift.org

 

Actual results:

Spoke will fail deployment - aci will show

message: 'The Spec could not be synced due to an input error: pull secret for
      new cluster is invalid: pull secret must contain auth for "registry.ci.openshift.org"' 

Expected results:

Spoke deployment proceeds and deploys successfully

 

 

Work around

 

 

Description of problem:

We need to bump the Kubernetes Version. To the latest API version OCP is using.

This what was done last time:

https://github.com/openshift/cluster-samples-operator/pull/409

Find latest stable version from here: https://github.com/kubernetes/api

This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities

    

Version-Release number of selected component (if applicable):


    

How reproducible:

Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).
    

Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/1040

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Customer defines proxy in its HostedCluster resource definition. The variables are propagated to some pods but not to oauth one:

 oc describe pod kube-apiserver-5f5dbf78dc-8gfgs | grep PROX
      HTTP_PROXY:   http://ocpproxy.corp.example.com:8080
      HTTPS_PROXY:  http://ocpproxy.corp.example.com:8080
      NO_PROXY:     .....
oc describe pod oauth-openshift-6d7b7c79f8-2cf99| grep PROX
      HTTP_PROXY:   socks5://127.0.0.1:8090
      HTTPS_PROXY:  socks5://127.0.0.1:8090
      ALL_PROXY:    socks5://127.0.0.1:8090
      NO_PROXY:     kube-apiserver

 

apiVersion: hypershift.openshift.io/v1beta1
kind: HostedCluster

...

spec:
  autoscaling: {}
  clusterID: 9c8db607-b291-4a72-acc7-435ec23a72ea
  configuration:

   .....
    proxy:
      httpProxy: http://ocpproxy.corp.example.com:8080
      httpsProxy: http://ocpproxy.corp.example.com:8080

 

Version-Release number of selected component (if applicable): 4.14
 

Description of problem:

When setting ENV OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP to keep bootstrap, and launch capi-based installation, installer exit with error when collecting applied cluster api manifests..., since local cluster api was already stopped.


06-20 15:26:51.216  level=debug msg=Machine jima417aws-gjrzd-bootstrap is ready. Phase: Provisioned
06-20 15:26:51.216  level=debug msg=Checking that machine jima417aws-gjrzd-master-0 has provisioned...
06-20 15:26:51.217  level=debug msg=Machine jima417aws-gjrzd-master-0 has status: Provisioned
06-20 15:26:51.217  level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-0...
06-20 15:26:51.217  level=debug msg=Checked IP InternalDNS: ip-10-0-50-47.us-east-2.compute.internal
06-20 15:26:51.217  level=debug msg=Found internal IP address: 10.0.50.47
06-20 15:26:51.217  level=debug msg=Machine jima417aws-gjrzd-master-0 is ready. Phase: Provisioned
06-20 15:26:51.217  level=debug msg=Checking that machine jima417aws-gjrzd-master-1 has provisioned...
06-20 15:26:51.217  level=debug msg=Machine jima417aws-gjrzd-master-1 has status: Provisioned
06-20 15:26:51.217  level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-1...
06-20 15:26:51.218  level=debug msg=Checked IP InternalDNS: ip-10-0-75-199.us-east-2.compute.internal
06-20 15:26:51.218  level=debug msg=Found internal IP address: 10.0.75.199
06-20 15:26:51.218  level=debug msg=Machine jima417aws-gjrzd-master-1 is ready. Phase: Provisioned
06-20 15:26:51.218  level=debug msg=Checking that machine jima417aws-gjrzd-master-2 has provisioned...
06-20 15:26:51.218  level=debug msg=Machine jima417aws-gjrzd-master-2 has status: Provisioned
06-20 15:26:51.218  level=debug msg=Checking that IP addresses are populated in the status of machine jima417aws-gjrzd-master-2...
06-20 15:26:51.218  level=debug msg=Checked IP InternalDNS: ip-10-0-60-118.us-east-2.compute.internal
06-20 15:26:51.218  level=debug msg=Found internal IP address: 10.0.60.118
06-20 15:26:51.218  level=debug msg=Machine jima417aws-gjrzd-master-2 is ready. Phase: Provisioned
06-20 15:26:51.218  level=info msg=Control-plane machines are ready
06-20 15:26:51.218  level=info msg=Cluster API resources have been created. Waiting for cluster to become ready...
06-20 15:26:51.219  level=warning msg=OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP is set, shutting down local control plane.
06-20 15:26:51.219  level=info msg=Shutting down local Cluster API control plane...
06-20 15:26:51.473  level=info msg=Stopped controller: Cluster API
06-20 15:26:51.473  level=info msg=Stopped controller: aws infrastructure provider
06-20 15:26:52.830  level=info msg=Local Cluster API system has completed operations
06-20 15:26:52.830  level=debug msg=Collecting applied cluster api manifests...
06-20 15:26:52.831  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: [failed to get manifest openshift-cluster-api-guests: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest default: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/awsclustercontrolleridentities/default": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/clusters/jima417aws-gjrzd": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsclusters/jima417aws-gjrzd": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-0: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-0": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-1: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-1": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-2: Get "https://127.0.0.1:46555/apis/infrastructure.cluster.x-k8s.io/v1beta2/namespaces/openshift-cluster-api-guests/awsmachines/jima417aws-gjrzd-master-2": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-0: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-0": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-1: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-1": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master-2: Get "https://127.0.0.1:46555/apis/cluster.x-k8s.io/v1beta1/namespaces/openshift-cluster-api-guests/machines/jima417aws-gjrzd-master-2": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-bootstrap: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests/secrets/jima417aws-gjrzd-bootstrap": dial tcp 127.0.0.1:46555: connect: connection refused, failed to get manifest jima417aws-gjrzd-master: Get "https://127.0.0.1:46555/api/v1/namespaces/openshift-cluster-api-guests/secrets/jima417aws-gjrzd-master": dial tcp 127.0.0.1:46555: connect: connection refused]

Version-Release number of selected component (if applicable):

4.16/4.17 nightly build

How reproducible:

always

Steps to Reproduce:

1. Set ENV OPENSHIFT_INSTALL_PRESERVE_BOOTSTRAP 
2. Trigger the capi-based installation
3.

Actual results:

Installer exited when collecting capi manifests.

Expected results:

Installation should be successful.

Additional info:

 

 

This is a clone of issue OCPBUGS-38966. The following is the description of the original issue:

Description of problem:

    installing into GCP shared VPC with BYO hosted zone failed with error "failed to create the private managed zone"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-multi-2024-08-26-170521

How reproducible:

    Always

Steps to Reproduce:

    1. pre-create the dns private zone in the service project, with the zone's dns name like "<cluster name>.<base domain>" and binding to the shared VPC
    2. activate the service account having minimum permissions, i.e. no permission to bind a private zone to the shared VPC in the host project (see [1])
    3. "create install-config" and then insert the interested settings (e.g. see [2])
    4. "create cluster"     

Actual results:

    It still tries to create a private zone, which is unexpected.

failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create the private managed zone: failed to create private managed zone: googleapi: Error 403: Forbidden, forbidden

Expected results:

    The installer should use the pre-configured dns private zone, rather than try to create a new one. 

Additional info:

The 4.16 epic adding the support: https://issues.redhat.com/browse/CORS-2591

One PROW CI test which succeeded using Terraform installation: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-4.17-upgrade-from-stable-4.17-gcp-ipi-xpn-mini-perm-byo-hosted-zone-arm-f28/1821177143447523328

The PROW CI test which failed: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-xpn-mini-perm-byo-hosted-zone-amd-f28-destructive/1828255050678407168

Description of the problem:

When having installation with proxy having password with special characters, the proxy vars are not passed to the agent.  The special characters are url-encoded (i.e %2C etc).

How reproducible:

Always

Steps to reproduce:

1. Environment with proxy having special characters.  The special characters should be url-encoded.

2.

3.

Actual results:

The agent ignored the proxy variables, so it tried connecting the destinations without proxy and failed.

Expected results:

Agent should use the proxy with the special characters.

Slack thread: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1715779618711999

There is a solution for similar issue here: https://access.redhat.com/solutions/7053351 (not assisted).
 

Description of problem:

In Quick Start guided tour, user needs to click "next" button two times for moving forward to next step. If you skip the alert(Yes/No input) and click the "next" button it won't work.

Steps to Reproduce

  1. Open any Quick Start from quick start catalog page
  2. Click Start
  3. Click "yes" for the "Check your work Alert" on the first step
  4. Click Next button to go to second step
  5. Skip the "Check your work Alert" and click Next Button

Actual results:

The next button don't respond for first click

Expected results:

The next button should navigate to next step whether the user has answered the Alert message or not.

Reproducibility (Always/Intermittent/Only Once): Always

This is a clone of issue OCPBUGS-34849. The following is the description of the original issue:

Description of problem:

On 4.17, ABI jobs fail with error

level=debug msg=Failed to register infra env. Error: 1 error occurred:
level=debug msg=	* mac-interface mapping for interface eno12399np0 is missing

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-05-24-193308

How reproducible:

On Prow CI ABI jobs, always

Steps to Reproduce:

    1. Generate ABI ISO starting with an agent-config file defining multiple network interfaces with `enabled: false`
    2. Boot the ISO
    3. Wait for error
    

Actual results:

    Install fails with error 'mac-interface mapping for interface xxxx is missing'

Expected results:

    Install completes

Additional info:

The check fails on the 1st network interface defined with `enabled: false`

Prow CI ABI Job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808

agent-config.yaml: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808/artifacts/baremetal-pxe-ha-agent-ipv4-static-connected-f14/baremetal-lab-agent-install/artifacts/agent-config.yaml

Description of problem:

Network policy doesn't work properly during SDN live migration. During the migration, when the 2 CNI plugins are running in parallel. Cross-CNI traffic will be denied by ACLs generated for the network policy.

Version-release number of selected component (if applicable):

How reproducible:

Steps to reproduce:

1. Deploy a cluster with openshift-sdn

2. Create testpods in 2 different namespaces, z1 and z2.

3. In namespace z1, create a network policy that allows traffic from z2.

4. Trigger SDN live migration

5. Monitor the accessibility between the pods in Z1 and Z2.

Actual results:

When the pods in z1 and z2 on different nodes are using different CNI, the traffic is denied.

Expected results:

The traffic shall be allowed regardless of the CNI utilized by either pod.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (especially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so, please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so, please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g., AWS, Azure, GCP, baremetal, etc) ? If so, please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What are the srcNode, srcIP, srcNamespace and srcPodName?
  • What are the dstNode, dstIP, dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, SOSR report, or other attachment, please provide the following details:
    • If the issue is in a customer namespace, then provide a namespace inspection.
    • If it is a connectivity issue:
      • What are the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What are the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure, etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem have happened, if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with "sbr-triaged.”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged.”
  • Do not set the priority; that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template.”
  • For guidance on using this template, please see
    OCPBUGS Template Training for Networking  components

Description of problem

Cluster-update keys has some old Red Hat keys which are self-signed with SHA-1. The keys that we use have recently been resigned with SHA256. We don't rely on the self-signing to establish trust in the keys (that trust is established by baking a ConfigMap manifest into release images, where it can be read by the cluster-version operator), but we do need to avoid spooking the key-loading library. Currently Go-1.22-build CVOs in FIPS mode fail to bootstrap, 
like this aws-ovn-fips run Artifacts install artifacts:

$ curl -s [https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-fips/1800906552731766784/artifacts/e2e-aws-ovn-fips/ipi-install-install/artifacts/log-bundle-20240612161314.tar] | tar -tvz | grep 'cluster-version.*log' -rw-r--r-- core/core 54653 2024-06-12 09:13 log-bundle-20240612161314/bootstrap/containers/cluster-version-operator-bd9f61984afa844dcd284f68006ffc9548377c045eff840096c74bcdcbe5cca3.log $ curl -s [https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-fips/1800906552731766784/artifacts/e2e-aws-ovn-fips/ipi-install-install/artifacts/log-bundle-20240612161314.tar] | tar -xOz log-bundle-20240612161314/bootstrap/containers/cluster-version-operator-bd9f61984afa844dcd284f68006ffc9548377c045eff840096c74bcdcbe5cca3.log | grep GPG I0612 16:06:15.952567 1 start.go:256] Failed to initialize from payload; shutting down: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set E0612 16:06:15.952600 1 start.go:309] Collected payload initialization goroutine: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set

That's this code attempting to call ReadArmoredKeyRing (which fails with a currently-unlogged openpgp: invalid data: user ID self-signature invalid: openpgp: invalid signature: RSA verification failure complaining about the SHA-1 signature, and then a fallback to ReadKeyRing, which fails on the reported openpgp: invalid data: tag byte does not have MSB set.

To avoid these failures, we should:

  • Improve the library-go function, so we get both the ReadArmoredKeyRing error and the ReadKeyRing error back on load failures.
  • Update our keys in cluster-update-keys to ones with SHA256 or other still-acceptable digest algorithm.
  • Drop verifier-public-key-redhat-release-auxiliary, which we have versioned in cluster-update-keys despite no known users ever.

Version-Release number of selected component

Only 4.17 will use Go 1.22, so that's the only release that needs patching. But the changes would be fine to backport if we wanted.

How reproducible

100%.

Steps to Reproduce

1. Build the CVO with Go 1.22
2. Launch a FIPS cluster.

Actual results

Fails to bootstrap, with the bootstrap CVO complaining, as shown in the Description of problem section.

Expected results

Successful install

Description of problem:

    With the changes in https://github.com/openshift/machine-config-operator/pull/4425, RHEL worker nodes fail as follows:

[root@ptalgulk-0807c-fq97t-w-a-l-rhel-1 cloud-user]# systemctl --failed
  UNIT                  LOAD   ACTIVE SUB    DESCRIPTION                
● disable-mglru.service loaded failed failed Disables MGLRU on Openshfit

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
[root@ptalgulk-0807c-fq97t-w-a-l-rhel-1 cloud-user]# journalctl -u disable-mglru.service
-- Logs begin at Mon 2024-07-08 06:23:03 UTC, end at Mon 2024-07-08 08:31:35 UTC. --
Jul 08 06:23:14 localhost.localdomain systemd[1]: Starting Disables MGLRU on Openshfit...
Jul 08 06:23:14 localhost.localdomain bash[710]: /usr/bin/bash: /sys/kernel/mm/lru_gen/enabled: No such file or directory
Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Main process exited, code=exited, status=1/FAILURE
Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Failed with result 'exit-code'.
Jul 08 06:23:14 localhost.localdomain systemd[1]: Failed to start Disables MGLRU on Openshfit.
Jul 08 06:23:14 localhost.localdomain systemd[1]: disable-mglru.service: Consumed 4ms CPU time

We should only disable mglru if it exists.

Version-Release number of selected component (if applicable):

    4.16, 4.17

How reproducible:

    Attempt to bring up rhel worker node

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-42574. The following is the description of the original issue:

Description of problem:

On "VolumeSnapshot" list page, when project dropdown is "All Projects", click "Create VolumeSnapshot", the project "Undefined" is shown on project field.

    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-27-213503
4.18.0-0.nightly-2024-09-28-162600
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Go to "VolumeSnapshot" list page, set "All Projects" in project dropdown list.
    2.Click "Create VolumeSnapshot", check project field on the creation page.
    3.
    

Actual results:

2. The project is "Undefined"
    

Expected results:

2. The project should be "default".
    

Additional info:


    

This is a clone of issue OCPBUGS-42783. The following is the description of the original issue:

Context
Some ROSA HCP users host their own container registries (e.g., self-hosted Quay servers) that are only accessible from inside of their VPCs. This is often achieved through the use of private DNS zones that resolve non-public domains like quay.mycompany.intranet to non-public IP addresses. The private registries at those addresses then present self-signed SSL certificates to the client that can be validated against the HCP's additional CA trust bundle.

Problem Description
A user of a ROSA HCP cluster with a configuration like the one described above is encountering errors when attempting to import a container image from their private registry into their HCP's internal registry via oc import-image. Originally, these errors showed up in openshift-apiserver logs as DNS resolution errors, i.e., OCPBUGS-36944. After the user upgraded their cluster to 4.14.37 (which fixes OCPBUGS-36944), openshift-apiserver was able to properly resolve the domain name but complains of HTTP 502 Bad Gateway errors. We suspect these 502 Bad Gateway errors are coming from the Konnectivity-agent while it proxies traffic between the control and data planes.

We've confirmed that the private registry is accessible from the HCP data plane (worker nodes) and that the certificate presented by the registry can be validated against the cluster's additional trust bundle. IOW, curl-ing the private registry from a worker node returns a HTTP 200 OK, but doing the same from a control plane node returns a HTTP 502. Notably, this cluster is not configured with a cluster-wide proxy, nor does the user's VPC feature a transparent proxy.

Version-Release number of selected component
OCP v4.14.37

How reproducible
Can be reliably reproduced, although the network config (see Context above) is quite specific

Steps to Reproduce

  1. Run the following command from the HCP data plane
    oc import-image imagegroup/imagename:v1.2.3 --from=quay.mycompany.intranet/imagegroup/imagename:v1.2.3 --confirm
    
  2. Observe the command output, the resulting ImageStream object, and openshift-apiserver logs

Actual Results

error: tag v1.2.3 failed: Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway
imagestream.image.openshift.io/imagename imported with errors

Name:            imagename
Namespace:        mynamespace
Created:        Less than a second ago
Labels:            <none>
Annotations:        openshift.io/image.dockerRepositoryCheck=2024-10-01T12:46:02Z
Image Repository:    default-route-openshift-image-registry.apps.rosa.clustername.abcd.p1.openshiftapps.com/mynamespace/imagename
Image Lookup:        local=false
Unique Images:        0
Tags:            1

v1.2.3
  tagged from quay.mycompany.intranet/imagegroup/imagename:v1.2.3

  ! error: Import failed (InternalError): Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway
      Less than a second ago

error: imported completed with errors

Expected Results
Desired container image is imported from private external image registry into cluster's internal image registry without error

Please review the following PR: https://github.com/openshift/cluster-api-provider-openstack/pull/315

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

We are in a live migration scenario.

If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.

I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.

Version-Release number of selected component (if applicable):

4.16.13

How reproducible:

Always

Steps to Reproduce:

1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.

2. Start the migration

3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)

Actual results:

Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.

Expected results:

Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.

Additional info:

This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.

This is a customer issue. More details to be included in private comments for privacy.

Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/98

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Attempts to update a cluster to a release payload with a signature published by Red Hat fails with CVO failing to verity the signature, signalled by the ReleaseAccepted=False condition:

Retrieving payload failed version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74" failure=The update cannot be verified: unable to verify sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 against keyrings: verifier-public-key-redhat

CVO shows evidence of not being able to find the proper signature in its stores:

$ grep verifier-public-key-redhat cvo.log | head
I0610 07:38:16.208595       1 event.go:364] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.16.0-rc.4" image="quay.io/openshift-release-dev/ocp-release@sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74" failure=The update cannot be verified: unable to verify sha256:5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 against keyrings: verifier-public-key-redhat // [2024-06-10T07:38:16Z: prefix sha256-5e76f8c2cdc81fa40abb809ee5e2d56cb84f409aab773aa9b9c7e8ed8811bf74 in config map signatures-managed: no more signatures to check, 2024-06-10T07:38:16Z: ClusterVersion spec.signatureStores is an empty array. Unset signatureStores entirely if you want to enable the default signature stores]
...

Version-Release number of selected component (if applicable):

4.16.0-rc.3
4.16.0-rc.4
4.17.0-ec.0

How reproducible:

Seems always. All CI build farm clusters showed this behavior when trying to update from 4.16.0-rc.3

Steps to Reproduce:

1. Launch update to a version with a signature published by RH

Actual results:

ReleaseAccepted=False and update is stuck

Expected results:

ReleaseAccepted=True and update proceeds

Additional info:

Suspected culprit is https://github.com/openshift/cluster-version-operator/pull/1030/ so the fix may be a revert or an attempt to fix-forward, but revert seems safer at this point.

Evidence:

  • #1030 was added in 4.16.0-rc.3
  • #1030 code is supposed to work with an updated ClusterVersion CRD field .spec.signatureStores which right now is in TechPreview, so it is not enabled by default
  • CVO log hints that it is trying to process the field but fails [1], and that somehow the signature store may be considered as an explicitly, intentionally empty array instead of default set of signature locations
  • easily reproducible on any release that contains #1030
  • testing cluster with a #1030 revert did not reproduce the bug, successfuly started updating to rc4

[1]

...ClusterVersion spec.signatureStores is an empty array. Unset signatureStores entirely if you want to enable the default signature store...
W0610 07:58:59.095970       1 warnings.go:70] unknown field "spec.signatureStores"

Description of problem:

1. https://github.com/openshift/cluster-monitoring-operator/blob/release-4.16/Documentation/resources.md?plain=1#L50

monitoritoring-alertmanager-api-writer  should be monitoring-alertmanager-api-writer

2. https://github.com/openshift/cluster-monitoring-operator/blob/release-4.16/Documentation/resources.md?plain=1#L93

Port 9092 provides access the `/metrics` and `/federate` endpoints only. This port is for internal use, and no other usage is guaranteed.

should be

Port 9092 provides access to the `/metrics` and `/federate` endpoints only. This port is for internal use, and no other usage is guaranteed.

Version-Release number of selected component (if applicable):

4.16+

How reproducible:

always

Steps to Reproduce:

1. check Documentation/resources.md

Actual results:

errors in doc

The haproxy image here is currently just amd64. This is preventing testing of arm nodepools on azure on aks. We should use a manifest list version with both arm and amd.

Description of problem:

        Update Docs links for "Learn More" in Display Warning Policy Notification

Actual link: 

https://docs.redhat.com/en/documentation/openshift_container_platform/4.15/html/architecture/admission-plug-ins#admission-plug-ins-about_admission-plug-ins

 

Version-Release number of selected component (if applicable):

    

How reproducible:

Code: https://github.com/openshift/console/blob/master/frontend/public/components/utils/documentation.tsx#L88

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-35358. The following is the description of the original issue:

I'm working with the Gitops operator (1.7)  and when there is a high amount of CR (38.000 applications objects in this case) the related install plan get stuck with the following error:

 

- lastTransitionTime: "2024-06-11T14:28:40Z"
    lastUpdateTime: "2024-06-11T14:29:42Z"
    message: 'error validating existing CRs against new CRD''s schema for "applications.argoproj.io":
      error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"argoproj.io",
      Version:"v1alpha1", Resource:"applications"}: the server was unable to return
      a response in the time allotted, but may still be processing the request' 

Even waiting for a long time the operator is unable to move forward not removing or reinstalling its components.

 

Over a lab, the issue was not present until we started to add load to the cluster (applications.argoproj.io) and when we hit 26.000 applications we were not able to upgrade or reinstall the operator anymore.

 

This is a clone of issue OCPBUGS-42782. The following is the description of the original issue:

Description of problem:
The OpenShift Pipelines operator automatically installs a OpenShift console plugin. The console plugin metrics reports this as unknown after the plugin was renamed from "pipeline-console-plugin" to "pipelines-console-plugin".

Version-Release number of selected component (if applicable):
4.14+

How reproducible:
Always

Steps to Reproduce:

  1. Install the OpenShift Pipelines operator with the plugin
  2. Navigate to Observe > Metrics
  3. Check the metrics console_plugins_info

Actual results:
It shows an "unknown" plugin in the metrics.

Expected results:
It should shows a "pipelines" plugin in the metrics.

Additional info:
None

Currently, the assisted-service generated code test does not ensure that modules are tidy. This results in untidy builds which could potentially lead to build failures or redundant packages in builds 

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/104

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus is flaking because sometimes the IngressController gets modified after the E2E test retrieved the state of the IngressController object, we get this error when trying to apply changes: 

=== NAME  TestAll/parallel/TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus
    router_status_test.go:134: failed to update ingresscontroller: Operation cannot be fulfilled on ingresscontrollers.operator.openshift.io "ic-route-selector-test": the object has been modified; please apply your changes to the latest version and try again

We should use updateIngressControllerSpecWithRetryOnConflict to repeatedly attempt to update the IngressController while refreshing the state.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

 

Steps to Reproduce:

1. Run TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus E2E test 

Actual results:

=== NAME TestAll/parallel/TestIngressControllerRouteSelectorUpdateShouldClearRouteStatus router_status_test.go:134: failed to update ingresscontroller: Operation cannot be fulfilled on ingresscontrollers.operator.openshift.io "ic-route-selector-test": the object has been modified; please apply your changes to the latest version and try again

Expected results:

Test should pass

Additional info:

Example Flake

Search.CI Link

This is a clone of issue OCPBUGS-39226. The following is the description of the original issue:

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible: Always

Repro Steps:

Add: "bridge=br0:enpf0,enpf2 ip=br0:dhcp" to dracut cmdline. Make sure either enpf0/enpf2 is the primary network of the cluster subnet.

The linux bridge can be configured to add a virtual switch between one or many ports. This can be done by a simple machine config that adds:
"bridge=br0:enpf0,enpf2 ip=br0:dhcp"
to the the kernel command line options which will be processed by dracut.

The use case of adding such a virtual bridge for simple IEEE802.1 switching is to support PCIe devices that act as co-processors in a baremetal server. For example:
-------- ---------------------

Host PCIe Co-processor
eth0 <-------> enpf0 <br0> enpf2 <---> network
     

-------- ---------------------
This co-processor could be a "DPU" network interface card. Thus the co-processor can be part of the same underlay network as the cluster and pods can be scheduled on the Host and the Co-processor. This allows for pods to be offloaded to the co-processor for scaling workloads.

Actual results:

ovs-configuration service fails.

Expected results:

ovs-configuration service passes with the bridge interface added to the ovs bridge.

Description of problem:

this issue is opened to track an known issue https://github.com/openshift/console/pull/13677/files#r1567936852   

Version-Release number of selected component (if applicable):

  4.16.0-0.nightly-2024-04-22-023835  

How reproducible:

Always    

Steps to Reproduce:

Actual results:

currently we are printing: 
console-telemetry-plugin: telemetry disabled - ignoring telemetry event: page 

Expected results:

    

Additional info:

    

Description of problem:

    When creating a Serverless Function via Web Console from GIT repository the validation claims that builder strategy is not s2i. However if the build strategy is not set in func.yaml, then the s2i should be assumed implicitly. There should be no error.

There should be error only if the strategy is explicitly set to something other than s2i in func.yaml.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Try to create Serverless function from git repository where func.yaml does not explicitly specify builder.
    2. The Serverless Function cannot be created because of the validation.
 
    

Actual results:

    The Function cannot be created.

Expected results:

    The function can be created.

Additional info:

    

 

This is a clone of issue OCPBUGS-38111. The following is the description of the original issue:

Description of problem:

See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700

This is a clone of issue OCPBUGS-38132. The following is the description of the original issue:

The CPO reconciliation aborts when the OIDC/LDAP IDP validation check fails and this result in failure to reconcile for any components that are reconciled after that point in the code.

This failure should not be fatal to the CPO reconcile and should likely be reported as a condition on the HC.

xref

Customer incident
https://issues.redhat.com/browse/OCPBUGS-38071

RFE for bypassing the check
https://issues.redhat.com/browse/RFE-5638

PR to proxy the IDP check through the data plane network
https://github.com/openshift/hypershift/pull/4273

 

Description of problem:

Collect number of resources in etcd with must-gather

 

Version-Release number of selected component (if applicable):

4.14, 4.15, 4.16, 4.17    

 

Actual results:

No etcd number of resources available in the must-gather    

Expected results:

etcd number of resources available in the must-gather    

 

Additional info: RFE-5765

PR for 4.17 [1]

 

 [1] https://github.com/openshift/must-gather/pull/453

Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/346

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

The docs [1] on how to remove a cluster without it being destroyed for MCE don't account for clusters installed with the infrastructure operator and late-binding.

If these steps are followed the hosts will be booted back into the discovery ISO, effectively destroying the cluster.

How reproducible:

Not sure, just opening this after a conversation with a customer and a check in the docs.

Steps to reproduce:

1. Deploy a cluster using late-binding and BMHs

2. Follow the referenced doc to "detach" a cluster from management

Actual results:

ClusterDeployment is removed and agents are unbound, rebooting them back into the discovery ISO.

Expected results:

"detaching" a cluster should not reboot the agents into the infraenv and should delete them instead even if late binding was used.
Alternatively the docs should be updated to ensure late-binding users don't accidentally destroy their clusters.

[1] https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.10/html/clusters/cluster_mce_overview#remove-a-cluster-by-using-the-cli

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/116

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/314

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38549. The following is the description of the original issue:

In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.

cc Ali Mobrem 

Description of problem:

    The etcd data store is leftover in the <install dir>/.clusterapi_output/etcd dir when infrastructure provisioning fails. This takes up a lot of local storage space and is useless.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Start an install
    2. introduce an infra provisioning failure -- I did this by editing the 02_infra-cluster.yaml manifest to point to a non existent region
    3. Check .clusterapi_output dir for etcd dir
    

Actual results:

    etcd data store remains

Expected results:

    etcd data store should be deleted during infra provisioning failures. It should only be persisted to disk if there is a failure/interrupt in between infrastructure provisioning and bootstrap destroy, in which case it can be used in conjunction with the wait-for and destroy bootstrap commands.

Additional info:

    

Description of problem:


https://github.com/prometheus/prometheus/pull/14446 is a fix for https://github.com/prometheus/prometheus/issues/14087 (see there for details)

This was introduced in Prom 2.51.0 https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/deps-versions.md

    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

The code introduced by https://github.com/openshift/hypershift/pull/4354 is potentially disrupting for legit clusters. This need to be dropped before releasing. This is a blocker but I don't happen to be able to set prio: blocker any more.

Description of problem:

The console frontend does not respect the basePath when loading locales    

Version-Release number of selected component (if applicable):

4.17.0    

How reproducible:

    Always

Steps to Reproduce:

    1. Launch ./bin/bridge --base-path=/okd/
    2. Open browser console and observe 404 errors
    

Actual results:

There are 404 errors    

Expected results:

   There are no 404 errors

Additional info:

Copied from https://github.com/openshift/console/issues/12671

Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1688

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/437

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

A provisioning object such as:

apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningInterface: enp175s0f1
  provisioningNetwork: Disabled

Would cause Metal3 to not function properly, errors such as:
**

waiting for IP to be configured ens175s0f1

would be seen in the metal-ironic container logs.

A workaround is to delete all related provisioning fields, i.e.:

    provisioningDHCPRange: 172.22.0.10,172.22.7.254
    provisioningIP: 172.22.0.3
    provisioningInterface: enp175s0f1

If the provisioning network is disabled all related provisioning options should be ignored.

 

 

Remove duplicate code in storage module

 

Code to remove: https://github.com/rhamilto/console/commit/73ef85da64b4fc310e1b72f8237805a98161d96d#diff-35ffecb511f64d24da7035a73455d9dce171deb791d8d5433d7c7d61f13082afR122

 

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819

4.18 Micro upgrade failures began with the initial payload  4.18.0-0.ci-2024-08-09-234503

CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346

The common change in all of those scenarios appears to be:
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4437
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4518

Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/83

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

During OpenShift cluster installation - 4.16 Openshift installer file which uses terraform module is unable to create tags for the security groups associated with master / worker nodes since the tag is in key value format.  (i.e key=value)

Error log for reference:
level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: faile                    d to create security groups: failed to tag the Control plane security group: Resource not found: [PUT
https://example.cloud:443/v2.0/security-groups/sg-id/tags/openshiftClusterID=ocpclientprod2-vwgsc]    

Version-Release number of selected component (if applicable):

4.16.0    

How reproducible:

100%    

Steps to Reproduce:

    1. Create install-config
    2. run the 4.16 installer
    3. Observe the installation logs
    

Actual results:

installation fails to tag the security group    

Expected results:

installation to be successful    

Additional info:

    

Description of problem:

- Pods managed by DaemonSets are being evicted.
- This is causing that some pods of OCP components, such as for example csi drivers (and might be more) are beeing evicted before the application pods, causing those application pods going into an Error status (because CSI pod cannot do the tear down of the volumes).
- As applicaiton pods remain in error status, drain operation also fails after the maxPodGracePeriod

Version-Release number of selected component (if applicable):

- 4.11

How reproducible:

- Wait for a new scale-down event

Steps to Reproduce:

1. Wait for a new scale-down event
2.Monitor csi pods (or dns, or ingress...), you will notice that they are evicted, and as it come from DaemonSets, they become scheduled again as new pods.
3. More evidences could be found from kube-api audit logs.

Actual results:

- From audit logs we can see that pods are evicted by the clusterautoscaler

  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "ec999193-2c94-4710-a8c7-ff9460e30f70",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/openshift-cluster-csi-drivers/pods/aws-efs-csi-driver-node-2l2xn/eviction",
  "verb": "create",
  "user": {
    "username": "system:serviceaccount:openshift-machine-api:cluster-autoscaler",
    "uid": "44aa427b-58a4-438a-b56e-197b88aeb85d",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:openshift-machine-api",
      "system:authenticated"
    ],
    "extra": {
      "authentication.kubernetes.io/pod-name": [
        "cluster-autoscaler-default-5d4c54c54f-dx59s"
      ],
      "authentication.kubernetes.io/pod-uid": [
        "d57837b1-3941-48da-afeb-179141d7f265"
      ]
    }
  },
  "sourceIPs": [
    "10.0.210.157"
  ],
  "userAgent": "cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "pods",
    "namespace": "openshift-cluster-csi-drivers",
    "name": "aws-efs-csi-driver-node-2l2xn",
    "apiVersion": "v1",
    "subresource": "eviction"
  },
  "responseStatus": {
    "metadata": {},
    "status": "Success",
    "code": 201

## Even if they come from a daemonset
$ oc get ds -n openshift-cluster-csi-drivers
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
aws-ebs-csi-driver-node   8         8         8       8            8           kubernetes.io/os=linux   146m
aws-efs-csi-driver-node   8         8         8       8            8           kubernetes.io/os=linux   127m

Expected results:

DaemonSet Pods should not be evicted

Additional info:

 

In our hypershift test, we see the openshift-controller-manager undoing the work of our controllers to set an imagePullSecrets entry on our ServiceAccounts.  The result is a rapid updating of ServiceAccounts as the controllers fight.

This started happening after https://github.com/openshift/openshift-controller-manager/pull/305

Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/35

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


4.16 installs fail for ROSA STS installations

time="2024-06-11T14:05:48Z" level=debug msg="\t[failed to apply security groups to load balancer \"jamesh-sts-52g29-int\": AccessDenied: User: arn:aws:sts::476950216884:assumed-role/ManagedOpenShift-Installer-Role/1718114695748673685 is not authorized to perform: elasticloadbalancing:SetSecurityGroups on resource: arn:aws:elasticloadbalancing:us-east-1:476950216884:loadbalancer/net/jamesh-sts-52g29-int/bf7ef748daa739ce because no identity-based policy allows the elasticloadbalancing:SetSecurityGroups action"

Version-Release number of selected component (if applicable):


4.16+

How reproducible:


Every time

Steps to Reproduce:

1. Create an installer policy with the permissions listed in the installer [here|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go]
2. Run a install in AWS IPI

Actual results:


The installer fails to install a cluster in AWS

The installer log should show AccessDenied messages for the IAM action elasticloadbalancing:SetSecurityGroups 

The installer should show the error message "failed to apply security groups to load balancer"

Expected results:


Install completes successfully

Additional info:


Managed OpenShift (ROSA) installs STS clusters with [this|https://github.com/openshift/managed-cluster-config/blob/master/resources/sts/4.16/sts_installer_permission_policy.json] permission policy for the installer which should be what is required from the installer [policy|https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/permissions.go] plus permissions needed for OCM to do pre install validation.

This is a clone of issue OCPBUGS-38479. The following is the description of the original issue:

Description of problem:

When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml:

additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: ci.devcluster.openshift.com
compute:
- architecture: arm64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: arm64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

However, the installation will fail with ambiguous error messages:

ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused

The actual error hides in the bootstrap VM's System Log:

Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17

SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA)

SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519)

SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA)

ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac

Ignition: ran on 2024/08/14 12:34:24 UTC (this boot)

Ignition: user-provided config was applied

[0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m



[1;31mRelease image arch amd64 does not match host arch arm64[0m

ip-10-29-3-15 login: [   89.141099] Warning: Unmaintained driver is detected: nft_compat

    

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

Use amd64 installer to install a cluster with aarch64 nodes
    

Steps to Reproduce:

    1. download amd64 installer
    2. generate the install-config.yaml
    3. edit install-config.yaml to use aarch64 nodes
    4. invoke the installer
    

Actual results:

installation timed out after ~30mins
    

Expected results:

installation failed immediately with proper error message indicating the installation is not possible
    

Additional info:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379
    

This is a clone of issue OCPBUGS-39402. The following is the description of the original issue:

There is a typo here: https://github.com/openshift/installer/blob/release-4.18/upi/openstack/security-groups.yaml#L370

It should be os_subnet6_range.

That task is only run if os_master_schedulable is defined and greater to 0 in the inventory.yaml

Tracker issue for bootimage bump in 4.17. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-36318.

This is a clone of issue OCPBUGS-41932. The following is the description of the original issue:

Description of problem:

When the Insights Operator is disabled (as described in the docs here  or here), the RemoteConfigurationAvailable and RemoteConfigurationValid clusteroperator conditions are reporting the previous (before distabling the gathering) state (which might be Available=True and Valid=True).

 
Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Disable the data gathering in the Insights Operator followings the docs links above
    2. Watch the clusteroperator conditions with "oc get co insights -o json | jq .status.conditions"
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn-conformance%22%7D%5D%7D

Tests failing:
hosted cluster version rollout succeeds

Tanked on May 25th (Sat) but may have come in late Friday the 24th.

Failure message:

{hosted cluster version rollout never completed  
      
error: hosted cluster version rollout never completed, dumping relevant hosted cluster condition messages
Degraded: [capi-provider deployment has 1 unavailable replicas, kube-apiserver deployment has 1 unavailable replicas]
ClusterVersionSucceeding: Condition not found in the CVO.
      
    }

Looks to broadly failing hypershift presubmits all over as well:

https://search.dptools.openshift.org/?search=capi-provider+deployment+has+1+unavailable+replicas&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Description of problem:

When user tries to run oc-mirror delete command with --generate after a (M2D + D2M) it fails with error below
2024/08/02 12:18:03  [ERROR]  : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
2024/08/02 12:18:03  [ERROR]  : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
2024/08/02 12:18:03  [ERROR]  :  pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
    

Version-Release number of selected component (if applicable):

    WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407302009.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-31T00:37:18Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

     Always
    

Steps to Reproduce:

    1. Download latest oc-mirror binary
    2. Use the ImageSetCofig below and perform (M2D + D2M)
    3. oc-mirror -c config.yaml file://CLID-136 --v2
    4. oc-mirror -c config.yaml --from file://CLID-136 --v2 docker://localhost:5000 --dest-tls-verify=false
    5. Now create deleteImageSetConfig as shown below and run delete command with --generate
     6. oc-mirror delete -c delete-config.yaml --generate --workspace file://CLID-136-delete docker://localhost:5000 --v2
    

Actual results:

    Below errors are seen
    2024/08/02 12:18:03  [ERROR]  : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
2024/08/02 12:18:03  [ERROR]  : [OperatorImageCollector] pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
2024/08/02 12:18:03  [ERROR]  :  pinging container registry localhost:55000: Get "https://localhost:55000/v2/": http: server gave HTTP response to HTTPS client
    

Expected results:

    No errors should be seen
    

Additional info:

    This error is resolved upon using  --src-tls-verify=false with the oc-mirror delete --generate command
   More details in the slack thread here https://redhat-internal.slack.com/archives/C050P27C71S/p1722601331671649?thread_ts=1722597021.825099&cid=C050P27C71S
    

This is a clone of issue OCPBUGS-37819. The following is the description of the original issue:

Description of problem:

    When we added new bundle metadata encoding as `olm.csv.metadata` in https://github.com/operator-framework/operator-registry/pull/1094 (downstreamed for 4.15+) we created situations where
- konflux onboarded operators, encouraged to use upstream:latest to generate FBC from templates; and
- IIB-generated catalog images which used earlier opm versions to serve content

could generate the new format but not be able to serve it. 

One only has to `opm render` an SQLite catalog image, or expand a catalog template.

 

 

Version-Release number of selected component (if applicable):

    

How reproducible:

every time    

Steps to Reproduce:

    1. opm render an SQLite catalog image
    2.
    3.
    

Actual results:

    uses `olm.csv.metadata` in the output

Expected results:

    only using `olm.bundle.object` in the output

Additional info:

    

Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/83

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In Openshift web console, Dashboards tab, data is not getting loaded for "Prometheus/Overview" Dashboard

Version-Release number of selected component (if applicable):

4.16.0-ec.5

How reproducible:

OCP 4.16.0-ec.5 cluster deployed on Power using UPI installer    

Steps to Reproduce:

1. Deploy 4.16.0-ec.5 cluster using UPI installer
2. Login to web console  
3. Select "Dashboards" panel under "Observe" tab
4. Select "Prometheus/Overview" from the "Dashboard" drop down
    

Actual results:

Data/graphs are not getting loaded. "No datapoints found." message is being displayed in all panels    

Expected results:

Data/Graphs should be displayed

Additional info:

Screenshots and must-gather.log are available at 
https://drive.google.com/drive/folders/1XnotzYBC_UDN97j_LNVygwrc77Tmmbtx?usp=drive_link     


Status of Prometheus pods:

[root@ha-416-sajam-bastion-0 ~]# oc get pods -n openshift-monitoring | grep prometheus
prometheus-adapter-dc7f96748-mczvq                       1/1     Running   0          3h18m
prometheus-adapter-dc7f96748-vl4n8                       1/1     Running   0          3h18m
prometheus-k8s-0                                         6/6     Running   0          7d2h
prometheus-k8s-1                                         6/6     Running   0          7d2h
prometheus-operator-677d4c87bd-8prnx                     2/2     Running   0          7d2h
prometheus-operator-admission-webhook-54549595bb-gp9bw   1/1     Running   0          7d3h
prometheus-operator-admission-webhook-54549595bb-lsb2p   1/1     Running   0          7d3h
[root@ha-416-sajam-bastion-0 ~]#

Logs of Prometheus pods are available at https://drive.google.com/drive/folders/13DhLsQYneYpouuSsxYJ4VFhVrdJfQx8P?usp=drive_link  

Please review the following PR: https://github.com/openshift/images/pull/186

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The MCO currently lays down a file at /etc/mco/internal-registry-pull-secret.json, which is extracted from the machine-os-puller SA into ControllerConfig. It is then templated down to a MachineConfig. For some reason, this SA is now being refreshed every hour or so, causing a new MachineConfig to be generated every hour. This also causes CI issues as the machineconfigpools will randomly update to a new config in the middle of a test.

More context: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1715888365021729

Please review the following PR: https://github.com/openshift/route-controller-manager/pull/42

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

If there was no DHCP Network Name, then the destroy code would skip deleting the DHCP resource.  Now we add a test to see if the DHCP backing VM is in ERROR state.  And, if so, delete it.
    

This was necessary for kuryr, which we no longer support. We should therefore stop creating machines with trunking enabled.

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/63

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Large portions of the code for setting the kargs in the agent ISO are duplicated from the implementation in assisted-image-service. By implementing an API in a-i-s comparable to the one used for ignition images in MULTIARCH-2678, we can eliminate this duplication and ensure that there is only one library that needs to be updated if anything changes in how kargs are set.

Description of problem:

For the fix of OCPBUGS-29494, only the hosted cluster was fixed, and changes to the node pool were ignored. The node pool encountered the following error:

    - lastTransitionTime: "2024-05-31T09:11:40Z"
      message: 'failed to check if we manage haproxy ignition config: failed to look
        up image metadata for registry.ci.openshift.org/ocp/4.14-2024-05-29-171450@sha256:9b88c6e3f7802b06e5de7cd3300aaf768e85d785d0847a70b35857e6d1000d51:
        failed to obtain root manifest for registry.ci.openshift.org/ocp/4.14-2024-05-29-171450@sha256:9b88c6e3f7802b06e5de7cd3300aaf768e85d785d0847a70b35857e6d1000d51:
        unauthorized: authentication required'
      observedGeneration: 1
      reason: ValidationFailed
      status: "False"
      type: ValidMachineConfig

Version-Release number of selected component (if applicable):

    4.14, 4.15, 4.16, 4.17

How reproducible:

    100%

Steps to Reproduce:

    1. try to deploy an hostedCluster on a disconnected environment without explicitly set hypershift.openshift.io/control-plane-operator-image annotation.
    2.
    3.

Expected results:

without set hypershift.openshift.io/control-plane-operator-image annotation
nodepool can be ready

This is a clone of issue OCPBUGS-41785. The following is the description of the original issue:

Context: https://redhat-internal.slack.com/archives/CH98TDJUD/p1682969691044039?thread_ts=1682946070.139719&cid=CH98TDJUD

If a neutron network MTU is too small, br-ex will be set to 1280 anyway, which might be problematic if the neutron mtu is smaller than that. we should have some validation in the installer to prevent this.

We need an MTU of 1380 at least, where 1280 is the minimum allowed for IPv6 + 100 for the OVN-Kubernetes encapsulation overhead.

Please review the following PR: https://github.com/openshift/node_exporter/pull/147

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/kubernetes/pull/1976

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-36196. The following is the description of the original issue:

Description of problem:

Launch CAPI based installation on Azure Government Cloud, installer was timeout when waiting for network infrastructure to become ready.

06-26 09:08:41.153  level=info msg=Waiting up to 15m0s (until 9:23PM EDT) for network infrastructure to become ready...
...
06-26 09:09:33.455  level=debug msg=E0625 21:09:31.992170   22172 azurecluster_controller.go:231] "failed to reconcile AzureCluster" err=<
06-26 09:09:33.455  level=debug msg=	failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	RESPONSE 404: 404 Not Found
06-26 09:09:33.456  level=debug msg=	ERROR CODE: SubscriptionNotFound
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	{
06-26 09:09:33.456  level=debug msg=	  "error": {
06-26 09:09:33.456  level=debug msg=	    "code": "SubscriptionNotFound",
06-26 09:09:33.456  level=debug msg=	    "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found."
06-26 09:09:33.456  level=debug msg=	  }
06-26 09:09:33.456  level=debug msg=	}
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	. Object will not be requeued
06-26 09:09:33.456  level=debug msg= > logger="controllers.AzureClusterReconciler.reconcileNormal" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" reconcileID="f2ff1040-dfdd-4702-ad4a-96f6367f8774" x-ms-correlation-request-id="d22976f0-e670-4627-b6f3-e308e7f79def" name="jima26mag-9bqkl"
06-26 09:09:33.457  level=debug msg=I0625 21:09:31.992215   22172 recorder.go:104] "failed to reconcile AzureCluster: failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: SubscriptionNotFound\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"SubscriptionNotFound\",\n    \"message\": \"The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.\"\n  }\n}\n--------------------------------------------------------------------------------\n. Object will not be requeued" logger="events" type="Warning" object={"kind":"AzureCluster","namespace":"openshift-cluster-api-guests","name":"jima26mag-9bqkl","uid":"20bc01ee-5fbe-4657-9d0b-7013bd55bf96","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"1115"} reason="ReconcileError"
06-26 09:17:40.081  level=debug msg=I0625 21:17:36.066522   22172 helpers.go:516] "returning early from secret reconcile, no update needed" logger="controllers.reconcileAzureSecret" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" name="jima26mag-9bqkl" reconcileID="2df7c4ba-0450-42d2-901e-683de399f8d2" x-ms-correlation-request-id="b2bfcbbe-8044-472f-ad00-5c0786ebbe84"
06-26 09:23:46.611  level=debug msg=Collecting applied cluster api manifests...
06-26 09:23:46.611  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure is not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
06-26 09:23:46.611  level=info msg=Shutting down local Cluster API control plane...
06-26 09:23:46.612  level=info msg=Stopped controller: Cluster API
06-26 09:23:46.612  level=warning msg=process cluster-api-provider-azure exited with error: signal: killed
06-26 09:23:46.612  level=info msg=Stopped controller: azure infrastructure provider
06-26 09:23:46.612  level=warning msg=process cluster-api-provider-azureaso exited with error: signal: killed
06-26 09:23:46.612  level=info msg=Stopped controller: azureaso infrastructure provider
06-26 09:23:46.612  level=info msg=Local Cluster API system has completed operations
06-26 09:23:46.612  [ERROR] Installation failed with error code '4'. Aborting execution.

From above log, Azure Resource Management API endpoint is not correct, endpoint "management.azure.com" is for Azure Public cloud, the expected one for Azure Government should be "management.usgovcloudapi.net".

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-06-23-145410

How reproducible:

    Always

Steps to Reproduce:

    1. Install cluster on Azure Government Cloud, capi-based installation 
    2.
    3.
    

Actual results:

    Installation failed because of the wrong Azure Resource Management API endpoint used.

Expected results:

    Installation succeeded.

Additional info:

    

This is a clone of issue OCPBUGS-39126. The following is the description of the original issue:

Description of problem:


Difficult to detect in which component I should report this bug. The description is the following.

Today we can install RH operators using one precise namespace or just all namepaces that will install the operator in "openshift-operators" namespace.

if this operator creates a serviceMonitor that should be scrapped by platform prometheus, this will have a token authentication and security configured in its definition.

But if the operator is installed in "openshift-operators" namespace, it's user workload monitoring that will try to scrappe it since this mentioned namespace has not the corresponding label to be scrapped by platform monitoring and we don't want it to have it because in this namespace we can also install community operators.

The result is that user workload monitoring will scrap this namespace and the service monitors will be skipped since they are configured with security against platform monitoring and UWM will not hande this.

A possible workaround is to do:

oc label namespace openshift-operators openshift.io/user-monitoring=false

losing functionality since some RH operators will not be monitored if installed in openshift-operators.



    

Version-Release number of selected component (if applicable):

 4.16

    

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/323

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In the Safari browser when creating an app with either pipeline or build option the topology shows the status on the left-hand corner of the topology(More details can be checked in the screenshot or video)
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1. Create an app 
    2. Go to topology 
    3.
    

Actual results:

UI is distorted with build labels, not in the appropriate position 
    

Expected results:

UI should show labels properly
    

Additional info:

Safari 17.4.1
    

Description of problem:

A breaking API change (Catalog -> ClusterCatalog) is blocking downstreaming of operator-framework/catalogd and operator-framework/operator-controller       

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

Downstreaming script fails.
https://prow.ci.openshift.org/?job=periodic-auto-olm-v1-downstreaming      

Actual results:

Downstreaming fails.    

Expected results:

Downstreaming succeeds.    

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/281

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/413

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Need to add the networking-console-plugin image to the OCP 4.17 release payload, so it can be consumed in the hosted CNO.

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

100%

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

done at: https://github.com/openshift/cluster-network-operator/pull/2474

This is a clone of issue OCPBUGS-32773. The following is the description of the original issue:

Description of problem:

In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared.

This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. 


Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared:


https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden)
https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden)


It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared.


This bug prevents users from successfully creating instances from templates in the WebConsole.

Version-Release number of selected component (if applicable):

4.15 4.14 

How reproducible:

YES

Steps to Reproduce:

1. Log in with a non-administrator account.
2. Select a template from the developer catalog and click on Instantiate Template.
3. Enter values into the initially empty form.
4. Wait for several seconds, and the entered values will disappear.

Actual results:

Entered values are disappeard

Expected results:

Entered values are appeard

Additional info:

I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.

Description of problem:

While investigating a problem with OpenShift Container Platform 4 - Node scaling, I found the below messages reported in my OpenShift Container Platform 4 - Cluster.

E0513 11:15:09.331353       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.331365       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.331529       1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0513 11:15:09.331684       1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.332076       1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented
I0513 11:15:09.332100       1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332110       1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332135       1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]

The same events are reported in must-gather reviewed from customers. Given that we have https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 that appear to be solved via https://github.com/kubernetes/autoscaler/pull/6677 and https://github.com/kubernetes/autoscaler/pull/6038 I'm wondering whether we should pull in those changes as they seem to eventually impact automated scaling of OpenShift Container Platform 4 - Node(s).

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.15

How reproducible:

Always

Steps to Reproduce:

1. Setup OpenShift Container Platform 4 with ClusterAutoscaler configured
2. Trigger scaling activity and verify the cluster-autoscaler-default logs

Actual results:

Logs like the below are being reported.

E0513 11:15:09.331353       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.331365       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.331529       1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0513 11:15:09.331684       1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.332076       1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented
I0513 11:15:09.332100       1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332110       1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332135       1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]

Expected results:

Scale-up of OpenShift Container Platform 4 - Node to happen without error being reported

I0513 11:15:09.331529       1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0513 11:15:09.331684       1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
I0513 11:15:09.332100       1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332110       1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332135       1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]

Additional info:

Please review https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 as they seem to document the problem and also have a solution linked/merged

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/217

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/telemeter/pull/533

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The image registry operator and ingress operator use the `/metrics` endpoint for liveness/readiness probes which in the case of the former results in a payload of ~100kb. This at scale can be non-performant and is also not best practice. The teams which own these operators should instead introduce health endpoints if these probes are needed. 

Please review the following PR: https://github.com/openshift/images/pull/182

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38272. The following is the description of the original issue:

Description of problem:

When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly *after* 2024-08-09-031511

How reproducible: always

Steps to Reproduce:

  1. Enable TechPreviewNoUpgrade
  2. Add a new vCenter to infrastructure. It can be the same one as the existing one - we just need to trigger "disable CSi migration when there are 2 or more vCenters"
  3. See that vsphere-csi-config-secret changed and has `migration-datastore-url =` (i.e. empty string value)

Actual results: the controller pods are not restarted

Expected results: the controller pods are  restarted

This is a clone of issue OCPBUGS-39246. The following is the description of the original issue:

Description of problem:

    Alerts with non-standard severity labels are sent to Telemeter.

Version-Release number of selected component (if applicable):

    All supported versions

How reproducible:

    Always

Steps to Reproduce:

    1. Create an always firing alerting rule with severity=foo.
    2. Make sure that telemetry is enabled for the cluster.
    3.
    

Actual results:

    The alert can be seen on the telemeter server side.

Expected results:

    The alert is dropped by the telemeter allow-list.

Additional info:

Red Hat operators should use standard severities: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide
Looking at the current data, it looks like ~2% of the alerts reported to Telemter have an invalid severity.

AC:

  • update the Getting Started banner with following items and link to them:
    • Impersonation QuickStart - add above 'Monitor your sample application'
    • Spanish and French language - add as second option to 'Explore new features and capabilities' that will be overridden by Lightspeed once available

Please review the following PR: https://github.com/openshift/image-registry/pull/399

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  When creating a HostedCluster that requires a request serving node larger than the ones for which we have placeholders (say a >93 node cluster in ROSA), in some cases the cluster creation does not succeed.  

Version-Release number of selected component (if applicable):

HyperShift operator 0f9f686

How reproducible:

    Sometimes

Steps to Reproduce:

    1. In request serving scaling management cluster, create a cluster with a size that is greater than that for which we have placeholders.
    2. Wait for the hosted cluster to schedule and run
    

Actual results:

    The hosted cluster never schedules

Expected results:

    The hosted cluster scales and comes up

Additional info:

    It is more likely that this occurs when there are many existing hosted clusters on the management cluster already. 

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/67

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-42196. The following is the description of the original issue:

Description of problem:

    When setting .spec.storage.azure.networkAccess.type: Internal (without providing vnet and subnet names), the image registry will attempt to discover the vnet by tag. 

Previous to the installer switching to cluster-api, the vnet tagging happened here: https://github.com/openshift/installer/blob/10951c555dec2f156fad77ef43b9fb0824520015/pkg/asset/cluster/azure/azure.go#L79-L92.

After the switch to cluster-api, this code no longer seems to be in use, so the tags are no longer there.

From inspection of a failed job, the new tags in use seem to be in the form of `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID` instead of the previous `kubernetes.io_cluster.$infraID`.

Image registry operator code responsible for this: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L678-L682

More details in slack discussion with installer team: https://redhat-internal.slack.com/archives/C68TNFWA2/p1726732108990319

Version-Release number of selected component (if applicable):

    4.17, 4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Get an Azure 4.17 or 4.18 cluster
    2. oc edit configs.imageregistry/cluster
    3. set .spec.storage.azure.networkAccess.type to Internal  

Actual results:

    The operator cannot find the vnet (look for "not found" in operator logs)

Expected results:

    The operator should be able to find the vnet by tag and configure the storage account as private

Additional info:

If we make the switch to look for vnet tagged with `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID`, one thing that needs to be tested is BYO vnet/subnet clusters. What I have currently observed in CI is that the cluster has the new tag key with `owned` value, but for BYO networks the value *should* be `shared`, but I have not tested it.
---

Although this bug is a regression, I'm not going to mark it as such because this affects a fairly new feature (introduced on 4.15), and there's a very easy workaround (manually setting the vnet and subnet names when configuring network access to internal).

 

This is a clone of issue OCPBUGS-39375. The following is the description of the original issue:

Description of problem:

Given 2 images with different names, but same layers, "oc image mirror" will only mirror 1 of them. For example:

$ cat images.txt
quay.io/openshift/community-e2e-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
quay.io/openshift/community-e2e-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS

$ oc image mirror -f images.txt
quay.io/
  bertinatto/test-images
    manifests:
      sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 -> e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
  stats: shared=0 unique=0 size=0B

phase 0:
  quay.io bertinatto/test-images blobs=0 mounts=0 manifests=1 shared=0

info: Planning completed in 2.6s
sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
info: Mirroring completed in 240ms (0B/s)    

Version-Release number of selected component (if applicable):

4.18    

How reproducible:

Always    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Only one of the images were mirrored.

Expected results:

Both images should be mirrored.     

Additional info:

    

Description of problem:

    Pending CSR in PowerVS Hypershift cluster causes monitoring CO to not become available. Upon investigation providerID format set by CAPI is improper.

Version-Release number of selected component (if applicable):

4.17.0    

How reproducible:

   100% 

Steps to Reproduce:

    1.Create a cluster with hypershift on powervs with 4.17 release image.
    2.
    3.
    

Actual results:

Pending CSRs and monitoring CO in unavailable state    

Expected results:

No pending CSRs and all CO should be available  

Additional info:

    

Description of problem:

Sometimes dns name configured in EgressFirewall was not resolved
    

Version-Release number of selected component (if applicable):

Using the build by
{code:java}
 build openshift/cluster-network-operator#2131
    How reproducible:{code:none}

    

Steps to Reproduce:

  

    % for i in {1..7};do oc create ns test$i;oc create -f  data/egressfirewall/eg_policy_wildcard.yaml -n test$i; oc create -f data/list-for-pod.json -n test$i;sleep 1;done
    namespace/test1 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
    namespace/test2 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
    namespace/test3 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
    namespace/test4 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
    namespace/test5 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
    namespace/test6 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
    namespace/test7 created
    egressfirewall.k8s.ovn.org/default created
    replicationcontroller/test-rc created
    service/test-service created
     
    % cat data/egressfirewall/eg_policy_wildcard.yaml
    kind: EgressFirewall
    apiVersion: k8s.ovn.org/v1
    metadata:
      name: default
    spec:
      egress:
      - type: Allow
        to:
          dnsName: "*.google.com" 
      - type: Deny 
        to:
          cidrSelector: 0.0.0.0/0
     
     
    Then I created namespace test8, created egressfirewall and updated dns anme,it worked well. Then I deleted test8
     
    After that I created namespace test11 as below steps, the issue happened again.
     % oc create ns test11
    namespace/test11 created
    % oc create -f data/list-for-pod.json -n test11
    replicationcontroller/test-rc created
    service/test-service created
    % oc create -f data/egressfirewall/eg_policy_dnsname1.yaml -n test11
    egressfirewall.k8s.ovn.org/default created
    % oc get egressfirewall -n test11
    NAME      EGRESSFIREWALL STATUS
    default   EgressFirewall Rules applied
     % oc get egressfirewall -n test11 -o yaml
    apiVersion: v1
    items:
    - apiVersion: k8s.ovn.org/v1
      kind: EgressFirewall
      metadata:
        creationTimestamp: "2024-05-16T05:32:07Z"
        generation: 1
        name: default
        namespace: test11
        resourceVersion: "101288"
        uid: 18e60759-48bf-4337-ac06-2e3252f1223a
      spec:
        egress:
        - to:
            dnsName: registry-1.docker.io
          type: Allow
        - ports:
          - port: 80
            protocol: TCP
          to:
            dnsName: www.facebook.com
          type: Allow
        - to:
            cidrSelector: 0.0.0.0/0
          type: Deny
      status:
        messages:
        - 'hrw-0516i-d884f-worker-a-m7769: EgressFirewall Rules applied'
        - 'hrw-0516i-d884f-master-0.us-central1-b.c.openshift-qe.internal: EgressFirewall
          Rules applied'
        - 'hrw-0516i-d884f-worker-b-q4fsm: EgressFirewall Rules applied'
        - 'hrw-0516i-d884f-master-1.us-central1-c.c.openshift-qe.internal: EgressFirewall
          Rules applied'
        - 'hrw-0516i-d884f-master-2.us-central1-f.c.openshift-qe.internal: EgressFirewall
          Rules applied'
        - 'hrw-0516i-d884f-worker-c-4kvgr: EgressFirewall Rules applied'
        status: EgressFirewall Rules applied
    kind: List
    metadata:
      resourceVersion: ""
     % oc get pods -n test11                  
    NAME            READY   STATUS    RESTARTS   AGE
    test-rc-ffg4g   1/1     Running   0          61s
    test-rc-lw4r8   1/1     Running   0          61s
     % oc rsh -n test11 test-rc-ffg4g
    ~ $ curl registry-1.docker.io -I
     
    ^C
    ~ $ curl www.facebook.com
    ^C
    ~ $ 
    ~ $ curl www.facebook.com --connect-timeout 5
    curl: (28) Failed to connect to www.facebook.com port 80 after 2706 ms: Operation timed out
    ~ $ curl registry-1.docker.io --connect-timeout 5
    curl: (28) Failed to connect to registry-1.docker.io port 80 after 4430 ms: Operation timed out
    ~ $ ^C
    ~ $ exit
    command terminated with exit code 130
    % oc get dnsnameresolver     -n openshift-ovn-kubernetes         
    NAME             AGE
    dns-67b687cfb5   7m47s
    dns-696b6747d9   2m12s
    dns-b6c74f6f4    2m12s
     
     % oc get dnsnameresolver  dns-696b6747d9  -n openshift-ovn-kubernetes  -o yaml
    apiVersion: network.openshift.io/v1alpha1
    kind: DNSNameResolver
    metadata:
      creationTimestamp: "2024-05-16T05:32:07Z"
      generation: 1
      name: dns-696b6747d9
      namespace: openshift-ovn-kubernetes
      resourceVersion: "101283"
      uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5
    spec:
      name: www.facebook.com.
 % oc get dnsnameresolver  dns-696b6747d9  -n openshift-ovn-kubernetes  -o yaml
    apiVersion: network.openshift.io/v1alpha1
    kind: DNSNameResolver
    metadata:
      creationTimestamp: "2024-05-16T05:32:07Z"
      generation: 1
      name: dns-696b6747d9
      namespace: openshift-ovn-kubernetes
      resourceVersion: "101283"
      uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5
    spec:
      name: www.facebook.com.
     
     % oc get dnsnameresolver  dns-696b6747d9  -n openshift-ovn-kubernetes  -o yaml
    apiVersion: network.openshift.io/v1alpha1
    kind: DNSNameResolver
    metadata:
      creationTimestamp: "2024-05-16T05:32:07Z"
      generation: 1
      name: dns-696b6747d9
      namespace: openshift-ovn-kubernetes
      resourceVersion: "101283"
      uid: a8546ad8-b16d-4d81-a943-46bdd0d82aa5
    spec:
      name: www.facebook.com.


    

Actual results:

The dns name like www.facebook.com configured in egressfirewall didn't get resolved to IP
    

Expected results:

EgressFirewall works as expected.
 
   

Additional info:


    

Description of problem:

    */network-status annotation doesn't not reflect multiple interfaces

Version-Release number of selected component (if applicable):

    latest release

How reproducible:

    Always

Steps to Reproduce:

   https://gist.github.com/dougbtv/1eb8ac2d61d494b56d65a6b236a86e61     

Description of problem: After changing the value of enable_topology in the openshift-config/cloud-provider-config config map, the CSI controller pods should restart to pick up the new value. This is not happening.

It seems like our understanding in https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/127#issuecomment-1780967488 was wrong.

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/268

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    [AWS] securityGroups and subnet don’t keep consistent in machine yaml and on aws console 
    No securityGroups huliu-aws531d-vlzbw-master-sg for masters on aws console, but shows in master machines yaml 
    No securityGroups huliu-aws531d-vlzbw-worker-sg for workers on aws console, but shows in worker machines yaml 
    No subnet huliu-aws531d-vlzbw-private-us-east-2a for masters and workers on aws console, but shows in master and worker machines yaml 

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-30-130713
This happens in the latest 4.16(CAPI) AWS cluster

How reproducible:

    Always

Steps to Reproduce:

    1. Install a AWS 4.16 cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-05-30-130713   True        False         46m     Cluster version is 4.16.0-0.nightly-2024-05-30-130713
liuhuali@Lius-MacBook-Pro huali-test % oc  get machine
NAME                                          PHASE     TYPE         REGION      ZONE         AGE
huliu-aws531d-vlzbw-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   65m
huliu-aws531d-vlzbw-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   65m
huliu-aws531d-vlzbw-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   65m
huliu-aws531d-vlzbw-worker-us-east-2a-swwmk   Running   m6i.xlarge   us-east-2   us-east-2a   62m
huliu-aws531d-vlzbw-worker-us-east-2b-f2gw9   Running   m6i.xlarge   us-east-2   us-east-2b   62m
huliu-aws531d-vlzbw-worker-us-east-2c-x6gbz   Running   m6i.xlarge   us-east-2   us-east-2c   62m

    2.Check the machines yaml, there are 4 securityGroups and 2 subnet value for master machines, 3 securityGroups and 2 subnet value for worker machines. 
But check on aws console, only 3 securityGroups and 1 subnet value for masters, 2 securityGroups and 1 subnet value for workers.

liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws531d-vlzbw-master-0  -oyaml
…
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-master-sg
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-node
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-lb
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-controlplane
      subnet:
        filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-private-us-east-2a
          - huliu-aws531d-vlzbw-subnet-private-us-east-2a
…
https://drive.google.com/file/d/1YyPQjSCXOm-1gbD3cwktDQQJter6Lnk4/view?usp=sharing
https://drive.google.com/file/d/1MhRIm8qIZWXdL9-cDZiyu0TOTFLKCAB6/view?usp=sharing
https://drive.google.com/file/d/1Qo32mgBerWp5z6BAVNqBxbuH5_4sRuBv/view?usp=sharing
https://drive.google.com/file/d/1seqwluMsPEFmwFL6pTROHYyJ_qPc0cCd/view?usp=sharing


liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-aws531d-vlzbw-worker-us-east-2a-swwmk  -oyaml
…
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-worker-sg
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-node
      - filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-lb
      subnet:
        filters:
        - name: tag:Name
          values:
          - huliu-aws531d-vlzbw-private-us-east-2a
          - huliu-aws531d-vlzbw-subnet-private-us-east-2a
…


https://drive.google.com/file/d/1FM7dxfSK0CGnm81dQbpWuVz1ciw9hgpq/view?usp=sharing
https://drive.google.com/file/d/1QClWivHeGGhxK7FdBUJnGu-vHylqeg5I/view?usp=sharing
https://drive.google.com/file/d/12jgyFfyP8fTzQu5wRoEa6RrXbYt_Gxm1/view?usp=sharing 
    

Actual results:

    securityGroups and subnet don’t keep consistent in machine yaml and on aws console 

Expected results:

    securityGroups and subnet should keep consistent in machine yaml and on aws console 

Additional info:

    

Description of problem:

GCP private cluster with CCO Passthrough mode failed to install due to CCO degraded.
status:  
conditions:  - lastTransitionTime: "2024-06-24T06:04:39Z"    message: 1 of 7 credentials requests are failing to sync.    reason: CredentialsFailing    status: "True"    type: Degraded    

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2024-06-21-203120    

How reproducible:

Always    

Steps to Reproduce:

    1.Create GCP private cluster with CCO Passthrough mode, flexy template is private-templates/functionality-testing/aos-4_13/ipi-on-gcp/versioned-installer-xpn-private     
    2.Wait for cluster installation
    

Actual results:

jianpingshu@jshu-mac ~ % oc get clusterversionNAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUSversion             False       False         23m     Error while reconciling 4.13.0-0.nightly-2024-06-21-203120: the cluster operator cloud-credential is degraded

status:  
conditions:  - lastTransitionTime: "2024-06-24T06:04:39Z"    message: 1 of 7 credentials requests are failing to sync.    reason: CredentialsFailing    status: "True"    type: Degraded

jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sortCredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: error while validating permissions: error testing permissions: googleapi: Error 400: Permission commerceoffercatalog.agreements.list is not valid for this resource., badRequest
NoConditions= openshift-cloud-network-config-controller-gcp :
NoConditions= openshift-gcp-ccm :
NoConditions= openshift-gcp-pd-csi-driver-operator :
NoConditions= openshift-image-registry-gcs :
NoConditions= openshift-ingress-gcp :
NoConditions= openshift-machine-api-gcp :    

Expected results:

Cluster installed successfully without degrade    

Additional info:

Some problem PROW CI tests: 
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-multi-nightly-gcp-ipi-user-labels-tags-filestore-csi-tp-arm-f14/1805064266043101184
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-upgrade-from-stable-4.13-gcp-ipi-xpn-fips-f28/1804676149503070208    

 

This is a clone of issue OCPBUGS-43378. The following is the description of the original issue:

In https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-ci-release-4.18-e2e-openstack-ovn-etcd-scaling/1834144693181485056 I noticed the following panic:

 Undiagnosed panic detected in pod expand_less 	0s
{  pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log.gz:ts=2024-09-12T09:30:09.273Z caller=klog.go:124 level=error component=k8s_client_runtime func=Errorf msg="Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3180480), concrete:(*abi.Type)(0x34a31c0), asserted:(*abi.Type)(0x3a0ac40), missingMethod:\"\"} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)\ngoroutine 13218 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x32f1080, 0xc05be06840})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x90\nk8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc010ef6000?})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b\npanic({0x32f1080?, 0xc05be06840?})\n\t/usr/lib/golang/src/runtime/panic.go:770 +0x132\ngithub.com/prometheus/prometheus/discovery/kubernetes.NewEndpoints.func11({0x34a31c0?, 0xc05bf3a580?})\n\t/go/src/github.com/prometheus/prometheus/discovery/kubernetes/endpoints.go:170 +0x4e\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/controller.go:253\nk8s.io/client-go/tools/cache.(*processorListener).run.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:977 +0x9f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00fc92f70, {0x456ed60, 0xc031a6ba10}, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc011678f70, 0x3b9aca00, 0x0, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161\nk8s.io/client-go/tools/cache.(*processorListener).run(0xc04c607440)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52\ncreated by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 12933\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73\n"}

This issue seems relatively common on openstack, these runs seem to very frequently be this failure.

Linked test name: Undiagnosed panic detected in pod

Description of problem:

OpenShift Assisted Installer reporting Dell PowerEdge C6615 node’s four 960GB SATA Solid State Disks as removable and subsequently refusing to continue installation of OpenShift on to at least one of those Disks.
This issue is where by OpenShift agent installer reports installed SATA SDDs are removable and refuses to use any of them as installation targets.

Linux Kernel reports:
sd 4:0:0:0 [sdb] Attached SCSI removable disk
sd 5:0:0:0 [sdc] Attached SCSI removable disk
sd 6:0:0:0 [sdd] Attached SCSI removable disk
sd 3:0:0:0 [sda] Attached SCSI removable disk
Each removable disk is clean, 894.3GiB  free space, no partitions etc.

However - Insufficient
This host does not meet the minimum hardware or networking requirements and will not be included in the cluster.
Hardware: 
Failed     
  Warning alert:    
    Insufficient        
Minimum disks of required size: No eligible disks were found, please check specific disks to see why they are not eligible.

    

Version-Release number of selected component (if applicable):

   4.15.z 

How reproducible:

    100 %

Steps to Reproduce:

    1. Install with assisted Installer
    2. Generate ISO using option over console.
    3. Boot the ISO on dell HW mentioned in description
    4. Observe journal logs for disk validations
    

Actual results:

    Installation fails at disk validation

Expected results:

    Installation should complete

Additional info:

    

Description of problem:

    Due to the branching of 4.17 not having happened yet, the mce-2.7 Konflux application can't merge the .tekton pipeline 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38070. The following is the description of the original issue:

Description of problem:

Create cluster with publish:Mixed by using CAPZ,
1. publish: Mixed + apiserver: Internal
install-config:
=================
publish: Mixed
operatorPublishingStrategy:
  apiserver: Internal
  ingress: External

In this case, api dns should not be created in public dns zone, but it was created.
==================
$ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com
{
  "TTL": 300,
  "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19",
  "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.",
  "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api",
  "metadata": {},
  "name": "api.jima07api",
  "provisioningState": "Succeeded",
  "resourceGroup": "os4-common",
  "targetResource": {},
  "type": "Microsoft.Network/dnszones/CNAME"
}

2. publish: Mixed + ingress: Internal
install-config:
=============
publish: Mixed
operatorPublishingStrategy:
  apiserver: External
  ingress: Internal

In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found.
================
$ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg
[]

Version-Release number of selected component (if applicable):

    4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Specify publish: Mixed + mixed External/Internal for api/ingress 
    2. Create cluster
    3. check public dns records and load balancer rules in internal/external load balancer to be created expected
    

Actual results:

    see description, some resources are unexpected to be created or missed.

Expected results:

    public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config

Additional info:

    

Description of problem:

    The ingress cluster capability has been introduced in OCP 4.16 (https://github.com/openshift/enhancements/pull/1415). It includes the cluster ingress operator and all its controllers. If the ingress capability is disabled all the routes of the cluster become unavailable (no router to back them up). The console operator heavily depends on the working (admitted/active) routes to do the health checks, configure the authentication flows, client downloads, etc. The console operator goes degraded if the routes are not served by a router. The console operator needs to be able to tolerate the absence of the ingress capability.

 

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    Always

Steps to Reproduce:

    1. Create ROSA HCP cluster.
    2. Scale the default ingresscontroller to 0: oc -n openshift-ingress-operator patch ingresscontroller default --type='json' -p='[{"op": "replace", "path": "/spec/replicas", "value":0}]'
    3. Check the status of console cluster operator: oc get co console
    

Actual results:

$ oc get co console
NAME      VERSION  AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console   4.16.0   False       False         False      53s     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.49e4812b7122bc833b72.hypershift.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.49e4812b7122bc833b72.hypershift.aws-2.ci.openshift.org": EOF

Expected results:

$ oc get co console
NAME      VERSION  AVAILABLE   PROGRESSING   DEGRADED   
console   4.16.0   True        False         False

Additional info:

    The ingress capability cannot be disabled on a standalone OpenShift (when the payload is managed by ClusterVersionOperator). Only clusters managed HyperShift with HostedControlPlane are impacted.

Description of problem:

We are seeing:

dhcp server failed to create private network: unable to retrieve active status for new PER connection information after create private network

This error might be alleviated by not conflicting the CIDR when we create the DHCP network. One way to mitigate this is to use a random subnet address.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Sometimes
    

Steps to Reproduce:

    1. Create a cluster
    

Actual results:


    

Expected results:


    

Additional info:


    

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/513

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   I have a CU who reported that they are not able to edit the "Until" option from developers perspective.

Version-Release number of selected component (if applicable):

    OCP v4.15.11     

Screenshot
https://redhat-internal.slack.com/archives/C04BSV48DJS/p1716889816419439

Description of problem

Seen in a 4.15.19 cluster, the PrometheusOperatorRejectedResources alert was firing, but did not link a runbook, despite the runbook existing since MON-2358.

Version-Release number of selected component

Seen in 4.15.19, but likely applies to all versions where the PrometheusOperatorRejectedResources alert exists.

How reproducible

Every time.

Steps to Reproduce:

Check the cluster console at /monitoring/alertrules?rowFilter-alerting-rule-source=platform&name=PrometheusOperatorRejectedResources, and click through to the alert definition.

Actual results

No mention of runbooks.

Expected results

A Runbook section linking the runbook.

Additional info

I haven't dug into the upstream/downstream sync process, but the runbook information likely needs to at least show up here, although that may or may not be the root location for injecting our canonical runbook into the upstream-sourced alert.

This is a clone of issue OCPBUGS-42528. The following is the description of the original issue:

Description of problem:

The created Node ISO is missing the architecture (<arch>) in its filename, which breaks consistency with other generated ISOs such as the Agent ISO.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%    

Actual results:

Currently, the Node ISO is being created with the filename node.iso.

Expected results:

Node ISO should be created as node.<arch>.iso to maintain consistency.

Description of the problem:

In my ACM 2.10.5 / MCE 2.5.6 Hub, I updated successfully to ACM 2.11.0 / MCE 2.6.0

At the completion of the update, I now see some odd aspects with my `local-cluster` object.

How reproducible:

Any existing hub that updates to 2.11 will see these changes introduced.

Steps to reproduce:

1.Have an existing OCP IPI on AWS

2.Install ACM 2.10 on it

3.Upgrade to ACM 2.11

 

Actual results:

notice the ACM local-cluster is an AWS provider type in the UI

notice that the ACM local-cluster is now an Host Inventory provider type in the UI

notice that the ACM local-cluster now has an 'Add hosts' menu action in the UI. This does not make sense to have for an IPI style OCP on AWS.

Expected results:

I did not expect that the Hub on AWS IPI OCP would be shown as Host Inventory, nor would I expect that I can / should use this Add hosts menu action.

Description of problem:

Autoscaler balance similar node groups failed on aws when run regression for https://issues.redhat.com/browse/OCPCLOUD-2616

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-20-165244

How reproducible:

Always 

Steps to Reproduce:

1. Create clusterautoscaler with balanceSimilarNodeGroups: true
2. Create 2 machineautoscaler min/max 1/8
3. Add workload

Actual results:

Couldn't see the "splitting scale-up" message from the cluster-autoscalerlogs. 
must-gather: https://drive.google.com/file/d/17aZmfQHKZxJEtqPvl37HPXkXA36Yp6i8/view?usp=sharing 2024-06-21T13:21:08.678016167Z 

I0621 13:21:08.678006       1 compare_nodegroups.go:157] nodes template-node-for-MachineSet/openshift-machine-api/zhsun-aws21-5slwv-worker-us-east-2b-5109433294514062211 and template-node-for-MachineSet/openshift-machine-api/zhsun-aws21-5slwv-worker-us-east-2c-760092546639056043 are not similar, labels do not match
2024-06-21T13:21:08.678030474Z I0621 13:21:08.678021       1 orchestrator.go:249] No similar node groups found 

Expected results:

balanceSimilarNodeGroups works well 

Additional info:

 

Owner: Architect:

_<Architect is responsible for completing this section to define the
details of the story>_

Story (Required)

As an OpenShift user, I'd like to see the LATEST charts, sorted by release (semver) version.

But today the list is in the order they were released, which makes sense if your chart is single-stream, but not if you are releasing for multiple product version streams (1.0.z, 1.1.z, 1.2.z)

Background (Required)

RHDH has charts for 2 concurrent release streams today, with a 3rd stream coming in June, so we'll have 1.0.z updates after 1.2, and 1.1.z after that. Mixing them together is confusing, especially with the Freshmaker CVE updates.

Acceptance Criteria

One of two implementations:

a) default sorting is by logical semver, with latest being the bottom of the list (default selected) and top of the list being the oldest chart; or

b) UI to allow choosing to sort by release date or version

Development:

QE:
Documentation: Yes/No (needs-docs|upstream-docs / no-doc)

Upstream: <Inputs/Requirement details: Concept/Procedure>/ Not
Applicable

Downstream: <Inputs/Requirement details: Concept/Procedure>/ Not
Applicable

Release Notes Type: <New Feature/Enhancement/Known Issue/Bug
fix/Breaking change/Deprecated Functionality/Technology Preview>

INVEST Checklist

Dependencies identified

Blockers noted and expected delivery timelines set

Design is implementable

Acceptance criteria agreed upon

Story estimated

v

Legend

Unknown

Verified

Unsatisfied

Description of problem:

    Admission webhook warning on creation of Route - violates policy 299 - unknow field "metadata.defaultAnnotations"  
Admission webhook warning on creation of buildConfig - violates policy 299 - unknow field "spec.source.git.type"

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Navigate to Import from git form and create a deployment
    2. See the `Admission webhook warning` toast notification
    

Actual results:

    Admission webhook warning - violates policy 299 - unknow field "metadata.defaultAnnotations" showing up on creation of Route and Admission webhook warning on creation of buildConfig - violates policy 299 - unknow field "spec.source.git.type"

Expected results:

    No Admission webhook warning should show

Additional info:

    

Description of problem:

Fix spelling "Rememeber" to "Remember"

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The cloud provider feature of NTO doesn't work as expected

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Create a cloud-provider profile like as 
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: provider-aws
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=GCE Cloud provider-specific profile
      # Your tuning for GCE Cloud provider goes here.
      [sysctl]
      vm.admin_reserve_kbytes=16386
    name: provider-aws     2.
    3.
    

Actual results:

    the value of vm.admin_reserve_kbytes still using default value

Expected results:

    the value of vm.admin_reserve_kbytes should change to 16386

Additional info:

    

This is a clone of issue OCPBUGS-43157. The following is the description of the original issue:

Description of problem:

    When running the `make fmt` target in the repository the command can fail due to a mismatch of versions between the go language and the goimports dependency.

 

Version-Release number of selected component (if applicable):

    4.16.z

How reproducible:

    always

Steps to Reproduce:

    1.checkout release-4.16 branch
    2.run `make fmt`
    

Actual results:

INFO[2024-10-01T14:41:15Z] make fmt make[1]: Entering directory '/go/src/github.com/openshift/cluster-cloud-controller-manager-operator' hack/goimports.sh go: downloading golang.org/x/tools v0.25.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.25.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local) 

Expected results:

    successful completion of `make fmt`

Additional info:

    our goimports.sh script file reference `goimports@latest` which means that this problem will most likely affect older branches as well. we will need to set a specific version of the goimports package for those branches.

given that the CCCMO includes golangci-lint, and uses it for a test, we should include goimports through golangci-lint which will solve this problem without needing special versions of goimports.

Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/149

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/84

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The default channel of 4.17 clusters is stable-4.16.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-03-193825

How reproducible:

Always

Steps to Reproduce:

    1. Install a 4.16 cluster
    2. Check default channel 

❯ oc adm upgrade
Cluster version is 4.17.0-0.test-2024-07-07-082848-ci-ln-htjr9ib-latestUpstream is unset, so the cluster will use an appropriate default.

Channel: stable-4.16
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.17.0-0.test-2024-07-07-082848-ci-ln-htjr9ib-latest not found in the "stable-4.16" channel

Actual results:

Default channel is stable-4.16 in a 4.17 cluster

Expected results:

Default channel should be stable-4.17

Additional info:

similar issue was observed and fixed in previous versions

This is a clone of issue OCPBUGS-38228. The following is the description of the original issue:

Description of problem:

On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-08-013133
4.16.0-0.nightly-2024-08-08-111530
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Check overview page's getting started resources card,  
    2.
    3.
    

Actual results:

1. There is "OpenShift LightSpeed" link  in "Explore new features and capabilities"
    

Expected results:

1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
    

Additional info:


    

Description of problem

Seen in some 4.17 update runs, like this one:

disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged expand_less	2h30m29s
{Your Test Panicked
github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:153
  When you, or your assertion library, calls Ginkgo's Fail(),
  Ginkgo panics to prevent subsequent assertions from running.

  Normally Ginkgo rescues this panic so you shouldn't see it.

  However, if you make an assertion in a goroutine, Ginkgo can't capture the
  panic.
  To circumvent this, you should call

  	defer GinkgoRecover()

  at the top of the goroutine that caused this panic.
...
github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.getClusterVersion({0x8b34870, 0xc0004d3b20}, 0xc0004d3b20?)
	github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:153 +0xee
github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.getCurrentVersion({0x8b34870?, 0xc0004d3b20?}, 0xd18c2e2800?)
	github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:163 +0x2c
github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.(*AdminAckTest).Test(0xc001599de0, {0x8b34800, 0xc005a36910})
	github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator/adminack.go:72 +0x28d
github.com/openshift/origin/test/e2e/upgrade/adminack.(*UpgradeTest).Test(0xc0018b23a0, {0x8b34608?, 0xccc6580?}, 0xc0055a3320?, 0xc0055ba180, 0x0?)
	github.com/openshift/origin/test/e2e/upgrade/adminack/adminack.go:53 +0xfa
...

We should deal with that noise, and get nicer error messages out of this test-case when there are hiccups calling getClusterVersion and similar.

Version-Release number of selected component

Seen in 4.17 CI, but it's fairly old code, so likely earlier 4.y are also exposed.

How reproducible:

Sippy reports 18 failures for this test-case vs. 250 success in the last week of periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade runs. So fairly rare, but not crazy rare.

Steps to Reproduce

Run hundreds of periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade and watch the Verify presence of admin ack gate blocks upgrade until acknowledged test-case.

Actual results

Occasional failures complaining about Ginko Fail panics.

Expected results

Reliable success.

Observed in 

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-serial-ovn-ipv6/1786198211774386176

 

there was a delay provisioning one of the master nodes, we should figure out why this is happening and if it can be prevented

 

from the ironic logs, there was a 5 minute delay during cleaning, on the other 2 masters this too a few seconds

 

 

01:20:53 1f90131a...moved to provision state "verifying" from state "enroll"
01:20:59 1f90131a...moved to provision state "manageable" from state "verifying"
01:21:04 1f90131a...moved to provision state "inspecting" from state "manageable"
01:21:35 1f90131a...moved to provision state "inspect wait" from state "inspecting"
01:26:26 1f90131a...moved to provision state "inspecting" from state "inspect wait" 
01:26:26 1f90131a...moved to provision state "manageable" from state "inspecting"
01:26:30 1f90131a...moved to provision state "cleaning" from state "manageable"
01:27:17 1f90131a...moved to provision state "clean wait" from state "cleaning"
>>> whats this 5 minute gap about ?? <<<
01:32:07 1f90131a...moved to provision state "cleaning" from state "clean wait" 
01:32:08 1f90131a...moved to provision state "clean wait" from state "cleaning"
01:32:12 1f90131a...moved to provision state "cleaning" from state "clean wait"
01:32:13 1f90131a...moved to provision state "available" from state "cleaning"
01:32:23 1f90131a...moved to provision state "deploying" from state "available"
01:32:28 1f90131a...moved to provision state "wait call-back" from state "deploying"
01:32:58 1f90131a...moved to provision state "deploying" from state "wait call-back"
01:33:14 1f90131a...moved to provision state "active" from state "deploying"
 

 

 

 

 

Please review the following PR: https://github.com/openshift/api/pull/1903

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-37663. The following is the description of the original issue:

Description of problem:

    CAPZ creates an empty route table during installs

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    Very

Steps to Reproduce:

    1.Install IPI cluster using CAPZ
    2.
    3.
    

Actual results:

    Empty route table created and attached to worker subnet

Expected results:

    No route table created

Additional info:

    

Description of problem:

oc command cannot be used with RHEL 8 based bastion    

Version-Release number of selected component (if applicable):

    4.16.0-rc.1

How reproducible:

    Very

Steps to Reproduce:

    1. Have a bastion for z/VM installation at Red Hat Enterprise Linux release 8.9 (Ootpa) 
    2. Download and install the 4.16.0-rc.1 client on the bastion
    3.Attempt to use the oc command
    

Actual results:

    oc get nodes
oc: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by oc)
oc: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by oc)
oc: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by oc)

Expected results:

    oc command returns without error

Additional info:

    This was introduced in 4.16.0-rc.1 - 4.16.0-rc.0 works fine

Please review the following PR: https://github.com/openshift/images/pull/184

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When doing offline SDN migration, setting the parameter "spec.migration.features.egressIP" to "false" to disable automatic migration of egressIP configuration doesn't work.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Launch a cluster with OpenShiftSDN. Configure an egressip to a node.
    2. Start offline SDN migration.
    3. In step-3, execute 
       oc patch Network.operator.openshift.io cluster --type='merge' \
  --patch '{
    "spec": {
      "migration": {
        "networkType": "OVNKubernetes",
        "features": {
          "egressIP": false
        }
      }
    }
  }'

Actual results:

An egressip.k8s.ovn.org CR is created automatcially.

Expected results:

No egressip CR shall be created for OVN-K

Additional info:

    

Description of problem:

The ingress operator provides the "SyncLoadBalancerFailed" status with a message that says "The kube-controller-manager logs may contain more details.".

Depending on the platform, that isn't accurate as we are transitioning the CCM out-of-tree to the "cloud-controller-manager".


Code Link: https://github.com/openshift/cluster-ingress-operator/blob/55780444031714fc931d90af298a4b193888977a/pkg/operator/controller/ingress/status.go#L874 

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    

Steps to Reproduce:

    1. Create a IngressController with a broken LoadBalancer-type service so it produces "SyncLoadBalancerFailed" (TBD, I'll try to figure out how to produce this...)
    2.
    3.
    

Actual results:

    "The kube-controller-manager logs may contain more details."

Expected results:

    "The cloud-controller-manager logs may contain more details."

Additional info:

    

Description of problem:

When the cloud-credential operator is used in manual mode, and awsSTSIAMRoleARN is not present in the secret operator pods, it throws aggressive errors every second. 

One of the customer concern about the number of errors from the operator pods

Two errors per second
============================
time="2024-05-10T00:43:45Z" level=error msg="error syncing credentials: an empty awsSTSIAMRoleARN was found so no Secret was created" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials

time="2024-05-10T00:43:46Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials

Version-Release number of selected component (if applicable):

    4.15.3

How reproducible:

    Always present in managed rosa clusters 

Steps to Reproduce:

    1.create a rosa cluster 
    2.check the errors of cloud credentials operator pods 
    3.
    

Actual results:

    The CCO logs continually throw errors

Expected results:

    The CCO logs should not be continually throwing these errors.

Additional info:

    The focus of this bug is only to remove the error lines from the logs. The underlying issue, of continually attempting to reconcile the CRs will be handled by other bugs.

Please review the following PR: https://github.com/openshift/cloud-provider-kubevirt/pull/43

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38860. The following is the description of the original issue:

Description of problem:

In 4.16 version now we can collapse and expand the "Getting Started resource" section under administrative perspective. 

But as in the earlier version, we can directly remove this tab [X], which is not there in the 4.16 version.

There is only an expand and collapse function available, but removing that tab is not available as it was there in previous versions. 


 

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Go to Web console. Click on the "Getting started resources." 
    2. Then you can expand and collapse this tab.
    3. But there is no option to directly remove this tab.      

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

The network resource provisioning playbook for 4.15 dualstack UPI contains a task for adding an IPv6 subnet to the existing external router [1].
This task fails with:
- ansible-2.9.27-1.el8ae.noarch & ansible-collections-openstack-1.8.0-2.20220513065417.5bb8312.el8ost.noarch in OSP 16 env (RHEL 8.5) or
- openstack-ansible-core-2.14.2-4.1.el9ost.x86_64 & ansible-collections-openstack-1.9.1-17.1.20230621074746.0e9a6f2.el9ost.noarch in OSP 17 env (RHEL 9.2)

Besides that we need to have a way for identifying resources for particular deployment, as it may interfere with existing one.

Version-Release number of selected component (if applicable):

4.15.0-0.nightly-2024-01-22-160236

How reproducible:

Always

Steps to Reproduce:

1. Set the os_subnet6 in the inventory file for setting dualstack
2. Run the 4.15 network.yaml playbook

Actual results:

Playbook fails:
TASK [Add IPv6 subnet to the external router] ********************************** fatal: [localhost]: FAILED! => {"changed": false, "extra_data": {"data": null, "details": "Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.\", \"detail\": \"\"}}"}, "msg": "Error updating router 8352c9c0-dc39-46ed-94ed-c038f6987cad: Client Error for url: https://10.46.43.81:13696/v2.0/routers/8352c9c0-dc39-46ed-94ed-c038f6987cad, Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}."}

Expected results:

Successful playbook execution

Additional info:

The router can be created in two different tasks, the playbook [2] worked for me.

[1] https://github.com/openshift/installer/blob/1349161e2bb8606574696bf1e3bc20ae054e60f8/upi/openstack/network.yaml#L43
[2] https://file.rdu.redhat.com/juriarte/upi/network.yaml

This is a clone of issue OCPBUGS-37850. The following is the description of the original issue:

Description of problem:

Occasional machine-config daemon panics in test-preview. For example this run has:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736

And the referenced logs include a full stack trace, the crux of which appears to be:

E0801 19:23:55.012345    2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 127 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x2424b80?, 0x4166150?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0})
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d
github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208)
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65
github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208})
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact

How reproducible:

looks like ~15% impact in those CI runs CI Search turns up.

Steps to Reproduce:

Run lots of CI. Look for MCD panics.

Actual results

CI Search results above.

Expected results

No hits.

Please review the following PR: https://github.com/openshift/oc/pull/1780

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

There are 2 problematic tests in the ImageEcosystem testsuite in: the rails sample and the s2i perl test. This issue tries to fix them both at once so that we can get a passing image ecosystem test.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

always
    

Steps to Reproduce:

    1. Run the imageecosystem testsuite
    2. observe the {[Feature:ImageEcosystem][ruby]} and {[Feature:ImageEcosystem][perl]} test fail
    

Actual results:

The two tests fail
    

Expected results:

No test failures
    

Additional info:


    

This is a clone of issue OCPBUGS-42873. The following is the description of the original issue:

Description of problem:

openshift-apiserver that sends traffic through konnectivity proxy is sending traffic intended for the local audit-webhook service. The audit-webhook service should be included in the NO_PROXY env var of the openshift-apiserver container.

    

4.14.z,4.15.z,4.15.z,4.16.z

    How reproducible:{code:none} Always

    

Steps to Reproduce:

    1. Create a rosa hosted cluster
    2. Obeserve logs of the konnectivity-proxy sidecar of openshift-apiserver
    3.
    

Actual results:

     Logs include requests to the audit-webhook local service

    

Expected results:

      Logs do not include requests to audit-webhook 
    

Additional info:


    

Description of problem:

cluster-api-provider-openstack panics when fed a non-existant network ID as the additional network of a Machine. Or as any network where to create a port really.

This issue has been fixed upstream in https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/2064, which is present in the latest release v0.10.3.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

1. For the Linux nodes, the container runtime is CRI-O and the port 9537 has a crio process listening on it.While, windows nodes doesn't have CRIO container runtime.
2. Prometheus is trying to connect to /metrics endpoint on the windows nodes on port 9537 which actually does not have any process listening on it. 
3. TargetDown is alerting crio job since it cannot reach the endpoint http://windows-node-ip:9537/metrics.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Install 4.13 cluster with windows operator
    2. In the Prometheus UI, go to > Status > Targets to know which targets are down.   
    

Actual results:

    It gives the alert for targetDown

Expected results:

    It should not give any such alert.

Additional info:

    

Description of problem:

This was discovered by a new alert that was reverted in https://issues.redhat.com/browse/OCPBUGS-36299 as the issue is making Hypershift Conformance fail.

Platform prometheus is asked to scrape targets from the namespace "openshift-operator-lifecycle-manager", but Prometheus isn't given the appropriate RBAC to do so.

The alert was revealing an RBAC issue on platform Prometheus: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.17-periodics-e2e-aws-ovn-conformance/1806305841511403520/artifacts/e2e-aws-ovn-conformance/dump/artifacts/hostedcluster-8a4fd7515fb581e231c4/namespaces/openshift-monitoring/pods/prometheus-k8s-0/prometheus/prometheus/logs/current.log

2024-06-27T14:59:38.968032082Z ts=2024-06-27T14:59:38.967Z caller=klog.go:108 level=warn component=k8s_client_runtime func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-operator-lifecycle-manager\""
2024-06-27T14:59:38.968032082Z ts=2024-06-27T14:59:38.968Z caller=klog.go:116 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource \"endpoints\" in API group \"\" in the namespace \"openshift-operator-lifecycle-manager\""

Before adding this alert, such issues went unnoticed.

https://docs.google.com/document/d/1rCKAYTrYMESjJDyJ0KvNap05NNukmNXVVi6MShISnOw/edit#heading=h.13rhihr867kk explains what should be done (cf the "Also, in order for Prometheus to be able to discover..." paragraph) in order to make Prometheus able to discover the targets.

Because no test was failing before, maybe the metrics from "openshift-operator-lifecycle-manager" are not needed and we should stop asking Prometheus to discover targets from there: delete the ServiceMonitor/PodMonitor

Expected results:

Description of problem:

apply nncp to configure DNS, then edit nncp to update nameserver, but /etc/resolv.conf is not updated.

Version-Release number of selected component (if applicable):

OCP version: 4.16.0-0.nightly-2024-03-13-061822
knmstate operator version: kubernetes-nmstate-operator.4.16.0-202403111814

How reproducible:

always

Steps to Reproduce:

1. install knmstate operator
2. apply below nncp to configure dns on one of the node
---
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: dns-staticip-4
spec:
  nodeSelector:
    kubernetes.io/hostname: qiowang-031510-k4cjs-worker-0-rw4nt
  desiredState:
    dns-resolver:
      config:
        search:
        - example.org
        server:
        - 192.168.221.146
        - 8.8.9.9
    interfaces:
    - name: dummy44
      type: dummy
      state: up
      ipv4:
        address:
        - ip: 192.0.2.251
          prefix-length: 24
        dhcp: false
        enabled: true
        auto-dns: false
% oc apply -f dns-staticip-noroute.yaml 
nodenetworkconfigurationpolicy.nmstate.io/dns-staticip-4 created
% oc get nncp
NAME             STATUS      REASON
dns-staticip-4   Available   SuccessfullyConfigured
% oc get nnce
NAME                                                 STATUS      STATUS AGE   REASON
qiowang-031510-k4cjs-worker-0-rw4nt.dns-staticip-4   Available   5s           SuccessfullyConfigured


3. check dns on the node, dns configured correctly
sh-5.1# cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
search qiowang-031510.qe.devcluster.openshift.com example.org
nameserver 192.168.221.146
nameserver 192.168.221.146
nameserver 8.8.9.9
# nameserver 192.168.221.1
sh-5.1# 
sh-5.1# cat /var/run/NetworkManager/resolv.conf 
# Generated by NetworkManager
search example.org
nameserver 192.168.221.146
nameserver 8.8.9.9
nameserver 192.168.221.1
sh-5.1# 
sh-5.1# nmcli | grep 'DNS configuration' -A 10
DNS configuration:
	servers: 192.168.221.146 8.8.9.9
	domains: example.org
	interface: dummy44
... ...


4. edit nncp, update nameserver, save the modification
---
spec:
  desiredState:
    dns-resolver:
      config:
        search:
        - example.org
        server:
        - 192.168.221.146
        - 8.8.8.8       <---- update from 8.8.9.9 to 8.8.8.8
    interfaces:
    - ipv4:
        address:
        - ip: 192.0.2.251
          prefix-length: 24
        auto-dns: false
        dhcp: false
        enabled: true
      name: dummy44
      state: up
      type: dummy
  nodeSelector:
    kubernetes.io/hostname: qiowang-031510-k4cjs-worker-0-rw4nt
% oc edit nncp dns-staticip-4
nodenetworkconfigurationpolicy.nmstate.io/dns-staticip-4 edited
% oc get nncp
NAME             STATUS      REASON
dns-staticip-4   Available   SuccessfullyConfigured
% oc get nnce
NAME                                                 STATUS      STATUS AGE   REASON
qiowang-031510-k4cjs-worker-0-rw4nt.dns-staticip-4   Available   8s           SuccessfullyConfigured


5. check dns on the node again

Actual results:

the dns nameserver in file /etc/resolv.conf is not updated after nncp updated, file /var/run/NetworkManager/resolv.conf updated correctly: 

sh-5.1# cat /etc/resolv.conf 
# Generated by KNI resolv prepender NM dispatcher script
search qiowang-031510.qe.devcluster.openshift.com example.org
nameserver 192.168.221.146
nameserver 192.168.221.146
nameserver 8.8.9.9        <---- it is not updated
# nameserver 192.168.221.1
sh-5.1# 
sh-5.1# cat /var/run/NetworkManager/resolv.conf 
# Generated by NetworkManager
search example.org
nameserver 192.168.221.146
nameserver 8.8.8.8        <---- updated correctly
nameserver 192.168.221.1
sh-5.1# 
sh-5.1# nmcli | grep 'DNS configuration' -A 10
DNS configuration:
	servers: 192.168.221.146 8.8.8.8
	domains: example.org
	interface: dummy44
... ...

Expected results:

the dns nameserver in file /etc/resolv.conf can be updated accordingly

Additional info:

 

Description of problem:

    When creating an application from a code sample, some of the icons are stretched out

 

 

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    Always

Steps to Reproduce:

    1. With a project selected, click +Add, then click samples
    2. Observe that some icons are stretched, e.g., Basic .NET, Basic Python

Actual results:

    Icons are stretched horizontally

Expected results:

    They are not stretched

Additional info:

    Access sample page via route /samples/ns/default

Description of problem:

    PAC provide the log link in git to see log of the PLR. Which is broken on 4.15 after this change https://github.com/openshift/console/pull/13470. This PR changes the log URL after react route package upgrade.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38789. The following is the description of the original issue:

Description of problem:

The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.

Version-Release number of selected component (if applicable):
4.18

How reproducible:
Always

Steps to Reproduce:

  1. Open the network section

Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section

Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section

Additional info:

Description of problem:

Renable knative and A-04-TC01 tests that are being disabled in the pr  https://github.com/openshift/console/pull/13931   

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

This is a clone of issue OCPBUGS-37953. The following is the description of the original issue:

Description of problem:

Specify long cluster name in install-config, 
==============
metadata:
  name: jima05atest123456789test123

Create cluster, installer exited with below error:
08-05 09:46:12.788  level=info msg=Network infrastructure is ready
08-05 09:46:12.788  level=debug msg=Creating storage account
08-05 09:46:13.042  level=debug msg=Collecting applied cluster api manifests...
08-05 09:46:13.042  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating storage account jima05atest123456789tsh586sa: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima05atest123456789t-sh586-rg/providers/Microsoft.Storage/storageAccounts/jima05atest123456789tsh586sa
08-05 09:46:13.042  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.042  level=error msg=RESPONSE 400: 400 Bad Request
08-05 09:46:13.043  level=error msg=ERROR CODE: AccountNameInvalid
08-05 09:46:13.043  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.043  level=error msg={
08-05 09:46:13.043  level=error msg=  "error": {
08-05 09:46:13.043  level=error msg=    "code": "AccountNameInvalid",
08-05 09:46:13.043  level=error msg=    "message": "jima05atest123456789tsh586sa is not a valid storage account name. Storage account name must be between 3 and 24 characters in length and use numbers and lower-case letters only."
08-05 09:46:13.043  level=error msg=  }
08-05 09:46:13.043  level=error msg=}
08-05 09:46:13.043  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.043  level=error
08-05 09:46:13.043  level=info msg=Shutting down local Cluster API controllers...
08-05 09:46:13.298  level=info msg=Stopped controller: Cluster API
08-05 09:46:13.298  level=info msg=Stopped controller: azure infrastructure provider
08-05 09:46:13.298  level=info msg=Stopped controller: azureaso infrastructure provider
08-05 09:46:13.298  level=info msg=Shutting down local Cluster API control plane...
08-05 09:46:15.177  level=info msg=Local Cluster API system has completed operations    

See azure doc[1], the naming rules on storage account name, it must be between 3 and 24 characters in length and may contain numbers and lowercase letters only.

The prefix of storage account created by installer seems changed to use infraID with CAPI-based installation, it's "cluster" when installing with terraform.

Is it possible to change back to use "cluster" as sa prefix to keep consistent with terraform? because there are several storage accounts being created once cluster installation is completed. One is created by installer starting with "cluster", others are created by image-registry starting with "imageregistry". And QE has some CI profiles[2] and automated test cases relying on installer sa, need to search prefix with "cluster", and not sure if customer also has similar scenarios.

[1] https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview
[2] https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L241

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-41228. The following is the description of the original issue:

Description of problem:

The console crashes when the user selects SSH as the Authentication type for the git server under add secret in the start pipeline form     

Version-Release number of selected component (if applicable):

    

How reproducible:

Everytime. Only in developer perspective and if the Pipelines dynamic plugin is enabled.
    

Steps to Reproduce:

    1. Create a pipeline through add flow and open start pipeline page 
    2. Under show credentials select add secret
    3. In the secret form select `Access to ` as Git server and `Authentication type` as SSH key
    

Actual results:

Console crashes
    

Expected results:

UI should work as expected
    

Additional info:

Attaching console log screenshot
    

https://drive.google.com/file/d/1bGndbq_WLQ-4XxG5ylU7VuZWZU15ywTI/view?usp=sharing

Description of problem:

    VirtualizedTable component  in console dynamic plugin don't have default sorting column. We need default sorting column for list pages. 
https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#virtualizedtable

This is a clone of issue OCPBUGS-43360. The following is the description of the original issue:

Description of problem:

Start last run option from the Action menu does not work on the BuildConfig details page
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Every time
    

Steps to Reproduce:

    1. Create workloads with with builds 
    2. Goto the builds page from navigation 
    3. Select the build config 
    4. Select the` Start last run` option from the Action menu
    

Actual results:

The option doesn't work
    

Expected results:

The option should work
    

Additional info:

Attaching video
    

https://drive.google.com/file/d/10shQqcFbIKfE4Jv60AxNYBXKz08EdUAK/view?usp=sharing

Searching for "Unable to obtain risk analysis from sippy after retries" indicates that sometimes the Risk Analysis request fails (which of course does not fail any tests, we just don't get RA for the job). It's pretty rare, but since we run a lot of tests, that's still a fair sample size.

Found in 0.04% of runs (0.25% of failures) across 37359 total runs and 5522 jobs 

Interestingly, searching for the error that leads up to this, "error requesting risk analysis from sippy", leads to similar frequency.

Found in 0.04% of runs (0.25% of failures) across 37460 total runs and 5531 jobs 

If failures were completely random and only occasionally repeated enough for retries to all fail, we would expect to see the lead-up a lot more often than the final failure. This suggests that either there's something problematic about a tiny subset of requests, or that perhaps postgres or other dependency is unusually slow for several minutes at a time.

Description of problem:

https://search.dptools.openshift.org/?search=failed+to+configure+the+policy+based+routes+for+network&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

 

See:

event happened 183 times, something is wrong: node/ip-10-0-52-0.ec2.internal hmsg/9cff2a8527 - reason/ErrorUpdatingResource error creating gateway for node ip-10-0-52-0.ec2.internal: failed to configure the policy based routes for network "default": invalid host address: 10.0.52.0/18 (17:55:20Z) result=reject

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.

2.

3.

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

This comes from this bug https://issues.redhat.com/browse/OCPBUGS-29940

After applying the workaround suggested [1][2] with "oc adm must-gather --node-name" we found another issue where must-gather creates the debug pod on all master nodes and gets stuck for a while because the script gather_network_logs_basics loop. Filtering out the NotReady nodes would allow us to apply the workaround.

The script gather_network_logs_basics gets the master nodes by label (node-role.kubernetes.io/master) and saves them in the CLUSTER_NODES variable. It then passes this as a parameter to the function gather_multus_logs $CLUSTER_NODES, where it loops through the list of master nodes and performs debugging for each node.

collection-scripts/gather_network_logs_basics
...
CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -oname)}"
/usr/bin/gather_multus_logs $CLUSTER_NODES
...
collection-scripts/gather_multus_logs
...
function gather_multus_logs {
  for NODE in "$@"; do
    nodefilename=$(echo "$NODE" | sed -e 's|node/||')
    out=$(oc debug "${NODE}" -- \
    /bin/bash -c "cat $INPUT_LOG_PATH" 2>/dev/null) && echo "$out" 1> "${OUTPUT_LOG_PATH}/multus-log-$nodefilename.log"
  done
}

This could be resolved with something similar to this:

CLUSTER_NODES="${@:-$(oc get node -l node-role.kubernetes.io/master -o json | jq -r '.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True")).metadata.name')}"
/usr/bin/gather_multus_logs $CLUSTER_NODES

[1] - https://access.redhat.com/solutions/6962230
[2] - https://issues.redhat.com/browse/OCPBUGS-29940

This is a clone of issue OCPBUGS-36261. The following is the description of the original issue:

Description of problem:

In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname  which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>):  

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      type: Route
~~~  

On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: 

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      route:
        hostname: oauth.<custom-domain>
      type: Route

$ oc get routes -n hcp-ns --show-labels
NAME    HOST/PORT             LABELS
oauth oauth.<custom-domain>  hypershift.openshift.io/hosted-control-plane=hcp-ns <---
~~~

The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: 

~~~
$ oc get ingresscontroller -n openshift-ingress-default default -oyaml
    routeSelector:
      matchExpressions:
      - key: hypershift.openshift.io/hosted-control-plane <---
        operator: DoesNotExist <---
~~~

This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.

Version-Release number of selected component (if applicable):

   4.15

How reproducible:

    Easily

Steps to Reproduce:

    1. Install HCP cluster 
    2. Configure OAuthServer with type Route 
    3. Add a custom hostname different than default wildcard ingress URL from management cluster
    

Actual results:

    Oauth route is not admitted

Expected results:

    Oauth route should be admitted by Ingresscontroller

Additional info:

    

Description of problem:

When a OCB is enabled, and a new MC is created, nodes are drained twice when the resulting osImage build is applied.

    

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Enable OCB in the worker pool

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  machineConfigPool:
    name: worker
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
EOF



    2. Wait for the image to be built

    3. When the opt-in image has been finished and applied create a new MC

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-machine-config-1
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,dGVzdA==
        filesystem: root
        mode: 420
        path: /etc/test-file-1.test

    4. Wait for the image to be built
    

Actual results:

Once the image is built it is applied to the worker nodes.

If we have a look at the drain operation, we can see that every worker node was drained twice instead of once:

oc -n openshift-machine-config-operator logs $(oc -n openshift-machine-config-operator get pods -l k8s-app=machine-config-controller -o jsonpath='{.items[0].metadata.name}') -c machine-config-controller | grep "initiating drain"
I0430 13:28:48.740300       1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain
I0430 13:30:08.330051       1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain
I0430 13:32:32.431789       1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain
I0430 13:33:50.643544       1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain
I0430 13:48:08.183488       1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain
I0430 13:49:01.379416       1 drain_controller.go:182] node ip-10-0-70-208.us-east-2.compute.internal: initiating drain
I0430 13:50:52.933337       1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain
I0430 13:52:12.191203       1 drain_controller.go:182] node ip-10-0-69-154.us-east-2.compute.internal: initiating drain


    

Expected results:

Nodes should drained only once when applying a new MC
    

Additional info:

    

Description of problem:

Disable serverless tests
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

This is a clone of issue OCPBUGS-37736. The following is the description of the original issue:

Modify the import to strip or change the bootOptions.efiSecureBootEnabled

https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319

archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}

ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil

{ err = fmt.Errorf("%s, %w", err.Error(), ferr) }

defer f.Close()

// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil

{ err = fmt.Errorf("%s, %w", err.Error(), cerr) }

return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}

ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil

{ return fmt.Errorf("failed to parse ovf: %w", err) }

Description of problem:

    The Installer still requires permissions to create and delete IAM roles even when the users brings existing roles.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1. Specify existing IAM role in the install-config
    2.
    3.
    

Actual results:

    The following permissions are required even though they are not used:
        "iam:CreateRole",
        "iam:DeleteRole",
        "iam:DeleteRolePolicy",
        "iam:PutRolePolicy",
        "iam:TagInstanceProfile"

Expected results:

    Only actually needed permissions are required.

Additional info:

    I think this is tech debt from when roles were not tagged. The fix will kind of revert https://github.com/openshift/installer/pull/5286

This is a clone of issue OCPBUGS-31738. The following is the description of the original issue:

Description of problem:

The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test frequently fails on OpenStack platform, which in turn also causes the [sig-network] can collect pod-to-service poller pod logs and [sig-network] can collect host-to-service poller pod logs tests to fail.

These failure happen frequently in vh-mecha, for example for all CSI jobs, such as 4.16-e2e-openstack-csi-cinder.

   

Description of problem:

monitor-add-nodes.sh returns Error: open .addnodesparams: permission denied. 

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

sometimes

Steps to Reproduce:

    1. Monitor adding a day2 node using monitor-add-nodes.sh
    2.
    3.
    

Actual results:

    Error: open .addnodesparams: permission denied. 

Expected results:

    monitor-add-nodes runs successfully

Additional info:

zhenying niu found an issue the node-joiner-monitor.sh

[core@ocp-edge49 installer]$ ./node-joiner-monitor.sh 192.168.122.6 namespace/openshift-node-joiner-mz8anfejbn created serviceaccount/node-joiner-monitor created clusterrole.rbac.authorization.k8s.io/node-joiner-monitor unchanged clusterrolebinding.rbac.authorization.k8s.io/node-joiner-monitor configured pod/node-joiner-monitor created Now using project "openshift-node-joiner-mz8anfejbn" on server "https://api.ostest.test.metalkube.org:6443". pod/node-joiner-monitor condition met time=2024-05-21T09:24:19Z level=info msg=Monitoring IPs: [192.168.122.6] Error: open .addnodesparams: permission denied Usage: node-joiner monitor-add-nodes [flags] Flags: -h, --help help for monitor-add-nodes Global Flags: --dir string assets directory (default ".") --kubeconfig string Path to the kubeconfig file. --log-level string log level (e.g. "debug | info | warn | error") (default "info") time=2024-05-21T09:24:19Z level=fatal msg=open .addnodesparams: permission denied Cleaning up Removing temporary file /tmp/nodejoiner-mZ8aNfEjbn
[~afasano@redhat.com]  found the root cause, the working directory was not set, so the pwd folder /output is used, and is not writable. An easy fix would be to just use /tmp, ie:
{code:java}
command: ["/bin/sh", "-c", "node-joiner monitor-add-nodes $ipAddresses --dir=/tmp --log-level=info; sleep 5"] 

The OCM-operator's imagePullSecretCleanupController attempts to prevent new pods from using an image pull secret that needs to be deleted, but this results in the OCM creating a new image pull secret in the meantime.

The overlap occurs when OCM-operator has detected the registry is removed, simultaneously triggering the imagePullSecretCleanup controller to start deleting and updating the OCM config to stop creating, but the OCM behavior change is delayed until its pods are restarted.

In 4.16 this churn is minimized due to the OCM naming the image pull secrets consistently, but the churn can occur during an upgrade given that the OCM-operator is updated first.

Description of problem:


There is one pod of metal3 operator in constant failure state. The cluster was acting as Hub cluster with ACM + GitOps for SNO installation. It was working well for a few days, until this moment when no other sites could be deployed.

oc get pods -A | grep metal3
openshift-machine-api                              metal3-64cf86fb8b-fg5b9                                           3/4     CrashLoopBackOff   35 (108s ago)   155m
openshift-machine-api                              metal3-baremetal-operator-84875f859d-6kj9s                        1/1     Running            0               155m
openshift-machine-api                              metal3-image-customization-57f8d4fcd4-996hd                       1/1     Running            0               5h

    

Version-Release number of selected component (if applicable):

OCP version: 4.16.ec5
    

How reproducible:

Once it starts to fail, it does not recover.
    

Steps to Reproduce:

    1. Unclear. Install Hub cluster with ACM+GitOps
    2. (Perhaps: Update AgentServiceConfig
    

Actual results:

Pod crashing and installation of spoke cluster fails
    

Expected results:

Pod running and installation of spoke cluster succeds.
    

Additional info:

Logs of metal3-ironic-inspector:

`[kni@infra608-1 ~]$ oc logs pods/metal3-64cf86fb8b-fg5b9 -c metal3-ironic-inspector
+ CONFIG=/etc/ironic-inspector/ironic-inspector.conf
+ export IRONIC_INSPECTOR_ENABLE_DISCOVERY=false
+ IRONIC_INSPECTOR_ENABLE_DISCOVERY=false
+ export INSPECTOR_REVERSE_PROXY_SETUP=true
+ INSPECTOR_REVERSE_PROXY_SETUP=true
+ . /bin/tls-common.sh
++ export IRONIC_CERT_FILE=/certs/ironic/tls.crt
++ IRONIC_CERT_FILE=/certs/ironic/tls.crt
++ export IRONIC_KEY_FILE=/certs/ironic/tls.key
++ IRONIC_KEY_FILE=/certs/ironic/tls.key
++ export IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt
++ IRONIC_CACERT_FILE=/certs/ca/ironic/tls.crt
++ export IRONIC_INSECURE=true
++ IRONIC_INSECURE=true
++ export 'IRONIC_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3'
++ IRONIC_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3'
++ export 'IPXE_SSL_PROTOCOL=-ALL +TLSv1.2 +TLSv1.3'
++ IPXE_SSL_PROTOCOL='-ALL +TLSv1.2 +TLSv1.3'
++ export IRONIC_VMEDIA_SSL_PROTOCOL=ALL
++ IRONIC_VMEDIA_SSL_PROTOCOL=ALL
++ export IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt
++ IRONIC_INSPECTOR_CERT_FILE=/certs/ironic-inspector/tls.crt
++ export IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key
++ IRONIC_INSPECTOR_KEY_FILE=/certs/ironic-inspector/tls.key
++ export IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt
++ IRONIC_INSPECTOR_CACERT_FILE=/certs/ca/ironic-inspector/tls.crt
++ export IRONIC_INSPECTOR_INSECURE=true
++ IRONIC_INSPECTOR_INSECURE=true
++ export IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt
++ IRONIC_VMEDIA_CERT_FILE=/certs/vmedia/tls.crt
++ export IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key
++ IRONIC_VMEDIA_KEY_FILE=/certs/vmedia/tls.key
++ export IPXE_CERT_FILE=/certs/ipxe/tls.crt
++ IPXE_CERT_FILE=/certs/ipxe/tls.crt
++ export IPXE_KEY_FILE=/certs/ipxe/tls.key
++ IPXE_KEY_FILE=/certs/ipxe/tls.key
++ export RESTART_CONTAINER_CERTIFICATE_UPDATED=false
++ RESTART_CONTAINER_CERTIFICATE_UPDATED=false
++ export MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt
++ MARIADB_CACERT_FILE=/certs/ca/mariadb/tls.crt
++ export IPXE_TLS_PORT=8084
++ IPXE_TLS_PORT=8084
++ mkdir -p /certs/ironic
++ mkdir -p /certs/ironic-inspector
++ mkdir -p /certs/ca/ironic
mkdir: cannot create directory '/certs/ca/ironic': Permission denied

    

Backport to 4.17 of AUTH-482 specifically for the openshift-*-infra.

Namespaces with workloads that need pinning:

  • openshift-kni-infra
  • openshift-openstack-infra
  • openshift-vsphere-infra

See 4.18 PR for more info on what needs pinning.

This is a clone of issue OCPBUGS-38515. The following is the description of the original issue:

Description of problem:

    container_network* metrics disappeared from pods

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-13-031847

How reproducible:

    always

Steps to Reproduce:

    1.create a pod
    2.check container_network* metrics from the pod
$oc get --raw /api/v1/nodes/jimabug02-95wr2-worker-westus-b2cpv/proxy/metrics/cadvisor  | grep container_network_transmit | grep $pod_name
 
    

Actual results:

2 It failed to report container_network* metrics

Expected results:

2 It should report container_network* metrics   

Additional info:

This may be a regression issue, we hit it in 4.14 https://issues.redhat.com/browse/OCPBUGS-13741

 

This is a clone of issue OCPBUGS-41328. The following is the description of the original issue:

Description of problem:

    Rotating the root certificates (root CA) requires multiple certificates during the rotation process to prevent downtime as the server and client certificates are updated in the control and data planes. Currently, the HostedClusterConfigOperator uses the cluster-signer-ca from the control plane to create a kublet-serving-ca on the data plane. The cluster-signer-ca contains only a single certificate that is used for signing certificates for the kube-controller-manager. 

During a rotation, the kublet-serving-ca will be updated with the new CA which triggers the metrics-server pod to restart and use the new CA. This will lead to an error in the metrics-server where it cannot scrape metrics as the kublet has yet to pickup the new certificate.

E0808 16:57:09.829746       1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.240.0.29:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="pres-cqogb7a10b7up68kvlvg-rkcpsms0805-default-00000130"

rkc@rmac ~> kubectl get pods -n openshift-monitoring
NAME                                                     READY   STATUS    RESTARTS   AGE
metrics-server-594cd99645-g8bj7                          0/1     Running   0          2d20h
metrics-server-594cd99645-jmjhj                          1/1     Running   0          46h 

The HostedClusterConfigOperator should likely be using the KubeletClientCABundle from the control plane for the kublet-serving-ca in the data plane. This CA bundle will contain both the new and old CA such that all data plane components can remain up during the rotation process.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time. 

Version-Release number of selected component (if applicable):

4.16.0-rc.1    

How reproducible:

once so far    

Steps to Reproduce:

    1. Deploy SNO with DU profile with disabled capabilities:

    installConfigOverrides:  "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}"

2. Leave the node running tests overnight for a couple of hours

3. Check for Pending CSRs

Actual results:

oc get csr -A | grep Pending | wc -l 
27    

Expected results:

No pending CSRs    

Also oc logs will return a tls internal error:

oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks 
Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller
Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error

Additional info:

Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities.

I0514 13:25:09.266546       1 controller.go:120] Reconciling CSR: csr-dw9c8
E0514 13:25:09.275585       1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1"
E0514 13:25:09.275665       1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c"
I0514 13:25:43.792140       1 controller.go:120] Reconciling CSR: csr-jvrvt
E0514 13:25:43.798079       1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1"
E0514 13:25:43.798128       1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff" 

Description of problem:

    When worker nodes are defined with DaemonSets that have PVC attached, they are node properly deleted, and node stay in deleting state as volumes are still attched

Version-Release number of selected component (if applicable):

    1.14.19

How reproducible:

    Check mustGather for details.

compute mustGather

HCP manager mustGather

Steps to Reproduce:

 

Actual results:

    Node stay in deltion state, until we manually "remove" DaemonSets

Expected results:

Node should properly delete without manual action    

Additional info:

    Cx is not experiencing this issue on std Rosa

Description of problem:

For STS, an AWS creds file is injected with credentials_process for installer to use. That usually points to a command that loads a Secret containing the creds necessary to assume role. 

For CAPI, installer runs in an ephemeral envtest cluster. So when it runs that credentials_process (via the black box of passing the creds file to the AWS SDK) the command ends up requesting that Secret from the envtest kube API server… where it doesn’t exist.

The Installer should avoid overriding KUBECONFIG whenever possible.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1. Deploy cluster with STS credentials
    2.
    3.
    

Actual results:

    Install fails with:

time="2024-06-02T23:50:17Z" level=debug msg="failed to get the service provider secret: secrets \"shawnnightly-aws-service-provider-secret\" not foundfailed to get the service provider secret: oc get events -n uhc-staging-2blaesc1478urglmcfk3r79a17n82lm3E0602 23:50:17.324137     151 awscluster_controller.go:327] \"failed to reconcile network\" err=<"
time="2024-06-02T23:50:17Z" level=debug msg="\tfailed to create new managed VPC: failed to create vpc: ProcessProviderExecutionError: error in credential_process"
time="2024-06-02T23:50:17Z" level=debug msg="\tcaused by: exit status 1"
time="2024-06-02T23:50:17Z" level=debug msg=" > controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/shawnnightly-c8zdl\" namespace=\"openshift-cluster-api-guests\" name=\"shawnnightly-c8zdl\" reconcileID=\"e7524343-f598-4b71-a788-ad6975e92be7\" cluster=\"openshift-cluster-api-guests/shawnnightly-c8zdl\""
time="2024-06-02T23:50:17Z" level=debug msg="I0602 23:50:17.324204     151 recorder.go:104] \"Failed to create new managed VPC: ProcessProviderExecutionError: error in credential_process\\ncaused by: exit status 1\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AWSCluster\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"shawnnightly-c8zdl\",\"uid\":\"f20bd7ae-a8d2-4b16-91c2-c9525256bb46\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta2\",\"resourceVersion\":\"311\"} reason=\"FailedCreateVPC\""

Expected results:

    No failures

Additional info:

    

This is a clone of issue OCPBUGS-37560. The following is the description of the original issue:

Description of problem:
Console user settings are saved in a ConfigMap for each user in the namespace openshift-console-user-settings.

The console frontend uses the k8s API to read and write that ConfigMap. The console backend creates a ConfigMap with a Role and RoleBinding for each user, giving that single user read and write access to his/her own ConfigMap.

The number of Role and RoleBindings might decrease a cluster performance. This has happened in the past, esp. on the Developer Sandbox, where a long-living cluster creates new users that is then automatically removed after a month. Keeping the Role and RoleBinding results in performance issues.

The resources had an ownerReference before 4.15 so that the 3 resources (1 ConfigMap, 1 Role, 1 RoleBinding) was automatically removed when the User resource was deleted. This ownerReference was removed with 4.15 to support external OIDC providers.

The ask in this issue is to restore that ownerReference for the OpenShift auth provider.

History:

  • User setting feature was introduced 2020 with 4.7 (ODC-4370) without a ownerReference for these resources.
  • After noticing performance issues on Dev Sandbox 2022 (BZ 2019564) we added an ownerReference in 4.11 (PR 11130) and backported this change 4.10 and 4.9.
  • The ownerReference was removed in 4.15 with CONSOLE-3829/OCPBUGS-16814/PR 13321. This is a regression.

See also:

Version-Release number of selected component (if applicable):
4.15+

How reproducible:
Always

Steps to Reproduce:

  1. Create a new user
  2. Login into the console
  3. Check for the user settings ConfigMap, Role and RoleBinding for that deleted user.
  4. Delete the user
  5. The resources should now be removed...

Actual results:
The three resources weren't deleted after the user was deleted.

Expected results:
The three resources should be deleted after the user is deleted.

Additional info:

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/232

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  setting the feature gate UserNamespacesSupport should set a cluster in tech preview not custom no upgrade  

Version-Release number of selected component (if applicable):

4.17.0    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-23080. The following is the description of the original issue:

Description of problem:

This is essentially an incarnation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1312444 that was fixed in OpenShift 3 but is now present again.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

Select a template in the console web UI, try to enter a multiline value.

Actual results:

It's impossible to enter line breaks.

Expected results:

It should be possible to achieve entering a multiline parameter when creating apps from templates.

Additional info:

I also filed an issue here https://github.com/openshift/console/issues/13317.
P.S. It's happening on https://openshift-console.osci.io, not sure what version of OpenShift they're running exactly.

After fixing https://issues.redhat.com/browse/OCPBUGS-29919 by merging https://github.com/openshift/baremetal-runtimecfg/pull/301 we have lost ability to properly debug the logic of selection Node IP used in runtimecfg.

In order to preserve debugability of this component, it should be possible to selectively enable verbose logs.

Description of problem:

In the RBAC which is set up for networkTypes other than OVNKubernetes, the cluster-network-operator role allows access to a configmap named "openshift-service-ca.crt", but the configmap which is actually used is named "root-ca".

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38011. The following is the description of the original issue:

Description of problem:

  1. Until OCP 4.11, the form with Name and Role in 'Dev Console -> Project -> Project Access tab' seems to have been changed to the form of Subject, Name, and Role through OCPBUGS-7800. Here, when the Subject is ServiceAccount, the Save button is not available unless Project is selected.

This seems to be a requirement to set Project/namespace.However, in the CLI, RoleBinding objects can be created without namespace with no issues.

$ oc describe rolebinding.rbac.authorization.k8s.io/monitor
Name: monitor
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: view
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount monitor

This is inconsistent with the dev console, causing confusion for developers and administrators and making things cumbersome for administrators.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Login to the web console for Developer.
    2. Select Project on the left.
    3. Select 'Project Access' tab.
    4. Add  access -> Select Sevice Account on the dropdown
   

Actual results:

   Save button is not active when no project is selected      

Expected results:

    The Save button is enabled even though the Project is not selected, so that it can be created just as it is handled in the CLI.

Additional info:

    

Description of problem:

 The cluster-api-provider-openstack branch used for e2e testing in cluster-capi-operator is not pinned to a branch. As such the go version used in the two projects goes out of sync causing the test to fail starting.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    The rhel8 build, necessary for rhel8 workers, is actually a rhel9 build.

Version-Release number of selected component (if applicable):

    4.16 + where base images are now rhel9
Update the 4.17 installer to use commit c6bcd313bce0fc9866e41bb9e3487d9f61c628a3 of cluster-api-provider-ibmcloud.  This includes a couple of necessary Transit Gateway fixes.

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/115

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/2185

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-35947. The following is the description of the original issue:

Description of problem:

In install-config file, there is no zone/instance type setting under controlplane or defaultMachinePlatform
==========================
featureSet: CustomNoUpgrade
featureGates:
- ClusterAPIInstallAzure=true
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

create cluster, master instances should be created in multi zones, since default instance type 'Standard_D8s_v3' have availability zones. Actually, master instances are not created in any zone.
$ az vm list -g jima24a-f7hwg-rg -otable
Name                                        ResourceGroup     Location        Zones
------------------------------------------  ----------------  --------------  -------
jima24a-f7hwg-master-0                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-master-1                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-master-2                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-worker-southcentralus1-wxncv  jima24a-f7hwg-rg  southcentralus  1
jima24a-f7hwg-worker-southcentralus2-68nxv  jima24a-f7hwg-rg  southcentralus  2
jima24a-f7hwg-worker-southcentralus3-4vts4  jima24a-f7hwg-rg  southcentralus  3

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-23-145410

How reproducible:

Always

Steps to Reproduce:

1. CAPI-based install on azure platform with default configuration
2. 
3.

Actual results:

master instances are created but not in any zone.

Expected results:

master instances should be created per zone based on selected instance type, keep the same behavior as terraform based install.

Additional info:

When setting zones under controlPlane in install-config, master instances can be created per zone.
install-config:
===========================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      zones: ["1","3"]

$ az vm list -g jima24b-p76w4-rg -otable
Name                                        ResourceGroup     Location        Zones
------------------------------------------  ----------------  --------------  -------
jima24b-p76w4-master-0                      jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-master-1                      jima24b-p76w4-rg  southcentralus  3
jima24b-p76w4-master-2                      jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-worker-southcentralus1-bbcx8  jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-worker-southcentralus2-nmgfd  jima24b-p76w4-rg  southcentralus  2
jima24b-p76w4-worker-southcentralus3-x2p7g  jima24b-p76w4-rg  southcentralus  3

 

This is a clone of issue OCPBUGS-38217. The following is the description of the original issue:

Description of problem:

    After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear.

// after changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:                   <<<< 
  connectionIdleTimeout: 0s            <<<<
networkLoadBalancer: {}
type: NLB

// create new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB



Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-08-013133

How reproducible:

    100%

Steps to Reproduce:

    1. changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}'

    2. create new ingresscontroller with NLB
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: nlb
  namespace: openshift-ingress-operator
spec:
  domain: nlb.<base-domain>
  replicas: 1
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: External
    type: LoadBalancerService

    3. check both ingresscontrollers status
    

Actual results:

// after changing default ingresscontroller to NLB 
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:
  connectionIdleTimeout: 0s
networkLoadBalancer: {}
type: NLB
 
// new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB
 

Expected results:

    If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB. 

Additional info:

    

This is a clone of issue OCPBUGS-39420. The following is the description of the original issue:

Description of problem:

ROSA HCP allows customers to select hostedcluster and nodepool OCP z-stream versions, respecting version skew requirements. E.g.:

  • A 4.15.28 hostedcluster with
  • A 4.15.28 nodepool
  • A 4.15.25 nodepool

Version-Release number of selected component (if applicable):

Reproducible on 4.14-4.16.z, this bug report demonstrates it for a 4.15.28 hostedcluster with a 4.15.25 nodepool

How reproducible:

100%    

Steps to Reproduce:

    1. Create a ROSA HCP cluster, which comes with a 2-replica nodepool with the same z-stream version (4.15.28)
    2. Create an additional nodepool at a different version (4.15.25)
    

Actual results:

Observe that while nodepool objects report the different version (4.15.25), the resulting kernel version of the node is that of the hostedcluster (4.15.28)

❯ k get nodepool -n ocm-staging-2didt6btjtl55vo3k9hckju8eeiffli8                                                                                    
NAME                     CLUSTER       DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
mshen-hyper-np-4-15-25   mshen-hyper   1               1               False         True         4.15.25   False             False            
mshen-hyper-workers      mshen-hyper   2               2               False         True         4.15.28   False             False  


❯ k get no -owide                                            
NAME                                         STATUS   ROLES    AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-129-139.us-west-2.compute.internal   Ready    worker   24m   v1.28.12+396c881   10.0.129.139   <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
ip-10-0-129-165.us-west-2.compute.internal   Ready    worker   98s   v1.28.12+396c881   10.0.129.165   <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
ip-10-0-132-50.us-west-2.compute.internal    Ready    worker   30m   v1.28.12+396c881   10.0.132.50    <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9

Expected results:

    

Additional info:

 

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/115

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

    Updating the secrets using Form editor displays the an unknown warning message. This is caused due to incorrect request object sent to server in edit Secret form. 

Slack: https://redhat-internal.slack.com/archives/C6A3NV5J9/p1715693990795919?thread_ts=1715685364.476189&cid=C6A3NV5J9

 

 

Description of problem:

    

Version-Release number of selected component (if applicable):

    Version4.16 - Always

How reproducible:

 

Steps to Reproduce:

    1. Goto Edit Secret form editor
    2. Click Save 
    The warning notification is triggered because of incorrect request object
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

console always send GET CSV requests to 'openshift' namespace even copiedCSV is not disabled

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-15-001800    

How reproducible:

Always    

Steps to Reproduce:

    1. Install operator into a specific namespace on the cluster(Operator will be available in a single Namespace only.) for example, subscribe APIcast into project 'test'
    2. Check if copiedCSV is disabled
    >> window.SERVER_FLAGS.copiedCSVsDisabled 
    <- false      // copiedCSV is NOT disabled

$ oc get olmconfig cluster -o json | jq .status.conditions
[
  {
    "lastTransitionTime": "2024-05-15T23:16:50Z",
    "message": "Copied CSVs are enabled and present across the cluster",
    "reason": "CopiedCSVsEnabled",
    "status": "False",
    "type": "DisabledCopiedCSVs"
  }
]
     3. monitor the browser console errors when check operator details via Operators -> Installed Operators -> click on the operator APIcast 
    

Actual results:

3. we can see a GET CSV request to 'openshift' namespace is sent and 404 was returned
GET
https://console-openshift-console.apps.qe-daily-416-0516.qe.devcluster.openshift.com/api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/openshift/clusterserviceversions/apicast-community-operator.v0.7.1    

Expected results:

3. copiedCSV is not disabled, probably we should not send request to query CSVs from 'openshift' namespace 

Additional info:

 

Description of problem:

    When deploying nodepools on OpenStack, the Nodepool condition complains about unsupported amd64 while we actually support it.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/network-tools/pull/129

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

https://github.com/openshift/console/pull/13769 removed the bulk of packages/kubevirt-plugin, but it orphaned packages/kubevirt-plugin/locales/* as those files were added after 13769 was authored.

Description of problem:

In the use case when worker nodes require a proxy for outside access and the control plane is external (and only accessible via the internet), ovnkube-node pods never become available because the ovnkube-controller container cannot reach the Kube APIServer.

Version-Release number of selected component (if applicable):

How reproducible: Always

Steps to Reproduce:

1. Create an AWS hosted cluster with Public access and requires a proxy to access the internet.

2. Wait for nodes to become active

Actual results:

Nodes join cluster, but never become active

Expected results:

Nodes join cluster and become active

This is a clone of issue OCPBUGS-38701. The following is the description of the original issue:

Description of problem:

clear all filters button is counted as part of resource type 

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-19-002129    

How reproducible:

Always    

Steps to Reproduce:

    1. navigate to Home -> Events page, choose 3 resource types, check what's shown on page
    2. navigate to Home -> Search page, choose 3 resource types, check what's shown on page. Choose 4 resource types and check what's shown    

Actual results:

1. it shows `1 more`, only clear all button will be shown if we click on `1 more` button
2. `1 more` button is only displayed when 4 resource types are selected, this is working as expected

Expected results:

1. clear all button should not be counted as part of resource number, the number 'N more' should reveal correct resource type numbers

Additional info:

    

Description of problem:

Power VS endpoint validation in the API only allows for lower case characters. However, the endpoint struct we check against is hardcoded to be PascalCase. We need to check against the lower case version of the string in the image registry operator.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After upgrading from 4.12 to 4.14, the customer reports that the pods cannot reach their service when a NetworkAttachmentDefinition is set.

How reproducible:

    Create a NetworkAttachmentDefinition

Steps to Reproduce:

    1.Create a pod with a service.
    2. Curl the service from inside the pod. Works.
    3. Create a NetworkAttachmentDefinition.
    4. The same curl does not work     

Actual results:

Pod does not reach service    

Expected results:

Pod reaches service 

Additional info:

    specifically updating the bug overview for posterity here but the specific issue is that we have pods set up with an exposed port (8080 - port doesn't matter), and a service with 1 endpoint pointing to the specific pod. We can call OTHER PODS in the same namespace via their single-endpoint call service, but we cannot call OURSELVES from inside the pod. 

The issue is with hairpinning loopback return. Is not affected by networkpolicy and appears to be an issue with (as discovered later in this jira) asymmetric routing in that return path to the container after it leaves the local net. 

This behavior is only observed when a network-attachment-definition is added to the pod and appears to be an issue with the way route rules are defined.

A workaround is available to inject the container with a route specicically, or modify the Net-attach-def to ensure a loopback route is available to the container space.

KCS for this problem with workarounds + patch fix versions (when available): https://access.redhat.com/solutions/7084866 

This is a clone of issue OCPBUGS-4466. The following is the description of the original issue:

Description of problem:

deploying compact 3-nodes cluster on GCP, by setting mastersSchedulable as true and removing worker machineset YAMLs, got panic

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install 4.13.0-0.nightly-2022-12-04-194803
built from commit cc689a21044a76020b82902056c55d2002e454bd
release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. create manifests
2. set 'spec.mastersSchedulable' as 'true', in <installation dir>/manifests/cluster-scheduler-02-config.yml
3. remove the worker machineset YAML file from <installation dir>/openshift directory
4. create cluster 

Actual results:

Got "panic: runtime error: index out of range [0] with length 0".

Expected results:

The installation should succeed, or giving clear error messages. 

Additional info:

$ openshift-install version
openshift-install 4.13.0-0.nightly-2022-12-04-194803
built from commit cc689a21044a76020b82902056c55d2002e454bd
release image registry.ci.openshift.org/ocp/release@sha256:9e61cdf7bd13b758343a3ba762cdea301f9b687737d77ef912c6788cbd6a67ea
release architecture amd64
$ 
$ openshift-install create manifests --dir test1
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
? Platform gcp
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
? Project ID OpenShift QE (openshift-qe)
? Region us-central1
? Base Domain qe.gcp.devcluster.openshift.com
? Cluster Name jiwei-1205a
? Pull Secret [? for help] ******
INFO Manifests created in: test1/manifests and test1/openshift 
$ 
$ vim test1/manifests/cluster-scheduler-02-config.yml
$ yq-3.3.0 r test1/manifests/cluster-scheduler-02-config.yml spec.mastersSchedulable
true
$ 
$ rm -f test1/openshift/99_openshift-cluster-api_worker-machineset-?.yaml
$ 
$ tree test1
test1
├── manifests
│   ├── cloud-controller-uid-config.yml
│   ├── cloud-provider-config.yaml
│   ├── cluster-config.yaml
│   ├── cluster-dns-02-config.yml
│   ├── cluster-infrastructure-02-config.yml
│   ├── cluster-ingress-02-config.yml
│   ├── cluster-network-01-crd.yml
│   ├── cluster-network-02-config.yml
│   ├── cluster-proxy-01-config.yaml
│   ├── cluster-scheduler-02-config.yml
│   ├── cvo-overrides.yaml
│   ├── kube-cloud-config.yaml
│   ├── kube-system-configmap-root-ca.yaml
│   ├── machine-config-server-tls-secret.yaml
│   └── openshift-config-secret-pull-secret.yaml
└── openshift
    ├── 99_cloud-creds-secret.yaml
    ├── 99_kubeadmin-password-secret.yaml
    ├── 99_openshift-cluster-api_master-machines-0.yaml
    ├── 99_openshift-cluster-api_master-machines-1.yaml
    ├── 99_openshift-cluster-api_master-machines-2.yaml
    ├── 99_openshift-cluster-api_master-user-data-secret.yaml
    ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
    ├── 99_openshift-machineconfig_99-master-ssh.yaml
    ├── 99_openshift-machineconfig_99-worker-ssh.yaml
    ├── 99_role-cloud-creds-secret-reader.yaml
    └── openshift-install-manifests.yaml2 directories, 26 files
$ 
$ openshift-install create cluster --dir test1
INFO Consuming Openshift Manifests from target directory
INFO Consuming Master Machines from target directory 
INFO Consuming Worker Machines from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Common Manifests from target directory 
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
panic: runtime error: index out of range [0] with length 0goroutine 1 [running]:
github.com/openshift/installer/pkg/tfvars/gcp.TFVars({{{0xc000cf6a40, 0xc}, {0x0, 0x0}, {0xc0011d4a80, 0x91d}}, 0x1, 0x1, {0xc0010abda0, 0x58}, ...})
        /go/src/github.com/openshift/installer/pkg/tfvars/gcp/gcp.go:70 +0x66f
github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1daff070, 0xc000cef530?)
        /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:479 +0x6bf8
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c78870, {0x1a777f40, 0x1daff070}, {0x0, 0x0})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:226 +0x5fa
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffc4c21413b?, {0x1a777f40, 0x1daff070}, {0x1dadc7e0, 0x8, 0x8})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48
main.runTargetCmd.func1({0x7ffc4c21413b, 0x5})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:259 +0x125
main.runTargetCmd.func2(0x1dae27a0?, {0xc000c702c0?, 0x2?, 0x2?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:289 +0xe7
github.com/spf13/cobra.(*Command).execute(0x1dae27a0, {0xc000c70280, 0x2, 0x2})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3a500)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
$ 

 

 

Description of problem:

In debugging recent cyclictest issues on OCP 4.16 (5.14.0-427.22.1.el9_4.x86_64+rt kernel), we have discovered that the "psi=1" kernel cmdline argument, which is now added by default due to cgroupsv2 being enabled, is causing latency issues (both cyclictest and timerlat are failing to meet the latency KPIs we commit to for Telco RAN DU deployments). See RHEL-42737 for reference.

Version-Release number of selected component (if applicable):

OCP 4.16

How reproducible:

Cyclictest and timerlat consistently fail on long duration runs (e.g. 12 hours).

Steps to Reproduce:

    1. Install OCP 4.16 and configure with the Telco RAN DU reference configuration.
    2. Run a long duration cyclictest or timerlat test    

Actual results:

Maximum latencies are detected above 20us.

Expected results:

All latencies are below 20us.

Additional info:

See RHEL-42737 for test results and debugging information. This was originally suspected to be an RHEL issue, but it turns out that PSI is being enabled by OpenShift code (which adds psi=1 to the kernel cmdline).

This is a clone of issue OCPBUGS-39398. The following is the description of the original issue:

Description of problem:

    When the console is loaded there are errors in the browsers console abouth failing to fetch networking-console-plugin locales.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    The issue is also effecting console CI

This is a clone of issue OCPBUGS-39222. The following is the description of the original issue:

The on-prem-resolv-prepender.path is enabled in UPI setup when it should only run for IPI

Description of problem:

RWOP accessMode is tech preview feature starting from OCP 4.14 and GA in 4.16. But on OCP console UI, there is not option available for creating a PVC with RWOP accessMode

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Login to OCP console in Administrator mode (4.14/4.15/4.16)
    2. Go to 'Storage -> PersistentVolumeClaim -> Click on Create PersistentVolumeClaim' 
    3. Check under 'Access Mode*', RWOP option is not present
    

Actual results:

    RWOP accessMode option is not present

Expected results:

    RWOP accessMode option is present

Additional info:

Storage feature: https://issues.redhat.com/browse/STOR-1171

Description of problem:

AWS VPCs support a primary CIDR range and multiple secondary CIDR ranges: https://aws.amazon.com/about-aws/whats-new/2017/08/amazon-virtual-private-cloud-vpc-now-allows-customers-to-expand-their-existing-vpcs/ 

Let's pretend a VPC exists with:

  • Primary CIDR range: 10.0.0.0/24 (subnet-a)
  • Seconday CIDR range: 10.1.0.0/24 (subnet-b)

and a hostedcontrolplane object like:

  networking:
...
    machineNetwork:
    - cidr: 10.1.0.0/24
...
  olmCatalogPlacement: management
  platform:
    aws:
      cloudProviderConfig:
        subnet:
          id: subnet-b
        vpc: vpc-069a93c6654464f03

Even though all EC2 instances will be spun up in subnet-b (10.1.0.0/24), CPO will detect the CIDR range of the VPC as 10.0.0.0/24 (https://github.com/openshift/hypershift/blob/0d10c822912ed1af924e58ccb8577d2bb1fd68be/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go#L4755-L4765) and create security group rules only allowing inboud traffic from 10.0.0.0/24. This specifically prevents these EC2 instances from communicating with the VPC Endpoint created by the awsendpointservice CR and reading the hosted control plane pods.

Version-Release number of selected component (if applicable):

    Reproduced on a 4.14.20 ROSA HCP cluster, but the version should not matter

How reproducible:

100%    

Steps to Reproduce:

    1. Create a VPC with at least one secondary CIDR block
    2. Install a ROSA HCP cluster providing the secondary CIDR block as the machine CIDR range and selecting the appropriate subnets within the secondary CIDR range   

Actual results:

* Observe that the default security group contains inbound security group rules allowing traffic from the VPC's primary CIDR block (not a CIDR range containing the cluster's worker nodes)

* As a result, the EC2 instances (worker nodes) fail to reach the ignition-server

Expected results:

The EC2 instances are able to reach the ignition-server and HCP pods

Additional info:

This bug seems like it could be fixed by using the machine CIDR range for the security group instead of the VPC CIDR range. Alternatively, we could duplicate rules for every secondary CIDR block, but the default AWS quota is 60 inbound security group rules/security group, so it's another failure condition to keep in mind if we go that route.

 

aws ec2 describe-vpcs output for a VPC with secondary CIDR blocks:    

❯ aws ec2 describe-vpcs --region us-east-2 --vpc-id vpc-069a93c6654464f03
{
    "Vpcs": [
        {
            "CidrBlock": "10.0.0.0/24",
            "DhcpOptionsId": "dopt-0d1f92b25d3efea4f",
            "State": "available",
            "VpcId": "vpc-069a93c6654464f03",
            "OwnerId": "429297027867",
            "InstanceTenancy": "default",
            "CidrBlockAssociationSet": [
                {
                    "AssociationId": "vpc-cidr-assoc-0abbc75ac8154b645",
                    "CidrBlock": "10.0.0.0/24",
                    "CidrBlockState": {
                        "State": "associated"
                    }
                },
                {
                    "AssociationId": "vpc-cidr-assoc-098fbccc85aa24acf",
                    "CidrBlock": "10.1.0.0/24",
                    "CidrBlockState": {
                        "State": "associated"
                    }
                }
            ],
            "IsDefault": false,
            "Tags": [
                {
                    "Key": "Name",
                    "Value": "test"
                }
            ]
        }
    ]
}

Please review the following PR: https://github.com/openshift/images/pull/185

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:
To support different auth providers (SSO via OIDC), we needed to remove the ownerReference from the ConfigMap, Role, and Rolebinding we create for each user to store the user settings.

Keeping these resources also when the user is deleted might decrease the overall cluster performance, esp. on Dev Sandbox where users are automatically removed every month.

We should make it easier to understand which user created these resources. This will help the Dev Sandbox team and maybe other customers in the future.

Version-Release number of selected component (if applicable):
4.15+

How reproducible:
Always when a user is deleted

Steps to Reproduce:

  1. Create a cluster with some developer user accounts
  2. Log in as one of the users
  3. Login again as kubeadmin and delete the User CR in the openshift-console-user-settings namespace

Actual results:
The user settings ConfigMap, Role, and RoleBinding in the same namespace aren't deleted and can only be found via the user uid. Which we might not know anymore since the User CR is already deleted.

Expected results:
The user settings ConfigMap, Role, and RoleBinding should also have a label or annotation referencing the user who created these resources.

See also https://github.com/openshift/console/issues/13696

For example:

metadata:
  labels:
    console.openshift.io/user-settings: "true"
    console.openshift.io/user-settings-username: "" # escaped if the username contains characters that are not valid as label-value
    console.openshift.io/user-settings-uid: "..." # only if available

Additional info:

Description of problem:

Add networking-console-plugin image to CNO as an env var, so the hosted CNO can fetch the image to deploy it on the hosted cluster.

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

100%

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

release-4.17 of openshift/cluster-api-provider-openstack is missing some commits that were backported in upstream project into the release-0.10 branch.
We should import them in our downstream fork.

4.17.0-0.nightly-2024-08-05-063006 failed in part due to aws-ovn-upgrade-4.17-micro aggregation failures

Caused by

{Zero successful runs, we require at least one success to pass  (P70=3.00s failures=[1820347983690993664=56s 1820348008856817664=80s 1820347977814773760=46s 1820347658217197568=52s 1820347967748444160=70s 1820347998836625408=52s 1820347972789997568=38s 1820347993786683392=52s 1820347988728352768=72s 1820347962715279360=80s 1820348003832041472=76s])  name: kube-api-http1-localhost-new-connections disruption P70 should not be worse
Failed: Mean disruption of openshift-api-http2-localhost-new-connections is 70.20 seconds is more than the failureThreshold of the weekly historical mean from 10 days ago: historicalMean=0.00s standardDeviation=0.00s failureThreshold=1.00s historicalP95=0.00s successes=[] failures=[1820347983690993664=68s 1820348008856817664=96s 1820347972789997568=48s 1820347658217197568=62s 1820348003832041472=88s 1820347962715279360=94s 1820347998836625408=62s 1820347993786683392=60s 1820347988728352768=86s 1820347977814773760=54s 1820347967748444160=80s]  name: openshift-api-http2-localhost-new-connections mean disruption should be less
  than historical plus five standard deviations
testsuitename: aggregated-disruption

Additionally we are seeing failures on azure upgrades that show large disruption during etcd-operator updating

Opening this bug to investigate and run payload tests against to rule it out.

Description of problem:

    The valid values for installconfig.platform.vsphere.diskType are thin, thick, and eagerZeroedThick.But no matter the diskType is set to thick or eagerZeroedThick, the actual check result is thin.

govc vm.info --json  /DEVQEdatacenter/vm/wwei-511d-gtbqd/wwei-511d-gtbqd-master-1 | jq -r .VirtualMachines[].Layout.Disk[].DiskFile[][vsanDatastore] e7323f66-86ef-9947-a2b9-507c6f3b795c/wwei-511d-gtbqd-master-1.vmdk[fedora@preserve-wwei ~]$ govc datastore.disk.info -ds  /DEVQEdatacenter/datastore/vsanDatastore e7323f66-86ef-9947-a2b9-507c6f3b795c/wwei-511d-gtbqd-master-1.vmdk |grep Type  Type:    thin

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-07-025557

How reproducible:

    Setting installconfig.platform.vsphere.diskType to thick or eagerZeroedThick and continue installation.

Steps to Reproduce:

    1.Setting installconfig.platform.vsphere.diskType to thick or eagerZeroedThick     
    2.continue installation          

Actual results:

    diskType is thin when install-config setting diskType: thick/eagerZeroedThick

Expected results:

    The check result for disk info should match the setting in install-config

Additional info:

    

The issue was observed during testing of the k8s 1.30 rebase in which the webhook client started using http2 for loopback IPs: kubernetes/kubernetes#122558.
It looks like the issue is caused by how a http2 client handles this invalid address, I verified this change by setting up a cluster with openshift/kubernetes#1953 and this pr.

Description of problem:

When a must-gather creation fails in the middle, the clusterrolebindings created for must-gather creation remains in the cluster.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Run a must-gather command: `oc adm must-gather`
    2. Interrupt the must-gather creation     
    3.Search for the cluster-rolebinding: `oc get clusterrolebinding | grep -i must`  
    4. Try deleting the must-gather namespace
    5.  Search for the cluster-rolebinding again:`oc get clusterrolebinding | grep -i must`   

Actual results:

clusterrolebindings created for must-gather creation remains in the cluster 

Expected results:

clusterrolebindings created for must-gather creation shouldn't remain in the cluster 

Additional info:

    

Description of problem:

When we configure a userCA or a cloudCA MCO adds those certificates to the ignition config and the nodes. Nevertheless, when we remove those certificates MCO does not remove them from the nodes and the ignition config.

    

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-01-24-133352   True        False         5h49m   Cluster version is 4.16.0-0.nightly-2024-01-24-133352

    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a new certificate

$ openssl genrsa -out privateKey.pem 4096
$ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com"


    2. Configure a userCA
# Create the configmap with the certificate
$ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt
configmap/cm-test-cert created

#Configure the proxy with the new test certificate
$ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}'
proxy.config.openshift.io/cluster patched

    3. Configure a cloudCA
$ oc set data -n openshift-config ConfigMap cloud-provider-config  --from-file=ca-bundle.pem=ca-bundle.crt

    4. Check that the certificates have been added


$  oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt" 
$  oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem" 

    5. Remove the configured userCA and cloudCA certificates

$ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": ""}}}'


$ oc edit  -n openshift-config ConfigMap cloud-provider-config  ### REMOVE THE ca-bundle.pem KEY




    

Actual results:

    Even though we have removed the certificates from the cluster config those can be found in the nodes

$  oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt" 
$  oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem" 


    

Expected results:


The certificates should be removed from the nodes and the ignition config when they are removed from the cluster config
    

Additional info:

    

Description of problem:

When you pass in endpoints in install config:

    serviceEndpoints:
    - name: dnsservices
      url: https://api.dns-svcs.cloud.ibm.com
    - name: cos
      url: https://s3.us-south.cloud-object-storage.appdomain.cloud

You see the following error:

failed creating minimalPowerVS client Error: setenv: invalid argument
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Always
    

Steps to Reproduce:

    1. Install cluster
    2. Pass in above to install config yaml
    3.
    

Actual results:

Worker nodes aren't created.
    

Expected results:

Worker nodes are created.
    

Additional info:


    

This is a clone of issue OCPBUGS-37052. The following is the description of the original issue:

Description of problem:

This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing.

LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue. 

Version-Release number of selected component (if applicable):

4.15.11     

How reproducible:

    

Steps to Reproduce:

 (From the customer)   
    1. Configure LDAP IDP
    2. Configure Proxy
    3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
    

Actual results:

    LDAP IDP communication from the control plane oauth pod goes through proxy 

Expected results:

    LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings

Additional info:

For more information, see linked tickets.    

Description of problem:

When application grouping is unchecked in display filters under the expand section the topology display is distorted and Application name is also missing.
    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Have some deployments
    2. In topology unselect the application grouping in the display filter 
    3.
    

Actual results:

Topology shows distorted UI and Application name is missing.
    

Expected results:

UI should be in the correct condition and Apllication name should present.
    

Additional info:

Screenshot:     

https://drive.google.com/file/d/1z80qLrr5v-K8ZFDa3P-n7SoDMaFtuxI7/view?usp=sharing

This is a clone of issue OCPBUGS-38599. The following is the description of the original issue:

Description of problem:

If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect

This does not occur if folder is defined.

An upstream bug was identified when debugging this:

https://github.com/vmware/govmomi/issues/3523

Some AWS installs are failing to bootstrap due to an issue where CAPA may fail to create load balancer resources, but still declare that infrastructure is ready (see upstream issue for more details).

In these cases, load balancers are failing to be created due to either rate limiting:

 

time="2024-05-25T21:43:07Z" level=debug msg="E0525 21:43:07.975223     356 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<"
time="2024-05-25T21:43:07Z" level=debug msg="\t[failed to modify target group attribute: Throttling: Rate exceeded" 

or in some cases another error:

time="2024-06-01T06:43:58Z" level=debug msg="E0601 06:43:58.902534     356 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<"
time="2024-06-01T06:43:58Z" level=debug msg="\t[failed to apply security groups to load balancer \"ci-op-jnqi01di-5feef-92njc-int\": ValidationError: A load balancer ARN must be specified"
time="2024-06-01T06:43:58Z" level=debug msg="\t\tstatus code: 400, request id: 77446593-03d2-40e9-93c0-101590d150c6, failed to create target group for load balancer: DuplicateTargetGroupName: A target group with the same name 'apiserver-target-1717224237' exists, but with different settings" 

We have an upstream PR in progress to retry the reconcile logic for load balancers.

 

Original component readiness report below.

=====

Component Readiness has found a potential regression in install should succeed: cluster bootstrap.

There is no significant evidence of regression

Sample (being evaluated) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-03T23:59:59Z
Success Rate: 96.60%
Successes: 227
Failures: 8
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 99.87%
Successes: 767
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=Installer%20%2F%20openshift-installer&confidence=95&environment=ovn%20no-upgrade%20amd64%20aws%20standard&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=aws&sampleEndTime=2024-06-03%2023%3A59%3A59&sampleRelease=4.16&sampleStartTime=2024-05-28%2000%3A00%3A00&testId=cluster%20install%3A6ce515c7c732a322333427bf4f5508a5&testName=install%20should%20succeed%3A%20cluster%20bootstrap&upgrade=no-upgrade&variant=standard

This is a clone of issue OCPBUGS-35036. The following is the description of the original issue:

Description of problem:

The following logs are from namespaces/openshift-apiserver/pods/apiserver-6fcd57c747-57rkr/openshift-apiserver/openshift-apiserver/logs/current.log

    2024-06-06T15:57:06.628216833Z E0606 15:57:06.628186       1 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 139.823053ms, panicked: true, err: <nil>, panic-reason: runtime error: invalid memory address or nil pointer dereference
2024-06-06T15:57:06.628216833Z goroutine 192790 [running]:
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:105 +0xa5
2024-06-06T15:57:06.628216833Z panic({0x498ac60?, 0x74a51c0?})
2024-06-06T15:57:06.628216833Z  runtime/panic.go:914 +0x21f
2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).importImages(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0xc07055f4a0, 0xc0a2487600)
2024-06-06T15:57:06.628216833Z  github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:263 +0x1cf5
2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).Import(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0x0?, 0x0?)
2024-06-06T15:57:06.628216833Z  github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:110 +0x139
2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport.(*REST).Create(0xc0033b2240, {0x5626bb0, 0xc0a50c7dd0}, {0x5600058?, 0xc07055f4a0?}, 0xc08e0b9ec0, 0x56422e8?)
2024-06-06T15:57:06.628216833Z  github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport/rest.go:337 +0x1574
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.(*namedCreaterAdapter).Create(0x55f50e0?, {0x5626bb0?, 0xc0a50c7dd0?}, {0xc0b5704000?, 0x562a1a0?}, {0x5600058?, 0xc07055f4a0?}, 0x1?, 0x2331749?)
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:254 +0x3b
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.1()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:184 +0xc6
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.2()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:209 +0x39e
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:117 +0x84

Version-Release number of selected component (if applicable):

We applied into all clusters in CI and checked 3 of them and all 3 share the same errors.

oc --context build09 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-rc.3   True        False         3d9h    Error while reconciling 4.16.0-rc.3: the cluster operator machine-config is degraded

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-rc.2   True        False         15d     Error while reconciling 4.16.0-rc.2: the cluster operator machine-config is degraded

oc --context build03 get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.16   True        False         34h     Error while reconciling 4.15.16: the cluster operator machine-config is degraded

How reproducible:

We applied this PR https://github.com/openshift/release/pull/52574/files to the clusters.

It breaks at least 3 of them.

"qci-pull-through-cache-us-east-1-ci.apps.ci.l2s4.p1.openshiftapps.com" is a registry cache server https://github.com/openshift/release/blob/master/clusters/app.ci/quayio-pull-through-cache/qci-pull-through-cache-us-east-1.yaml

Additional info:

There are lots of image imports in OpenShift CI jobs.

It feels like the registry cache server returns unexpected results to the openshift-apiserver:

2024-06-06T18:13:13.781520581Z E0606 18:13:13.781459       1 strategy.go:60] unable to parse manifest for "sha256:c5bcd0298deee99caaf3ec88de246f3af84f80225202df46527b6f2b4d0eb3c3": unexpected end of JSON input 

Our theory is that the requests of imports from all CI clusters crashed the cache server and it sent some unexpected data which caused apiserver to panic.

 

The expected behaviour is that if the image cannot be pulled from the first mirror in the ImageDigestMirrorSet, then it will be failed over to the next one.

This is a clone of issue OCPBUGS-38775. The following is the description of the original issue:

Description of problem:

see from screen recording https://drive.google.com/file/d/1LwNdyISRmQqa8taup3nfLRqYBEXzH_YH/view?usp=sharing

dev console, "Observe -> Metrics" tab, input in in the query-browser input text-area, the cursor would focus in the project drop-down list, this issue exists in 4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129, no such issue with admin console

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129

How reproducible:

always

Steps to Reproduce:

1. see the description
    

Actual results:

cursor would focus in the project drop-down 

Expected results:

cursor should not move

Additional info:

    

This is a clone of issue OCPBUGS-36236. The following is the description of the original issue:

Description of problem:

    The installer for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%, dependent on order of subnets returned by IBM Cloud API's however

Steps to Reproduce:

    1. Create 50+ IBM Cloud VPC Subnets
    2. Use Bring Your Own Network (BYON) configuration (with Subnet names for CP and/or Compute) in install-config.yaml
    3. Attempt to create manifests (openshift-install create manifests)
    

Actual results:

    ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-1", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-2", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-3", platform.ibmcloud.controlPlaneSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-cp-eu-de-1", "eu-de-subnet-paginate-1-cp-eu-de-2", "eu-de-subnet-paginate-1-cp-eu-de-3"}: number of zones (0) covered by controlPlaneSubnets does not match number of provided or default zones (3) for control plane in eu-de, platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-1", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-2", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-3", platform.ibmcloud.computeSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-compute-eu-de-1", "eu-de-subnet-paginate-1-compute-eu-de-2", "eu-de-subnet-paginate-1-compute-eu-de-3"}: number of zones (0) covered by computeSubnets does not match number of provided or default zones (3) for compute[0] in eu-de]

Expected results:

    Successful manifests and cluster creation

Additional info:

    IBM Cloud is working on a fix

Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/563

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-39189. The following is the description of the original issue:

Expected results:

networking-console-plugin deployment has the required-scc annotation   

Additional info:

The deployment does not have any annotation about it   

CI warning

# [sig-auth] all workloads in ns/openshift-network-console must set the 'openshift.io/required-scc' annotation
annotation missing from pod 'networking-console-plugin-7c55b7546c-kc6db' (owners: replicaset/networking-console-plugin-7c55b7546c); suggested required-scc: 'restricted-v2'

This is a clone of issue OCPBUGS-41270. The following is the description of the original issue:

Component Readiness has found a potential regression in the following test:

[sig-network] pods should successfully create sandboxes by adding pod to network

Probability of significant regression: 96.41%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-27T00:00:00Z
End Time: 2024-09-03T23:59:59Z
Success Rate: 88.37%
Successes: 26
Failures: 5
Flakes: 12

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.46%
Successes: 43
Failures: 1
Flakes: 21

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=minor&Upgrade=minor&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20unknown%20ha%20minor&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Ametal&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-03%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-27%2000%3A00%3A00&testId=openshift-tests-upgrade%3A65e48733eb0b6115134b2b8c6a365f16&testName=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network

 

Here is an example run

We see the following signature for the failure:

 

namespace/openshift-etcd node/master-0 pod/revision-pruner-11-master-0 hmsg/b90fda805a - 111.86 seconds after deletion - firstTimestamp/2024-09-02T13:14:37Z interesting/true lastTimestamp/2024-09-02T13:14:37Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-11-master-0_openshift-etcd_08346d8f-7d22-4d70-ab40-538a67e21e3c_0(d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57): error adding pod openshift-etcd_revision-pruner-11-master-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57" Netns:"/var/run/netns/97dc5eb9-19da-462f-8b2e-c301cfd7f3cf" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-etcd;K8S_POD_NAME=revision-pruner-11-master-0;K8S_POD_INFRA_CONTAINER_ID=d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57;K8S_POD_UID=08346d8f-7d22-4d70-ab40-538a67e21e3c" Path:"" ERRORED: error configuring pod [openshift-etcd/revision-pruner-11-master-0] networking: Multus: [openshift-etcd/revision-pruner-11-master-0/08346d8f-7d22-4d70-ab40-538a67e21e3c]: error waiting for pod: pod "revision-pruner-11-master-0" not found  

 

The same signature has been reported for both azure and x390x as well.

 

It is worth mentioning that sdn to ovn transition adds some complication to our analysis. From the component readiness above, you will see most of the failures are for job: periodic-ci-openshift-release-master-nightly-X.X-upgrade-from-stable-X.X-e2e-metal-ipi-ovn-upgrade. This is a new job for 4.17 and therefore miss base stats in 4.16.

 

So we ask for:

  1. An analysis of the root cause and impact of this issue
  2. Team can compare relevant 4.16 sdn jobs to see if this is really a regression.
  3. Given the current passing rate of 88%, what priority we should give to this?
  4. Since this is affecting component readiness, and management depends on a green dashboard for release decision, we need to figure out what is the best approach for handling this issue.  

 

Description of problem:

The ovnkube-sbdb route removal is missing a management cluster capabilities check and thus fails on a Kubernetes based management cluster.

Version-Release number of selected component (if applicable):

4.15.z, 4.16.0, 4.17.0

How reproducible:

Always

Steps to Reproduce:

Deploy an OpenShift version 4.16.0-rc.6 cluster control plane using HyperShift on a Kubernetes based management cluster. 

Actual results:

Cluster control plane deployment fails because the cluster-network-operator pod is stuck in Init state due to the following error:

{"level":"error","ts":"2024-06-19T20:51:37Z","msg":"Reconciler error","controller":"hostedcontrolplane","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedControlPlane","HostedControlPlane":{"name":"cppjslm10715curja3qg","namespace":"master-cppjslm10715curja3qg"},"namespace":"master-cppjslm10715curja3qg","name":"cppjslm10715curja3qg","reconcileID":"037842e8-82ea-4f6e-bf28-deb63abc9f22","error":"failed to update control plane: failed to reconcile cluster network operator: failed to clean up ovnkube-sbdb route: error getting *v1.Route: no matches for kind \"Route\" in version \"route.openshift.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Expected results:

Cluster control plane deployment succeeds.

Additional info:

https://ibm-argonauts.slack.com/archives/C01C8502FMM/p1718832205747529
openshift-install version
/root/installer/bin/openshift-install 4.17.0-0.nightly-2024-07-16-033047
built from commit 8b7d5c6fe26a70eafc47a142666b90ed6081159e
release image registry.ci.openshift.org/ocp/release@sha256:afb704dd7ab8e141c56f1da15ce674456f45c7767417e625f96a6619989e362b
release architecture amd64

openshift-install image-based create image --dir tt                                                                                                      
panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                       
[signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x5a2ed8e] 

~/installer/bin/openshift-install image-based create image --dir tt
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x5a2ed8e]

goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/imagebased/image.(*RegistriesConf).Generate(0xc00150a000, {0x5?, 0x81d2708?}, 0xc0014f8d80)
        /go/src/github.com/openshift/installer/pkg/asset/imagebased/image/registriesconf.go:38 +0x6e
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0014f8270, {0x275ca770, 0xc0014b5220}, {0x275a7900, 0xc00150a000}, {0xc00146cdb0, 0x4})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:227 +0x6ec
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0014f8270, {0x275ca770, 0xc0014b5220}, {0x275a7990, 0xc000161610}, {0x8189983, 0x2})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:221 +0x54c
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc0014f8270, {0x275ca770, 0xc0014b5220}, {0x7fce0489ad20, 0x2bc3eea0}, {0x0, 0x0})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:221 +0x54c
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0xc0014f8270, {0x275ca770?, 0xc0014b5220?}, {0x7fce0489ad20, 0x2bc3eea0}, {0x2bb33070, 0x1, 0x1})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x4e
github.com/openshift/installer/pkg/asset/store.(*fetcher).FetchAndPersist(0xc000aa8640, {0x275ca770, 0xc0014b5220}, {0x2bb33070, 0x1, 0x1})
        /go/src/github.com/openshift/installer/pkg/asset/store/assetsfetcher.go:47 +0x16b
main.newImageBasedCreateCmd.runTargetCmd.func3({0x7ffdb9f8638c?, 0x2?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:306 +0x6a
main.newImageBasedCreateCmd.runTargetCmd.func4(0x2bc00100, {0xc00147fd00?, 0x4?, 0x818b15a?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:320 +0x102
github.com/spf13/cobra.(*Command).execute(0x2bc00100, {0xc00147fcc0, 0x2, 0x2})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:987 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0xc000879508)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1039
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:67 +0x3c6
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:39 +0x168

Description of problem:

with the release of 4.16  prometheus adapter[0] is deprecated and there is a new alert[1] ClusterMonitoringOperatorDeprecatedConfig, there needs to be a better details on how these alerts can be handled which will reduce the support cases.

[0] https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html#ocp-4-16-prometheus-adapter-removed
[1] https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html#ocp-4-16-monitoring-changes-to-alerting-rules

Version-Release number of selected component (if applicable):

4.16

How reproducible:

NA  

Steps to Reproduce:

NA

Actual results:

As per the current configuration the clarification for the alert is not provided with much information

Expected results:

  more information should be provided on how to fix the alert.

Additional info:

As per the discussion, there will be a runbook added which will help in better understanding of the alert    

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/542

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    According to https://github.com/openshift/enhancements/pull/1502 all managed TLS artifacts (secrets, configmaps and files on disk) should have clear ownership and other necessary metadata

`metal3-ironic-tls` is created by cluster-baremetal-operator but doesn't have ownership annotation

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

Under config.interfaces we should add additional config ONLY for vlan
Example.
spec:
  config:
    interfaces:
      - name: "eth0"
        type: ethernet
        state: up
        mac-address: "52:54:00:0A:86:94"

We didn't require to add it in 4.15 and for example in 4.16 bond still doesn't have this config and passes the deployment

How reproducible:

100%

Steps to reproduce:

1. Deploy vlan spoke  with 4.15 nmstate config.

2.

3.

Actual results:

deployment failes

Expected results:
deployment succeeded or documentation for 4.16 should be updated

This is a clone of issue OCPBUGS-38288. The following is the description of the original issue:

Description of problem:

The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-39573. The following is the description of the original issue:

Description of problem:

Enabling the topology tests in CI
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Please review the following PR: https://github.com/openshift/coredns/pull/119

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38226. The following is the description of the original issue:

Description of problem:

https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Component Readiness has found a potential regression in the following test:

operator conditions monitoring

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-07-17T00:00:00Z
End Time: 2024-07-23T23:59:59Z
Success Rate: 59.49%
Successes: 47
Failures: 32
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 147
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=vsphere&Platform=vsphere&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Monitoring&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20vsphere%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-07-23%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-07-17%2000%3A00%3A00&testId=Operator%20results%3A7e4c8db94dde9f957ea7d639cd29d6dd&testName=operator%20conditions%20monitoring

This test / pattern is actually showing up on various other variant combinations but the commonality is vsphere, so this test, and installs in general, are not going well on vsphere.

Error message:

 operator conditions monitoring expand_less 	0s
{Operator unavailable (UpdatingPrometheusFailed): UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded  Operator unavailable (UpdatingPrometheusFailed): UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded}

From: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn/1815616622711279616

This is a clone of issue OCPBUGS-41920. The following is the description of the original issue:

Description of problem:

When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes.

For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes)

$ oc get mcp,nodes
NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s

NAME                                             STATUS   ROLES                       AGE    VERSION
node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4



After 20 minutes or half an hour the MCPs start reporting the right number of nodes

    

Version-Release number of selected component (if applicable):
IPI on AWS version:

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101

    

How reproducible:
Always

    

Steps to Reproduce:

    1. Create a MCP
    
     oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-perf
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-perf]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-perf: ""
EOF

    
    2. Add 2 nodes to the MCP
    
   $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf=
   $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf=

    3. Create another MCP
    oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-perf-canary
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-perf,worker-perf-canary]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-perf-canary: ""
EOF

    3. Move one node from the MCP created in step 1 to the MCP created in step 3
    $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary=
    $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
    
    
    

Actual results:

The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP.
$ oc get mcp,nodes
NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s

NAME                                             STATUS   ROLES                       AGE    VERSION
node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4


    

Expected results:

MCPs should always report the right number of nodes
    

Additional info:

It is very similar to this other issue 
https://bugzilla.redhat.com/show_bug.cgi?id=2090436
That was discussed in this slack conversation
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
    

Description of problem:

The presubmit test that expects an inactive CPMS to be regnerated, resets the state at the end of the test.
In doing so, it causes the CPMS generator to re-generate back to the original state.
Part of regeneration involves deleting and recreating the CPMS.

If the regeneration is not quick enough, the next part of the test can fail, as it is expecting the CPMS to exist.

We should change this to an eventually to avoid the race between the generator and the test.

See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-control-plane-machine-set-operator/304/pull-ci-openshift-cluster-control-plane-machine-set-operator-release-4.13-e2e-aws-operator/1801195115868327936 as an example failure

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-42324. The following is the description of the original issue:

Description of problem:

This is a spinoff of https://issues.redhat.com/browse/OCPBUGS-38012. For additional context please see that bug.

The TLDR is that Restart=on-failure for oneshot units were only supported in systemd v244 and onwards, meaning any bootimage for 4.12 and previous doesn't support this on firstboot, and upgraded clusters would no longer be able to scale nodes if it references any such service.

Right now this is only https://github.com/openshift/machine-config-operator/blob/master/templates/common/openstack/units/afterburn-hostname.service.yaml#L16-L24 which isn't covered by https://issues.redhat.com/browse/OCPBUGS-38012

Version-Release number of selected component (if applicable):

4.16 right now

How reproducible:

Uncertain, but https://issues.redhat.com/browse/OCPBUGS-38012 is 100%

Steps to Reproduce:

    1.install old openstack cluster
    2.upgrade to 4.16
    3.attempt to scale node
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    In PR - https://github.com/openshift/console/pull/13676 we worked on improving the performance of the PipelineRun list page and the issue https://issues.redhat.com/browse/OCPBUGS-32631 is created to still improve the performance of the PLR list page. Once this is complete, we have to improve the performance of Pipeline list page by considering below point,

1. TaskRuns should not be fetched for all the PLR's. 
2. Use pipelinerun.status.conditions.message  to get the status of TaskRuns 3. For any PLR, if string pipelinerun.status.conditions.message having data about Tasks status use that string only instead of fetching TaskRuns 

Description of problem:

oc adm prune deployments` does not work and giving below error when using --replica-set option.

    [root@weyb1525 ~]# oc adm prune deployments --orphans --keep-complete=1 --keep-failed=0 --keep-younger-than=1440m --replica-sets --v=6
I0603 09:55:39.588085 1540280 loader.go:373] Config loaded from file:  /root/openshift-install/paas-03.build.net.intra.laposte.fr/auth/kubeconfig
I0603 09:55:39.890672 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps.openshift.io/v1/deploymentconfigs 200 OK in 301 milliseconds
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
I0603 09:55:40.529367 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps/v1/deployments 200 OK in 65 milliseconds
I0603 09:55:41.369413 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/api/v1/replicationcontrollers 200 OK in 706 milliseconds
I0603 09:55:43.083804 1540280 round_trippers.go:553] GET https://api-int.paas-03.build.net.intra.laposte.fr:6443/apis/apps/v1/replicasets 200 OK in 118 milliseconds
I0603 09:55:43.320700 1540280 prune.go:58] Creating deployment pruner with keepYoungerThan=24h0m0s, orphans=true, replicaSets=true, keepComplete=1, keepFailed=0
Dry run enabled - no modifications will be made. Add --confirm to remove deployments
panic: interface conversion: interface {} is *v1.Deployment, not *v1.DeploymentConfig

goroutine 1 [running]:
github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*dataSet).GetDeployment(0xc007fa9bc0, {0x5052780?, 0xc00a0b67b0?})
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/data.go:171 +0x3d6
github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*orphanReplicaResolver).Resolve(0xc006ec87f8)
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/resolvers.go:78 +0x1a6
github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*mergeResolver).Resolve(0x55?)
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/resolvers.go:28 +0xcf
github.com/openshift/oc/pkg/cli/admin/prune/deployments.(*pruner).Prune(0x5007c40?, {0x50033e0, 0xc0083c19e0})
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/prune.go:96 +0x2f
github.com/openshift/oc/pkg/cli/admin/prune/deployments.PruneDeploymentsOptions.Run({0x0, 0x1, 0x1, 0x4e94914f0000, 0x1, 0x0, {0x0, 0x0}, {0x5002d00, 0xc000ba78c0}, ...})
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/deployments.go:206 +0xa03
github.com/openshift/oc/pkg/cli/admin/prune/deployments.NewCmdPruneDeployments.func1(0xc0005f4900?, {0xc0006db020?, 0x0?, 0x6?})
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/deployments/deployments.go:78 +0x118
github.com/spf13/cobra.(*Command).execute(0xc0005f4900, {0xc0006dafc0, 0x6, 0x6})
        /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:944 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0xc000e5b800)
        /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:992
k8s.io/component-base/cli.run(0xc000e5b800)
        /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/cli/run.go:146 +0x317
k8s.io/component-base/cli.RunNoErrOutput(...)
        /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/cli/run.go:84
main.main()
        /go/src/github.com/openshift/oc/cmd/oc/oc.go:77 +0x365 

Version-Release number of selected component (if applicable):

    

How reproducible:

   Run  oc adm prune deployments command with --replica-sets option
 #  oc adm prune deployments --keep-younger-than=168h --orphans --keep-complete=5 --keep-failed=1 --replica-sets=true

Actual results:

    Its failing with below error:panic: interface conversion: interface {} is *v1.Deployment, not *v1.DeploymentConfig 

Expected results:

    Its should not fail and work as expected.

Additional info:

    Slack thread https://redhat-internal.slack.com/archives/CKJR6200N/p1717519017531979

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/705

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

When attempting to install on a provider network on PSI, I get the following pre-flight validation error:

ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.openstack.controlPlanePort.network: Invalid value: "316eeb47-1498-46b4-b39e-00ddf73bd2a5": network must contain subnets

The network does contain one subnet.

install-config.yaml:

# yaml-language-server: $schema=https://raw.githubusercontent.com/pierreprinetti/openshift-installconfig-schema/release-4.16/installconfig.schema.json
apiVersion: v1
baseDomain: ${BASE_DOMAIN}
compute:
- name: worker
  platform:
    openstack:
      type: ${COMPUTE_FLAVOR}
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      type: ${CONTROL_PLANE_FLAVOR}
  replicas: 3
metadata:
  name: ${CLUSTER_NAME}
platform:
  openstack:
    controlPlanePort:
      network:
        id: 316eeb47-1498-46b4-b39e-00ddf73bd2a5
    cloud: ${OS_CLOUD}
    clusterOSImage: rhcos-4.16
publish: External
pullSecret: |
  ${PULL_SECRET}
sshKey: |
  ${SSH_PUB_KEY} 

In our vertical scaling test, after we delete a machine, we rely on the `status.readyReplicas` field of the ControlPlaneMachineSet (CPMS) to indicate that it has successfully created a new machine that let's us scale up before we scale down.
https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L76-L87

As we've seen in the past as well, that status field isn't a reliable indicator of the scale up of machines, as status.readyReplicas might stay at 3 as the soon-to-be-removed node that is pending deletion can go  Ready=Unknown in runs such as the following: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1286/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling/1808186565449486336

Which then ends up the test timing out on waiting for status.readyReplicas=4 while the scale-up and down may already have happened.
This shows up across scaling tests on all platforms as:

fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:81]: Unexpected error:
    <*errors.withStack | 0xc002182a50>: 
    scale-up: timed out waiting for CPMS to show 4 ready replicas: timed out waiting for the condition
    {
        error: <*errors.withMessage | 0xc00304c3a0>{
            cause: <wait.errInterrupted>{
                cause: <*errors.errorString | 0xc0003ca800>{
                    s: "timed out waiting for the condition",
                },
            },
            msg: "scale-up: timed out waiting for CPMS to show 4 ready replicas",
        }, 

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling/1811686448848441344

https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522etcd-scaling%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=net_improvement

In hindsight all we care about is whether the deleted machine's member is replaced by another machine's member and can ignore the flapping of node and machine statuses while we wait for the scale-up then down of members to happen. So we can relax or replace that check on status.readyReplicas with just looking at the membership change.

PS: We can also update the outdated Godoc comments for the test to mention that it relies on CPMSO to create a machine for us https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L34-L38

This is a clone of issue OCPBUGS-38437. The following is the description of the original issue:

Description of problem:

    After branching, main branch still publishes Konflux builds to mce-2.7

Version-Release number of selected component (if applicable):

    mce-2.7

How reproducible:

    100%

Steps to Reproduce:

    1.Post a PR to 

main

    2. Check the jobs that run
    

Actual results:

Both mce-2.7 and main Konflux builds get triggered    

Expected results:

Only main branch Konflux builds gets triggered

Additional info:

    

This is a clone of issue OCPBUGS-41358. The following is the description of the original issue:

Description of problem:

While upgrading the cluster from web-console the below warning message observed. 
~~~
Warning alert:Admission Webhook Warning
ClusterVersion version violates policy 299 - "unknown field \"spec.desiredUpdate.channels\"", 299 - "unknown field \"spec.desiredUpdate.url\""
~~~

There are no such fields in the clusterVersion yaml for which the warning message fired.

From the documentation here: https://docs.openshift.com/container-platform/4.16/rest_api/config_apis/clusterversion-config-openshift-io-v1.html 

It's possible to see that "spec.desiredUpdate" exists, but there is no mention of values "channels" or "url" under desiredUpdate.



Note: This is not impacting the cluster upgrade. However creating confusion among customers due to the warning message.

 

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

Everytime    

Steps to Reproduce:

    1. Install cluster of version 4.16.4
    2. Upgrade the cluster from web-console to the next-minor version
    3.
    

Actual results:

    Upgrade should proceed with no such warnings

Expected results:

    

Additional info:

    

Description of problem:

Running `oc exec` through a proxy doesn't work    

Version-Release number of selected component (if applicable):

4.17.1    

How reproducible:

100%

Additional info:

Looks to have been fixed upstream in https://github.com/kubernetes/kubernetes/pull/126253 which made it into 1.30.4 and should already be in 1.31.1 as used in 4.18.

Likely just needs to bump oc to that version or later.    

Description of problem:

    When using nvme disks the assisted installation fails with: error Host perf-intel-6.perf.eng.bos2.dc.redhat.com: updated status from installing-in-progress to error (Failed - failed after 3 attempts, last error: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- coreos-installer install --insecure -i /opt/openshift/master.ign --append-karg ip=eno12399:dhcp /dev/nvme0n1], Error exit status 1, LastOutput "Error: checking for exclusive access to /dev/nvme0n1 Caused by: 0: couldn't reread partition table: device is in use 1: EBUSY: Device or resource busy")

Version-Release number of selected component (if applicable):

   4.15 using the web assisted installer 

How reproducible:

In a PowerEdge R760 with 5 disks (3ssd in raid 5, and 2 nvme no raid), if you use the 5 ssd disks in raid the installer works as expected. If you disable these disks and use the nvme storage the installer fails with the above message. I tried other distributions booting and using only the nvme disk and they work as expected (Fedora rawhide and Ubuntu 22.04).


                        
                        
                        
                        

Steps to Reproduce:

    1. Try the assisted installer with nvme disks    

Actual results:

    The installer fails

Expected results:

    The installer finishes correctly

Additional info:

 

Description of problem:

non-existing oauth.config.openshift.io resource is  listed on Global Configuration page   

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-05-082646    

How reproducible:

Always    

Steps to Reproduce:

1. visit global configuration page /settings/cluster/globalconfig
2. check listed items on the page
3.
    

Actual results:

2. There are two OAuth.config.openshift.io entries, one is linking to /k8s/cluster/config.openshift.io~v1~OAuth/oauth-config, this will return 404: Not Found

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-06-05-082646   True        False         171m    Cluster version is 4.16.0-0.nightly-2024-06-05-082646

$ oc get oauth.config.openshift.io
NAME      AGE
cluster   3h26m
 

Expected results:

from CLI output we can see there is only one oauth.config.openshift.io resource, but we are showing one more 'oauth-config'  

Only one oauth.config.openshift.io resource should be listed   

Additional info:

    

Description of problem:

The runbook was added in https://issues.redhat.com/browse/MON-3862
The alert is more likely to fire in >=4.16

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-43655. The following is the description of the original issue:

Description of problem:

When we switched the API servers to use the /livez endpoint, we overlooked updating the audit policy to exclude this endpoint from being logged. As a result, requests to the /livez endpoint are currently being persisted in the audit log files.

The issue applies to the other API servers as well (oas and oauth-apiserver)

Version-Release number of selected component (if applicable):

    

How reproducible:

Just download must-gather and grep for /livez endpoint.

Steps to Reproduce:

Just download must-gather and grep for /livez endpoint.

Actual results:

Requests to the /livez endpoint are being recorded in the audit log files.

Expected results:

Requests to the /livez endpoint are NOT being recorded in the audit log files.

Additional info:

    

This is a clone of issue OCPBUGS-42987. The following is the description of the original issue:

It is been observed that the esp_offload kernel module might be loaded by libreswan even if bond ESP offloads have been correctly turned off.

This might be because ipsec service and configure-ovs run at the same time, so it is possible that ipsec service starts when bond offloads are not yet turned off and trick libreswan into thinking they should be used.

The potential fix would be to run ipsec service after configure-ovs.

Upstream just merged https://github.com/etcd-io/etcd/pull/16246

which refactors the way the dashboards are defined. We need to see whether our own jsonnet integration still works with that and we can still display dashboards in OpenShift.

See additional challenges they faced with the helm chart:
https://github.com/prometheus-community/helm-charts/pull/3880

AC:

  • vendor upstream refactoring changes (usual update)
  • ensure openshift dashboards are not affected by upstream change
  • update the integration, if necessary

This is a clone of issue OCPBUGS-42143. The following is the description of the original issue:

Description of problem:

    There is another panic occurred in https://issues.redhat.com/browse/OCPBUGS-34877?focusedId=25580631&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25580631 which should be fixed

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-36222. The following is the description of the original issue:

Description of problem:

The AWS Cluster API Provider (CAPA) runs a required check to resolve the DNS Name for load balancers it creates. If the CAPA controller (in this case, running in the installer) cannot resolve the DNS record, CAPA will not report infrastructure ready. We are seeing in some cases, that installations running on local hosts (we have not seen this problem in CI) will not be able to resolve the LB DNS name record and the install will fail like this:

    DEBUG I0625 17:05:45.939796    7645 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" namespace="openshift-cluster-api-guests" name="umohnani-4-16test-5ndjw" reconcileID="553beb3d-9b53-4d83-b417-9c70e00e277e" cluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" 
DEBUG Collecting applied cluster api manifests...  
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded

We do not know why some hosts cannot resolve these records, but it could be something like issues with the local DNS resolver cache, DNS records are slow to propagate in AWS, etc.

 

Version-Release number of selected component (if applicable):

    4.16, 4.17

How reproducible:

    Not reproducible / unknown -- this seems to be dependent on specific hosts and we have not determined why some hosts face this issue while others do not.

Steps to Reproduce:

n/a    

Actual results:

Install fails because CAPA cannot resolve LB DNS name 

Expected results:

    As the DNS record does exist, install should be able to proceed.

Additional info:

Slack thread:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1719351032090749

Description of problem:

Install OCP with capi, when setting bootType: "UEFI", got unsupported value error. Installing with terrform did not met this issue.  

  platform:
    nutanix:
      bootType: "UEFI" 
# ./openshift-install create cluster --dir cluster --log-level debug
...
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create control-plane manifest: NutanixMachine.infrastructure.cluster.x-k8s.io "sgao-nutanix-zonal-jwp6d-bootstrap" is invalid: spec.bootType: Unsupported value: "UEFI": supported values: "legacy", "uefi" 

Set bootType: "uefi" also won't work

# ./openshift-install create manifests --dir cluster
...
FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: platform.nutanix.bootType: Invalid value: "uefi": valid bootType: "", "Legacy", "UEFI", "SecureBoot". 

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-08-222442

How reproducible:

    always

Steps to Reproduce:

    1.Create install config with bootType: "UEFI" and enable capi by setting:
featureSet: CustomNoUpgrade
featureGates:
- ClusterAPIInstall=true

    2.Install cluster
    

Actual results:

    Install failed

Expected results:

    Install passed

Additional info:

    

Description of problem:

    Contrary to terraform, we do not delete the S3 bucket used for ignition during bootstrapping.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1. Deploy cluster
    2. Check that openshift-bootstrap-data-$infraID bucket exists and is empty.
    3.
    

Actual results:

    Empty bucket left.

Expected results:

    Bucket is deleted.

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    When we run opm on RHEL8, we met the following error
./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm)

Note: it happened for 4.15.0-ec.3
I tried the 4.14, it works.
I also tried to compile it with latest code, it also work.

Version-Release number of selected component (if applicable):

    4.15.0-ec.3

How reproducible:

    always

Steps to Reproduce:

[root@preserve-olm-env2 slavecontainer]# curl -s -k -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp-dev-preview/candidate/opm-linux-4.15.0-ec.3.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz
opm
[root@preserve-olm-env2 slavecontainer]# ./opm version
./opm: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.33' not found (required by ./opm)
./opm: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./opm)
[root@preserve-olm-env2 slavecontainer]# curl -s -l -L https://mirror2.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest-4.14/opm-linux-4.14.5.tar.gz -o opm.tar.gz && tar -xzvf opm.tar.gz
opm
[root@preserve-olm-env2 slavecontainer]# opm version
Version: version.Version{OpmVersion:"639fc1203", GitCommit:"639fc12035292dec74a16b306226946c8da404a2", BuildDate:"2023-11-21T08:03:15Z", GoOs:"linux", GoArch:"amd64"}
[root@preserve-olm-env2 kuiwang]# cd operator-framework-olm/
[root@preserve-olm-env2 operator-framework-olm]# git branch
  gs
* master
  release-4.10
  release-4.11
  release-4.12
  release-4.13
  release-4.8
  release-4.9
[root@preserve-olm-env2 operator-framework-olm]# git pull origin master
remote: Enumerating objects: 1650, done.
remote: Counting objects: 100% (1650/1650), done.
remote: Compressing objects: 100% (831/831), done.
remote: Total 1650 (delta 727), reused 1617 (delta 711), pack-reused 0
Receiving objects: 100% (1650/1650), 2.03 MiB | 12.81 MiB/s, done.
Resolving deltas: 100% (727/727), completed with 468 local objects.
From github.com:openshift/operator-framework-olm
 * branch master -> FETCH_HEAD
   639fc1203..85c579f9b master -> origin/master
Updating 639fc1203..85c579f9b
Fast-forward
 go.mod | 120 +-
 go.sum | 240 ++--
 manifests/0000_50_olm_00-pprof-secret.yaml
...
 create mode 100644 vendor/google.golang.org/protobuf/types/dynamicpb/types.go
[root@preserve-olm-env2 operator-framework-olm]# rm -fr bin/opm
[root@preserve-olm-env2 operator-framework-olm]# make build/opm
make bin/opm
make[1]: Entering directory '/data/kuiwang/operator-framework-olm'
go build -ldflags "-X 'github.com/operator-framework/operator-registry/cmd/opm/version.gitCommit=85c579f9be61aaea11e90b6c870452c72107300a' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.opmVersion=85c579f9b' -X 'github.com/operator-framework/operator-registry/cmd/opm/version.buildDate=2023-12-11T06:12:50Z'" -mod=vendor -tags "json1" -o bin/opm github.com/operator-framework/operator-registry/cmd/opm
make[1]: Leaving directory '/data/kuiwang/operator-framework-olm'
[root@preserve-olm-env2 operator-framework-olm]# which opm
/data/kuiwang/operator-framework-olm/bin/opm
[root@preserve-olm-env2 operator-framework-olm]# opm version
Version: version.Version{OpmVersion:"85c579f9b", GitCommit:"85c579f9be61aaea11e90b6c870452c72107300a", BuildDate:"2023-12-11T06:12:50Z", GoOs:"linux", GoArch:"amd64"}

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-41610. The following is the description of the original issue:

Description of problem:

After click "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines, if now click "Lightspeed" popup button at the right bottom, the highlighted rectangle lines lay above the popup modal.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-09-150616
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Clicked "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines. At the same time, click "Lightspeed" popup button at the right bottom.
    2.
    3.
    

Actual results:

1. The highlighted rectangle lines lay above the popup modal.
Screenshot: https://drive.google.com/drive/folders/15te0dbavJUTGtqRYFt-rM_U8SN7euFK5?usp=sharing
    

Expected results:

1. The Lightspeed popup modal should be on the top layer.
    

Additional info:


    

This is a clone of issue OCPBUGS-42097. The following is the description of the original issue:

Example failed test:

4/1291 Tests Failed.expand_less: user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller in ns/openshift-infra must not produce too many applies 

{had 7618 applies, check the audit log and operator log to figure out why  details in audit log}    

Description of problem:

The ci/prow/verify-crd-schema job on openshift/api fails due to missing listType tags when adding a tech preview feature to the IngressController because it branches the CRDs into separate versions via features sets. 

As an example, it fails on: https://github.com/openshift/api/pull/1841.


The errors "must set x-kubernetes-list-type" need to resolved by adding:
    // +listType=atomic
or
    // +listType=map
    // +listMapKey=<key>

to the fields that are missing the tags.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%    

Steps to Reproduce:

    1. Add a techpreview field to the IngressController API
    2. make update
    3. ./hack/verify-crd-schema-checker.sh     

Actual results:

t.io version/v1 field/^.spec.httpHeaders.headerNameCaseAdjustments must set x-kubernetes-list-type
        error in operator/v1/zz_generated.crd-manifests/0000_50_ingress_00_ingresscontrollers-CustomNoUpgrade.crd.yaml: ListsMustHaveSSATags: crd/ingresscontrollers.operator.openshift.io version/v1 field/^.spec.logging.access.httpCaptureCookies must set x-kubernetes-list-type

...    

Expected results:

No errors except for any errors for embedded external fields. E.g. this error is unavoidable and must always be overridden:

error in operator/v1/zz_generated.featuregated-crd-manifests/ingresscontrollers.operator.openshift.io/IngressControllerLBSubnetsAWS.yaml: NoMaps: crd/ingresscontrollers.operator.openshift.io version/v1 field/^.spec.routeSelector.matchLabels may not be a map

Additional info:

    

This is a clone of issue OCPBUGS-36494. The following is the description of the original issue:

Description of problem:

    If the `template:` field in the vsphere platform spec is defined the installer should not be downloading the OVA

Version-Release number of selected component (if applicable):

    4.16.x 4.17.x

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

While creating an install configuration for PowerVS IPI, the default region is not set leading to the survey getting stuck if nothing is entered at the command line.

Description of problem:

    The coresPerSocket value set in install-config does not match the actual result. When setting controlPlane.platform.vsphere.cpus to 16 and controlPlane.platform.vsphere.coresPerSocket to 8.The actual result I checked was: "NumCPU": 16,"NumCoresPerSocket": 16, NumCoresPerSocket should match the setting in install-config instead of NumCPU.

Check the setting in VSphereMachine-openshift-cluster-api-guests-wwei1215a-42n48-master-0.yaml, the numcorespersocket is 0:
    numcpus: 16    
    numcorespersocket: 0

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-05-08-222442

How reproducible:

    See description

Steps to Reproduce:

    1.setting coresPerSocket for control plane in install-config. cpu needs to be a multiple of corespersocket.
    2.install the cluster
    

Actual results:

    The NumCoresPerSocket is equal to NumCPU. In file VSphereMachine-openshift-cluster-api-guests-xxxx-xxxx-master-0.yaml, the numcorespersocket is 0. and in vm setting: "NumCoresPerSocket": 8.

Expected results:

    The NumCoresPerSocket should match the setting in install-config.

Additional info:

installconfig setting:
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere:
      cpus: 16
      coresPerSocket: 8
check result:     
"Hardware": {          "NumCPU": 16,          "NumCoresPerSocket": 16,
the check result for compute node is expected.
installconfig setting:
compute:- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    vsphere:
      cpus: 8
      coresPerSocket: 4
check result:
"Hardware": {          "NumCPU": 8,          "NumCoresPerSocket": 4,

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/117

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Request for sending data via telemetry

The goal is to collect metrics about CPU cores of ACM managed clusters because it is one of the sources to bill the customers for the product subscription usage.

acm_managed_cluster_worker_cores

acm_managed_cluster_worker_cores represents the total number of CPU cores on the worker nodes of the ACM managed clusters.

Labels

  • hub_cluster_id, the cluster ID of the ACM hub cluster
  • managed_cluster_id, the cluster ID of the ACM managed cluster. Cluster name is used if the managed cluster is a non-OpenShift cluster.

The cardinality of the metric is at most 1.

Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/142

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

When performing a LocalClusterImport, the following error is seen when booting from the discovery ISO created by the provided Infraenv.

```
State Info: Host does not meet the minimum hardware requirements: This host has failed to download the ignition file from http://api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com:22624/config/worker with the following error: ignition file download failed: request failed: Get "http://api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com:22624/config/worker": dial tcp: lookup api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com on 192.168.123.1:53: no such host. Please ensure the host can reach this URL
```

The URL for the discovery ISO is reported as
`api.local-cluster.ocp-edge-cluster-0.qe.lab.redhat.com`

When for this use case, the URL for the discovery ISO should be
`api.ocp-edge-cluster-0.qe.lab.redhat.com`

Some changes to how the Day2 import is performed for the hub cluster will need to take place to ensure that when importing the local hub cluster, this issue is avoided.

Description of problem:

Mirror failed due to {{manifest unknown}} on certain images for v2 format

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403251146.p0.g03ce0ca.assembly.stream.el9-03ce0ca", GitCommit:"03ce0ca797e73b6762fd3e24100ce043199519e9", GitTreeState:"clean", BuildDate:"2024-03-25T16:34:33Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1)  Test full==true with following imagesetconfig:
cat config-full.yaml 
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.16
      full: true

`oc-mirror --config config-full.yaml  file://out-full --v2`   

Actual results: 

Mirror command always failed, but hit  errors :

2024/04/08 02:50:52  [ERROR]  : [Worker] errArray initializing source docker://registry.redhat.io/3scale-mas/zync-rhel8@sha256:8a108677b0b4100a3d58d924b2c7a47425292492df3dc6a2ebff33c58ca4e9e8: reading manifest sha256:8a108677b0b4100a3d58d924b2c7a47425292492df3dc6a2ebff33c58ca4e9e8 in registry.redhat.io/3scale-mas/zync-rhel8: manifest unknown

2024/04/08 09:12:55  [ERROR]  : [Worker] errArray initializing source docker://registry.redhat.io/integration/camel-k-rhel8-operator@sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0: reading manifest sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0 in registry.redhat.io/integration/camel-k-rhel8-operator: manifest unknown

2024/04/08 09:12:55  [ERROR]  : [Worker] errArray initializing source docker://registry.redhat.io/integration/camel-k-rhel8-operator@sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0: reading manifest sha256:4796985f3efcd37b057dea0a35b526c02759f8ea63327921cdd2e504c575d3c0 in registry.redhat.io/integration/camel-k-rhel8-operator: manifest unknown

Expected results:

No error

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/302

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

https://github.com/openshift/console/pull/13964 fixed pseudolocalization, but now the user needs to know their first preferred language code in order for pseudolocalization to work.  Add information to INTERNATIONALIZATION.md on how to obtain that language code.

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/83

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/installer/pull/8449

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

add-flow-ci.feature test is flaking sporadically for both console and console-operator repositories.

  Running:  add-flow-ci.feature                                                             (1 of 1)
[23798:0602/212526.775826:ERROR:zygote_host_impl_linux.cc(273)] Failed to adjust OOM score of renderer with pid 24169: Permission denied (13)
Couldn't determine Mocha version


  Logging in as test
  Create the different workloads from Add page
      redirect to home
      ensure perspective switcher is set to Developer
    ✓ Getting started resources on Developer perspective (16906ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Select Template category CI/CD
      You are on Topology page - Graph view
    ✓ Deploy Application using Catalog Template "CI/CD": A-01-TC02 (example #1) (27858ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Select Template category Databases
      You are on Topology page - Graph view
    ✓ Deploy Application using Catalog Template "Databases": A-01-TC02 (example #2) (29800ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Select Template category Languages
      You are on Topology page - Graph view
    ✓ Deploy Application using Catalog Template "Languages": A-01-TC02 (example #3) (38286ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Select Template category Middleware
      You are on Topology page - Graph view
    ✓ Deploy Application using Catalog Template "Middleware": A-01-TC02 (example #4) (30501ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Select Template category Other
      You are on Topology page - Graph view
    ✓ Deploy Application using Catalog Template "Other": A-01-TC02 (example #5) (35567ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Application Name "sample-app" is created
      Resource type "deployment" is selected
      You are on Topology page - Graph view
    ✓ Deploy secure image with Runtime icon from external registry: A-02-TC02 (example #1) (28896ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Application Name "sample-app" is selected
      Resource type "deployment" is selected
      You are on Topology page - Graph view
    ✓ Deploy image with Runtime icon from internal registry: A-02-TC03 (example #1) (23555ms)
      redirect to home
      ensure perspective switcher is set to Developer
      Resource type "deployment" is selected
      You are on Topology page - Graph view
      You are on Topology page - Graph view
      You are on Topology page - Graph view
    ✓ Edit Runtime Icon while Editing Image: A-02-TC05 (47438ms)
      redirect to home
      ensure perspective switcher is set to Developer
      You are on Topology page - Graph view
    ✓ Create the Database from Add page: A-03-TC01 (19645ms)
      redirect to home
      ensure perspective switcher is set to Developer
      redirect to home
      ensure perspective switcher is set to Developer
    1) Deploy git workload with devfile from topology page: A-04-TC01
      redirect to home
      ensure perspective switcher is set to Developer
      Resource type "Deployment" is selected
      You are on Topology page - Graph view
    ✓ Create a workload from Docker file with "Deployment" as resource type: A-05-TC02 (example #1) (43434ms)
      redirect to home
      ensure perspective switcher is set to Developer
      You are on Topology page - Graph view
    ✓ Create a workload from YAML file: A-07-TC01 (31905ms)
      redirect to home
      ensure perspective switcher is set to Developer
    ✓ Upload Jar file page details: A-10-TC01 (24692ms)
      redirect to home
      ensure perspective switcher is set to Developer
      You are on Topology page - Graph view
    ✓ Create Sample Application from Add page: GS-03-TC05 (example #1) (40882ms)
      redirect to home
      ensure perspective switcher is set to Developer
      You are on Topology page - Graph view
    ✓ Create Sample Application from Add page: GS-03-TC05 (example #2) (52287ms)
      redirect to home
      ensure perspective switcher is set to Developer
    ✓ Quick Starts page when no Quick Start has started: QS-03-TC02 (23439ms)
      redirect to home
      ensure perspective switcher is set to Developer
      quick start is complete
    ✓ Quick Starts page when Quick Start has completed: QS-03-TC03 (28139ms)


  17 passing (10m)
  1 failing

  1) Create the different workloads from Add page
       Deploy git workload with devfile from topology page: A-04-TC01:
     CypressError: `cy.focus()` can only be called on a single element. Your subject contained 14 elements.

https://on.cypress.io/focus
      at Context.focus (https://console-openshift-console.apps.ci-op-lm9pvf4l-be832.origin-ci-int-aws.dev.rhcloud.com/__cypress/runner/cypress_runner.js:112944:70)
      at wrapped (https://console-openshift-console.apps.ci-op-lm9pvf4l-be832.origin-ci-int-aws.dev.rhcloud.com/__cypress/runner/cypress_runner.js:138021:19)
  From Your Spec Code:
      at Context.eval (webpack:///./support/step-definitions/addFlow/create-from-devfile.ts:10:59)
      at Context.resolveAndRunStepDefinition (webpack:////go/src/github.com/openshift/console/frontend/node_modules/cypress-cucumber-preprocessor/lib/resolveStepDefinition.js:217:0)
      at Context.eval (webpack:////go/src/github.com/openshift/console/frontend/node_modules/cypress-cucumber-preprocessor/lib/createTestFromScenario.js:26:0)



[mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_devconsole.json


  (Results)

  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Tests:        18                                                                               │
  │ Passing:      17                                                                               │
  │ Failing:      1                                                                                │
  │ Pending:      0                                                                                │
  │ Skipped:      0                                                                                │
  │ Screenshots:  2                                                                                │
  │ Video:        false                                                                            │
  │ Duration:     10 minutes, 0 seconds                                                            │
  │ Spec Ran:     add-flow-ci.feature                                                              │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘


  (Screenshots)

  -  /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree     (1280x720)
     nshots/add-flow-ci.feature/Create the different workloads from Add page -- Deplo               
     y git workload with devfile from topology page A-04-TC01 (failed).png                          
  -  /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree     (1280x720)
     nshots/add-flow-ci.feature/Create the different workloads from Add page -- Deplo               
     y git workload with devfile from topology page A-04-TC01 (failed) (attempt 2).pn               
     g                                                                                              


====================================================================================================

  (Run Finished)


       Spec                                              Tests  Passing  Failing  Pending  Skipped  
  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ ✖  add-flow-ci.feature                      10:00       18       17        1        -        - │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘
    ✖  1 of 1 failed (100%)                     10:00       18       17        1        -        -  

 

console

console-operator

discoverOpenIDURLs and checkOIDCPasswordGrantFlow fail if endpoints are private to the data plane.

This enabled the oauth server traffic to flow through the dataplane to enable reaching private endpoints e.g ldap https://issues.redhat.com/browse/HOSTEDCP-421

This enabled fallback to the management cluster network so for public endpoints we are not blocking on having data plane, e.g. github https://issues.redhat.com/browse/OCPBUGS-8073

This issue is to enable the CPO oidc checks to flow through the data plane and fallback to the management side to satisfy both cases above.

This woudl cover https://issues.redhat.com/browse/RFE-5638

This is a clone of issue OCPBUGS-38026. The following is the description of the original issue:

Description of problem:
There are two enhancements we could have for cns-migration:
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 07:59:34.884908     131 logger.go:28] logging successfully to vcenter
I0806 07:59:36.078911     131 logger.go:28] ----------- Migration Summary ------------
I0806 07:59:36.078944     131 logger.go:28] Migrated 0 volumes
I0806 07:59:36.078960     131 logger.go:28] Failed to migrate 0 volumes
I0806 07:59:36.078968     131 logger.go:28] Volumes not found 0    

See the source datastore checing:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 08:02:08.719657     138 logger.go:28] logging successfully to vcenter
E0806 08:02:08.749709     138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter

 

 

2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.

 

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    Always

Steps to Reproduce:

    See Description     

Description of problem:

Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5:

step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000

spec used:
spec:
  raid:
    hardwareRAIDVolumes:
    - name: test-vol
      level: "1"
      numberOfPhysicalDisks: 2
      sizeGibibytes: 350
  online: true

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Provision an HEP worker with ILO 5 using redfish
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

As of OpenShift 4.16, CRD management is more complex. This is an artifact of improvements made to feature gates and feature sets. David Eads and I agreed that, to avoid confusion, we should aim to stop having CRDs installed via operator repos, and, if their types live in o/api, install them from there instead.

We started this by moving the ControlPlaneMachineSet back to o/api, which is part of the MachineAPI  capability.

Unbeknown to us at the time, the way the installer currently works is that all resources that are rendered, get applied by a cluster-bootstrap tools, roughly here and not by CVO.

Cluster-bootstrap is not capability aware, so installed the CPMS CRD, which in turn broke the check in the CSR approver which stops it from crashing on MachineAPI less clusters.

Options for moving forward include:

  • Reverting the move (complex)
  • Making the API render somehow understand capabilities and remove any CRD from a disabled cap
  • Make the cluster-bootstrap tool filter for caps

I'm not sure presently which of the 2nd or 3rd options is better, nor am I sure how I would expect the caps to come into knowledge of the "renderers", installer can provide them as args in bootkube.sh.template?


Original bug below, description of what's happening above


Description of problem:

After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time. 

Version-Release number of selected component (if applicable):

4.16.0-rc.1    

How reproducible:

once so far    

Steps to Reproduce:

    1. Deploy SNO with DU profile with disabled capabilities:

    installConfigOverrides:  "{\"capabilities\":{\"baselineCapabilitySet\": \"None\", \"additionalEnabledCapabilities\": [ \"NodeTuning\", \"ImageRegistry\", \"OperatorLifecycleManager\" ] }}"

2. Leave the node running tests overnight for a couple of hours

3. Check for Pending CSRs

Actual results:

oc get csr -A | grep Pending | wc -l 
27    

Expected results:

No pending CSRs    

Also oc logs will return a tls internal error:

oc -n openshift-cluster-machine-approver --insecure-skip-tls-verify-backend=true logs machine-approver-866c94c694-7dwks 
Defaulted container "kube-rbac-proxy" out of: kube-rbac-proxy, machine-approver-controller
Error from server: Get "https://[2620:52:0:8e6::d0]:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-866c94c694-7dwks/kube-rbac-proxy": remote error: tls: internal error

Additional info:

Checking the machine-approver-controller container logs on the node we can see the reconciliation is failing be cause it cannot find the Machine API which is disabled from the capabilities.

I0514 13:25:09.266546       1 controller.go:120] Reconciling CSR: csr-dw9c8
E0514 13:25:09.275585       1 controller.go:138] csr-dw9c8: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1"
E0514 13:25:09.275665       1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-dw9c8" namespace="" name="csr-dw9c8" reconcileID="6f963337-c6f1-46e7-80c4-90494d21653c"
I0514 13:25:43.792140       1 controller.go:120] Reconciling CSR: csr-jvrvt
E0514 13:25:43.798079       1 controller.go:138] csr-jvrvt: Failed to list machines in API group machine.openshift.io/v1beta1: no matches for kind "Machine" in version "machine.openshift.io/v1beta1"
E0514 13:25:43.798128       1 controller.go:329] "Reconciler error" err="Failed to list machines: no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" controller="certificatesigningrequest" controllerGroup="certificates.k8s.io" controllerKind="CertificateSigningRequest" CertificateSigningRequest="csr-jvrvt" namespace="" name="csr-jvrvt" reconcileID="decbc5d9-fa10-45d1-92f1-1c999df956ff" 

Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/68

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-38632. The following is the description of the original issue:

Description of problem:

When we add a userCA bundle to a cluster that has MCPs with yum based rhel nodes, the MCP with rhel nodes are degraded.
    

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-18-131731   True        False         101m    Cluster version is 4.17.0-0.nightly-2024-08-18-131731

    

How reproducible:

Always

In the CI we found this issue running test case "[sig-mco] MCO security Author:sregidor-NonHyperShiftHOST-High-67660-MCS generates ignition configs with certs [Disruptive] [Serial]" on prow job periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-workers-rhel8-fips-f28-destructive

    

Steps to Reproduce:

    1. Create a certificate 
    
   	$ openssl genrsa -out privateKey.pem 4096
    	$ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com"
    
    2. Add the certificate to the cluster
    
   	# Create the configmap with the certificate
	$ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt
	configmap/cm-test-cert created

	#Configure the proxy with the new test certificate
	$ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}'
	proxy.config.openshift.io/cluster patched
    
    3. Check the MCP status and the MCD logs
    

Actual results:

    
    The MCP is degraded
    $ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-3251b00997d5f49171e70f7cf9b64776   True      False      False      3              3                   3                     0                      130m
worker   rendered-worker-05e7664fa4758a39f13a2b57708807f7   False     True       True       3              0                   0                     1                      130m

    We can see this message in the MCP
      - lastTransitionTime: "2024-08-19T11:00:34Z"
    message: 'Node ci-op-jr7hwqkk-48b44-6mcjk-rhel-1 is reporting: "could not apply
      update: restarting coreos-update-ca-trust.service service failed. Error: error
      running systemctl restart coreos-update-ca-trust.service: Failed to restart
      coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.\n:
      exit status 5"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

In the MCD logs we can see:

I0819 11:38:55.089991    7239 update.go:2665] Removing SIGTERM protection
E0819 11:38:55.090067    7239 writer.go:226] Marking Degraded due to: could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.
    

Expected results:

	No degradation should happen. The certificate should be added without problems.
    

Additional info:


    

Description of problem:

The 4.17 PowerVS CI is failing due to the following issue:
https://github.com/kubernetes-sigs/cluster-api-provider-ibmcloud/pull/2029

So we need to update to 9b077049 in the 4.17 release as well.
    

This is a clone of issue OCPBUGS-38183. The following is the description of the original issue:

Description of problem:

 azure-disk-csi-driver doesnt use registryOverrides 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.set registry override on CPO
    2.watch that azure-disk-csi-driver continues to use default registry
    3.
    

Actual results:

    azure-disk-csi-driver uses default registry

Expected results:

    azure-disk-csi-driver mirrored registry

Additional info:

    

This is a clone of issue OCPBUGS-36293. The following is the description of the original issue:

Description of problem:

CAPA is leaking one EIP in the bootstrap life cycle when creating clustres on 4.16+ with BYO IPv4 Pool on config.

The install logs is showing the message of duplicated EIP, there is a kind of race condition when the EIP is created and tried to be associated when the instance isn't ready (Running state):

~~~
time="2024-05-08T15:49:33-03:00" level=debug msg="I0508 15:49:33.785472 2878400 recorder.go:104] 
\"Failed to associate Elastic IP for \\\"ec2-i-03de70744825f25c5\\\": InvalidInstanceID: 
The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation.\\n\\tstatus code: 
400, request id: 7582391c-b35e-44b9-8455-e68663d90fed\" logger=\"events\" type=\"Warning\" 
object=[...]\"name\":\"mrb-byoip-32-kbcz9\",\"[...] reason=\"FailedAssociateEIP\""

time="2024-05-08T15:49:33-03:00" level=debug msg="E0508 15:49:33.803742 2878400 controller.go:329] \"Reconciler error\" err=<"

time="2024-05-08T15:49:33-03:00" level=debug msg="\tfailed to reconcile EIP: failed to associate Elastic IP 
\"eipalloc-08faccab2dbb28d4f\" to instance \"i-03de70744825f25c5\": 
InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation."
~~~

The EIP is deleted when the bootstrap node is removed after a success installation, although the bug impacts any new machine with Public IP set using BYO IPv4 provisioned by CAPA. Upstream issue has been opened: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038

Version-Release number of selected component (if applicable):

   4.16+

How reproducible:

    always

Steps to Reproduce:

    1. create install-config.yaml setting platform.aws.publicIpv4Pool=poolID
    2. create cluster
    3. check the AWS Console, EIP page filtering by your cluster, you will see the duplicated EIP, while only one is associated to the correct bootstrap instance
    

Actual results:

    

Expected results:

- installer/capa creates only one EIP for bootstrap when provisioning the cluster
- no error messages for expected behavior (ec2 association errors in pending state)     

Additional info:

    CAPA issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038 

 

With the current v2 implementation there is no validation on the flags, for example it is possible to use -v2 which is not valid. The valid is --v2 with double -

So it is required to create a validation to check if the flags are valid if it is not present in cobra framework.

In our plugin documentation, there is no discussion of how to translate messages in a dynamic plugin. The only documentation we have currently is in the enhancement and in the demo plugin readme:

https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md#localization

https://github.com/openshift/console/tree/master/dynamic-demo-plugin#i18n

Without a reference, plugin developers won't know how to handle translations.

cc Ali Mobrem Joseph Caiani Cyril Ajieh 

Description of problem:

   Violation warning is not displayed for `minAvailable` in PDB Create/Edit form 

Addition info:

A maxUnavailable of 0% or 0 or a minAvailable of 100% or equal to the number of replicas is permitted but can block nodes from being drained.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38441. The following is the description of the original issue:

Description of problem:

 

Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.

Example error:

lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:

 

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

50%

Steps to Reproduce:

    1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI

Actual results:

    Flakes

Expected results:

    Shouldn't flake

Additional info:

CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB

CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations

CI Search: FAIL: TestAll/parallel/TestAWSLBSubnets

Description of problem:

    https://issues.redhat.com/browse/MGMT-15691 introduced the code restructuring related to external platform and oci via PR https://github.com/openshift/assisted-service/pull/5787 Assisted service needs to be re-vendored in the installer in 4.16 and 4.17 releases to make sure the assisted-service dependencies are consistent.

The master branch or 4.18 do not need this revendoring as the master branch was recently revendored via https://github.com/openshift/installer/pull/9058 

Version-Release number of selected component (if applicable):

    4.17, 4.16

How reproducible:

   Always 

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When installing a fresh 4.16-rc.5 on AWS, the following logs are shown:

time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596147    4921 logger.go:75] \"enabling EKS controllers and webhooks\" logger=\"setup\""
time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596154    4921 logger.go:81] \"EKS IAM role creation\" logger=\"setup\" enabled=false"
time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596159    4921 logger.go:81] \"EKS IAM additional roles\" logger=\"setup\" enabled=false"
time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596164    4921 logger.go:81] \"enabling EKS control plane controller\" logger=\"setup\""
time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596184    4921 logger.go:81] \"enabling EKS bootstrap controller\" logger=\"setup\""
time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596198    4921 logger.go:81] \"enabling EKS managed cluster controller\" logger=\"setup\""
time="2024-06-18T16:47:23+02:00" level=debug msg="I0618 16:47:23.596215    4921 logger.go:81] \"enabling EKS managed machine pool controller\" logger=\"setup\""

That is somehow strange and may have side effects. It seems the EKS CAPA is enabled by default (see additional info)

Version-Release number of selected component (if applicable):

4.16-rc.5

How reproducible:

Always

Steps to Reproduce:

1. Install an cluster (even an SNO works) on AWS using IPI

Actual results:

EKS feature enabled

Expected results:

EKS feature not enabled

Additional info:

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/feature/feature.go#L99

Description of problem:

Deployment of spoke cluster with GitOps ZTP approach fails during nodes introspection
    
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent Traceback (most recent call last):
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent     resp = conn.urlopen(
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/connectionpool.py", line 756, in urlopen
Jul 23 15:27:45 openshift-master-0 ironic-agent[3305]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in cre
ate_connection
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent     retries = retries.increment(
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent   File "/usr/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent     raise MaxRetryError(_pool, url, error or ResponseError(cause))
Jul 23 15:27:45 openshift-master-0 podman[3269]: 2024-07-23 15:27:45.412 1 ERROR ironic-python-agent urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.46.182.10', port=5050): Max
retries exceeded with url: /v1/continue (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f86e2817a60>: Failed to establish a new connection: [Errno 111] ECONNREFUSED'))

Version-Release number of selected component (if applicable):

advanced-cluster-management.v2.11.0
multicluster-engine.v2.6.0
    

How reproducible:

so far once
    

Steps to Reproduce:

    1. Deploy dualstack SNO hub cluster
    2. Install and configure hub cluster for GitOps ZTP deployment
    3. Deploy multi node cluster with GitOps ZTP workflow
    

Actual results:

Deployment fails as nodes fail to be introspected
    

Expected results:

Deployment succeeds
    

Description of problem:


When using SecureBoot tuned reports the following error as debugfs access is restricted:

tuned.utils.commands: Writing to file '/sys/kernel/debug/sched/migration_cost_ns' error: '[Errno 1] Operation not permitted: '/sys/kernel/debug/sched/migration_cost_ns''
tuned.plugins.plugin_scheduler: Error writing value '5000000' to 'migration_cost_ns'

This issue has been reported with the following tickets:

As this is a confirmed limitation of the NTO due to the TuneD component, we should document this as a limitation in the OpenShift Docs:
https://docs.openshift.com/container-platform/4.16/nodes/nodes/nodes-node-tuning-operator.html

Expected Outcome:

  • Document that the NTO cannot leverage some of the Tuned features when secureboot is enabled.

Description of problem:

KAS labels on projects created should be consistent with OCP - enforce: privileged

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

See https://issues.redhat.com/browse/OCPBUGS-20526.

Steps to Reproduce:

See https://issues.redhat.com/browse/OCPBUGS-20526.     

Actual results:

See https://issues.redhat.com/browse/OCPBUGS-20526.

Expected results:

See https://issues.redhat.com/browse/OCPBUGS-20526.

Additional info:

See https://issues.redhat.com/browse/OCPBUGS-20526.

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1244

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The crio test for CPU affinity is failing on the RT Kernel in 4.17. We need to investigate what changed in 4.17 to cause this to start failing.

We will need to revert https://github.com/openshift/origin/pull/28854 once a solution has been found.

Example failure:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-rt-upgrade-4.17-minor-release-openshift-release-analysis-aggregator/1797986895053983744

This is a clone of issue OCPBUGS-42671. The following is the description of the original issue:

Description of problem:

   Prometheus write_relabel_configs in remotewrite unable to drop metric in Grafana  

Version-Release number of selected component (if applicable):

    

How reproducible:

 Customer has tried both configurations to drop MQ metric with source_label(configuration 1) or without source_label(configuration 2) but it's not working.

It seems to me that  drop configuration is not working properly and is buggy. 


Configuration 1:

```
 remoteWrite:
        - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
          write_relabel_configs:
          - source_labels: ['__name__']
            regex: 'ibmmq_qmgr_uptime'
            action: 'drop'
          basicAuth:
            username:
              name: kubepromsecret
              key: username
            password:
              name: kubepromsecret
              key: password
```

Configuration 2:
```
remoteWrite:
        - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
          write_relabel_configs:
          - regex: 'ibmmq_qmgr_uptime'
            action: 'drop'
          basicAuth:
            username:
              name: kubepromsecret
              key: username
            password:
              name: kubepromsecret
              key: password
```


Customer wants to know what's the correct remote_write configuration to drop metric in Grafana ?

Document links:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#configuring-remote-write-storage_configuring-the-monitoring-stack
https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#creating-user-defined-workload-monitoring-configmap_configuring-the-monitoring-stack

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    prometheus remote_write configurations NOT droppping metric in Grafana   

Expected results:

prometheus  remote_write configurations should drop metric in Grafana    

Additional info:

    

This is a clone of issue OCPBUGS-38425. The following is the description of the original issue:

Description of problem:

    When a HostedCluster is upgraded to a new minor version, its OLM catalog imagestreams are not updated to use the tag corresponding to the new minor version.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1. Create a HostedCluster (4.15.z)
    2. Upgrade the HostedCluster to a new minor version (4.16.z)
    

Actual results:

    OLM catalog imagestreams remain at the previous version (4.15)

Expected results:

    OLM catalog imagestreams are updated to new minor version (4.16)

Additional info:

    

Dependabot has stopped running on the HyperShift repo. This task is to track the reason why & the fix to get it working again.

Description of problem:

Customers using OVN-K localnet topology networks for virtualization often do not define a "subnets" field in their NetworkAttachmentDefinitions. Examples in the OCP documentation virtualization section do not include that field either.

When a cluster with such NADs is upgraded from 4.16 to 4.17, the ovnkube-control-plane pods crash when CNO is upgraded and the upgrade hangs in a failing state. Once in the failing state, the cluster upgrade can be recovered by adding a subnets field to the localnet NADs

Version-Release number of selected component (if applicable): 4.16.15 > 4.17.1

How reproducible:

Start with an OCP 4.16 cluster with OVN-K localnet NADs configured per the OpenShift Virtualization documentation and attempt to upgrade the cluster to 4.17.1.

Steps to Reproduce:

1. Deploy an OCP 4.16.15 cluster, the type shouldn't matter but all testing has been done on bare metal (SNO and HA topologies)

2. Configure an OVS bridge with localnet bridge mappings and create one or more NetworkAttachmentDefinitions using the localnet topology without configuring the "subnets" field

3. Observe that this is a working configuration in 4.16 although error-level log messages appear in the ovnkube-control-plane pod (see OCPBUGS-37561)

4. Delete the ovnkube-control-plane pod on 4.16 and observe that the log messages do not prevent you from starting ovnkube on 4.16

5. Trigger an upgrade to 4.17.1

6. Once ovnkube-control-plane is restarted as part of the upgrade, observe that the ovnkube-cluster-manager container is crashing with the following message where "vlan10" is the name of a NetworkAttachmentDefinition created earlier

failed to run ovnkube: failed to start cluster manager: initial sync failed: failed to sync network vlan10: [cluster-manager network manager]: failed to create network vlan10: no cluster network controller to manage topology

7. Edit all NetworkAttachmentDefinitions to include a subnets field

8. Wait or delete the ovnkube-control-plane pods and observe that the pods come up and the upgrade resumes and completes normally

Actual results: The upgrade fails and ovnkube-control-plane is left in a crashing state

Expected results: The upgrade succeeds and ovnkube-control-plane is running

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms: Tested on baremetal, but all using OVN-K localnet networks should be impacted

Is it an

  1. internal CI failure
  2. customer issue / SD (Case 03960269)
  3. internal RedHat testing failure - Reproduction steps are based on internal testing as the customer environment has been repaired with the workaround

If it is an internal RedHat testing failure:

  • Kubeconfig for an internet-reachable cluster currently in the failed state is available upon request from Andrew Austin Byrum until at least 25 October 2024

 

This shouldn't be possible, it should have a locator pointing to node it's from.

    {
      "level": "Info",
      "display": false,
      "source": "KubeletLog",
      "locator": {
        "type": "",
        "keys": null
      },
      "message": {
        "reason": "ReadinessFailed",
        "cause": "",
        "humanMessage": "Get \"https://10.177.121.252:6443/readyz\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)",
        "annotations": {
          "node": "ci-op-vxiib4hx-9e8b4-wwnfx-master-2",
          "reason": "ReadinessFailed"
        }
      },
      "from": "2024-05-02T16:58:06Z",
      "to": "2024-05-02T16:58:06Z",
      "filename": "e2e-events_20240502-163726.json"
    },

Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2358

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

when use targetCatalog, mirror failed with error: 
error: error rebuilding catalog images from file-based catalogs: error copying image docker://registry.redhat.io/abc/redhat-operator-index:v4.13 to docker://localhost:5000/abc/redhat-operator-index:v4.13: initializing source docker://registry.redhat.io/abc/redhat-operator-index:v4.13: (Mirrors also failed: [localhost:5000/abc/redhat-operator-index:v4.13: pinging container registry localhost:5000: Get "https://localhost:5000/v2/": http: server gave HTTP response to HTTPS client]): registry.redhat.io/abc/redhat-operator-index:v4.13: reading manifest v4.13 in registry.redhat.io/abc/redhat-operator-index: unauthorized: access to the requested resource is not authorized 

 

 

Version-Release number of selected component (if applicable):

oc-mirror 4.16

How reproducible:

always

Steps to Reproduce:

1) Use following isc to do mirror2mirror for v1:    
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  local:
    path: /tmp/case60597
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.13
    targetCatalog: abc/redhat-operator-index
    packages:
    - name: servicemeshoperator  
`oc-mirror --config config.yaml docker://localhost:5000 --dest-use-http`

 

Actual results: 

1) mirror failed with error:
info: Mirroring completed in 420ms (0B/s)
error: error rebuilding catalog images from file-based catalogs: error copying image docker://registry.redhat.io/abc/redhat-operator-index:v4.13 to docker://localhost:5000/abc/redhat-operator-index:v4.13: initializing source docker://registry.redhat.io/abc/redhat-operator-index:v4.13: (Mirrors also failed: [localhost:5000/abc/redhat-operator-index:v4.13: pinging container registry localhost:5000: Get "https://localhost:5000/v2/": http: server gave HTTP response to HTTPS client]): registry.redhat.io/abc/redhat-operator-index:v4.13: reading manifest v4.13 in registry.redhat.io/abc/redhat-operator-index: unauthorized: access to the requested resource is not authorized

Expected results:

1) no error.

Additional information:

compared with oc-mirror 4.15.9, can't reproduce this issue 

This is a clone of issue OCPBUGS-42939. The following is the description of the original issue:

Description of problem:

4.18 efs controller, node pods are left behind after uninstalling driver

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-08-075347

How reproducible:

Always

Steps to Reproduce:

1. Install 4.18 EFS operator, driver on  cluster and check the efs pods are all up and Running
2. Uninstall EFs driver and check the controller, node pods gets deleted

Execution on 4.16 and 4.18 clusters 

4.16 cluster

oc create -f og-sub.yaml
oc create -f driver.yaml

oc get pods | grep "efs"
aws-efs-csi-driver-controller-b8858785-72tp9     4/4     Running   0          4s
aws-efs-csi-driver-controller-b8858785-gvk4b     4/4     Running   0          6s
aws-efs-csi-driver-node-2flqr                    3/3     Running   0          9s
aws-efs-csi-driver-node-5hsfp                    3/3     Running   0          9s
aws-efs-csi-driver-node-kxnlv                    3/3     Running   0          9s
aws-efs-csi-driver-node-qdshm                    3/3     Running   0          9s
aws-efs-csi-driver-node-ss28h                    3/3     Running   0          9s
aws-efs-csi-driver-node-v9zwx                    3/3     Running   0          9s
aws-efs-csi-driver-operator-65b55bf877-4png9     1/1     Running   0          2m53s

oc get clustercsidrivers | grep "efs"
efs.csi.aws.com   2m26s

oc delete -f driver.yaml

oc get pods | grep "efs"
aws-efs-csi-driver-operator-65b55bf877-4png9     1/1     Running   0          4m40s

4.18 cluster
oc create -f og-sub.yaml
oc create -f driver.yaml

oc get pods | grep "efs" 
aws-efs-csi-driver-controller-56d68dc976-847lr   5/5     Running   0               9s
aws-efs-csi-driver-controller-56d68dc976-9vklk   5/5     Running   0               11s
aws-efs-csi-driver-node-46tsq                    3/3     Running   0               18s
aws-efs-csi-driver-node-7vpcd                    3/3     Running   0               18s
aws-efs-csi-driver-node-bm86c                    3/3     Running   0               18s
aws-efs-csi-driver-node-gz69w                    3/3     Running   0               18s
aws-efs-csi-driver-node-l986w                    3/3     Running   0               18s
aws-efs-csi-driver-node-vgwpc                    3/3     Running   0               18s
aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv     1/1     Running   0               2m55s

oc get clustercsidrivers 
efs.csi.aws.com   2m19s

oc delete -f driver.yaml

oc get pods | grep "efs"              
aws-efs-csi-driver-controller-56d68dc976-847lr   5/5     Running   0               4m58s
aws-efs-csi-driver-controller-56d68dc976-9vklk   5/5     Running   0               5m
aws-efs-csi-driver-node-46tsq                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-7vpcd                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-bm86c                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-gz69w                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-l986w                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-vgwpc                    3/3     Running   0               5m7s
aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv     1/1     Running   0               7m44s

oc get clustercsidrivers  | grep "efs" => Nothing is there

Actual results:

The EFS controller, node pods are left behind

Expected results:

After uninstalling driver the EFS controller, node pods should get deleted

Additional info:

 On 4.16 cluster this is working fine

EFS Operator logs:

oc logs aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv
E1009 07:13:41.460469       1 base_controller.go:266] "LoggingSyncer" controller failed to sync "key", err: clustercsidrivers.operator.openshift.io "efs.csi.aws.com" not found

Discussion: https://redhat-internal.slack.com/archives/C02221SB07R/p1728456279493399 

Description of problem:

A breaking API change (Catalog -> ClusterCatalog) is blocking downstreaming of operator-framework/catalogd and operator-framework/operator-controller    

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

Downstreaming script fails.
https://prow.ci.openshift.org/?job=periodic-auto-olm-v1-downstreaming   
    

Actual results:

Downstreaming fails.  

Expected results:

Downstreaming succeeds.    

Additional info:

    

This is a clone of issue OCPBUGS-39531. The following is the description of the original issue:

-> While upgrading the cluster from 4.13.38 -> 4.14.18, it is stuck on CCO, clusterversion is complaining about

"Working towards 4.14.18: 690 of 860 done (80% complete), waiting on cloud-credential".

While checking further we see that CCO deployment is yet to rollout.

-> ClusterOperator status.versions[name=operator] isn't a narrow "CCO Deployment is updated", it's "the CCO asserts the whole CC component is updated", which requires (among other things) a functional CCO Deployment. Seems like you don't have a functional CCO Deployment, because logs have it stuck talking about asking for a leader lease. You don't have Kube API audit logs to say if it's stuck generating the Lease request, or waiting for a response from the Kube API server.

This is a clone of issue OCPBUGS-38012. The following is the description of the original issue:

Description of problem:

Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23

At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes.
As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node.

However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition.
As a workaround, customer manually changed this service definition which helped them to scale up new nodes.

Version-Release number of selected component (if applicable):

4.15 , 4.16

How reproducible:

100%

Steps to Reproduce:

1. Install OCP vSphere IPI cluster version 4.8 or 4.9
2. Check "on-prem-resolv-prepender.service" service definition
3. Upgrade it to 4.15.22 or 4.15.23
4. Check if the node scaling is working 
5. Check "on-prem-resolv-prepender.service" service definition     

Actual results:

Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.

Expected results:

Node sclaing should work without making any manual changes in the service definition.

Additional info:

on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
[Service]
Type=oneshot
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=0
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0                -----------> this
[Service]
Type=oneshot
#Restart=on-failure                    -----------> this
RestartSec=10
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0
[Service]
Type=oneshot
Restart=on-failure
RestartSec=10
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

Observed this in the rendered MachineConfig which is assembled with the 00-worker

This tracks disabling of MG LRU by writing 0 to `/sys/kernel/mm/lru_gen/enabled`

Description of problem:

Since 4.16.0 pods with memory limits tend to OOM very frequently when writing files larger than memory limit to PVC

Version-Release number of selected component (if applicable):

4.16.0-rc.4

How reproducible:

100% on certain types of storage
(AWS FSx, certain LVMS setups, see additional info)

Steps to Reproduce:

1. Create pod/pvc that writes a file larger than the container memory limit (attached example)
2.
3.

Actual results:

OOMKilled

Expected results:

Success

Additional info:

For simplicity, I will focus on BM setup that produces this with LVM storage.
This is also reproducible on AWS clusters with NFS backed NetApp ONTAP FSx.

Further reduced to exclude the OpenShift layer, LVM on a separate (non root) disk:

Prepare disk
lvcreate -T vg1/thin-pool-1 -V 10G -n oom-lv
mkfs.ext4 /dev/vg1/oom-lv 
mkdir /mnt/oom-lv
mount /dev/vg1/oom-lv /mnt/oom-lv

Run container
podman run -m 600m --mount type=bind,source=/mnt/oom-lv,target=/disk --rm -it quay.io/centos/centos:stream9 bash
[root@2ebe895371d2 /]# curl https://cloud.centos.org/centos/9-stream/x86_64/images/CentOS-Stream-GenericCloud-x86_64-9-20240527.0.x86_64.qcow2 -o /disk/temp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 47 1157M   47  550M    0     0   111M      0  0:00:10  0:00:04  0:00:06  111MKilled
(Notice the process gets killed, I don't think podman ever whacks the whole container over this though)

The same process on the same hardware on a 4.15 node (9.2) does not produce an OOM
(vs 4.16 which is RHEL 9.4)

For completeness, I will provide some details about the setup behind the LVM pool, though I believe it should not impact the decision about whether this is an issue:
sh-5.1# pvdisplay 
  --- Physical volume ---
  PV Name               /dev/sdb
  VG Name               vg1
  PV Size               446.62 GiB / not usable 4.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              114335
  Free PE               11434
  Allocated PE          102901
  PV UUID               <UUID>
Hardware:
SSD (INTEL SSDSC2KG480G8R) behind a RAID 0 of a PERC H330 Mini controller

At the very least, this seems like a change in behavior but tbh I am leaning towards an outright bug.

QE Verification Steps

It's been independently verified that setting /sys/kernel/mm/lru_gen/enabled = 0 avoids the oomkills. So verifying that nodes get this value applied is the main testing concern at this point, new installs, upgrades, and new nodes scaled after an upgrade.

If we want to go so far as to verify that the oomkills don't happen the kernel QE team have a simplified reproducer here which involves mounting an NFS volume and using podman to create a container with a memory limit and writing data to that NFS volume.

https://issues.redhat.com/browse/RHEL-43371?focusedId=24981771&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24981771

 

This is a clone of issue OCPBUGS-43898. The following is the description of the original issue:

Description of problem:

OCP 4.17 requires permissions to tag network interfaces (ENIs) on instance creation in support of the Egress IP feature.

ROSA HCP uses managed IAM policies, which are reviewed and gated by AWS. The current policy AWS has applied does not allow us to tag ENIs out of band, only ones that have 'red-hat-managed: true`, which are going to be tagged during instance creation.

However, in order to support backwards compatibility for existing clusters, we need to roll out a CAPA patch that allows us to call `RunInstances` with or without the ability to tag ENIs.

Once we backport this to the Z streams, upgrade clusters and rollout the updated policy with AWS, we can then go back and revert the backport.

For more information see https://issues.redhat.com/browse/SDE-4496

Version-Release number of selected component (if applicable):

4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/193

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/55

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

While debugging https://docs.google.com/document/d/10kcIQPsn2H_mz7dJx3lbZR2HivjnC_FAnlt2adc53TY/edit#heading=h.egy1agkrq2v1, we came across the log:

2023-07-31T16:51:50.240749863Z W0731 16:51:50.240586       1 tasks.go:72] task 3 of 15: Updating Prometheus-k8s failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]

After some searching, we understood that the log is trying to say that ValidatePrometheus timed out waiting for prometheus to become ready.

The 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

See here https://redhat-internal.slack.com/archives/C02BQCCFZPX/p1690892059971129?thread_ts=1690873617.023399&cid=C02BQCCFZPX for how to get the function time out. 

Actual results:

 

Expected results:

- Clearer logs.

- Some info that we are logging makes more sense to be part of the error, example: https://github.com/openshift/cluster-monitoring-operator/blob/af831de434ce13b3edc0260a468064e0f3200044/pkg/client/client.go#L890

- Make info as "unavailable (unknown):" clearer as we cannot understand want it means without referring to code.

Additional info:

- Do the same for the other functions that wait for other components if using the same wait mechanism (PollUntilContextTimeout...)

- https://redhat-internal.slack.com/archives/C02BQCCFZPX/p1690873617023399 for more details.

see https://redhat-internal.slack.com/archives/C0VMT03S5/p1691069196066359?thread_ts=1690827144.818209&cid=C0VMT03S5 for the slack discussion.

Please review the following PR: https://github.com/openshift/oauth-server/pull/149

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Oc-mirror should fail with error when operator not found 

 

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202403070215.p0.gc4f8295.assembly.stream.el9-c4f8295", GitCommit:"c4f829512107f7d0f52a057cd429de2030b9b3b3", GitTreeState:"clean", BuildDate:"2024-03-07T03:46:24Z", GoVersion:"go1.21.7 (Red Hat 1.21.7-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1)  With following imagesetconfig:  
cat config-e.yaml
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14
      packages:
        - name: cincinnati-operator
        - name: cluster-logging
          channels:
            - name: stable
              minVersion: 5.7.7
              maxVersion: 5.7.7

`oc-mirror --config config-e.yaml file://out2 --v2`

2)  Check the operator version
[root@preserve-fedora36 app1]# oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.14 --package cluster-logging --channel stable-5.7
VERSIONS
5.7.6
5.7.7
5.7.0
5.7.1
5.7.10
5.7.2
5.7.4
5.7.5
5.7.9
5.7.11
5.7.3
5.7.8
[root@preserve-fedora36 app1]# oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.14 --package cluster-logging --channel stable
VERSIONS
5.8.0
5.8.1
5.8.2
5.8.3
5.8.4 

Actual results: 

2) No error when operator not found

oc-mirror --config config-e.yaml file://out2 --v2
--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used. 
2024/03/25 05:07:57  [INFO]   : mode mirrorToDisk
2024/03/25 05:07:57  [INFO]   : local storage registry will log to /app1/0321/out2/working-dir/logs/registry.log
2024/03/25 05:07:57  [INFO]   : starting local storage on localhost:55000
2024/03/25 05:07:57  [INFO]   : copying  cincinnati response to out2/working-dir/release-filters
2024/03/25 05:07:57  [INFO]   : total release images to copy 0
2024/03/25 05:07:57  [INFO]   : copying operator image registry.redhat.io/redhat/redhat-operator-index:v4.14
2024/03/25 05:08:00  [INFO]   : manifest 6839c41621e7d3aa2be40499ed1d69d833bc34472689688d8efd4e944a32469e
2024/03/25 05:08:00  [INFO]   : label /configs
2024/03/25 05:08:16  [INFO]   : related images length 2
2024/03/25 05:08:16  [INFO]   : images to copy (before duplicates) 4
2024/03/25 05:08:16  [INFO]   : total operator images to copy 4
2024/03/25 05:08:16  [INFO]   : total additional images to copy 0
2024/03/25 05:08:16  [INFO]   : images to mirror 4
2024/03/25 05:08:16  [INFO]   : batch count 1
2024/03/25 05:08:16  [INFO]   : batch index 0 
2024/03/25 05:08:16  [INFO]   : batch size 4
2024/03/25 05:08:16  [INFO]   : remainder size 0
2024/03/25 05:08:16  [INFO]   : starting batch 0
2024/03/25 05:08:27  [INFO]   : completed batch 0
2024/03/25 05:08:42  [INFO]   : start time      : 2024-03-25 05:07:57.7405637 +0000 UTC m=+0.058744792
2024/03/25 05:08:42  [INFO]   : collection time : 2024-03-25 05:08:16.069731565 +0000 UTC m=+18.387912740
2024/03/25 05:08:42  [INFO]   : mirror time     : 2024-03-25 05:08:42.4006485 +0000 UTC m=+44.71882960

Expected results:

2) For channel stable, we can’t find the 5.7.7 version for cluster-logging, the mirror should fail with error.

Description of problem:

    In case the interface changes, we might miss updating AWS and not realize it.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    No issue currently but could potentially break in the future.

Expected results:

    

Additional info:

    

Under heavy load(?) crictl can fail and return errors which iptables-alerter does not handle correctly, and as a result, it may accidentally end up checking for iptables rules in hostNetwork pods, and then logging events about it.

This is a clone of issue OCPBUGS-37782. The following is the description of the original issue:

Description of problem:

    ci/prow/security is failing on google.golang.org/grpc/metadata

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

always    

Steps to Reproduce:

    1. run ci/pro/security job on 4.15 pr
    2.
    3.
    

Actual results:

    Medium severity vulnerability found in google.golang.org/grpc/metadata

Expected results:

    

Additional info:

 

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/305

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Tooltip on Pipeline whenexpression is not shows in Pipeline visualization.

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. Create a Pipeline with whenExpression
  2. navigate to the Pipeline details page
  3. hover over the whenExpression diamond shape

Actual results:

When expression tooltip is not shows on hover

Expected results:

Should show When expression tooltip on hover

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/273

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-32053. The following is the description of the original issue:

Description of problem:

    The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands.  These options exist in these docs:

https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html

but not in these docs:

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user 

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1263

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:
The REST API returns 422 if the input doesn't match the regex defined in the swagger, whereas our code returns 400 for input errors. There may be other cases where the generated errors are inconsistent with ours. We should change 422 to 400 and review the rest. 

Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/99

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/image-registry/pull/401

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The specified network tags are not applied to control-plane machines.

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-multi-2024-06-04-172027

How reproducible:

    Always

Steps to Reproduce:

1. "create install-config"
2. edit install-config.yaml to insert network tags and featureSet setting (see [1])
3. "create cluster" and make sure it succeeds
4. using "gcloud" to check the network tags of the cluster machines (see [2])

Actual results:

    The specified network tags are not applied to control-plane machines, although the compute/worker machines do have the specified network tags.

Expected results:

    Both control-plane machines and compute/worker machines should be applied to the specified network tags. 

Additional info:

QE's Flexy-install job: /Flexy-install/288061/

VARIABLES_LOCATION private-templates/functionality-testing/aos-4_16/ipi-on-gcp/versioned-installer_techpreview

LAUNCHER_VARS
installer_payload_image: quay.io/openshift-release-dev/ocp-release-nightly:4.16.0-0.nightly-multi-2024-06-04-172027
num_workers: 2
control_plane_tags: ["installer-qe-tag01", "installer-qe-tag02"]
compute_tags: ["installer-qe-tag01", "installer-qe-tag03"]

This is a clone of issue OCPBUGS-37506. The following is the description of the original issue:

Description of problem:

Install Azure fully private IPI cluster by using CAPI with payload built from cluster bot including openshift/installer#8727,openshift/installer#8732,

install-config:
=================
platform:
  azure:
    region: eastus
    outboundType: UserDefinedRouting
    networkResourceGroupName: jima24b-rg
    virtualNetwork: jima24b-vnet
    controlPlaneSubnet: jima24b-master-subnet
    computeSubnet: jima24b-worker-subnet
publish: Internal
featureSet: TechPreviewNoUpgrade

Checked storage account created by installer, its property allowBlobPublicAccess is set to True.
$ az storage account list -g jima24b-fwkq8-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
jima24bfwkq8sa    True

This is not consistent with terraform code, https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L74

At least, storage account should have no public access for fully private cluster.

Version-Release number of selected component (if applicable):

    4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Create fully private cluster
    2. Check storage account created by installer
    3.
    

Actual results:

    storage account have public access on fully private cluster.

Expected results:

     storage account should have no public access on fully private cluster.

Additional info:

    

This is a clone of issue OCPBUGS-43041. The following is the description of the original issue:

Description of problem:

    A slice of something like

idPointers := make([]*string, len(ids))

should be corrected to 

idPointers := make([]*string, 0, len(ids))

When the initial size is not provided to the make for slice creating, the slice is made to length (last argument) and filled with the default value. For instance _ := make([]int, 5) creates an array {0, 0, 0, 0, 0}. If this appended to rather than accessing and setting the information by index, then there are extra values. 

1. If we append to the array then we leave behind the default values (this could change the behavior of the function that the array is passed to). This could also pose as a malloc issue.
2. If we dont fill the array completely (ie. create a size of 5 and only fill 4 elements), then the same issue as above could come in to play.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

secrets-store-csi-driver with AWS provider does not work in HyperShift hosted cluster, pod can't mount the volume successfully.

Version-Release number of selected component (if applicable):

secrets-store-csi-driver-operator.v4.14.0-202308281544 in 4.14.0-0.nightly-2023-09-06-235710 HyperShift hosted cluster.

How reproducible:

Always

Steps to Reproduce:

1. Follow test case OCP-66032 "Setup" part to install secrets-store-csi-driver-operator.v4.14.0-202308281544 , secrets-store-csi-driver and AWS provider successfully:

$ oc get po -n openshift-cluster-csi-drivers
NAME                                                READY   STATUS    RESTARTS   AGE
aws-ebs-csi-driver-node-7xxgr                       3/3     Running   0          5h18m
aws-ebs-csi-driver-node-fmzwf                       3/3     Running   0          5h18m
aws-ebs-csi-driver-node-rgrxd                       3/3     Running   0          5h18m
aws-ebs-csi-driver-node-tpcxq                       3/3     Running   0          5h18m
csi-secrets-store-provider-aws-2fm6q                1/1     Running   0          5m14s
csi-secrets-store-provider-aws-9xtw7                1/1     Running   0          5m15s
csi-secrets-store-provider-aws-q5lvb                1/1     Running   0          5m15s
csi-secrets-store-provider-aws-q6m65                1/1     Running   0          5m15s
secrets-store-csi-driver-node-4wdc8                 3/3     Running   0          6m22s
secrets-store-csi-driver-node-n7gkj                 3/3     Running   0          6m23s
secrets-store-csi-driver-node-xqr52                 3/3     Running   0          6m22s
secrets-store-csi-driver-node-xr24v                 3/3     Running   0          6m22s
secrets-store-csi-driver-operator-9cb55b76f-7cbvz   1/1     Running   0          7m16s

2. Follow test case OCP-66032 steps to create AWS secret, set up AWS IRSA successfully.

3. Follow test case OCP-66032 steps SecretProviderClass, deployment with the secretProviderClass successfully. Then check pod, pod is stuck in ContainerCreating:

$ oc get po
NAME                               READY   STATUS              RESTARTS   AGE
hello-openshift-84c76c5b89-p5k4f   0/1     ContainerCreating   0          10m

$ oc describe po hello-openshift-84c76c5b89-p5k4f
...
Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    11m   default-scheduler  Successfully assigned xxia-proj/hello-openshift-84c76c5b89-p5k4f to ip-10-0-136-205.us-east-2.compute.internal
  Warning  FailedMount  11m   kubelet            MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: 92d1ff5b-36be-4cc5-9b55-b12279edd78e
  Warning  FailedMount  11m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: 50907328-70a6-44e0-9f05-80a31acef0b4
  Warning  FailedMount  11m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: 617dc3bc-a5e3-47b0-b37c-825f8dd84920
  Warning  FailedMount  11m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: 8ab5fc2c-00ca-45e2-9a82-7b1765a5df1a
  Warning  FailedMount  11m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: b76019ca-dc04-4e3e-a305-6db902b0a863
  Warning  FailedMount  11m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: b395e3b2-52a2-4fc2-80c6-9a9722e26375
  Warning  FailedMount  11m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: ec325057-9c0a-4327-80c9-a9b6233a64dd
  Warning  FailedMount  10m  kubelet  MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
           status code: 400, request id: 405492b2-ed52-429b-b253-6a7c098c26cb
  Warning  FailedMount  82s (x5 over 9m35s)  kubelet  Unable to attach or mount volumes: unmounted volumes=[secrets-store-inline], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
  Warning  FailedMount  74s (x5 over 9m25s)  kubelet  (combined from similar events): MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod xxia-proj/hello-openshift-84c76c5b89-p5k4f, err: rpc error: code = Unknown desc = us-east-2: Failed fetching secret xxiaSecret: WebIdentityErr: failed to retrieve credentials
caused by: InvalidIdentityToken: Incorrect token audience
  status code: 400, request id: c38bbed1-012d-4250-b674-24ab40607920

Actual results:

Hit above stuck issue.

Expected results:

Pod should be Running.

Additional info:

Compared another operator (cert-manager-operator) which also uses AWS IRSA: OCP-62500 , that case works well. So secrets-store-csi-driver-operator has bug.

OCP version: 4.15.0

We have monitoring alerts configured against a cluster in our longevity setup.
After receiving alerts for metal3 - we examined the graph for the pod.

The graph indicates a continuous steady growth of memory consumption.

Open Github Security Advisory for: containers/image

https://github.com/advisories/GHSA-6wvf-f2vw-3425

The ARO SRE team became aware of this advisory against our installer fork. Upstream installer is also pinning a vulnerable version of containerd.

Advisory recommends to update to versions 5.30.1

This is a clone of issue OCPBUGS-43520. The following is the description of the original issue:

Description of problem:

   When installing a GCP cluster with the CAPI based method, the kube-api firewall rule that is created always uses a source range of 0.0.0.0/0. In the prior terraform based method, internal published clusters were limited to the network_cidr. This change opens up the API to additional sources, which could be problematic such as in situations where traffic is being routed from a non-cluster subnet.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Install a cluster in GCP with publish: internal
    2.
    3.
    

Actual results:

    Kube-api firewall rule has source of 0.0.0.0/0

Expected results:

    Kube-api firewall rule has a more limited source of network_cidr

Additional info:

    

This is a clone of issue OCPBUGS-38118. The following is the description of the original issue:

Description of problem:

IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699

Version-Release number of selected component (if applicable):

    

How reproducible:

Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    The installation will fail. 

Expected results:

    The installation succeeds to create a Nutanix OCP cluster with the DHCP network.

Additional info:

    

This is a clone of issue OCPBUGS-43448. The following is the description of the original issue:

Description of problem:

The cluster policy controller does not get the same feature flags that other components in the control plane are getting.
    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create hosted cluster
    2. Get cluster-policy-controller-config configmap from control plane namespace
    

Actual results:

Default feature gates are not included in the config
    

Expected results:

Feature gates are included in the config
    

Additional info:


    

Description of problem:

    When creating a serverless function in create serverless form, BuildConfig is not created

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1.Install Serverless operator
    2.Add https://github.com/openshift-dev-console/kn-func-node-cloudevents in create serverless form     
    3.Create the function and check BuildConfig page
    

Actual results:

    BuildConfig is not created

Expected results:

    Should create BuildConfig

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/38

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/169

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The self-managed hypershift cli (hcp) reports an inaccurate OCP supported version.

For example, if I have a hypershift-operator deployed which supports OCP v4.14 and I build the hcp cli from the latest source code, when I execute "hcp -v", the cli tool reports the following. 


$ hcp -v
hcp version openshift/hypershift: 02bf7af8789f73c7b5fc8cc0424951ca63441649. Latest supported OCP: 4.16.0

This makes it appear that the hcp cli is capable of deploying OCP v4.16.0, when the backend is actually limited to v4.14.0.

The cli needs to indicate what the server is capable of deploying. Otherwise it appears that v4.16.0 would be deployable in this scenario, but the backend would not allow that. 


Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    100%

Steps to Reproduce:

    1. download an HCP client that does not match the hypershift-operator backend
    2. execute 'hcp -v'
    3. the reported "Latest supported OCP" is not representative of the version the hypershift-operator actually supports     

Actual results:

   

Expected results:

     hcp cli reports a latest OCP version that is representative of what the deployed hypershift operator is capable of deploying. 

Additional info:

    

If not specified those are defined by an environment variable in the image service, so connected users are not aware to the images that exist in the deployment, It can cause a confusion when adding a new release image, the result is an error in infra-env, the error is clear but user will not understand what needs to be done

Failed to create image: The requested RHCOS version (4.14, arch: x86_64) does not have a matching OpenShift release image'

This is a clone of issue OCPBUGS-38249. The following is the description of the original issue:

openshift/api was bumped in CNO without running codegen. codegen needs to be run

Description of problem:

If a cluster admin creates a new MachineOSConfig that references a legacy pull secret, the canonicalized version of this secret that gets created is not updated whenever the original pull secret changes.

 

How reproducible:

Always

 

Steps to Reproduce:

  1. Create a new legacy-style Docker pull secret in the MCO namespace. Specifically, one which follows the pattern of {"hostname.com": {"username": ""...}

    .

  2. Create a MachineOSConfig that references this legacy pull secret. The MachineOSConfig will get updated with a different secret name with the suffix -canonical.
  3. Change the original legacy-style Docker pull secret that was created to a different secret.

Actual results:

The canonicalized version of the pull secret is never updated with the contents of the legacy-style pull secret.

 

Expected results:

Ideally, the canonicalized version of the pull secret should be updated since BuildController created it.

 

Additional info:

This occurs because when the legacy pull secret is initially detected, BuildController canonicalizes it and then updates the MachineOSConfig with the name of the canonicalized secret. The next time this secret is referenced, the original secret does not get read.

Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/78

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/39

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In the OCP upgrades from 4.13 to 4.14, the canary route configuration is changed as below: 

 

Canary route configuration in OCP 4.13
$ oc get route -n openshift-ingress-canary canary -oyaml
apiVersion: route.openshift.io/v1
kind: Route
metadata:
labels:
ingress.openshift.io/canary: canary_controller
name: canary
namespace: openshift-ingress-canary
spec:
host: canary-openshift-ingress-canary.apps.<cluster-domain>.com <---- canary route configured with .spec.host
Canary route configuration in OCP 4.14:
$ oc get route -n openshift-ingress-canary canary -oyaml
apiVersion: route.openshift.io/v1
kind: Route
labels:
ingress.openshift.io/canary: canary_controller
name: canary
namespace: openshift-ingress-canary
spec:
port:
targetPort: 8080
subdomain: canary-openshift-ingress-canary <---- canary route configured with .spec.subdomain

 

After the upgrade, the following messages are printed in the ingress-operator pod: 

2024-04-24T13:16:34.637Z        ERROR   operator.init   controller/controller.go:265    Reconciler error        {"controller": "canary_controller", "object": {"name":"default","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "default", "reconcileID": "46290893-d755-4735-bb01-e8b707be4053", "error": "failed to ensure canary route: failed to update canary route openshift-ingress-canary/canary: Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable"}
 

The issue is resolved when the canary route is deleted. 

See below the audit logs from the process: 

# The route can't be updated with error 422: 

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4e8bfb36-21cc-422b-9391-ef8ff42970ca","stage":"ResponseComplete","requestURI":"/apis/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes/canary","verb":"update","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingress-operator","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["ingress-operator-746cd8598-hq2st"],"authentication.kubernetes.io/pod-uid":["f3ebccdf-f3b3-420d-8ea5-e33d98945403"]}},"sourceIPs":["10.128.0.93","10.128.0.2"],"userAgent":"Go-http-client/2.0","objectRef":{"resource":"routes","namespace":"openshift-ingress-canary","name":"canary","uid":"3e179946-d4e3-45ad-9380-c305baefd14e","apiGroup":"route.openshift.io","apiVersion":"v1","resourceVersion":"297888"},"responseStatus":{"metadata":{},"status":"Failure","message":"Route.route.openshift.io \"canary\" is invalid: spec.subdomain: Invalid value: \"canary-openshift-ingress-canary\": field is immutable","reason":"Invalid","details":{"name":"canary","group":"route.openshift.io","kind":"Route","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"canary-openshift-ingress-canary\": field is immutable","field":"spec.subdomain"}]},"code":422},"requestReceivedTimestamp":"2024-04-24T13:16:34.630249Z","stageTimestamp":"2024-04-24T13:16:34.636869Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-operator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}}

# Route is deleted manually

"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"70821b58-dabc-4593-ba6d-5e81e5d27d21","stage":"ResponseComplete","requestURI":"/aps/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes/canary","verb":"delete","user":{"username":"system:admin","groups":["system:masters","syste:authenticated"]},"sourceIPs":["10.0.91.78","10.128.0.2"],"userAgent":"oc/4.13.0 (linux/amd64) kubernetes/7780c37","objectRef":{"resource":"routes","namespace:"openshift-ingress-canary","name":"canary","apiGroup":"route.openshift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","details":{"ame":"canary","group":"route.openshift.io","kind":"routes","uid":"3e179946-d4e3-45ad-9380-c305baefd14e"},"code":200},"requestReceivedTimestamp":"2024-04-24T1324:39.558620Z","stageTimestamp":"2024-04-24T13:24:39.561267Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}

# Route is created again

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"92e6132a-aa1d-482d-a1dc-9ce021ae4c37","stage":"ResponseComplete","requestURI":"/aps/route.openshift.io/v1/namespaces/openshift-ingress-canary/routes","verb":"create","user":{"username":"system:serviceaccount:openshift-ingress-operator:ingres-operator","groups":["system:serviceaccounts","system:serviceaccounts:openshift-ingress-operator","system:authenticated"],"extra":{"authentication.kubernetesio/pod-name":["ingress-operator-746cd8598-hq2st"],"authentication.kubernetes.io/pod-uid":["f3ebccdf-f3b3-420d-8ea5-e33d98945403"]}},"sourceIPs":["10.128.0.93""10.128.0.2"],"userAgent":"Go-http-client/2.0","objectRef":{"resource":"routes","namespace":"openshift-ingress-canary","name":"canary","apiGroup":"route.opensift.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2024-04-24T13:24:39.577255Z","stageTimestamp":"2024-04-24T1:24:39.584371Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"openshift-ingress-perator\" of ClusterRole \"openshift-ingress-operator\" to ServiceAccount \"ingress-operator/openshift-ingress-operator\""}}

 

Version-Release number of selected component (if applicable):

    Ocp upgrade between 4.13 and 4.14

How reproducible:

    Upgrade the cluster from OCP 4.13 to 4.14 and check the ingress operator pod logs

Steps to Reproduce:

    1. Install cluster in OCP 4.13
    2. Upgrade to OCP 4.14
    3. Check the ingress operator logs
    

Actual results:

    Reported errors above

Expected results:

    The ingress canary route should be update without isssues

Additional info:

    

This is a clone of issue OCPBUGS-38925. The following is the description of the original issue:

Description of problem:

periodics are failing due to a change in coreos.    

Version-Release number of selected component (if applicable):

    4.15,4.16,4.17,4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Check any periodic conformance jobs
    2.
    3.
    

Actual results:

    periodic conformance fails with hostedcluster creation

Expected results:

    periodic conformance test suceeds 

Additional info:

    

This is a clone of issue OCPBUGS-42732. The following is the description of the original issue:

Description of problem:

    The operator cannot succeed removing resources when networkAccess is set to Removed.
    It looks like the authorization error changes from bloberror.AuthorizationPermissionMismatch to bloberror.AuthorizationFailure after the storage account becomes private (networkAccess: Internal).
    This is either caused by weird behavior in the azure sdk, or in the azure api itself.
    The easiest way to solve it is to also handle bloberror.AuthorizationFailure here: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L1145

    The error condition is the following:

status:
  conditions:
  - lastTransitionTime: "2024-09-27T09:04:20Z"
    message: "Unable to delete storage container: DELETE https://imageregistrywxj927q6bpj.blob.core.windows.net/wxj-927d-jv8fc-image-registry-rwccleepmieiyukdxbhasjyvklsshhee\n--------------------------------------------------------------------------------\nRESPONSE
      403: 403 This request is not authorized to perform this operation.\nERROR CODE:
      AuthorizationFailure\n--------------------------------------------------------------------------------\n\uFEFF<?xml
      version=\"1.0\" encoding=\"utf-8\"?><Error><Code>AuthorizationFailure</Code><Message>This
      request is not authorized to perform this operation.\nRequestId:ababfe86-301e-0005-73bd-10d7af000000\nTime:2024-09-27T09:10:46.1231255Z</Message></Error>\n--------------------------------------------------------------------------------\n"
    reason: AzureError
    status: Unknown
    type: StorageExists
  - lastTransitionTime: "2024-09-27T09:02:26Z"
    message: The registry is removed
    reason: Removed
    status: "True"
    type: Available 

Version-Release number of selected component (if applicable):

    4.18, 4.17, 4.16 (needs confirmation), 4.15 (needs confirmation)

How reproducible:

    Always

Steps to Reproduce:

    1. Get an Azure cluster
    2. In the operator config, set networkAccess to Internal
    3. Wait until the operator reconciles the change (watch networkAccess in status with `oc get configs.imageregistry/cluster -oyaml |yq '.status.storage'`)
    4. In the operator config, set management state to removed: `oc patch configs.imageregistry/cluster -p '{"spec":{"managementState":"Removed"}}' --type=merge`
    5. Watch the cluster operator conditions for the error

Actual results:

    

Expected results:

    

Additional info:

    

When the custom AMI feature was introduced, the Installer didn't support machine pools. Now that it does, and has done for a while, we should deprecate the field `platform.aws.amiID`.

The same affect is now achieved by setting `platform.aws.defaultMachinePlatform.amiID`.

Description of problem:

checked on 4.17.0-0.nightly-2024-08-13-031847, there are 2 Metrics tab in 4.17 developer console under "Observe" section, see picture: https://drive.google.com/file/d/1x7Jm2Q9bVDOdFcctjG6WOUtIv_nsD9Pd/view?usp=sharing

checked, https://github.com/openshift/monitoring-plugin/pull/138 is merged to 4.17, but https://github.com/openshift/console/pull/14105 is merged to 4.18, not merged to 4.17

example, code

const expectedTabs: string[] = ['Dashboards', 'Silences', 'Events'] 

merged to 4.18

https://github.com/openshift/console/blob/release-4.18/frontend/packages/dev-console/src/components/monitoring/__tests__/MonitoringPage.spec.tsx#L35

but not merged to 4.17

https://github.com/openshift/console/blob/release-4.17/frontend/packages/dev-console/src/components/monitoring/__tests__/MonitoringPage.spec.tsx

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-13-031847

How reproducible:

always

Steps to Reproduce:

1. go to developer console, select one project and click Observe

Actual results:

2 Metrics tab in 4.17 developer console  

Expected results:

only one

Additional info:

should back port https://github.com/openshift/console/pull/14105 to 4.17

Release a new stable branch of Gophercloud with the following changes:

  • Use a more recent version of Go (>=1.21)
  • accept context for tracing and cancellation across all network-bound functions
  • make error handling more ergonomic by enabling (error).Unwrap() and removing the support for deprecated alternative signaling
  • move functionality away from deprecated "extension" packages
  • remove support for OpenStack Yoga

After successfully creating a NAD of type: "OVN Kubernetes secondary localnet network", when viewing the object in the GUI, it will say that it is of type "OVN Kubernetes L2 overlay network".

When examining the objects YAML, it is still correctly configured as a NAD type of localnet.

Version-Release number of selected component:
OCP Virtualization 4.15.1

How reproducible:100%

Steps to Reproduce:
1. Create appropriate NNCP and apply
for example:

apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: nncp-br-ex-vlan-101
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: '' 
  desiredState:
    ovn:
      bridge-mappings:
      - localnet: vlan-101 
        bridge: br-ex
        state: present 

2. Create localnet type NAD (from GUI or YAML)
For example:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: vlan-101
  namespace: default
spec:
  config: |2 
    {
           "name":"br-ex",
           "type":"ovn-k8s-cni-overlay",
           "cniVersion":"0.4.0",
           "topology":"localnet",
           "vlanID":101,
           "netAttachDefName":"default/vlan-101"
     } 

3. View through the GUI by clicking on Networking–>NetworkAttachementDefinitions–>NAD you just created

4.  When you look under type it will incorrectly display as Type: OVN Kubernetes L2 overlay Network

 

Actual results:

Type is displayed as OVN Kubernetes L2 overlay Network

If you examine the YAML for the NAD you will see that it is indeed still of type localnet

Please see attached screenshots for display of NAD type and the actual YAML of NAD.

At this point in time it looks as though this is just a display error.

Expected results:
Type should be displayed as OVN Kubernetes secondary localnet network

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/72

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Example test run failed https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-monitoring-operator/2416/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws-ovn-techpreview/1815492554490122240#1:build-log.txt%3A2457

{  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:409]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: 
    promQL query returned unexpected results:
    avg_over_time(cluster:telemetry_selected_series:count[43m25s]) >= 750
    [
      {
        "metric": {
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1721688289.455,
          "752.1379310344827"
        ]
      }
    ]
    [
        <*errors.errorString | 0xc001a6d120>{
            s: "promQL query returned unexpected results:\navg_over_time(cluster:telemetry_selected_series:count[43m25s]) >= 750\n[\n  {\n    \"metric\": {\n      \"prometheus\": \"openshift-monitoring/k8s\"\n    },\n    \"value\": [\n      1721688289.455,\n      \"752.1379310344827\"\n    ]\n  }\n]",
        },
    ]
occurred
Ginkgo exit error 1: exit with code 1}

This test blocks PR merges in CMO

Description of problem:

Perf & scale team is running scale tests to to find out maximum supported egress ips and come across this issue. When we have 55339 egress ip objects (each egress ip object with one egress ip address) in 118 worker node baremetal cluster, multus-admission-controller pod is stuck in CrashLoopBackOff state.

"oc describe pod" command output is copied here http://storage.scalelab.redhat.com/anilvenkata/multus-admission/multus-admission-controller-84b896c8-kmvdk.describe 

"oc describe pod" shows that the names of all 55339 egress ips are passed to container's exec command 
#cat multus-admission-controller-84b896c8-kmvdk.describe  | grep ignore-namespaces | tr ',' '\n' | grep -c egressip
55339

and exec command is failing as this argument list is too long.
# oc logs  -n openshift-multus multus-admission-controller-84b896c8-kmvdk
Defaulted container "multus-admission-controller" out of: multus-admission-controller, kube-rbac-proxy
exec /bin/bash: argument list too long

# oc get co network
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
network   4.14.16   True        True          False      35d     Deployment "/openshift-multus/multus-admission-controller" update is rolling out (1 out of 3 updated)

# oc describe pod -n openshift-multus multus-admission-controller-84b896c8-kmvdk > multus-admission-controller-84b896c8-kmvdk.describe
 
# oc get pods -n openshift-multus  | grep multus-admission-controller
multus-admission-controller-6c58c66ff9-5x9hn   2/2     Running            0                35d
multus-admission-controller-6c58c66ff9-zv9pd   2/2     Running            0                35d
multus-admission-controller-84b896c8-kmvdk     1/2     CrashLoopBackOff   26 (2m56s ago)   110m

As this environment has 55338 namespaces (each namespace with 1 pod and 1 eip object), it will hard to capture must gather.  

Version-Release number of selected component (if applicable):

    4.14.16

How reproducible:

    always

Steps to Reproduce:

    1. use kube-burner to create 55339 egress ip obejct, each object with one egress ip address. 
    2. We will see multus-admission-controller pod stuck in CrashLoopBackOff     
    

Actual results:

    

Expected results:

    

Additional info:

    

This is a clone of issue OCPBUGS-38922. The following is the description of the original issue:

Description of problem:

With the Configuring a private storage endpoint on Azure by enabling the Image Registry Operator to discover VNet and subnet names[1], if creating cluster with internal Image Registry, it will create a storage account with private endpoint, so once the new pvc using the same skuName with this private storage account, it will hit the mount permission issue. 
 

[1] https://docs.openshift.com/container-platform/4.16/post_installation_configuration/configuring-private-cluster.html#configuring-private-storage-endpoint-azure-vnet-subnet-iro-discovery_configuring-private-cluster

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Always

Steps to Reproduce:

Creating cluster with flexy job: aos-4_17/ipi-on-azure/versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm profile and specify enable_internal_image_registry: "yes"
Create pod and pvc with azurefile-csi sc     

Actual results:

pod failed to up due to mount error:

mount //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 on /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount failed with mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,gid=1018570000,file_mode=0777,dir_mode=0777, //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount
  Output: mount error(13): Permission denied 

Expected results:

Pod should be up

Additional info:

We can have some simple WA like using storageclass with networkEndpointType: privateEndpoint or specify another storage account, but using the pre-defined storageclass azurefile-csi will fail. And the automation is not easy to walk around.  

I'm not sure if CSI Driver could check if the reused storage account has the private endpoint before using the existing storage account. 

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/288

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Because of a bug in upstream CAPA, the Load Balancer ingress rules are continuously revoked and then authorized, causing unnecessary AWS API calls and cluster provision delays.

Version-Release number of selected component (if applicable):

4.16+

How reproducible:

always

Steps to Reproduce:

1.
2.
3.

Actual results:

A constant loop of revoke-authorize of ingress rules.

Expected results:

Rules should be revoked only when needed (for example, when the installer removes the allow-all ssh rule). In the other cases, rules should be authorized only once.

Additional info:

Upstream issue created: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5023
PR submitted upstream: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5024

Description of problem:

In PowerVS, when I try and deploy a 4.17 cluster, I see the following ProbeError event:
Liveness probe error: Get "https://192.168.169.11:10258/healthz": dial tcp 192.168.169.11:10258: connect: connection refused

Version-Release number of selected component (if applicable):

release-ppc64le:4.17.0-0.nightly-ppc64le-2024-06-14-211304

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster

Description of problem:

MultipleDefaultStorageClasses alert has incorrect rules because it does not deactivate right after user fixes the cluster to have only 1 storage class but is active for another ~5 minutes after the fix is applied.

Version-Release number of selected component (if applicable):

OCP 4.11+

How reproducible:

always (platform independent, reproducible with any driver and storage class)

Steps to Reproduce:

Set additional storage class as default
```
$ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'storageclass.storage.k8s.io/gp2-csi patched
```

Check that prometheus metrics is now > 1
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s --data-urlencode "query=default_storage_class_count" http://localhost:9090/api/v1/query | jq -r '.data.result[0].value[1]'2
```

Wait at least 5 minutes for alert to be `pending`, after 10 minutes the alert starts `firing`
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing
```

Annotate storage class as non default, making sure there's only one default now
```
$ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "false"}}}'storageclass.storage.k8s.io/gp2-csi patched
```

Alert is still present for 5 minutes but should have disappeared immediately - this is the actual bug
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing
```

After 5 minutes alert is gone
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"
```

Root cause -> the alerting rule is set to get `max_over_time` but it should be `min_over_time` here:
https://github.com/openshift/cluster-storage-operator/blob/7b4d8861d8f9364d63ad9a58347c2a7a014bff70/manifests/12_prometheusrules.yaml#L19

Additional info:

To verify changes follow the same procedure and verify that the alert is gone right after the settings are fixed (meaning there's only 1 default storage class again).

Changes are tricky to test -> on a live cluster, changing the Prometheus rule won't work as it will get reconciled by CSO, but if CSO is scaled down to prevent this then metrics are not collected. I'd suggest testing this by editing CSO code, scaling down CSO+CVO and running CSO locally, see README with instructions how to do it: https://github.com/openshift/cluster-storage-operator/blob/master/README.md

Component Readiness has found a potential regression in [sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial].

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-05-24T00:00:00Z
End Time: 2024-05-30T23:59:59Z
Success Rate: 13.33%
Successes: 2
Failures: 13
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 42
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&arch=amd64&baseEndTime=2024-02-28%2023%3A59%3A59&baseRelease=4.15&baseStartTime=2024-02-01%2000%3A00%3A00&capability=Other&component=kube-apiserver&confidence=95&environment=ovn%20no-upgrade%20amd64%20aws%20serial%2Ctechpreview&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&network=ovn&pity=5&platform=aws&platform=aws&sampleEndTime=2024-05-30%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-05-24%2000%3A00%3A00&testId=openshift-tests%3A6de2ed665ef7cc3434e216343df033db&testName=%5Bsig-api-machinery%5D%20API%20data%20in%20etcd%20should%20be%20stored%20at%20the%20correct%20location%20and%20version%20for%20all%20resources%20%5BSerial%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D&upgrade=no-upgrade&upgrade=no-upgrade&variant=serial%2Ctechpreview&variant=serial%2Ctechpreview

This issue is actively blocking payloads as no techpreview serial jobs can pass: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-overall-analysis-all/1796109729718603776

Sippy test data shows this permafailing around the time the kube rebase merged.

Test failure output:

[sig-api-machinery] API data in etcd should be stored at the correct location and version for all resources [Serial] [Suite:openshift/conformance/serial] expand_less 	46s
{  fail [github.com/openshift/origin/test/extended/etcd/etcd_storage_path.go:534]: test failed:
no test data for resource.k8s.io/v1alpha2, Kind=ResourceClaimParameters.  Please add a test for your new type to etcdStorageData.
no test data for resource.k8s.io/v1alpha2, Kind=ResourceClassParameters.  Please add a test for your new type to etcdStorageData.
no test data for resource.k8s.io/v1alpha2, Kind=ResourceSlice.  Please add a test for your new type to etcdStorageData.
etcd data does not match the types we saw:
seen but not in etcd data:
[
	resource.k8s.io/v1alpha2, Resource=resourceclassparameters 
	resource.k8s.io/v1alpha2, Resource=resourceslices 
	resource.k8s.io/v1alpha2, Resource=resourceclaimparameters]
Ginkgo exit error 1: exit with code 1}

The provisioning CR is now created with a paused annotation (since https://github.com/openshift/installer/pull/8346)

On baremetal IPI installs, this annotation is removed at the conclusion of bootstrapping.

On assisted/ABI installs there is nothing to remove it, so cluster-baremetal-operator never deploys anything.

This is a clone of issue OCPBUGS-41852. The following is the description of the original issue:

Description of problem:

update the tested instance type for IBMCloud

Version-Release number of selected component (if applicable):

4.17

How reproducible:

1. Some new instance type need to be added
2. match the memory and cpu limitation

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://docs.openshift.com/container-platform/4.16/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html#installation-ibm-cloud-tested-machine-types_installing-ibm-cloud-customizations     

Description of problem:

An IBM Cloud DNS zone does not go into "Active" state unless a permitted network is added to it and so if we try to use a DNS Zone which does not have a VPC attached as a permitted network, the installer fails with the error "failed to get DNS Zone id" when such a zone is attempted to be used. We already have code to attach a permitted network to a DNS Zone, but it cannot be used unless the DNS Zone is in "Active" state. The zone does not even show up in the install-config survey

Version-Release number of selected component (if applicable):

4.16, 4.17    

How reproducible:

In the scenario where the user attempts to create a private cluster without attaching a permitted network to the DNS Zone.

Steps to Reproduce:

    1. Create an IBM Cloud DNS zone in a DNS instance.
    2. openshift-install create install-config [OPTIONS]
    3. User created DNS zone won't show up in the selection for DNS Zone
    4. Proceed anyway choosing another private DNS zone.
    5. Edit the generated install-config and change basedomain to your zone.
    6. openshift-install create manifests [OPTIONS]
    7. The above step will fail with "failed to get DNS Zone id".

Actual results:

DNS Zone is not visible in the survey and creating manifests fails.

Expected results:

The DNS zone without permitted networks shows up in the survey and the installation completes.

Removed default behavior to disable automated cleaning when in non-converged flow in this PR https://github.com/openshift/assisted-service/pull/5319 
This causes issues now when a customer disables the converged flow but doesn't set automatedCleaningMode to disabled manually on their BMH.

Description of problem:

in the doc installing_ibm_cloud_public/installing-ibm-cloud-customizations.html have not the tested instance type list

Version-Release number of selected component (if applicable):

4.15

How reproducible:

   Always

Steps to Reproduce:

    1.https://docs.openshift.com/container-platform/4.15/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html

    have not list the tested vm

Actual results:

  have not list the tested type  

Expected results:

   list the tested instance type as https://docs.openshift.com/container-platform/4.15/installing/installing_azure/installing-azure-customizations.html#installation-azure-tested-machine-types_installing-azure-customizations

Additional info:

    

Description of problem:

  • We're seeing [0] in two customers environments, while one of the two confirmed this issue is replicated both in the context of a freshly installed 4.14.26 cluster, as well as an upgraded cluster.
  • Looking at [1] and the changes since 4.13 in the vsphere-problem-detector, I see we introduced some additional vSphere permissions checks in the checkDataStoreWithURL() [2][3] function: it was initially suspected that it was due to [4], but this was backported to 4.14.26, where the customer confirms the issue persists.
  •  

[0]

$ omc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-78cbc7fdbb-2g9mx | grep -i -e datastore.go -e E0508
2024-05-08T07:44:05.842165300Z I0508 07:44:05.839356       1 datastore.go:329] checking datastore ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/ for permissions
2024-05-08T07:44:05.842165300Z I0508 07:44:05.839504       1 datastore.go:125] CheckStorageClasses: thin-csi: storage policy openshift-storage-policy-tc01-rpdd7: unable to find datastore with URL ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/
2024-05-08T07:44:05.842165300Z I0508 07:44:05.839522       1 datastore.go:142] CheckStorageClasses checked 7 storage classes, 1 problems found
2024-05-08T07:44:05.848251057Z E0508 07:44:05.848212       1 operator.go:204] failed to run checks: StorageClass thin-csi: storage policy openshift-storage-policy-tc01-rpdd7: unable to find datastore with URL ds:///vmfs/volumes/vsan:526390016b19d2b5-21ae3fd76fa61150/
[...]

[1] https://github.com/openshift/vsphere-problem-detector/compare/release-4.13...release-4.14
[2] https://github.com/openshift/vsphere-problem-detector/blame/release-4.14/pkg/check/datastore.go#L328-L344
[3] https://github.com/openshift/vsphere-problem-detector/pull/119
[4] https://issues.redhat.com/browse/OCPBUGS-28879

4.17.0-0.nightly-2024-05-16-195932 and 4.16.0-0.nightly-2024-05-17-031643 both have resource quota issues like

 failed to create iam: LimitExceeded: Cannot exceed quota for OpenIdConnectProvidersPerAccount: 100
	status code: 409, request id: f69bf82c-9617-408a-b281-92c1ef0ec974 
 failed to create infra: failed to create VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
	status code: 400, request id: f90dcc5b-7e66-4a14-aa22-cec9f602fa8e 

Seth has indicated he is working to clean things up in https://redhat-internal.slack.com/archives/C01CQA76KMX/p1715913603117349?thread_ts=1715557887.529169&cid=C01CQA76KMX

Description of problem:

4.16.0-0.nightly-2024-05-14-095225, "logtostderr is removed in the k8s upstream and has no effect any more." log in kube-rbac-proxy-main/kube-rbac-proxy-self/kube-rbac-proxy-thanos containers

$ oc -n openshift-monitoring logs -c kube-rbac-proxy-main openshift-state-metrics-7f78c76cc6-nfbl4
W0514 23:19:50.052015       1 deprecated.go:66] 
==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.===============================================
...

$ oc -n openshift-monitoring logs -c kube-rbac-proxy-self openshift-state-metrics-7f78c76cc6-nfbl4
...
W0514 23:19:50.177692       1 deprecated.go:66] 
==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.===============================================
...

$ oc -n openshift-monitoring get pod openshift-state-metrics-7f78c76cc6-nfbl4 -oyaml | grep logtostderr -C3
spec:
  containers:
  - args:
    - --logtostderr
    - --secure-listen-address=:8443
    - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
    - --upstream=http://127.0.0.1:8081/
--
      name: kube-api-access-v9hzd
      readOnly: true
  - args:
    - --logtostderr
    - --secure-listen-address=:9443
    - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
    - --upstream=http://127.0.0.1:8082/

$ oc -n openshift-monitoring logs -c kube-rbac-proxy-thanos prometheus-k8s-0
W0515 02:55:54.209496       1 deprecated.go:66] 
==== Removed Flag Warning ======================logtostderr is removed in the k8s upstream and has no effect any more.===============================================
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep logtostderr -C3
    - --config-file=/etc/kube-rbac-proxy/config.yaml
    - --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
    - --allow-paths=/metrics
    - --logtostderr=true
    - --tls-min-version=VersionTLS12
    env:
    - name: POD_IP

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-14-095225

How reproducible:

always

Steps to Reproduce:

1. see the description

Actual results:

logtostderr is removed in the k8s upstream and has no effect any more    

Expected results:

no such info    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/150

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/162

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-42106. The following is the description of the original issue:

Description of problem:

Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized.
Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster).

Controller manager is filled with entries like: 
- "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4"
- "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org"

In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check.

The duplication is evident in OpenShiftControllerManager/cluster:
      dockerPullSecret:
        internalRegistryHostname: image-registry.openshift-image-registry.svc:5000
        registryURLs:
        - registry.build01.ci.openshift.org
        - registry.build01.ci.openshift.org


But there is only one hostname in config.imageregistry.operator.openshift.io/cluster:
  routes:
  - hostname: registry.build01.ci.openshift.org
    name: public-routes
    secretName: public-route-tls

Version-Release number of selected component (if applicable):

4.17.0-rc.3

How reproducible:

Constant on build01 but not on other build farms

Steps to Reproduce:

    1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager.
    2.
    3.
    

Actual results:

- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces.    
- The openshift-controller-manager is hot looping and experiencing client throttling.

Expected results:

1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes.
2. openshift-controller-manager should not possess duplicate entries.
3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries.
4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/74

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

In order to safely roll out the new method of calculating the cpu system reserved values [1] , we would have to introduce versioning in auto node sizing. This way even if the new method ends up reserving more CPU, existing customers won't see any dip in the amount of CPU available for their workloads. 

 

[1] https://issues.redhat.com/browse/OCPNODE-2211

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Multus is now a Pod and will be captured by normal oc adm must-gather command.

The multus.log file is removed since 4.16 and doesn't exist anymore.

User Story:

As a (user persona), I want to be able to:

  • Add a baremetal arm64 node to a nodepool associated with a multi-arch hosted cluster.

so that I can achieve

  • that a hosted cluster can have nodepools related to different architectures (arm64,amd64)

Acceptance Criteria:

Description of criteria:

  • A baremetal arm64 node can join a nodepool associated with a hosted cluster that has other nodepools with baremetal amd64 nodes

Description of problem:

    After deploying a node with ZTP and applying a PerformanceProfile, the PerformanceProfile continually transitions between conditions (Available, Upgradeable, Progressing, Degraded) and the cluster tuning operator logs show a cycle of Reconciling/Updating the profile every 15 minutes or so.

No apparent impact to the cluster, but it is generating a lot of noise and concern with one of our partners

Version-Release number of selected component (if applicable):

Observed in 4.14.5 and 4.14.25    

How reproducible:

    Always

Steps to Reproduce:

    1.Deploy a node via ZTP
    2.Apply a performanceprofile via ACM/policies    

Actual results:

    PerformanceProfile is applied, but logs show repeated reconcile/update attempts, generating noise in the logs

Expected results:

    PerformaneProfile is applied and reconciled, but without the Updates and state transitions.

Additional info:

  Logs show that the MachineConfig for the perf profile is getting updated every 15 minutes, but nothing has been changed i.e. no change in the applied PerformanceProfile:

I0605 18:52:08.786257       1 performanceprofile_controller.go:390] Reconciling PerformanceProfile
I0605 18:52:08.786568       1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8"
I0605 18:52:08.786899       1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile"
I0605 18:52:08.788604       1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile"
I0605 18:52:08.812015       1 status.go:83] Updating the performance profile "openshift-node-performance-profile" status
I0605 18:52:08.823836       1 performanceprofile_controller.go:390] Reconciling PerformanceProfile
I0605 18:52:08.824049       1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8"
I0605 18:52:08.824994       1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile"
I0605 18:52:08.826478       1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile"
I0605 18:52:29.069218       1 performanceprofile_controller.go:390] Reconciling PerformanceProfile
I0605 18:52:29.069349       1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8"
I0605 18:52:29.069571       1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile"
I0605 18:52:29.074617       1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile"
I0605 18:52:29.088866       1 status.go:83] Updating the performance profile "openshift-node-performance-profile" status
I0605 18:52:29.096390       1 performanceprofile_controller.go:390] Reconciling PerformanceProfile
I0605 18:52:29.096506       1 resources.go:41] checking staged MachineConfig "rendered-master-478a5ff5b6e20bdac08368b380ae69c8"
I0605 18:52:29.096834       1 performanceprofile_controller.go:461] using "crun" as high-performance runtime class container-runtime for profile "openshift-node-performance-profile"
I0605 18:52:29.097912       1 resources.go:109] Update machine-config "50-performance-openshift-node-performance-profile"

# oc get performanceprofile -o yaml
<...snip...>
  status:
    conditions:
    - lastHeartbeatTime: "2024-06-05T19:09:08Z"
      lastTransitionTime: "2024-06-05T19:09:08Z"
      message: cgroup=v1;
      status: "True"
      type: Available
    - lastHeartbeatTime: "2024-06-05T19:09:08Z"
      lastTransitionTime: "2024-06-05T19:09:08Z"
      status: "True"
      type: Upgradeable
    - lastHeartbeatTime: "2024-06-05T19:09:08Z"
      lastTransitionTime: "2024-06-05T19:09:08Z"
      status: "False"
      type: Progressing
    - lastHeartbeatTime: "2024-06-05T19:09:08Z"
      lastTransitionTime: "2024-06-05T19:09:08Z"
      status: "False"
      type: Degraded

Caught by the test (among others):

[sig-network] there should be reasonably few single second disruptions for kube-api-http2-localhost-new-connections

Sample job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-upgrade-4.17-micro-release-openshift-release-analysis-aggregator/1814754751858872320

Beginning with CI payload 4.17.0-0.ci-2024-07-20-200703 and continuing into nightlies with 4.17.0-0.nightly-2024-07-21-065611 the aggregated tests started recording weird disruption results. Most of the runs (as in the sample above) report success but the aggregated test doesn't count them all, e.g.

Test Failed! suite=[root openshift-tests], testCase=[sig-network] there should be reasonably few single second disruptions for openshift-api-http2-service-network-reused-connections Message: Passed 2 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success name: '[sig-network] there should be reasonably few single second disruptions for openshift-api-http2-service-network-reused-connections' testsuitename: openshift-tests summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success'

https://github.com/openshift/origin/pull/28277 is the sole PR in that first payload and it certainly seems related, so I will put up a revert.

Starting with payload 4.17.0-0.nightly-2024-06-25-103421 we are seeing aggregated failures for aws due to

[sig-network-edge][Conformance][Area:Networking][Feature:Router][apigroup:route.openshift.io][apigroup:config.openshift.io] The HAProxy router should pass the http2 tests [apigroup:image.openshift.io][apigroup:operator.openshift.io] [Suite:openshift/conformance/parallel/minimal]

This test was recently reenabled for aws via https://github.com/openshift/origin/pull/28515

Description of problem:

compact agent e2e jobs are consistently failing the e2e test (when they manage to install):

 [sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel]

Examining CI search I noticed that this failure is also occurring on many other jobs:

https://search.dptools.openshift.org/?search=Managed+cluster+should+verify+that+nodes+have+no+unexpected+reboots&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

In CI search, we can see failures in 4.15-4.17

How reproducible:

See CI search results

Steps to Reproduce:

1.
2.
3.

Actual results:

fail e2e

Expected results:

Pass

Additional info:

 

Description of problem:

Using IPI on AWS. When we create a new worker using a 4.1 cloud image, the "node-valid-hostname.service" service fails with this error

# systemctl status node-valid-hostname.service
× node-valid-hostname.service - Wait for a non-localhost hostname
     Loaded: loaded (/etc/systemd/system/node-valid-hostname.service; enabled; preset: disabled)
     Active: failed (Result: timeout) since Mon 2023-10-16 08:37:50 UTC; 1h 13min ago
   Main PID: 1298 (code=killed, signal=TERM)
        CPU: 330msOct 16 08:32:50 localhost.localdomain mco-hostname[1298]: waiting for non-localhost hostname to be assigned
Oct 16 08:32:50 localhost.localdomain systemd[1]: Starting Wait for a non-localhost hostname...
Oct 16 08:37:50 localhost.localdomain systemd[1]: node-valid-hostname.service: start operation timed out. Terminating.
Oct 16 08:37:50 localhost.localdomain systemd[1]: node-valid-hostname.service: Main process exited, code=killed, status=15/TERM
Oct 16 08:37:50 localhost.localdomain systemd[1]: node-valid-hostname.service: Failed with result 'timeout'.
Oct 16 08:37:50 localhost.localdomain systemd[1]: Failed to start Wait for a non-localhost hostname.


The configured hostname is:

sh-5.1# hostname
localhost.localdomain

Version-Release number of selected component (if applicable):

IPI on AWS

4.15.0-0.nightly-2023-10-15-214132

How reproducible:

Always

Steps to Reproduce:

1. Create a machineset using a 4.1 cloud image
2. Scale the machineset to create a new worker node
3. When the worker node is added, check the hostname and the service

Actual results:

The "node-valid-hostname.service" is failed and the configured hostname is 

sh-5.1# hostname localhost.localdomain

Expected results:

No service should fail and the new worker should have a valid hostname.

Additional info:

 

Cluster API Provider IBM (CAPI) provides the ability to override the endpoints it interacts with. When we start CAPI, we should pass along any endpoint overrides from the install config.

Service endpoints were added to the install config here

CAPI accepts endpoint overrides as described here

 

Pass any endpoint overrides we can from the installer to CAPI

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

 

The openshift/origin/text/extended/router/http2.go tests don't run on AWS. We disabled this sometime ago. Let's enable this to see if it is still an issue. I have been running the http2 tests on AWS this week and I have not run into the original issue highlighted by the original bugzilla bug.

// platformHasHTTP2LoadBalancerService returns true where the default
// router is exposed by a load balancer service and can support http/2
// clients.
func platformHasHTTP2LoadBalancerService(platformType configv1.PlatformType) bool {
    switch platformType {
    case configv1.AzurePlatformType, configv1.GCPPlatformType:
        return true
    case configv1.AWSPlatformType:
        e2e.Logf("AWS support waiting on https://bugzilla.redhat.com/show_bug.cgi?id=1912413")
        fallthrough
    default:
        return false
    }
}

Description of problem:

kube-state-metrics failing with "network is unreachable" needing to be restarted manually

Version-Release number of selected component (if applicable):

    

How reproducible:

 I have not been able to reproduce this in a lab cluster   

Steps to Reproduce:

N/A

Actual results:

 Metrics not being served    

Expected results:

kube-state-metrics probe fails and the pod restarts    

Additional info:

    

Even if the –namespace arg is specified to hypershift install render, the openshift-config-managed-trusted-ca-bundle configmap's namespace is always set to "hypershift".

Description of problem:

FDP released a new OVS 3.4 version, that will be used on the host.

We want to maintain the same version in the container.

This is mostly needed for OVN observability feature.

Description of problem:
Configuring mTLS on default IngressController breaks ingress canary check & console health checks which in turn makes the ingress and console cluster operators into a degraded state.

OpenShift release version:
OCP-4.9.5

Cluster Platform:
UPI on Baremetal (Disconnected cluster)

How reproducible:
Configure mutual TLS/mTLS using default IngressController as described in the doc(https://docs.openshift.com/container-platform/4.9/networking/ingress-operator.html#nw-mutual-tls-auth_configuring-ingress)

Steps to Reproduce (in detail):
1. Create a config map that is in the openshift-config namespace.
2. Edit the IngressController resource in the openshift-ingress-operator project
3.Add the spec.clientTLS field and subfields to configure mutual TLS:
~~~
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: default
namespace: openshift-ingress-operator
spec:
clientTLS:
clientCertificatePolicy: Required
clientCA:
name: router-ca-certs-default
allowedSubjectPatterns:

  • "^/CN=example.com/ST=NC/C=US/O=Security/OU=OpenShift$"
    ~~~
    Actual results:
    setting up mTLS using documented steps breaks canary and console health checks as clientCertificatePolicy is set as Required these health checks are looking for the client Certs and hence failing and in turn Ingress and Console operators are in a degraded state.

Expected results:
mTLS setup should work properly without degrading the Ingress and Console operators.

Impact of the problem:
Instable cluster with Ingress and Console operators into Degraded state.

Additional info:
The following is the Error message for your reference:
The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)

// Canary checks looking for required tls certificate.
2021-11-19T17:17:58.237Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check

{"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.bruce.openshift.local\": Get \"https://canary-openshift-ingress-canary.apps.bruce.openshift.local\": remote error: tls: certificate required"}

// Console operator:
RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.bruce.openshift.local): Get "https://console-openshift-console.apps.bruce.openshift.local": remote error: tls: certificate required

    • Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.

Along with disruption monitoring via external endpoint we should add in-cluster monitors which run the same checks over:

  • service network (kubernetes.default.svc)
  • api-int endpoint (via hostnetwork)
  • localhosts (on masters only)

These tests should be implemented as deployments with anti-affinity landing on different nodes. Deployments are selected so that the nodes could properly be drained. These deployments are writing to host disk and on restart the pod will pick up existing data. When a special configmap is created the pod will stop collecting disruption data.

External part of the test will create deployments (and necessary RBAC objects) when test is started, create stop configmap when it ends and collect data from the nodes. The test will expose them on intervals chart, so that the data could be used to find the source of disruption

This is a clone of issue OCPBUGS-36670. The following is the description of the original issue:

Description of problem:

Using payload built with https://github.com/openshift/installer/pull/8666/ so that master instances can be provisioned from gen2 image, which is required when configuring security type in install-config.

Enable TrustedLaunch security type in install-config:
==================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure: 
      encryptionAtHost: true
      settings:
        securityType: TrustedLaunch
        trustedLaunch:
          uefiSettings:
            secureBoot: Enabled
            virtualizedTrustedPlatformModule: Enabled

Launch capi-based installation, installer failed after waiting 15min for machines to provision...
INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5 
INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5-gen2 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master 
INFO Waiting up to 15m0s (until 6:26AM UTC) for machines [jima08conf01-9vgq5-bootstrap jima08conf01-9vgq5-master-0 jima08conf01-9vgq5-master-1 jima08conf01-9vgq5-master-2] to provision... 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded 
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: azure infrastructure provider 
INFO Stopped controller: azureaso infrastructure provider 
INFO Local Cluster API system has completed operations 

In openshift-install.log,
time="2024-07-08T06:25:49Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jima08conf01-9vgq5-rg/jima08conf01-9vgq5-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/virtualMachines/jima08conf01-9vgq5-bootstrap"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg="\tRESPONSE 400: 400 Bad Request"
time="2024-07-08T06:25:49Z" level=debug msg="\tERROR CODE: BadRequest"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg="\t{"
time="2024-07-08T06:25:49Z" level=debug msg="\t  \"error\": {"
time="2024-07-08T06:25:49Z" level=debug msg="\t    \"code\": \"BadRequest\","
time="2024-07-08T06:25:49Z" level=debug msg="\t    \"message\": \"Use of TrustedLaunch setting is not supported for the provided image. Please select Trusted Launch Supported Gen2 OS Image. For more information, see https://aka.ms/TrustedLaunch-FAQ.\""
time="2024-07-08T06:25:49Z" level=debug msg="\t  }"
time="2024-07-08T06:25:49Z" level=debug msg="\t}"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/jima08conf01-9vgq5-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"jima08conf01-9vgq5-bootstrap\" reconcileID=\"bee8a459-c3c8-4295-ba4a-f3d560d6a68b\""

Looks like that capi-based installer missed to enable security features during creating gen2 image, which can be found in terraform code.
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L166-L169

Gen2 image definition created by terraform:
$ az sig image-definition show --gallery-image-definition jima08conf02-4mrnz-gen2 -r gallery_jima08conf02_4mrnz -g jima08conf02-4mrnz-rg --query 'features'
[
  {
    "name": "SecurityType",
    "value": "TrustedLaunch"
  }
]
It's empty when querying from gen2 image created by using CAPI.
$ az sig image-definition show --gallery-image-definition jima08conf01-9vgq5-gen2 -r gallery_jima08conf01_9vgq5 -g jima08conf01-9vgq5-rg --query 'features'
$ 

Version-Release number of selected component (if applicable):

4.17 payload built from cluster-bot with PR https://github.com/openshift/installer/pull/8666/

How reproducible:

Always

Steps to Reproduce:

    1. Enable security type in install-config
    2. Create cluster by using CAPI
    3. 
    

Actual results:

    Install failed.

Expected results:

    Install succeeded.

Additional info:

   It impacts installation with security type ConfidentialVM or TrustedLaunch enabled.  

 

METAL-904 / https://github.com/openshift/cluster-baremetal-operator/pull/406 changed CBO to create a secret that contains the ironic image and service URLs.

 If this secret is present we should use it instead of using vendored (and possibly out of date) CBO code to find these values ourselves.

The ironic agent image in the secret should take precedence over detecting the image assuming the spoke CPU architecture matches the hub. If the architecture is different we'll need to fall back to some other solution (annotations, or default image).

This secret will only be present in hub versions that have this new version of the CBO code so we also will need to maintain our current solutions for older hub cluster versions.

This is a clone of issue OCPBUGS-42535. The following is the description of the original issue:

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info: