Back to index

4.12.41

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.11.59

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

1. Proposed title of this feature request
Add runbook_url to alerts in the OCP UI

2. What is the nature and description of the request?
If an alert includes a runbook_url label, then it should appear in the UI for the alert as a link.

3. Why does the customer need this? (List the business requirements here)
Customer can easily reach the alert runbook and be able to address their issues.

4. List any affected packages or components.

Epic Goal

  • Make it possible to disable the console operator at install time, while still having a supported+upgradeable cluster.

Why is this important?

  • It's possible to disable console itself using spec.managementState in the console operator config. There is no way to remove the console operator, though. For clusters where an admin wants to completely remove console, we should give the option to disable the console operator as well.

Scenarios

  1. I'm an administrator who wants to minimize my OpenShift cluster footprint and who does not want the console installed on my cluster

Acceptance Criteria

  • It is possible at install time to opt-out of having the console operator installed. Once the cluster comes up, the console operator is not running.

Dependencies (internal and external)

  1. Composable cluster installation

Previous Work (Optional):

  1. https://docs.google.com/document/d/1srswUYYHIbKT5PAC5ZuVos9T2rBnf7k0F1WV2zKUTrA/edit#heading=h.mduog8qznwz
  2. https://docs.google.com/presentation/d/1U2zYAyrNGBooGBuyQME8Xn905RvOPbVv3XFw3stddZw/edit#slide=id.g10555cc0639_0_7

Open questions::

  1. The console operator manages the downloads deployment as well. Do we disable the downloads deployment? Long term we want to move to CLI manager: https://github.com/openshift/enhancements/blob/6ae78842d4a87593c63274e02ac7a33cc7f296c3/enhancements/oc/cli-manager.md

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

In the console-operator repo we need to add `capability.openshift.io/console` annotation to all the manifests that the operator either contains creates on the fly.

 

Manifests are currently present in /bindata and /manifest directories.

 

Here is example of the insights-operator change.

Here is the overall enhancement doc.

 

Feature Overview
Provide CSI drivers to replace all the intree cloud provider drivers we currently have. These drivers will probably be released as tech preview versions first before being promoted to GA.

Goals

  • Framework for rapid creation of CSI drivers for our cloud providers
  • CSI driver for AWS EBS
  • CSI driver for AWS EFS
  • CSI driver for GCP
  • CSI driver for Azure
  • CSI driver for VMware vSphere
  • CSI Driver for Azure Stack
  • CSI Driver for Alicloud
  • CSI Driver for IBM Cloud

Requirements

Requirement Notes isMvp?
Framework for CSI driver  TBD Yes
Drivers should be available to install both in disconnected and connected mode   Yes
Drivers should upgrade from release to release without any impact   Yes
Drivers should be installable via CVO (when in-tree plugin exists)    

Out of Scope

This work will only cover the drivers themselves, it will not include

  • enhancements to the CSI API framework
  • the migration to said drivers from the the intree drivers
  • work for non-cloud provider storage drivers (FC-SAN, iSCSI) being converted to CSI drivers

Background, and strategic fit
In a future Kubernetes release (currently 1.21) intree cloud provider drivers will be deprecated and replaced with CSI equivalents, we need the drivers created so that we continue to support the ecosystems in an appropriate way.

Assumptions

  • Storage SIG won't move out the changeover to a later Kubernetes release

Customer Considerations
Customers will need to be able to use the storage they want.

Documentation Considerations

  • Target audience: cluster admins
  • Updated content: update storage docs to show how to use these drivers (also better expose the capabilities)

This Epic is to track the GA of this feature

Goal

  • Make available the Google Cloud File Service via a CSI driver, it is desirable that this implementation has dynamic provisioning
  • Without GCP filestore support, we are limited to block / RWO only (GCP PD 4.8 GA)
  • Align with what we support on other major public cloud providers.

Why is this important?

  • There is a know storage gap with google cloud where only block is supported
  • More customers deploying on GCE and asking for file / RWX storage.

Scenarios

  1. Install the CSI driver
  2. Remove the CSI Driver
  3. Dynamically provision a CSI Google File PV*
  4. Utilise a Google File PV
  5. Assess optional features such as resize & snapshot

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Customers::

  • Telefonica Spain
  • Deutsche Bank

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an OCP user, I want images for GCP Filestore CSI Driver and Operator, so that I can install them on my cluster and utilize GCP Filestore shares.

We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.

Goals

  • To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
  • When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
  • Reduce storage test flakiness so we can spot real bugs in our CI.

Requirements

Requirement Notes isMvp?
Telemetry   No
Certification   No
API metrics   No
     

Out of Scope

n/a

Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low

Assumptions

Customer Considerations

Documentation Considerations

  • Target audience: internal
  • Updated content: none at this time.

Notes

In progress:

  • CI flakes:
    • Configurable timeouts for e2e tests
      • Azure is slow and times out often
      • Cinder times out formatting volumes
      • AWS resize test times out

 

High prio:

  • Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
    • Should it be part of the installer?
    • Spike exists
  • Add / use cloud API call metrics
    • Helps customers to understand why things are slow
    • Helps build cop to understand a flake
      • With a post-install step that filters data from Prometheus that’s still running in the CI job.
    • Ideas:
      • Cloud is throttling X% of API calls longer than Y seconds
      • Attach / detach / provisioning / deletion / mount / unmount / resize takes longer than X seconds?
    • Capture metrics of operations that are stuck and won’t finish.
      • Sweep operation map from executioner???
      • Report operation metric into the highest bucket after the bucket threshold (i.e. if 10minutes is the last bucket, report an operation into this bucket after 10 minutes and don’t wait for its completion)?
      • Ask the monitoring team?
    • Include in CSI drivers too.
      • With alerts too

Unsorted

  • As the number of storage operators grows, it would be grafana board for storage operators
    • CSI driver metrics (from CSI sidecars + the driver itself  + its operator?)
    • CSI migration?
  • Get aggregated logs in cluster
    • They're rotated too soon
    • No logs from dead / restarted pods
    • No tools to combine logs from multiple pods (e.g. 3 controller managers)
  • What storage issues customers have? it was 22% of all issues.
    • Insufficient docs?
    • Probably garbage
  • Document basic storage troubleshooting for our supports
    • What logs are useful when, what log level to use
    • This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.
  • Common vSphere errors, their debugging and fixing. 
  • Document sig-storage flake handling - not all failed [sig-storage] tests are ours
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The End of General support for vSphere 6.7 will be on October 15, 2022. So, vSphere 6.7 will be deprecated for 4.11.

We want to encourage vSphere customers to upgrade to vSphere 7 in OCP 4.11 since VMware is EOLing (general support) for vSphere 6.7 in Oct 2022.

We want the cluster Upgradeable=false + have a strong alert pointing to our docs / requirements.

related slack: https://coreos.slack.com/archives/CH06KMDRV/p1647541493096729

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator 
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • manila-csi-driver-operator
  • ovirt-csi-driver-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator

 

  • cluster-storage-operator
  • csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

There is a new driver release 5.0.0 since the last rebase that includes snapshot support:

https://github.com/kubernetes-sigs/ibm-vpc-block-csi-driver/releases/tag/v5.0.0

Rebase the driver on v5.0.0 and update the deployments in ibm-vpc-block-csi-driver-operator.
There are no corresponding changes in ibm-vpc-node-label-updater since the last rebase.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

This includes ibm-vpc-node-label-updater!

(Using separate cards for each driver because these updates can be more complicated)

Epic Goal

  • Enable the migration from a storage intree driver to a CSI based driver with minimal impact to the end user, applications and cluster
  • These migrations would include, but are not limited to:
    • CSI driver for AWS EBS
    • CSI driver for GCP
    • CSI driver for Azure (file and disk)
    • CSI driver for VMware vSphere

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

On new installations, we should make the StorageClass created by the CSI operator the default one. 

However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set  a different quota on the CSI driver Storage Class.

Exit criteria:

  • New clusters get the CSI Storage Class as the default one.
  • Existing clusters don't get their default Storage Classes changed.

This Epic tracks the GA of this feature

Epic Goal

Why is this important?

  • OpenShift needs to maintain it's ability to enable PVCs and PVs of the main storage types
  • CSI Migration is getting close to GA, we need to have the feature fully tested and enabled in OpenShift
  • Upstream intree drivers are being deprecated to make way for the CSI drivers prior to intree driver removal

Scenarios

  1. User initiated move to from intree to CSI driver
  2. Upgrade initiated move from intree to CSI driver
  3. Upgrade from EUS to EUS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

On new installations, we should make the StorageClass created by the CSI operator the default one. 

However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set  a different quota on the CSI driver Storage Class.

Exit criteria:

  • New clusters get the CSI Storage Class as the default one.
  • Existing clusters don't get their default Storage Classes changed.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.24

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.24 release

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

  • As an infrastructure owner, I want a repeatable method to quickly deploy the initial OpenShift cluster.
  • As an infrastructure owner, I want to install the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters.

Goals

  • Enable customers and partners to successfully deploy a single “first” cluster in disconnected, on-premises settings

Requirements

4.11 MVP Requirements

  • Customers and partners needs to be able to download the installer
  • Enable customers and partners to deploy a single “first” cluster (cluster 0) using single node, compact, or highly available topologies in disconnected, on-premises settings
  • Installer must support advanced network settings such as static IP assignments, VLANs and NIC bonding for on-premises metal use cases, as well as DHCP and PXE provisioning environments.
  • Installer needs to support automation, including integration with third-party deployment tools, as well as user-driven deployments.
  • In the MVP automation has higher priority than interactive, user-driven deployments.
  • For bare metal deployments, we cannot assume that users will provide us the credentials to manage hosts via their BMCs.
  • Installer should prioritize support for platforms None, baremetal, and VMware.
  • The installer will focus on a single version of OpenShift, and a different build artifact will be produced for each different version.
  • The installer must not depend on a connected registry; however, the installer can optionally use a previously mirrored registry within the disconnected environment.

Use Cases

  • As a Telco partner engineer (Site Engineer, Specialist, Field Engineer), I want to deploy an OpenShift cluster in production with limited or no additional hardware and don’t intend to deploy more OpenShift clusters [Isolated edge experience].
  • As a Enterprise infrastructure owner, I want to manage the lifecycle of multiple clusters in 1 or more sites by first installing the first  (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters [Cluster before your cluster].
  • As a Partner, I want to package OpenShift for large scale and/or distributed topology with my own software and/or hardware solution.
  • As a large enterprise customer or Service Provider, I want to install a “HyperShift Tugboat” OpenShift cluster in order to offer a hosted OpenShift control plane at scale to my consumers (DevOps Engineers, tenants) that allows for fleet-level provisioning for low CAPEX and OPEX, much like AKS or GKE [Hypershift].
  • As a new, novice to intermediate user (Enterprise Admin/Consumer, Telco Partner integrator, RH Solution Architect), I want to quickly deploy a small OpenShift cluster for Poc/Demo/Research purposes.

Questions to answer…

  •  

Out of Scope

Out of scope use cases (that are part of the Kubeframe/factory project):

  • As a Partner (OEMs, ISVs), I want to install and pre-configure OpenShift with my hardware/software in my disconnected factory, while allowing further (minimal) reconfiguration of a subset of capabilities later at a different site by different set of users (end customer) [Embedded OpenShift].
  • As an Infrastructure Admin at an Enterprise customer with multiple remote sites, I want to pre-provision OpenShift centrally prior to shipping and activating the clusters in remote sites.

Background, and strategic fit

  • This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  1. The user has only access to the target nodes that will form the cluster and will boot them with the image presented locally via a USB stick. This scenario is common in sites with restricted access such as government infra where only users with security clearance can interact with the installation, where software is allowed to enter in the premises (in a USB, DVD, SD card, etc.) but never allowed to come back out. Users can't enter supporting devices such as laptops or phones.
  2. The user has access to the target nodes remotely to their BMCs (e.g. iDrac, iLo) and can map an image as virtual media from their computer. This scenario is common in data centers where the customer provides network access to the BMCs of the target nodes.
  3. We cannot assume that we will have access to a computer to run an installer or installer helper software.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

 

 

Epic Goal

As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with dual-stack IPv4/IPv6

As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with single-stack IPv6

Why is this important?

IPv6 and dual-stack clusters are requested often by customers, especially from Telco customers. Working with dual-stack clusters is a requirement for many but also a transition into a single-stack IPv6 clusters, which for some of our users is the final destination.

Acceptance Criteria

  • Agent-based installer can deploy IPv6 clusters
  • Agent-based installer can deploy dual-stack clusters
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work

Karim's work proving how agent-based can deploy IPv6: IPv6 deploy with agent based installer]

Done Checklist * CI - CI is running, tests are automated and merged.

  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>|

For dual-stack installations the agent-cluster-install.yaml must have both an IPv4 and IPv6 subnet in the networkking.MachineNetwork or assisted-service will throw an error. This field is in InstallConfig but it must be added to agent-cluster-install in its Generate().

For IPv4 and IPv6 installs, setting up the MachineNetwork is not needed but it also does not cause problems if its set, so it should be fine to set it all times.

Epic Goal

As an OpenShift infrastructure owner, I want to deploy a cluster zero with RHACM or MCE and have the required components installed when the installation is completed

Why is this important?

BILLI makes it easier to deploy a cluster zero. BILLI users know at installation time what the purpose of their cluster is when they plan the installation. Day-2 steps are necessary to install operators and users, especially when automating installations, want to finish the installation flow when their required components are installed.

Acceptance Criteria

  • A user can provide MCE manifests and have it installed without additional manual steps after the installation is completed
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a customer, I want to be able to:

  • Install MCE with the agent-installer

so that I can achieve

  • create an MCE hub with my openshift install

Acceptance Criteria:

Description of criteria:

  • Upstream documentation including examples of the extra manifests needed
  • Unit tests that include MCE extra manifests
  • Ability to install MCE using agent-installer is tested
  • Point 3

(optional) Out of Scope:

We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a customer, I want to be able to:

  • Install MCE with the agent-installer

so that I can achieve

  • create an MCE hub with my openshift install

Acceptance Criteria:

Description of criteria:

  • Upstream documentation including examples of the extra manifests needed
  • Unit tests that include MCE extra manifests
  • Ability to install MCE using agent-installer is tested
  • Point 3

(optional) Out of Scope:

We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Set the ClusterDeployment CRD to deploy OpenShift in FIPS mode and make sure that after deployment the cluster is set in that mode

In order to install FIPS compliant clusters, we need to make sure that installconfig + agentoconfig based deployments take into account the FIPS config in installconfig.

This task is about passing the config to agentclusterinstall so it makes it into the iso. Once there, AGENT-374 will give it to assisted service

Epic Goal

  • Rebase cluster autoscaler on top of Kubernetes 1.25

Why is this important?

  • Need to pick up latest upstream changes

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a user I would like to see all the events that the autoscaler creates, even duplicates. Having the CAO set this flag will allow me to continue to see these events.

Background

We have carried a patch for the autoscaler that would enable the duplication of events. This patch can now be dropped because the upstream added a flag for this behavior in https://github.com/kubernetes/autoscaler/pull/4921

Steps

  • add the --record-duplicated-events flag to all autoscaler deployments from the CAO

Stakeholders

  • openshift eng

Definition of Done

  • autoscaler continues to work as expected and produces events for everything
  • Docs
  • this does not require documentation as it preserves existing behavior and provides no interface for user interaction
  • Testing
  • current tests should continue to pass

Feature Overview

Add GA support for deploying OpenShift to IBM Public Cloud

Goals

Complete the existing gaps to make OpenShift on IBM Cloud VPC (Next Gen2) General Available

Requirements

Optional requirements

  • OpenShift can be deployed using Mint mode and STS for cloud provider credentials (future release, tbd)
  • OpenShift can be deployed in disconnected mode https://issues.redhat.com/browse/SPLAT-737)
  • OpenShift on IBM Cloud supports User Provisioned Infrastructure (UPI) deployment method (future release, 4.14?)

Epic Goal

  • Enable installation of private clusters on IBM Cloud. This epic will track associated work.

Why is this important?

  • This is required MVP functionality to achieve GA.

Scenarios

  1. Install a private cluster on IBM Cloud.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background and Goal

Currently in OpenShift we do not support distributing hotfix packages to cluster nodes. In time-sensitive situations, a RHEL hotfix package can be the quickest route to resolving an issue. 

Acceptance Criteria

  1. Under guidance from Red Hat CEE, customers can deploy RHEL hotfix packages to MachineConfigPools.
  2. Customers can easily remove the hotfix when the underlying RHCOS image incorporates the fix.

Before we ship OCP CoreOS layering in https://issues.redhat.com/browse/MCO-165 we need to switch the format of what is currently `machine-os-content` to be the new base image.

The overall plan is:

  • Publish the new base image as `rhel-coreos-8` in the release image
  • Also publish the new extensions container (https://github.com/openshift/os/pull/763) as `rhel-coreos-8-extensions`
  • Teach the MCO to use this without also involving layering/build controller
  • Delete old `machine-os-content`

After https://github.com/openshift/os/pull/763 is in the release image, teach the MCO how to use it. This is basically:

  • Schedule the extensions container as a kubernetes service (just serves a yum repo via http)
  • Change the MCD to write a file into `/etc/yum.repos.d/machine-config-extensions.repo` that consumes it instead of what it does now in pulling RPMs from the mounted container filesystem

As a OCP CoreOS layering developer, having telemetry data about number of cluster using osImageURL will help understand how broadly this feature is getting used and improve accordingly.

Acceptance Criteria:

  • Cluster using Custom osImageURL is available via telemetry

 

Why?

  • Decouple control and data plane. 
    • Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
  • Improve security
    • Shift credentials out of cluster that support the operation of core platform vs workload
  • Improve cost
    • Allow a user to toggle what they don’t need.
    • Ensure a smooth path to scale to 0 workers and upgrade with 0 workers.

 

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

 

 

Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

Epic Goal

  • To improve debug-ability of ovn-k in hypershift
  • To verify the stability of of ovn-k in hypershift
  • To introduce a EgressIP reach-ability check that will work in hypershift

Why is this important?

  • ovn-k is supposed to be GA in 4.12. We need to make sure it is stable, we know the limitations and we are able to debug it similar to the self hosted cluster.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. This will need consultation with the people working on HyperShift

Previous Work (Optional):

  1. https://issues.redhat.com/browse/SDN-2589

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Overview 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure, and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

DoD 

Run cluster-storage-operator (CSO) + AWS EBS CSI driver operator + AWS EBS CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster.

More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

 

As HyperShift Cluster Instance Admin, I want to run cluster-storage-operator (CSO) in the management cluster, so the guest cluster runs just my applications.

  • Add a new cmdline option for the guest cluster kubeconfig file location
  • Parse both kubeconfigs:
    • One from projected service account, which leads to the management cluster.
    • Second from the new cmdline option introduced above. This one leads to the guest cluster.
  • Tag manifests of objects that should not be deployed by CVO in HyperShift
  • Only on HyperShift:
    • When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.
    • Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.
    • Pass only the guest kubeconfig to the operands (AWS EBS CSI driver operator).

Exit criteria:

  • CSO and AWS EBS CSI driver operator runs in the management cluster in HyperShift
  • Storage works in the guest cluster.
  • No regressions in standalone OCP.

As OCP support engineer I want the same guest cluster storage-related objects in output of "hypershift dump cluster --dump-guest-cluster" as in "oc adm must-gather ", so I can debug storage issues easily.

 

must-gather collects: storageclasses persistentvolumes volumeattachments csidrivers csinodes volumesnapshotclasses volumesnapshotcontents

hypershift collects none of this, the relevant code is here: https://github.com/openshift/hypershift/blob/bcfade6676f3c344b48144de9e7a36f9b40d3330/cmd/cluster/core/dump.go#L276

 

Exit criteria:

  • verify that hypershift dump cluster --dump-guest-cluster has storage objects from the guest cluster.

As HyperShift Cluster Instance Admin, I want to run AWS EBS CSI driver operator + control plane of the CSI driver in the management cluster, so the guest cluster runs just my applications.

  • Add a new cmdline option for the guest cluster kubeconfig file location
  • Parse both kubeconfigs:
    • One from projected service account, which leads to the management cluster.
    • Second from the new cmdline option introduced above. This one leads to the guest cluster.
  • Only on HyperShift:
    • When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.
    • Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.
  •  
  •  
    • Pass only the guest kubeconfig to the operand (control-plane Deployment of the CSI driver).

Exit criteria:

  • Control plane Deployment of AWS EBS CSI driver runs in the management cluster in HyperShift.
  • Storage works in the guest cluster.
  • No regressions in standalone OCP.

Overview 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumption

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure, and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

DoD 

cluster-snapshot-controller-operator is running on the CP. 

More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit 

As OpenShift developer I want cluster-csi-snapshot-controller-operator to use existing controllers in library-go, so I don’t need to maintain yet another code that does the same thing as library-go.

  • Check and remove manifests/03_configmap.yaml, it does not seem to be useful.
  • Check and remove manifests/03_service.yaml, it does not seem to be useful (at least now).
  • Use DeploymentController from library-go to sync Deployments.
  • Get rid of common/ package? It does not seem to be useful.
  • Use StaticResourceController for static content, including the snapshot CRDs.

Note: if this refactoring introduces any new conditions, we must make sure that 4.11 snapshot controller clears them to support downgrade! This will need 4.11 BZ + z-stream update!

Similarly, if some conditions become obsolete / not managed by any controller, they must be cleared by 4.12 operator.

Exit criteria:

  • The operator code is smaller.
  • No regressions in standalone OCP.
  • Upgrade/downgrade from/to standalone OCP 4.11 works.

As HyperShift Cluster Instance Admin, I want to run cluster-csi-snapshot-controller-operator in the management cluster, so the guest cluster runs just my applications.

  • Add a new cmdline option for the guest cluster kubeconfig file location
  • Parse both kubeconfigs:
    • One from projected service account, which leads to the management cluster.
    • Second from the new cmdline option introduced above. This one leads to the guest cluster.
  • Move creation of manifests/08_webhook_service.yaml from CVO to the operator - it needs to be created in the management cluster.
  • Tag manifests of objects that should not be deployed by CVO in HyperShift by
  • Only on HyperShift:
    • When interacting with Kubernetes API, carefully choose the right kubeconfig to watch / create / update objects in the right cluster.
    • Replace namespaces in all Deployments and other objects that are created in the management cluster. They must be created in the same namespace as the operator.
    • Don’t create operand’s PodDisruptionBudget?
    • Update ValidationWebhookConfiguration to point directly to URL exposed by manifests/08_webhook_service.yaml instead of a Service. The Service is not available in the guest cluster.
    • Pass only the guest kubeconfig to the operands (both the webhook and csi-snapshot-controller).
    • Update unit tests to handle two kube clients.

Exit criteria:

  • cluster-csi-snapshot-controller-operator runs in the management cluster in HyperShift
  • csi-snapshot-controller runs in the management cluster in HyperShift
  • It is possible to take & restore volume snapshot in the guest cluster.
  • No regressions in standalone OCP.

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

OC mirror is GA product as of Openshift 4.11 .

The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror 

Epic Goal

  • Mirror to mirror operations and custom mirroring flows required by IBM CloudPak catalog management

Why is this important?

  • IBM needs additional customization around the actual mirroring of images to enable CloudPaks to fully adopt OLM-style operator packaging and catalog management
  • IBM CloudPaks introduce additional compute architectures, increasing the download volume by 2/3rds to day, we need the ability to effectively filter non-required image versions of OLM operator catalogs during filtering for other customers that only require a single or a subset of the available image architectures
  • IBM CloudPaks regularly run on older OCP versions like 4.8 which require additional work to be able to read the mirrored catalog produced by oc mirror

Scenarios

  1. Customers can use the oc utility and delegate the actual image mirror step to another tool
  2. Customers can mirror between disconnected registries using the oc utility
  3. The oc utility supports filtering manifest lists in the context of multi-arch images according to the sparse manifest list proposal in the distribution spec

Acceptance Criteria

  • Customers can use the oc utility to mirror between two different air-gapped environments
  • Customers can specify the desired computer architectures and oc mirror will create sparse manifest lists in the target registry as a result

Dependencies (internal and external)

Previous Work:

  1. WRKLDS-369
  2. Disconnected Mirroring Improvement Proposal

Related Work:

  1. https://github.com/opencontainers/distribution-spec/pull/310
  2. https://github.com/distribution/distribution/pull/3536
  3. https://docs.google.com/document/d/10ozLoV7sVPLB8msLx4LYamooQDSW-CAnLiNiJ9SER2k/edit?usp=sharing

Pre-Work Objectives

Since some of our requirements from the ACM team will not be available for the 4.12 timeframe, the team should work on anything we can get done in the scope of the console repo so that when the required items are available in 4.13, we can be more nimble in delivering GA content for the Unified Console Epic.

Overall GA Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 2 Goal: Productization of the united Console 

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

As a developer I would like to disable clusters like *KS that we can't support for multi-cluster (for instance because we can't authenticate). The ManagedCluster resource has a vendor label that we can use to know if the cluster is supported.

cc Ali Mobrem Sho Weimer Jakub Hadvig 

UPDATE: 9/20/22 : we want an allow-list with OpenShift, ROSA, ARO, ROKS, and  OpenShiftDedicated

Acceptance criteria:

  • Investigate if console-operator should pass info about which cluster are supported and unsupported to the frontend
  • Unsupported clusters should not appear in the cluster dropdown
  • Unsupported clusters based off
    • defined vendor label
    • non 4.x ocp clusters

Feature Overview

RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.

 

Requirements

  • RHEL 9.x sources for RHCOS builds starting with OCP 4.13 and RHEL 9.2.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

  • 9.2 Preview via Layering No longer necessary assuming we stay the course of going all in on 9.2

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

PROBLEM

We would like to improve our signal for RHEL9 readiness by increasing internal engineering engagement and external partner engagement on our community OpehShift offering, OKD.

PROPOSAL

Adding OKD to run on SCOS (a CentOS stream for CoreOS) brings the community offering closer to what a partner or an internal engineering team might expect on OCP.

ACCEPTANCE CRITERIA

Image has been switched/included: 

DEPENDENCIES

The SCOS build payload.

RELATED RESOURCES

OKD+SCOS proposal: https://docs.google.com/presentation/d/1_Xa9Z4tSqB7U2No7WA0KXb3lDIngNaQpS504ZLrCmg8/edit#slide=id.p

OKD+SCOS work draft: https://docs.google.com/document/d/1cuWOXhATexNLWGKLjaOcVF4V95JJjP1E3UmQ2kDVzsA/edit

 

Acceptance Criteria

A stable OKD on SCOS is built and available to the community sprintly.

 

This comes up when installing ipi-on-aws on arm64 with the custom payload build at quay.io/aleskandrox/okd-release:4.12.0-0.okd-centos9-full-rebuild-arm64 that is using scos as machine-content-os image

 

```

[root@ip-10-0-135-176 core]# crictl logs c483c92e118d8
2022-08-11T12:19:39+00:00 [cnibincopy] FATAL ERROR: Unsupported OS ID=scos
```

 

The probable fix has to land on https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L41-L53

Overview 

HyperShift came to life to serve multiple goals, some are main near-term, some are secondary that serve well long-term. 

Main Goals for hosted control planes (HyperShift)

  • Optimize OpenShift for Cost/footprint/ which improves our competitive stance against the *KSes
  • Establish separation of concerns which makes it more resilient for SRE to manage their workload clusters (be it security, configuration management, etc).
  • Simplify and enhance multi-cluster management experience especially since multi-cluster is becoming an industry need nowadays. 

Secondary Goals

HyperShift opens up doors to penetrate the market. HyperShift enables true hybrid (CP and Workers decoupled, mixed IaaS, mixed Arch,...). An architecture that opens up more options to target new opportunities in the cloud space. For more details on this one check: Hosted Control Planes (aka HyperShift) Strategy [Live Document]

 

Hosted Control Planes (HyperShift) Map 

To bring hosted control planes to our customers, we need the means to ship it. Today MCE is how HyperShift shipped, and installed so that customers can use it. There are two main customers for hosted-control-planes: 

 

  • Self-managed: In that case, Red Hat would provide hosted control planes as a service that is managed and SREed by the customer for their tenants (hence “self”-managed). In this management model, our external customers are the direct consumers of the multi-cluster control plane as a servie. Once MCE is installed, they can start to self-service dedicated control planes. 

 

  • Managed: This is OpenShift as a managed service, today we only “manage” the CP, and share the responsibility for other system components, more info here. To reduce management costs incurred by service delivery organizations which translates to operating profit (by reducing variable costs per control-plane), as well as to improve user experience, lower platform overhead (allow customers to focus mostly on writing applications and not concern themselves with infrastructure artifacts), and improve the cluster provisioning experience. HyperShift is shipped via MCE, and delivered to Red Hat managed SREs (same consumption route). However, for managed services, additional tooling needs to be refactored to support the new provisioning path. Furthermore, unlike self-managed where customers are free to bring their own observability stack, Red Hat managed SREs need to observe the managed fleet to ensure compliance with SLOs/SLIs/…

 

If you have noticed, MCE is the delivery mechanism for both management models. The difference between managed and self-managed is the consumer persona. For self-managed, it's the customer SRE for managed its the RH SRE

High-level Requirements

For us to ship HyperShift in the product (as hosted control planes) in either management model, there is a necessary readiness checklist that we need to satisfy. Below are the high-level requirements needed before GA: 

 

  • Hosted control planes fits well with our multi-cluster story (with MCE)
  • Hosted control planes APIs are stable for consumption  
  • Customers are not paying for control planes/infra components.  
  • Hosted control planes has an HA and a DR story
  • Hosted control planes is in parity with top-level add-on operators 
  • Hosted control planes reports metrics on usage/adoption
  • Hosted control planes is observable  
  • HyperShift as a backend to managed services is fully unblocked.

 

Please also have a look at our What are we missing in Core HyperShift for GA Readiness? doc. 

Hosted control planes fits well with our multi-cluster story

Multi-cluster is becoming an industry need today not because this is where trend is going but because it’s the only viable path today to solve for many of our customer’s use-cases. Below is some reasoning why multi-cluster is a NEED:

 

 

As a result, multi-cluster management is a defining category in the market where Red Hat plays a key role. Today Red Hat solves for multi-cluster via RHACM and MCE. The goal is to simplify fleet management complexity by providing a single pane of glass to observe, secure, police, govern, configure a fleet. I.e., the operand is no longer one cluster but a set, a fleet of clusters. 

HyperShift logically centralized architecture, as well as native separation of concerns and superior cluster lifecyle management experience, makes it a great fit as the foundation of our multi-cluster management story. 

Thus the following stories are important for HyperShift: 

  • When lifecycling OpenShift clusters (for any OpenShift form factor) on any of the supported providers from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin):
  • I want to be able to use a consistent UI so I can manage and operate (observe, govern,...) a fleet of clusters.
  • I want to specify HA constraints (e.g., deploy my clusters in different regions) while ensuring acceptable QoS (e.g., latency boundaries) to ensure/reduce any potential downtime for my workloads. 
  • When operating OpenShift clusters (for any OpenShift form factor) on any of the supported provider from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin):
  • I want to be able to backup any critical data so I am able to restore them in case of hosting service cluster (management cluster) failure. 

Refs:

Hosted control planes APIs are stable for consumption.

 

HyperShift is the core engine that will be used to provide hosted control-planes for consumption in managed and self-managed. 

 

Main user story:  When life cycling clusters as a cluster service consumer via HyperShift core APIs, I want to use a stable/backward compatible API that is less susceptible to future changes so I can provide availability guarantees. 

 

Ref: What are we missing in Core HyperShift for GA Readiness?

Customers are not paying for control planes/infra components. 

 

Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.

Assumptions

  • A customer will be able to associate a cluster as “Infrastructure only”
  • E.g. one option: management cluster has role=master, and role=infra nodes only, control planes are packed on role=infra nodes
  • OR the entire cluster is labeled infrastructure , and node roles are ignored.
  • Anything that runs on a master node by default in Standalone that is present in HyperShift MUST be hosted and not run on a customer worker node.

HyperShift - proposed cuts from data plane

HyperShift has an HA and a DR story

When operating OpenShift clusters (for any OpenShift form factor) from MCE/ACM/OCM/CLI as a Cluster Service Consumer  (RH managed SRE, or self-manage SRE/admin) I want to be able to migrate CPs from one hosting service cluster to another:

  • as means for disaster recovery in the case of total failure
  • so that scaling pressures on a management cluster can be mitigated or a management cluster can be decommissioned.

More information: 

 

Hosted control planes reports metrics on usage/adoption

To understand usage patterns and inform our decision making for the product. We need to be able to measure adoption and assess usage.

See Hosted Control Planes (aka HyperShift) Strategy [Live Document]

Hosted control plane is observable  

Whether it's managed or self-managed, it’s pertinent to report health metrics to be able to create meaningful Service Level Objectives (SLOs), alert of failure to meet our availability guarantees. This is especially important for our managed services path. 

HyperShift is in parity with top-level add-on operators

https://issues.redhat.com/browse/OCPPLAN-8901 

Unblock HyperShift as a backend to managed services

HyperShift for managed services is a strategic company goal as it improves usability, feature, and cost competitiveness against other managed solutions, and because managed services/consumption-based cloud services is where we see the market growing (customers are looking to delegate platform overhead). 

 

We should make sure our SD milestones are unblocked by the core team. 

 

Note 

This feature reflects HyperShift core readiness to be consumed. When all related EPICs and stories in this EPIC are complete HyperShift can be considered ready to be consumed in GA form. This does not describe a date but rather the readiness of core HyperShift to be consumed in GA form NOT the GA itself.

- GA date for self-managed will be factoring in other inputs such as adoption, customer interest/commitment, and other factors. 
- GA dates for ROSA-HyperShift are on track, tracked in milestones M1-7 (have a look at https://issues.redhat.com/browse/OCPPLAN-5771

Epic Goal*

The goal is to split client certificate trust chains from the global Hypershift root CA.

 
Why is this important? (mandatory)

This is important to:

  • assure a workload can be run on any kind of OCP flavor
  • reduce the blast radius in case of a sensitive material leak
  • separate trust to allow more granular control over client certificate authentication

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. I would like to be able to run my workloads on any OpenShift-like platform.
    My workloads allow components to authenticate using client certificates based
    on a trust bundle that I am able to retrieve from the cluster.
  1. I don't want my users to have access to any CA bundle that would allow them
    to trust a random certificate from the cluster for client certificate authentication.

 
Dependencies (internal and external) (mandatory)

Hypershift team needs to provide us with code reviews and merge the changes we are to deliver

Contributing Teams(and contacts) (mandatory) 

  • Development - OpenShift Auth, Hypershift
  • Documentation -OpenShift Auth Docs team
  • QE - OpenShift Auth QE
  • PX - I have no idea what PX is
  • Others - others

Acceptance Criteria (optional)

The serviceaccount CA bundle automatically injected to all pods cannot be used to authenticate any client certificate generated by the control-plane.

Drawbacks or Risk (optional)

Risk: there is a throbbing time pressure as this should be delivered before first stable Hypershift release

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Cloned from OCPSTRAT-377 to represent the backport to 4.12

Backport questions:

 
1) What's the impact/cost to any other critical items on the next release? 
 
Installer and edge are mostly focused on activation/retention and working the list top-to-bottom without release blockers. This is an activation item highly coveted by SD and applicable in existing versions.
 
2) Is it a breaking change to the existing fleet?
 
No.
 
 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic —

Links:

Enhancement PR: https://github.com/openshift/enhancements/pull/1397 

API PR: https://github.com/openshift/api/pull/1460 

Ingress  Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/928 

Background

Feature Goal: Support OpenShift installation in AWS Shared VPC scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.

The ingress operator is responsible for creating DNS records in AWS Route53 for cluster ingress. Prior to the implementation of this epic, the ingress operator doesn't have the capability to add DNS records into an existing Route 53 hosted zone in the shared VPC.

Epic Goal

  • Add support to the ingress operator for creating DNS records in preexisting Route53 private hosted zones for Shared VPC clusters

Non-Goals

  • Ingress operator support for day-2 operations (i.e. changes to the AWS IAM Role value after installation)  
  • E2E testing (will be handled by the Installer Team) 

Design

As described in the WIP PR https://github.com/openshift/cluster-ingress-operator/pull/928, the ingress operator will consume a new API field that contains the IAM Role ARN for configuring DNS records in the private hosted zone. If this field is present, then the ingress operator will use this account to create all private hosted zone records. The API fields will be described in the Enhancement PR.

The ingress operator code will accomplish this by defining a new provider implementation that wraps two other DNS providers, using one of them to publish records to the public zone and the other to publish records to the private zone.

External DNS Operator Impact

See NE-1299

AWS Load Balancer Operator (ALBO) Impact

See NE-1299

Why is this important?

  • Without this ingress operator support, OpenShift users are unable to create DNS records in a preexisting Route53 private hosted zone which means OpenShift users can't share the Route53 component with a Shared VPC
  • Shared VPCs are considers AWS best practice

Scenarios

  1. ...

Acceptance Criteria

  • Unit tests must be written and automatically run in CI (E2E tests will be handled by the Installer Team)
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ingress Operator creates DNS Records in preexisting Route53 private hosted zones for shared VPC Clusters
  • Network Edge Team has reviewed all of the related enhancements and code changes for Route53 in Shared VPC Clusters

Dependencies (internal and external)

  1. Installer Team is adding the new API fields required for enabling sharing Route53 with in Shared VPCs in https://issues.redhat.com/browse/CORS-2613
  2. Testing this epic requires having access to two AWS account

Previous Work (Optional):

  1. Significant discussion was done in this thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1681997102492889?thread_ts=1681837202.378159&cid=C68TNFWA2
  1. Slack channel #tmp-xcmbu-114

Open questions:

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Enable/confirm installation in AWS shared VPC scenario where Private Hosted Zone belongs to an account separate from the cluster installation target account

Why is this important?

  • AWS best practices suggest this setup

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

OLM would have to support a mechanism like podAffinity which allows multiple architecture values to be specified which enables it to pin operators to the matching architecture worker nodes

Ref: https://github.com/openshift/enhancements/pull/1014

 

Cut a new release of the OLM API and update OLM API dependency version (go.mod) in OLM package; then
Bring the upstream changes from OLM-2674 to the downstream olm repo.

A/C:

 - New OLM API version release
 - OLM API dependency updated in OLM Project
 - OLM Subscription API changes  downstreamed
 - OLM Controller changes  downstreamed
 - Changes manually tested on Cluster Bot

Epic Goal

  • Enabling integration of single hub cluster to install both ARM and x86 spoke clusters
  • Enabling support for heterogeneous OCP clusters
  • document requirements deployment flows
  • support in disconnected environment

Why is this important?

  • clients request

Scenarios

  1. Users manage both ARM and x86 machines, we should not require to have two different hub clusters
  2. Users manage a mixed architecture clusters without requirement of all the nodes to be of the same architecture

Acceptance Criteria

  • Process is well documented
  • we are able to install in a disconnected environment

We have a set of images

  • quay.io/edge-infrastructure/assisted-installer-agent:latest
  • quay.io/edge-infrastructure/assisted-installer-controller:latest
  • quay.io/edge-infrastructure/assisted-installer:latest

that should become multiarch images. This should be done both in upstream and downstream.

As a reference, we have built internally those images as multiarch and made them available as

  • registry.redhat.io/rhai-tech-preview/assisted-installer-agent-rhel8:latest
  • registry.redhat.io/rhai-tech-preview/assisted-installer-reporter-rhel8:latest
  • registry.redhat.io/rhai-tech-preview/assisted-installer-rhel8:latest

They can be consumed by the Assisted Serivce pod via the following env

    - name: AGENT_DOCKER_IMAGE
      value: registry.redhat.io/rhai-tech-preview/assisted-installer-agent-rhel8:latest
    - name: CONTROLLER_IMAGE
      value: registry.redhat.io/rhai-tech-preview/assisted-installer-reporter-rhel8:latest
    - name: INSTALLER_IMAGE
      value: registry.redhat.io/rhai-tech-preview/assisted-installer-rhel8:latest

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Goal: Provide queryable metrics and telemetry for cluster routes and sharding in an OpenShift cluster.

Problem: Today we test OpenShift performance and scale with best-guess or anecdotal evidence for the number of routes that our customers use. Best practices for a large number of routes in a cluster is to shard, however we have no visibility with regard to if and how customers are using sharding.

Why is this important? These metrics will inform our performance and scale testing, documented cluster limits, and how customers are using sharding for best practice deployments.

Dependencies (internal and external):

Prioritized epics + deliverables (in scope / not in scope):

Not in scope:

Estimate (XS, S, M, L, XL, XXL):

Previous Work:

Open questions:

Acceptance criteria:

Epic Done Checklist:

  • CI - CI Job & Automated tests: <link to CI Job & automated tests>
  • Release Enablement: <link to Feature Enablement Presentation> 
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • Notes for Done Checklist
    • Adding links to the above checklist with multiple teams contributing; select a meaningful reference for this Epic.
    • Checklist added to each Epic in the description, to be filled out as phases are completed - tracking progress towards “Done” for the Epic.

Description:

As described in the Design Doc, the following information is needed to be exported from Cluster Ingress Operator:

  • Number of routes/shard

Design 2 will be implemented as part of this story.

 

Acceptance Criteria:

  • Support for exporting the above mentioned metrics by Cluster Ingress Operator

Description:

As described in the Metrics to be sent via telemetry section of the Design Doc, the following metrics is needed to be sent from OpenShift cluster to Red Hat premises:

  • Minimum Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:min  : min(route_metrics_controller_routes_per_shard)
    • Gives the minimum value of Routes per Shard.
  • Maximum Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:max  : max(route_metrics_controller_routes_per_shard)
    • Gives the maximum value of Routes per Shard.
  • Average Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:avg  : avg(route_metrics_controller_routes_per_shard)
    • Gives the average value of Routes per Shard.
  • Median Routes per Shard
    • Recording Rule – cluster:route_metrics_controller_routes_per_shard:median  : quantile(0.5, route_metrics_controller_routes_per_shard)
    • Gives the median value of Routes per Shard.
  • Number of Routes summed by TLS Termination type
    • Recording Rule – cluster:openshift_route_info:tls_termination:sum : sum (openshift_route_info) by (tls_termination)
    • Gives the number of Routes for each tls_termination value. The possible values for tls_termination are edge, passthrough and reencrypt. 

The metrics should be allowlisted on the cluster side.

The steps described in Sending metrics via telemetry are needed to be followed. Specifically step 5.

Depends on CFE-478.

Acceptance Criteria:

  • Support for sending the above mentioned metrics from OpenShift clusters to the Red Hat premises by allowlisting metrics on the cluster side

This is a epic bucket for all activities surrounding the creation of declarative approach to release and maintain OLM catalogs.

Epic Goal

  • Allow Operator Authors to easily change the layout of the update graph in a single location so they can version/maintain/release it via git and have more approachable controls about graph vertices than today's replaces, skips and/or skipRange taxonomy
  • Allow Operators authors to have control over channel and bundle channel membership

Why is this important?

  • The imperative catalog maintenance approach so far with opm is being moved to a declarative format (OLM-2127 and OLM-1780) moving away from bundle-level controls but the update graph properties are still attached to a bundle
  • We've received feedback from the RHT internal developer community that maintaining and reasoning about the graph in the context of a single channel is still too hard, even with visualization tools
  • making the update graph easily changeable is important to deliver on some of the promises of declarative index configuration
  • The current interface for declarative index configuration still relies on skips, skipRange and replaces to shape the graph on a per-bundle level - this is too complex at a certain point with a lot of bundles in channels, we need to something at the package level

Scenarios

  1. An Operator author wants to release a new version replacing the latest version published previously
  2. After additional post-GA testing an Operator author wants to establish a new update path to an existing released version from an older, released version
  3. After finding a bug post-GA an Operator author wants to temporarily remove a known to be problematic update path
  4. An automated system wants to push a bundle inbetween an existing update path as a result of an Operator (base) image rebuild (Freshmaker use case)
  5. A user wants to take a declarative graph definition and turn it into a graphical image for visually ensuring the graph looks like they want
  6. An Operator author wants to promote a certain bundle to an additional / different channel to indicate progress in maturity of the operator.

Acceptance Criteria

  • The declarative format has to be user readable and terse enough to make quick modifications
  • The declarative format should be machine writeable (Freshmaker)
  • The update graph is declared and modified in a text based format aligned with the declarative config
  • it has to be possible to add / removes edges at the leave of the graph (releasing/unpublishing a new version)
  • it has to be possible to add/remove new vertices between existing edges (releasing/retracting a new update path)
  • it has to be possible to add/remove new edges in between existing vertices (releasing/unpublishing a version inbetween, freshmaker user case)
  • it has to be possible to change the channel member ship of a bundle after it's published (channel promotion)
  • CI - MUST be running successfully with tests automated
  • it has to be possible to add additional metadata later to implement OLM-2087 and OLM-259 if required

Dependencies (internal and external)

  1. Declarative Index Config (OLM-2127)

Previous Work:

  1. Declarative Index Config (OLM-1780)

Related work

Open questions:

  1. What other manipulation scenarios are required?
    1. Answer: deprecation of content in the spirit of OLM-2087
    2. Answer: cross-channel update hints as described in OLM-2059 if that implementation requires it

 

When working on this Epic, it's important to keep in mind this other potentially related Epic: https://issues.redhat.com/browse/OLM-2276

 

Jira Description

As an OPM maintainer, I want to downstream the PR for (OCP 4.12 ) and backport it to OCP 4.11 so that IIB will NOT be impacted by the changes when it upgrades the OPM version to use the next/future opm upstream release (v1.25.0).

Summary / Background

IIB(the downstream service that manages the indexes) uses the upstream version and if they bump the OPM version to the next/future (v1.25.0) release with this change before having the downstream images updated then: the process to manage the indexes downstream will face issues and it will impact the distributions. 

Acceptance Criteria

  • The changes in the PR are available for the releases which uses FBC -> OCP 4.11, 4.12

Definition of Ready

  • PRs merged into downstream OCP repos branches 4.11/4.12

Definition of Done

  • We checked that the downstream images are with the changes applied (i.e.: we can try to verify in the same way that we checked if the changes were in the downstream for the fix OLM-2639 )

enhance the veneer rendering to be able to read the input veneer data from stdin, via a pipe, in a manner similar to https://dev.to/napicella/linux-pipes-in-golang-2e8j

then the command could be used in a manner similar to many k8s examples like

```shell
opm alpha render-veneer semver -o yaml < infile > outfile
```

Upstream issue link: https://github.com/operator-framework/operator-registry/issues/1011

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

Epic Goal

  • Change the default value for the spec.tuningOptions.maxConnections field in the IngressController API, which configures the HAProxy maxconn setting, to 50000 (fifty thousand).

Why is this important?

  • The maxconn setting constrains the number of simultaneous connections that HAProxy accepts. Beyond this limit, the kernel queues incoming connections. 
  • Increasing maxconn enables HAProxy to queue incoming connections intelligently.  In particular, this enables HAProxy to respond to health probes promptly while queueing other connections as needed.
  • The default setting of 20000 has been in place since OpenShift 3.5 was released in April 2017 (see BZ#1405440, commit, RHBA-2017:0884). 
  • Hardware capabilities have increased over time, and the current default is too low for typical modern machine sizes. 
  • Increasing the default setting improves HAProxy's performance at an acceptable cost in the common case. 

Scenarios

  1. As a cluster administrator who is installing OpenShift on typical hardware, I want OpenShift router to be tuned appropriately to take advantage of my hardware's capabilities.

Acceptance Criteria

  • CI is passing. 
  • The new default setting is clearly documented. 
  • A release note informs cluster administrators of the change to the default setting. 

Dependencies (internal and external)

  1. None.

Previous Work (Optional):

  1. The  haproxy-max-connections-tuning enhancement made maxconn configurable without changing the default.  The enhancement document details the tradeoffs in terms of memory for various settings of nbthreads and maxconn with various numbers of routes. 

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

OCP/Telco Definition of Done

Epic Template descriptions and documentation.

Epic Goal

Why is this important?

  • This regression is a major performance and stability issue and it has happened once before.

Drawbacks

  • The E2E test may be complex due to trying to determine what DNS pods are responding to DNS requests. This is straightforward using the chaos plugin.

Scenarios

  • CI Testing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. SDN Team

Previous Work (Optional):

  1. N/A

Open questions::

  1. Where do these E2E test go? SDN Repo? DNS Repo?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Enable the chaos plugin https://coredns.io/plugins/chaos/ in our CoreDNS configuration so that we can use a DNS query to easily identify what DNS pods are responding to our requests.

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

When OCP is performing cluster upgrade user should be notified about this fact.

There are two possibilities how to surface the cluster upgrade to the users:

  • Display a console notification throughout OCP web UI saying that the cluster is currently under upgrade.
  • Global notification throughout OCP web UI saying that the cluster is currently under upgrade.
  • Have an alert firing for all the users of OCP stating the cluster is undergoing an upgrade. 

 

AC:

  • Console-operator will create a ConsoleNotification CR when the cluster is being upgraded. Once the upgrade is done console-operator will remote that CR. These are the three statuses based on which we are determining if the cluster is being upgraded.
  • Add unit tests

 

Note: We need to decide if we want to distinguish this particular notification by a different color? ccing Ali Mobrem 

 

Created from: https://issues.redhat.com/browse/RFE-3024

As a console user I want to have option to:

  • Restart Deployment
  • Retry latest DeploymentConfig if it failed

 

For Deployments we will add the 'Restart rollout' action button. This action will PATCH the Deployment object's 'spec.template.metadata.annotations' block, by adding 'openshift.io/restartedAt: <actual-timestamp>' annotation. This will restart the deployment, by creating a new ReplicaSet.

  • action is disabled if:
    • Deployment is paused

 

For DeploymentConfig we will add 'Retry rollout' action button.  This action will PATCH the latest revision of ReplicationController object's 'metadata.annotations' block by setting 'openshift.io/deployment/phase: "New"' and removing openshift.io/deployment.cancelled and openshift.io/deployment.status-reason.

  • action is enabled if:
    • latest revision of the ReplicationController resource is in Failed phase
  • action is disabled if:
    • latest revision of the ReplicationController resource is in Complete phase
    • DeploymentConfig does not have any rollouts
    • DeploymentConfigs is paused

 

Acceptance Criteria:

  • Add the 'Restart rollout' action button for the Deployment resource to both action menu and kebab menu
  • Add the 'Retry rollout' action button for the DeploymentConfig resource to both action menu and kebab menu

 

BACKGROUND:

OpenShift console will be updated to allow rollout restart deployment from the console itself.

Currently, from the OpenShift console, for the resource “deploymentconfigs” we can only start and pause the rollout, and for the resource “deployment” we can only resume the rollout. None of the resources (deployment & deployment config) has this option to restart the rollout. So, that is the reason why the customer wants this functionality to perform the same action from the CLI as well as the OpenShift console.

The customer wants developers who are not fluent with the oc tool and terminal utilities, can use the console instead of the terminal to restart deployment, just like we use to do it through CLI using the command “oc rollout restart deploy/<deployment-name>“.
Usually when developers change the config map that deployment uses they have to restart pods. Currently, the developers have to use the oc rollout restart deployment command. The customer wants the functionality to get this button/menu to perform the same action from the console as well.

Design
Doc: https://docs.google.com/document/d/1i-jGtQGaA0OI4CYh8DH5BBIVbocIu_dxNt3vwWmPZdw/edit

As a developer, I want to make status.HostIP for Pods visible in the Pod details page of the OCP Web Console. Currently there is no way to view the node IP for a Pod in the OpenShift Web Console.  When viewing a Pod in the console, the field status.HostIP is not visible.

 

Acceptance criteria:

  • Make pod's HostIP field visible in the pod details page, similarly to PodIP field

Feature Overview (aka. Goal Summary)  

The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike. 

Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.

In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.

The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike. 

For this epic, "state" means "what is the MCO doing?" – so the goal here is to try to make sure that it's always known what the MCO is doing. 

This includes: 

  • Conditions
  • Some Logging 
  • Possibly Some Events 

While this probably crosses a little bit into the "status" portion of certain MCO objects, as some state is definitely recorded there, this probably shouldn't turn into a "better status reporting" epic.  I'm interpreting "status" to mean "how is it going" so status is maybe a "detail attached to a state". 

 

Exploration here: https://docs.google.com/document/d/1j6Qea98aVP12kzmPbR_3Y-3-meJQBf0_K6HxZOkzbNk/edit?usp=sharing

 

https://docs.google.com/document/d/17qYml7CETIaDmcEO-6OGQGNO0d7HtfyU7W4OMA6kTeM/edit?usp=sharing

 

The current property description is:

configuration represents the current MachineConfig object for the machine config pool.

But in a 4.12.0-ec.4 cluster, the actual semantics seem to be something closer to "the most recent rendered config that we completely leveled on". We should at least update the godocs to be more specific about the intended semantics. And perhaps consider adjusting the semantics?

Feature Overview

Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.

Goals

  • Simplicity The folks preparing and installing OpenShift clusters (typically SNO) at the Far Edge range in technical expertise from technician to barista. The preparation and installation phases need to be reduced to a human-readable script that can be utilized by a variety of non-technical operators. There should be as few steps as possible in both the preparation and installation phases.
  • Minimize Deployment Time A telecommunications provider technician or brick-and-mortar employee who is installing an OpenShift cluster, at the Far Edge site, needs to be able to do it quickly. The technician has to wait for the node to become in-service (CaaS and CNF provisioned and running) before they can move on to installing another cluster at a different site. The brick-and-mortar employee has other job functions to fulfill and can't stare at the server for 2 hours. The install time at the far edge site should be in the order of minutes, ideally less than 20m.
  • Utilize Telco Facilities Telecommunication providers have existing Service Depots where they currently prepare SW/HW prior to shipping servers to Far Edge sites. They have asked RH to provide a simple method to pre-install OCP onto servers in these facilities. They want to do parallelized batch installation to a set of servers so that they can put these servers into a pool from which any server can be shipped to any site. They also would like to validate and update servers in these pre-installed server pools, as needed.
  • Validation before Shipment Telecommunications Providers incur a large cost if forced to manage software failures at the Far Edge due to the scale and physical disparate nature of the use case. They want to be able to validate the OCP and CNF software before taking the server to the Far Edge site as a last minute sanity check before shipping the platform to the Far Edge site.
  • IPSec Support at Cluster Boot Some far edge deployments occur on an insecure network and for that reason access to the host’s BMC is not allowed, additionally an IPSec tunnel must be established before any traffic leaves the cluster once its at the Far Edge site. It is not possible to enable IPSec on the BMC NIC and therefore even OpenShift has booted the BMC is still not accessible.

Requirements

  • Factory Depot: Install OCP with minimal steps
    • Telecommunications Providers don't want an installation experience, just pick a version and hit enter to install
    • Configuration w/ DU Profile (PTP, SR-IOV, see telco engineering for details) as well as customer-specific addons (Ignition Overrides, MachineConfig, and other operators: ODF, FEC SR-IOV, for example)
    • The installation cannot increase in-service OCP compute budget (don't install anything other that what is needed for DU)
    • Provide ability to validate previously installed OCP nodes
    • Provide ability to update previously installed OCP nodes
    • 100 parallel installations at Service Depot
  • Far Edge: Deploy OCP with minimal steps
    • Provide site specific information via usb/file mount or simple interface
    • Minimize time spent at far edge site by technician/barista/installer
    • Register with desired RHACM Hub cluster for ongoing LCM
  • Minimal ongoing maintenance of solution
    • Some, but not all telco operators, do not want to install and maintain an OCP / ACM cluster at Service Depot
  • The current IPSec solution requires a libreswan container to run on the host so that all N/S OCP traffic is encrypted. With the current IPSec solution this feature would need to support provisioning host-based containers.

 

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

requirement Notes isMvp?
     
     
     

 

Describe Use Cases (if needed)

Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.

 

Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.

 

Out of Scope

Q: how challenging will it be to support multi-node clusters with this feature?

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>

<Does the Feature introduce data that could be gathered and used for Insights purposes?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

< What does success look like?>

< Does this feature have doc impact?  Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact>

< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Interoperability Considerations

< Which other products and versions in our portfolio does this feature impact?>

< What interoperability test scenarios should be factored by the layered product(s)?>

Questions

Question Outcome
   

 

 

Epic Goal

  • Install SNO within 10 minutes

Why is this important?

  • SNO installation takes around 40+ minutes.
  • This makes SNO less appealing when compared to k3s/microshift.
  • We should analyze the  SNO installation, figure our why it takes so long and come up with ways to optimize it

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. https://docs.google.com/document/d/1ULmKBzfT7MibbTS6Sy3cNtjqDX1o7Q0Rek3tAe1LSGA/edit?usp=sharing

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This is a clone of issue OCPBUGS-14416. The following is the description of the original issue:

Description of problem:

When installing SNO with bootstrap in place the cluster-policy-controller hangs for 6 minutes waiting for the lease to be acquired. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.Run the PoC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc
2.Observe the cluster-policy-controller logs post reboot

Actual results:

I0530 16:01:18.011988       1 leaderelection.go:352] lock is held by leaderelection.k8s.io/unknown and has not yet expired
I0530 16:01:18.012002       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-policy-controller-lock
I0530 16:07:31.176649       1 leaderelection.go:258] successfully acquired lease kube-system/cluster-policy-controller-lock

Expected results:

Expected the bootstrap cluster-policy-controller to release the lease so that the cluster-policy-controller running post reboot won't have to wait the lease to expire.  

Additional info:

Suggested resolution for bootstrap in place: https://github.com/openshift/installer/pull/7219/files#diff-f12fbadd10845e6dab2999e8a3828ba57176db10240695c62d8d177a077c7161R44-R59

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Epic Goal

  • Update OpenShift components that are owned by the Builds + Jenkins Team to use Kubernetes 1.25

Why is this important?

  • Our components need to be updated to ensure that they are using the latest bug/CVE fixes, features, and that they are API compatible with other OpenShift components.

Acceptance Criteria

  • Existing CI/CD tests must be passing

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Today the links point at a rule-scoped page, but that page lacks information about recommended resolution.  You can click through by cluster ID to your specific cluster and get that recommendation advice, but it would be more convenient and less confusing for customers if we linked directly to the cluster-scoped recommendation page.

We can implement by updating the template here to be:

fmt.Sprintf("https://console.redhat.com/openshift/insights/advisor/clusters/%s?first=%s%%7C%s", clusterID, ruleIDStr, rec.ErrorKey)

or something like that.

 

unknowns

request is clear, solution/implementation to be further clarified

This epic contains all the Dynamic Plugins related stories for OCP release-4.11 

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

  •  

This story only covers API components. We will create a separate story for other utility functions.

Today we are generating documentation for Console's Dynamic Plugin SDK in
frontend/packages/dynamic-plugin-sdk. We are missing ts-doc for a set of hooks and components.

We are generating the markdown from the dynamic-plugin-sdk using

yarn generate-doc

Here is the list of the API that the dynamic-plugin-sdk is exposing:

https://gist.github.com/spadgett/0ddefd7ab575940334429200f4f7219a

Acceptance Criteria:

  • Add missing jsdocs for the API that dynamic-plugin-sdk exposes

Out of Scope:

  • This does not include work for integrating the API docs into the OpenShift docs
  • This does not cover other public utilities, only components.

This epic contains all the Dynamic Plugins related stories for OCP release-4.12

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

We should have a global notification or the `Console plugins` page (e.g., k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins) should alert users when console operator `spec.managementState` is `Unmanaged` as changes to `enabled` for plugins will have no effect.

Move `frontend/public/components/nav` to `packages/console-app/src/components/nav` and address any issues resulting from the move.

There will be some expected lint errors relating to cyclical imports. These will require some refactoring to address.

Following https://coreos.slack.com/archives/C011BL0FEKZ/p1650640804532309, it would be useful for us (network observability team) to have access to ResourceIcon in dynamic-plugin-sdk.

Currently ResourceLink is exported but not ResourceIcon

 

AC:

  • Require the ResourceIcon  from public to dynamic-plugin-sdk
  • Add the component to the dynamic-demo-plugin
  • Add a CI test to check for the ResourceIcon component

 

The console has good error boundary components that are useful for dynamic plugin.
Exposing them will enable the plugins to get the same look and feel of handling react errors as console
The minimum requirement right now is to expose the ErrorBoundaryFallbackPage component from
https://github.com/openshift/console/blob/master/frontend/packages/console-shared/src/components/error/fallbacks/ErrorBoundaryFallbackPage.tsx

During the development of https://issues.redhat.com/browse/CONSOLE-3062, it was determined additional information is needed in order to assist a user when troubleshooting a Failed plugin (see https://github.com/openshift/console/pull/11664#issuecomment-1159024959). As it stands today, there is no data available to the console to relay to the user regarding why the plugin Failed. Presumably, a message should be added to NotLoadedDynamicPlugin to address this gap.

 

AC: Add `message` property to NotLoadedDynamicPluginInfo type.

Currently the ConsolePlugins API version is v1alpha1. Since we are going GA with dynamic plugins we should be creating a v1 version.

This would require updates in following repositories:

  1. openshift/api (add the v1 version and generate a new CRD)
  2. openshift/client-go (picku the changes in the openshift/api repo and generate clients & informers for the new v1 version)
  3. openshift/console-operator repository will using both the new v1 version and v1alpha1 in code and manifests folder.

AC:

  • both v1 and v1alpha1 ConsolePlugins should be passed to the console-config.yaml when the plugins are enabled and present on the cluster.

 

NOTE: This story does not include the conversion webhook change which will be created as a follow on story

We neither use nor support static plugin nav extensions anymore so we should remove the API in the static plugin SDK and get rid of related cruft in our current nav components.

 

AC: Remove static plugin nav extensions code. Check the navigation code for any references to the old API.

`@openshift-console/plugin-shared` (NPM) is a package that will contain shared components that can be upversioned separately by the Plugins so they can keep core compatibility low but upversion and support more shared components as we need them.

This isn't documented today. We need to do that.

Acceptance Criteria

  • Add a note in the "SDK packages" section of the README about the existence of this package and it's purpose
    • The purpose of being a static utility delivery library intended not to be tied to OpenShift Console versions and compatible with multiple version of OpenShift Console

Based on API review CONSOLE-3145, we have decided to deprecate the following APIs:

  • useAccessReviewAllowed (use useAccessReview instead)
  • useSafetyFirst

cc Andrew Ballantyne Bryan Florkiewicz 

Currently our `api.md` does not generate docs with "tags" (aka `@deprecated`) – we'll need to add that functionality to the `generate-doc.ts` script. See the code that works for `console-extensions.md`

The extension `console.dashboards/overview/detail/item` doesn't constrain the content to fit the card.

The details-card has an expectation that a <dd> item will be the last item (for spacing between items). Our static details-card items use a component called 'OverviewDetailItem'. This isn't enforced in the extension and can cause undesired padding issues if they just do whatever they want.

I feel our approach here should be making the extension take the props of 'OverviewDetailItem' where 'children' is the new 'component'.

Acceptance Criteria:

  • Deprecate the old extension (in docs, with date/stamp)
  • Make a new extension that applies a stricter type
  • Include this new extension next to the old one (with the error boundary around it)

when defining two proxy endpoints, 
apiVersion: console.openshift.io/v1alpha1
kind: ConsolePlugin
metadata:
...
name: forklift-console-plugin
spec:
displayName: Console Plugin Template
proxy:

  • alias: forklift-inventory
    authorize: true
    service:
    name: forklift-inventory
    namespace: konveyor-forklift
    port: 8443
    type: Service
  • alias: forklift-must-gather-api
    authorize: true
    service:
    name: forklift-must-gather-api
    namespace: konveyor-forklift
    port: 8443
    type: Service

service:
basePath: /
I get two proxy endpoints
/api/proxy/plugin/forklift-console-plugin/forklift-inventory
and
/api/proxy/plugin/forklift-console-plugin/forklift-must-gather-api

but both proxy to the `forklift-must-gather-api` service

e.g.
curl to:
[server url]/api/proxy/plugin/forklift-console-plugin/forklift-inventory
will point to the `forklift-must-gather-api` service, instead of the `forklift-inventory` service

To align with https://github.com/openshift/dynamic-plugin-sdk, plugin metadata field dependencies as well as the @console/pluginAPI entry contained within should be made optional.

If a plugin doesn't declare the @console/pluginAPI dependency, the Console release version check should be skipped for that plugin.

This epic contains all the OLM related stories for OCP release-4.12

Epic Goal

  • Track all the stories under a single epic

This enhancement Introduces support for provisioning and upgrading heterogenous architecture clusters in phases.

 

We need to scan through the compute nodes and build a set of supported architectures from those. Each node on the cluster has a label for architecture: e.g. `kuberneties.io/arch:arm64`, `kubernetes.io/arch:amd64` etc. Based on the set of supported architectures console will need to surface only those operators in the Operator Hub, which are supported on our Nodes. Each operator's PackageManifest contains a labels that indicates whats the operator's supported architecture, e.g.  `operatorframework.io/arch.s390x: supported`. An operator can be supported on multiple architectures

AC:

  1. Implement logic in the console's backend to read the set of architecture types from console-config.yaml and set it as a SERVER_FLAG.nodeArchitectures (Change similar to https://github.com/openshift/console/commit/39aabe171a2e89ed3757ac2146d252d087fdfd33)
  2. In Operator hub render only operators that are support on any given node, based on the SERVER_FLAG.nodeArchitectures field implemented in CONSOLE-3242.

 

OS and arch filtering: https://github.com/openshift/console/blob/2ad4e17d76acbe72171407fc1c66ca4596c8aac4/frontend/packages/operator-lifecycle-manager/src/components/operator-hub/operator-hub-items.tsx#L49-L86

 

@jpoulin is good to ask about heterogeneous clusters.

This enhancement Introduces support for provisioning and upgrading heterogenous architecture clusters in phases.

 

We need to scan through the compute nodes and build a set of supported architectures from those. Each node on the cluster has a label for architecture: e.g. kubernetes.io/arch=arm64, kubernetes.io/arch=amd64 etc. Based on the set of supported architectures console will need to surface only those operators in the Operator Hub, which are supported on our Nodes.

 

AC: 

  1. Implement logic in the console-operator that will scan though all the nodes and build a set of all the architecture types that the cluster nodes run on and pass it to the console-config.yaml
  2. Add unit and e2e test cases in the console-operator repository.

 

@jpoulin is good to ask about heterogeneous clusters.

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

As a developer, I want to be able to clean up the css markup after making the css / scss changes required for dark mode and remove any old unused css / scss content. 

 

Acceptance criteria:

  • Remove any unused scss / css content after revamping for dark mode

Epic Goal

  • Enable OpenShift IPI Installer to deploy OCP to a shared VPC in GCP.
  • The host project is where the VPC and subnets are defined. Those networks are shared to one or more service projects.
  • Objects created by the installer are created in the service project where possible. Firewall rules may be the only exception.
  • Documentation outlines the needed minimal IAM for both the host and service project.

Why is this important?

  • Shared VPC's are a feature of GCP to enable granular separation of duties for organizations that centrally manage networking but delegate other functions and separation of billing. This is used more often in larger organizations where separate teams manage subsets of the cloud infrastructure. Enterprises that use this model would also like to create IPI clusters so that they can leverage the features of IPI. Currently organizations that use Shared VPC's must use UPI and implement the features of IPI themselves. This is repetative engineering of little value to the customer and an increased risk of drift from upstream IPI over time. As new features are built into IPI, organizations must become aware of those changes and implement them themselves instead of getting them "for free" during upgrades.

Scenarios

  1. Deploy cluster(s) into service project(s) on network(s) shared from a host project.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user, I want to be able to:

  • skip creating service accounts in Terraform when using passthrough credentialsMode.
  • pass the installer service account to Terraform to be used as the service account for instances when using passthrough credentialsMode.

so that I can achieve

  • creating an IPI cluster using Shared VPC networks using a pre-created service account with the necessary permissions in the Host Project.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

1. Proposed title of this feature request
Basic authentication for Helm Chart repository in helmchartrepositories.helm.openshift.io CRD.

2. What is the nature and description of the request?
As of v4.6.9, the HelmChartRepository CRD only supports client TLS authentication through spec.connectionConfig.tlsClientConfig.

3. Why do you need this? (List the business requirements here)
Basic authentication is widely used by many chart repositories managers (Nexus OSS, Artifactory, etc.)
Helm CLI also supports them with the helm repo add command.
https://helm.sh/docs/helm/helm_repo_add/

4. How would you like to achieve this? (List the functional requirements here)
Probably by extending the CRD:

spec:
connectionConfig:
username: username
password:
secretName: secret-name

The secret namespace should be openshift-config to align with the tlsClientConfig behavior.

5. For each functional requirement listed in question 4, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
Trying to pull helm charts from remote private chart repositories that has disabled anonymous access and offers basic authentication.
E.g.: https://github.com/sonatype/docker-nexus

Owner: Architect:

Story (Required)

As an OCP user I will like to be able to install helm charts from repos added to ODC with basic authentication fields populated

Background (Required)

We need to support helm installs for Repos that have the basic authentication secret name and namespace.

Glossary

Out of scope

Updating the ProjectHelmChartRepository CRD, already done in diff story
Supporting the HelmChartRepository CR, this feature will be scoped first to project/namespace scope repos.

In Scope

<Defines what is included in this story>

Approach(Required)

If the new fields for basic auth are set in the repo CR then use those credentials when making API calls to helm to install/upgrade charts. We will error out if user logged in does not have access to the secret referenced by Repo CR. If basic auth fields are not present we assume is not an authenticated repo.

Dependencies

Nonet

Edge Case

NA

Acceptance Criteria

I can list, install and update charts on authenticated repos from ODC
Needs Documentation both upstream and downstream
Needs new unit test covering repo auth

INVEST Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Legend

Unknown
Verified
Unsatisfied

Epic Goal

  • Support manifest lists by image streams and the integrated registry. Clients should be able to pull/push manifests lists from/into the integrated registry. They also should be able to import images via `oc import-image` and them pull them from the internal registry.

Why is this important?

  • Manifest lists are becoming more and more popular. Customers want to mirror manifest lists into the registry and be able to pull them by digest.

Scenarios

  1. Manifest lists can be pushed into the integrated registry
  2. Imported manifests list can be pulled from the integrated registry
  3. Image triggers work with manifest lists

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Existing functionality shouldn't change its behavior

Dependencies (internal and external)

  1. ...

Previous Work (Optional)

  1. https://github.com/openshift/enhancements/blob/master/enhancements/manifestlist/manifestlist-support.md

Open questions

  1. Can we merge creation of images without having the pruner?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ACCEPTANCE CRITERIA

  • The ImageStream object should contain a new flag indicating that it refers to a manifest list
  • openshift-controller-manager uses new openshift/api code to import image streams
  • changing `importMode` of an image stream tag triggers a new import (i.e. updates generation in the tag spec)

NOTES

This is a follow up Epic to https://issues.redhat.com/browse/MCO-144, which aimed to get in-place upgrades for Hypershift. This epic aims to capture additional work to focus on using CoreOS/OCP layering into Hypershift, which has benefits such as:

 

 - removing or reducing the need for ignition

 - maintaining feature parity between self-driving and managed OCP models

 - adding additional functionality such as hotfixes

Right now in https://github.com/openshift/hypershift/pull/1258 you can only perform one upgrade at a time. Multiple upgrades will break due to controller logic

 

Properly create logic to handle manifest creation/updates and deletion, so the logic is more bulletproof

Currently not implemented, and will require the MCD hypershift mode to be adjusted to handle disruptionless upgrades like regular MCD

We plan to build Ironic Container Images using RHEL9 as base image in OCP 4.12

This is required because the ironic components have abandoned support for CentOS Stream 8 and Python 3.6/3.7 upstream during the most recent development cycle that will produce the stable Zed release, in favor of CentOS Stream 9 and Python 3.8/3.9

More info on RHEL8 to RHEL9 transition in OCP can be found at https://docs.google.com/document/d/1N8KyDY7KmgUYA9EOtDDQolebz0qi3nhT20IOn4D-xS4

Epic Goal

  • We need the installer to accept a LB type from user and then we could set type of LB in the following object.
    oc get ingress.config.openshift.io/cluster -o yaml
    Then we can fetch info from this object and reconcile the operator to have the NLB changes reflected.

 

This is an API change and we will consider this as a feature request.

Why is this important?

https://issues.redhat.com/browse/NE-799 Please check this for more details

 

Scenarios

https://issues.redhat.com/browse/NE-799 Please check this for more details

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. installer
  2. ingress operator

Previous Work (Optional):

 No

Open questions::

N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need tests for the ovirt-csi-driver and the cluster-api-provider-ovirt. These tests help us to

  • minimize bugs,
  • reproduce and fix them faster and
  • pin down current behavior of the driver

Also, having dedicated tests on lower levels with a smaller scope (unit, integration, ...) has the following benefits:

  • fast feedback cycle (local test execution)
  • developer in-code documentation
  • easier onboarding for new contributers
  • lower resource consumption
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

As a user, In the topology view, I would like to be updated intuitively if any of the deployments have reached quota limits

Acceptance Criteria

  1. Show a yellow border around deployments if any of the deployments have reached the quota limit
  2. For deployments, if there are any errors associated with resource limits or quotas, include a warning alert in the side panel.
    1. If we know resource limits are the cause, include link to Edit resource limits
    2. If we know pod count is the cause, include a link to Edit pod count

Additional Details:

 

Refer below for more details 

Description

As a user, I would like to be informed in an intuitive way,  when quotas have been reached in a namespace

Acceptance Criteria

  1. Show an alert banner on the Topology and add page for this project/namespace when there is a RQ (Resource Quota) / ACRQ (Applied Cluster Resource Quota) issue
    PF guideline: https://www.patternfly.org/v4/components/alert/design-guidelines#using-alerts 
  2. The above alert should have a CTA link to the search page with all RQ, ACRQ and if there is just one show the details page for the same
  3. For RQ, ACRQ list view show one more column called status with details as shown in the project view.

Additional Details:

 

Refer below for more details 

Goal

Provide a form driven experience to allow cluster admins to manage the perspectives to meet the ACs below.

Problem:

We have heard the following requests from customers and developer advocates:

  • Some admins do not want to provide access to the Developer Perspective from the console
  • Some admins do not want to provide non-priv users access to the Admin Perspective from the console

Acceptance criteria:

  1. Cluster administrator is able to "hide" the admin perspective for non-priv users
  2. Cluster administrator is able to "hide" the developer perspective for all users
  3. Be user that User Preferences for individual users behaves appropriately. If only one perspective is available, the perspective switcher is not needed.

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As an admin, I should be able to see a code snippet that shows how to add user perspectives

Based on the https://issues.redhat.com/browse/ODC-6732 enhancement proposal, the cluster admin can add user perspectives

To support the cluster-admin to configure the perspectives correctly, the developer console should provide a code snippet for the customization of yaml resource (Console CRD).

Customize Perspective Enhancement PR: https://github.com/openshift/enhancements/pull/1205

Acceptance Criteria

  1. When the admin opens the Console CRD there is a snippet in the sidebar which provides a default YAML which supports the admin to add user perspectives

Additional Details:

Previous work:

  1. https://issues.redhat.com/browse/ODC-5080
  2. https://issues.redhat.com/browse/ODC-5449

Description

As an admin, I want to hide the admin perspective for non-privileged users or hide the developer perspective for all users

Based on the https://issues.redhat.com/browse/ODC-6730 enhancement proposal, it is required to extend the console configuration CRD to enable the cluster admins to configure this data in the console resource

Acceptance Criteria

  1. Extend the "customization" spec type definition for the CRD in the openshift/api project

Additional Details:

Previous customization work:

  1. https://issues.redhat.com/browse/ODC-5416
  2. https://issues.redhat.com/browse/ODC-5020
  3. https://issues.redhat.com/browse/ODC-5447

Description

As an admin, I want to hide user perspective(s) based on the customization.

Acceptance Criteria

  1. Hide perspective(s) based on the customization
    1. When the admin perspective is disabled -> we hide the admin perspective for all unprivileged users
    2. When the dev perspective is disabled -> we hide the dev perspective for all users
  2. When all the perspectives are hidden from a user or for all users, show the Admin perspective by default

Additional Details:

Description

As an admin, I want to be able to use a form driven experience  to hide user perspective(s)

Acceptance Criteria

  1. Add checkboxes with the options
    1. Hide "Administrator" perspective for non-privileged users
    2.  Hide "Developer" perspective for all users
  2. The console configuration CR should be updated as per the selected option

Additional Details:

Problem:

Customers don't want their users to have access to some/all of the items which are available in the Developer Catalog.  The request is to change access for the cluster, not per user or persona.

Goal:

Provide a form driven experience to allow cluster admins easily disable the Developer Catalog, or one or more of the sub catalogs in the Developer Catalog.

Why is it important?

Multiple customer requests.

Acceptance criteria:

  1. As a cluster admin, I can hide/disable access to the developer catalog for all users across all namespaces.
  2. As a cluster admin, I can hide/disable access to a specific sub-catalog in the developer catalog for all users across all namespaces.
    1. Builder Images
    2. Templates
    3. Helm Charts
    4. Devfiles
    5. Operator Backed

Notes

We need to consider how this will work with subcatalogs which are installed by operators: VMs, Event Sources, Event Catalogs, Managed Services, Cloud based services

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As an admin, I want to hide/disable access to specific sub-catalogs in the developer catalog or the complete dev catalog for all users across all namespaces.

Based on the https://issues.redhat.com/browse/ODC-6732 enhancement proposal, it is required to extend the console configuration CRD to enable the cluster admins to configure this data in the console resource

Acceptance Criteria

Extend the "customization" spec type definition for the CRD in the openshift/api project

Additional Details:

Previous customization work:

  1. https://issues.redhat.com/browse/ODC-5416
  2. https://issues.redhat.com/browse/ODC-5020
  3. https://issues.redhat.com/browse/ODC-5447

Description

As an admin, I want to hide sub-catalogs in the developer catalog or hide the developer catalog completely based on the customization.

Acceptance Criteria

  1. Hide all links to the sub-catalog(s) from the add page, topology actions, empty states, quick search, and the catalog itself
  2. The sub-catalog should show Not found if the user opens the sub-catalog directly
  3. The feature should not be hidden if a sub-catalog option is disabled

Additional Details:

Description

As a cluster-admin, I should be able to see a code snippet that shows how to enable sub-catalogs or the entire dev catalog.

Based on the https://issues.redhat.com/browse/ODC-6732 enhancement proposal, the cluster admin can add sub-catalog(s)  from the Developer Catalog or the Dev catalog as a whole.

To support the cluster-admin to configure the sub-catalog list correctly, the developer console should provide a code snippet for the customization yaml resource (Console CRD).

Acceptance Criteria

  1. When the admin opens the Console CRD there is a snippet in the sidebar which provides a default YAML, which supports the admin to add sub-catalogs/the whole dev catalog

Additional Details:

Previous work:

  1. https://issues.redhat.com/browse/ODC-5080
  2. https://issues.redhat.com/browse/ODC-5449

Epic Goal

  • Facilitate the transition to for OLM and content to PSA enforcing the `restricted` security profile
  • Use the label synch'er to enforce the required security profile
  • Current content should work out-of-the-box as is
  • Upgrades should not be blocked

Why is this important?

  • PSA helps secure the cluster by enforcing certain security restrictions that the pod must meet to be scheduled
  • 4.12 will enforce the `restricted` profile, which will affect the deployment of operators in `openshift-*` namespaces 

Scenarios

  1. Admin installs operator in an `openshift-*`namespace that is not managed by the label syncher -> label should be applied
  2. Admin installs operator in an `openshift-*` namespace that has a label asking the label syncher to not reconcile it -> nothing changes

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Done only downstream
  • Transition documentation written and reviewed

Dependencies (internal and external)

  1. label syncher (still searching for the link)

Open questions::

  1. Is this only for openshift-* namespaces?

Resources

Stakeholders

  • Daniel S...?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an admin, I would like openshift-* namespaces with an operator to be labeled with security.openshift.io/scc.podSecurityLabelSync=true to ensure the continual functioning of operators without manual intervention. The label should only be applied to openshift-* namespaces with an operator (the presence of a ClusterServiceVersion resource) IF the label is not already present. This automation will help smooth functioning of the cluster and avoid frivolous operational events.

Context: As part of the PSA migration period, Openshift will ship with the "label sync'er" - a controller that will automatically adjust PSA security profiles in response to the workloads present in the namespace. We can assume that not all operators (produced by Red Hat, the community or ISVs) will have successfully migrated their deployments in response to upstream PSA changes. The label sync'er will sync, by default, any namespace not prefixed with "openshift-", of which an explicit label (security.openshift.io/scc.podSecurityLabelSync=true) is required for sync.

A/C:
 - OLM operator has been modified (downstream only) to label any unlabelled "openshift-" namespace in which a CSV has been created
 - If a labeled namespace containing at least one non-copied csv becomes unlabelled, it should be relabelled 
 - The implementation should be done in a way to eliminate or minimize subsequent downstream sync work (it is ok to make slight architectural changes to the OLM operator in the upstream to enable this)

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a SRE, I want hypershift operator to expose a metric when hosted control plane is ready. 

This should allow SRE to tune (or silence) alerts occurring while the hosted control plane is spinning up. 

 

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The Kube APIServer has a sidecar to output audit logs. We need similar sidecars for other APIServers that run on the control plane side. We also need to pass the same audit log policy that we pass to the KAS to these other API servers.

This epic tracks network tooling improvements for 4.12

New framework and process should be developed to make sharing network tools with devs, support and customers convenient. We are going to add some tools for ovn troubleshooting before ovn-k goes default, also some tools that we got from customer cases, and some more to help analyze and debug collected logs based on stable must-gather/sosreport format we get now thanks to 4.11 Epic.

Our estimation for this Epic is 1 engineer * 2 Sprints

WHY:
This epic is important to help improve the time it takes our customers and our team to understand an issue within the cluster.
A focus of this epic is to develop tools to quickly allow debugging of a problematic cluster. This is crucial for the engineering team to help us scale. We want to provide a tool to our customers to help lower the cognitive burden to get at a root cause of an issue.

 

Alert if any of the ovn controllers disconnected for a period of time from the southbound database using metric ovn_controller_southbound_database_connected.

The metric updates every 2 minutes so please be mindful of this when creating the alert.

If the controller is disconnected for 10 minutes, fire an alert.

DoD: Merged to CNO and tested by QE

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Come up with a consistent way to detect node down on OCP and hypershift. Current mechanism for OCP (probe port 9) does not work for hypershift, meaning, hypershift node down detection will be longer (~40 secs). We should aim to have a common mechanism for both. As well, we should consider alternatives to the probing port 9. Perhaps BFD, or other detection.
  • Get clarification on node down detection times. Some customers have (apparently) asked for detection on the order of 100ms, recommendation is to use multiple Egress IPs, so this may not be a hard requirement. Need clarification from PM/Customers.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add sock proxy to cluster-network-operator so egressip can use grpc to reach worker nodes.
 
With the introduction of grpc as means for determining the state of a given egress node, hypershift should
be able to leverage socks proxy and become able to know the state of each egress node.
 
References relevant to this work:
1281-network-proxy
[+https://coreos.slack.com/archives/C01C8502FMM/p1658427627751939+]
[+https://github.com/openshift/hypershift/pull/1131/commits/28546dc587dc028dc8bded715847346ff99d65ea+]

This Epic is here to track the rebase we need to do when kube 1.25 is GA https://www.kubernetes.dev/resources/release/

Keeping this in mind can help us plan our time better. ATTOW GA is planned for August 23

https://docs.google.com/document/d/1h1XsEt1Iug-W9JRheQas7YRsUJ_NQ8ghEMVmOZ4X-0s/edit --> this is the link for rebase help

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Place holder epic to track spontaneous task which does not deserve its own epic.

AC:

We have connectDirectlyToCloudAPIs flag in konnectiviy socks5 proxy to dial directly to cloud providers without going through konnectivity.

This introduce another path for exception https://github.com/openshift/hypershift/pull/1722

We should consolidate both by keep using connectDirectlyToCloudAPIs until there's a reason to not.

 

Once the HostedCluster and NodePool gets stopped using PausedUntil statement, the awsprivatelink controller will continue reconciling.

 

How to test this:

  • Deploy a private cluster
  • Put it in pause once deployed
  • Delete the AWSEndPointService and the Service from the HCP namespace
  • And wait for a reconciliation, the result it's that they should not be recreated
  • Unpause it and wait for recreation.

AWS has a hard limit of 100 OIDC providers globally. 
Currently each HostedCluster created by e2e creates its own OIDC provider, which results in hitting the quota limit frequently and causing the tests to fail as a result.

 
DOD:
Only a single OIDC provider should be created and shared between all e2e HostedClusters. 

DoD:

At the moment if the input etcd kms encryption (key and role) is invalid we fail transparently.

We should check that both key and role are compatible/operational for a given cluster and fail in a condition otherwise

Changes made in METAL-1 open up opportunities to improve our handling of images by cleaning up redundant code that generates extra work for the user and extra load for the cluster.

We only need to run the image cache DaemonSet if there is a QCOW URL to be mirrored (effectively this means a cluster installed with 4.9 or earlier). We can stop deploying it for new clusters installed with 4.10 or later.

Currently, the image-customization-controller relies on the image cache running on every master to provide the shared hostpath volume containing the ISO and initramfs. The first step is to replace this with a regular volume and an init container in the i-c-c pod that extracts the images from machine-os-images. We can use the copy-metal -image-build flag (instead of -all used in the shared volume) to provide only the required images.

Once i-c-c has its own volume, we can switch the image extraction in the metal3 Pod's init container to use the -pxe flag instead of -all.

The machine-os-images init container for the image cache (not the metal3 Pod) can be removed. The whole image cache deployment is now optional and need only be started if provisioningOSDownloadURL is set (and in fact should be deleted if it is not).

Epic Goal

  • To improve the reliability of disk cleaning before installation and to provide the user with sufficient warning regarding the consequences of the cleaning

Why is this important?

  • Insufficient cleaning can lead to installation failure
  • Insufficient warning can lead to complaints of unexpected data loss

Scenarios

  1.  

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:
When running assisted-installer on a machine where is more than one volume group per physical volume. Only the first volume group will be cleaned up. This leads to problems later and will lead to errors such as

Failed - failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- pvremove /dev/sda -y -ff], Error exit status 5, LastOutput "Can't open /dev/sda exclusively. Mounted filesystem? 

How reproducible:

Set up a VM with more than one volume group per physical volume. As an example, look at the following sample from a customer cluster.

List block devices
/usr/bin/lsblk -o NAME,MAJ:MIN,SIZE,TYPE,FSTYPE,KNAME,MODEL,UUID,WWN,HCTL,VENDOR,STATE,TRAN,PKNAME
NAME              MAJ:MIN   SIZE TYPE FSTYPE      KNAME MODEL            UUID                                   WWN                HCTL       VENDOR   STATE   TRAN PKNAME
loop0               7:0   125.9G loop xfs         loop0                  c080b47b-2291-495c-8cc0-2009ebc39839                                                       
loop1               7:1   885.5M loop squashfs    loop1                                                                                                             
sda                 8:0   894.3G disk             sda   INTEL SSDSC2KG96                                        0x55cd2e415235b2db 1:0:0:0    ATA      running sas  
|-sda1              8:1     250M part             sda1                                                          0x55cd2e415235b2db                                  sda
|-sda2              8:2     750M part ext2        sda2                   3aa73c72-e342-4a07-908c-a8a49767469d   0x55cd2e415235b2db                                  sda
|-sda3              8:3      49G part xfs         sda3                   ffc3ccfe-f150-4361-8ae5-f87b17c13ac2   0x55cd2e415235b2db                                  sda
|-sda4              8:4   394.2G part LVM2_member sda4                   Ua3HOc-Olm4-1rma-q0Ug-PtzI-ZOWg-RJ63uY 0x55cd2e415235b2db                                  sda
`-sda5              8:5     450G part LVM2_member sda5                   W8JqrD-ZvaC-uNK9-Y03D-uarc-Tl4O-wkDdhS 0x55cd2e415235b2db                                  sda
  `-nova-instance 253:0     3.1T lvm  ext4        dm-0                   d15e2de6-2b97-4241-9451-639f7b14594e                                          running      sda5
sdb                 8:16  894.3G disk             sdb   INTEL SSDSC2KG96                                        0x55cd2e415235b31b 1:0:1:0    ATA      running sas  
`-sdb1              8:17  894.3G part LVM2_member sdb1                   6ETObl-EzTd-jLGw-zVNc-lJ5O-QxgH-5wLAqD 0x55cd2e415235b31b                                  sdb
  `-nova-instance 253:0     3.1T lvm  ext4        dm-0                   d15e2de6-2b97-4241-9451-639f7b14594e                                          running      sdb1
sdc                 8:32  894.3G disk             sdc   INTEL SSDSC2KG96                                        0x55cd2e415235b652 1:0:2:0    ATA      running sas  
`-sdc1              8:33  894.3G part LVM2_member sdc1                   pBuktx-XlCg-6Mxs-lddC-qogB-ahXa-Nd9y2p 0x55cd2e415235b652                                  sdc
  `-nova-instance 253:0     3.1T lvm  ext4        dm-0                   d15e2de6-2b97-4241-9451-639f7b14594e                                          running      sdc1
sdd                 8:48  894.3G disk             sdd   INTEL SSDSC2KG96                                        0x55cd2e41521679b7 1:0:3:0    ATA      running sas  
`-sdd1              8:49  894.3G part LVM2_member sdd1                   exVSwU-Pe07-XJ6r-Sfxe-CQcK-tu28-Hxdnqo 0x55cd2e41521679b7                                  sdd
  `-nova-instance 253:0     3.1T lvm  ext4        dm-0                   d15e2de6-2b97-4241-9451-639f7b14594e                                          running      sdd1
sr0                11:0     989M rom  iso9660     sr0   Virtual CDROM0   2022-06-17-18-18-33-00                                    0:0:0:0    AMI      running usb  

Now run the assisted installer and try to install an SNO node on this machine, you will find that the installation will fail with a message that indicates that it could not exclusively access /dev/sda

Actual results:

 The installation will fail with a message that indicates that it could not exclusively access /dev/sda

Expected results:

The installation should proceed and the cluster should start to install.

Suspected Cases
https://issues.redhat.com/browse/AITRIAGE-3809
https://issues.redhat.com/browse/AITRIAGE-3802
https://issues.redhat.com/browse/AITRIAGE-3810

Description of the problem:

Cluster Installation fail if installation disk has lvm on raid:

Host: test-infra-cluster-3cc862c9-master-0, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- mdadm --stop /dev/md0], Error exit status 1, LastOutput "mdadm: Cannot get exclusive access to /dev/md0:Perhaps a running process, mounted filesystem or active volume group?" 

How reproducible:

100%

Steps to reproduce:

1. Install a cluster while master nodes has disk with LVM on RAID (reproduces using test: https://gitlab.cee.redhat.com/ocp-edge-qe/kni-assisted-installer-auto/-/blob/master/api_tests/test_disk_cleanup.py#L97)

Actual results:

Installation failed

Expected results:

Installation success

Epic Goal

  • Increase success-rate of of our CI jobs
  • Improve debugability / visibility or tests 

Why is this important?

  • Failed presubmit jobs (required or optional) can make an already tested+approved PR to not get in
  • Failed periodic jobs interfere our visibility around stability of features

Epic Goal

Why is this important?

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

This is a clone of issue MULTIARCH-3683. The following is the description of the original issue:

Flags similar to these https://github.com/openshift/hypershift/blob/main/cmd/cluster/powervs/create.go#L57toL61 from create command are missing in destroy command, so that infra destroy functionality not getting these flags for proper destroy of infra with existing resources.

This is a clone of issue MULTIARCH-3708. The following is the description of the original issue:

Following issues need to be take care on cluster deletion with resource reuse flags.

  1. Currently it's trying to remove DHCP server on an existing PowerVS instance, need to reuse the existing one to keep it simple.
  2. In case reusing existing VPC, load balancer is not getting removed. 

Description of problem:

check_pkt_length cannot be offloaded without
1) sFlow offload patches in Openvswitch
2) Hardware driver support.

Since 1) will not be done anytime soon. We need a work around for the check_pkt_length issue.

Version-Release number of selected component (if applicable):

4.11/4.12

How reproducible:

Always

Steps to Reproduce:

1. Any flow that has check_pkt_len()
  5-b: Pod -> NodePort Service traffic (Pod Backend - Different Node)
  6-b: Pod -> NodePort Service traffic (Host Backend - Different Node)
  4-b: Pod -> Cluster IP Service traffic (Host Backend - Different Node)
  10-b: Host Pod -> Cluster IP Service traffic (Host Backend - Different Node)
  11-b: Host Pod -> NodePort Service traffic (Pod Backend - Different Node)
  12-b: Host Pod -> NodePort Service traffic (Host Backend - Different Node)   

Actual results:

Poor performance due to upcalls when check_pkt_len() is not supported.

Expected results:

Good performance.

Additional info:

https://docs.google.com/spreadsheets/d/1LHY-Af-2kQHVwtW4aVdHnmwZLTiatiyf-ySffC8O5NM/edit#gid=670206692

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Run OpenShift builds that do not execute as the "root" user on the host node.

Why is this important?

  • OpenShift builds require an elevated set of capabilities to build a container image
  • Builds currently run as root to maintain adequate performance
  • Container workloads should run as non-root from the host's perspective. Containers running as root are a known security risk.
  • Builds currently run as root and require a privileged container. See BUILD-225 for removing the privileged container requirement.

Scenarios

  1. Run BuildConfigs in a multi-tenant environment
  2. Run BuildConfigs in a heightened security environment/deployment

Acceptance Criteria

  • Developers can opt into running builds in a cri-o user namespace by providing an environment variable with a specific value.
  • When the correct environment variable is provided, builds run in a cri-o user namespace, and the build pod does not require the "privileged: true" security context.
  • User namespace builds can pass basic test scenarios for the Docker and Source strategy build.
  • Steps to run unprivileged builds are documented.

Dependencies (internal and external)

  1. Buildah supports running inside a non-privileged container
  2. CRI-O allows workloads to opt into running containers in user namespaces.

Previous Work (Optional):

  1. BUILD-225 - remove privileged requirement for builds.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges

Acceptance Criteria

  • Developers can provide an environment variable to indicate the build should not use privileged containers
  • When the correct env var + value is specified, builds run in a user namespace (non-root on the host)

QE Impact

No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.

Docs Impact

We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.

PX Impact

This likely warrants an OpenShift blog post, potentially?

Notes

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We have been running into a number of problems with configure-ovs and nodeip-configuration selecting different interfaces in OVNK deployments. This causes connectivity issues, so we need some way to ensure that everything uses the same interface/IP.

Currently configure-ovs runs before nodeip-configuration, but since nodeip-configuration is the source of truth for IP selection regardless of CNI plugin, I think we need to look at swapping that order. That way configure-ovs could look at what nodeip-configuration chose and not have to implement its own interface selection logic.

I'm targeting this at 4.12 because even though there's probably still time to get it in for 4.11, changing the order of boot services is always a little risky and I'd prefer to do it earlier in the cycle so we have time to tease out any issues that arise. We may need to consider backporting the change though since this has been an issue at least back to 4.10.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Any ERRORs produces by TuneD will result in Degraded Tuned Profiles.  Cleanup upstream and NTO/PPC-shipped TuneD profiles and add ways of limiting the ERROR message count.
  • Review the policy of restarting TuneD on errors every resync period.  See: OCPBUGS-11150

Why is this important?

  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/PSAP-908

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

CU cluster of the Mavenir deployment has cluster-node-tuning-operator in a CrashLoopBackOff state and does not apply performance profile

Version-Release number of selected component (if applicable):

4.14rc0 and 4.14rc1

How reproducible:

100%

Steps to Reproduce:

1. Deploy CU cluster with ZTP gitops method
2. Wait for Policies to be complient
3. Check worker nodes and cluster-node-tuning-operator status 

Actual results:

Nodes do not have performance profile applied
cluster-node-tuning-operator is crashing with following in logs:

E0920 12:16:57.820680       1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(nil), concrete:(*runtime._type)(nil), asserted:(*runtime._type)(0x1e68ec0), missingMethod:""} (interface conversion: interface is nil, not v1.Object)
goroutine 615 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c98c20?, 0xc0006b7a70})
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000d49500?})
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1c98c20, 0xc0006b7a70})
        /usr/lib/golang/src/runtime/panic.go:884 +0x213
github.com/openshift/cluster-node-tuning-operator/pkg/util.ObjectInfo({0x0?, 0x0})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/util/objectinfo.go:10 +0x39
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).machineConfigLabelsMatch(0xc000a23ca0?, 0xc000445620, {0xc0001b38e0, 0x1, 0xc0010bd480?})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:374 +0xc7
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).calculateProfile(0xc000607290, {0xc000a40900, 0x33})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:208 +0x2b9
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).syncProfile(0xc000195b00, 0x0?, {0xc000a40900, 0x33})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:664 +0x6fd
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).sync(0xc000195b00, {{0x1f48661, 0x7}, {0xc000000fc0, 0x26}, {0xc000a40900, 0x33}, {0x0, 0x0}})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:371 +0x1571
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor.func1(0xc000195b00, {0x1dd49c0?, 0xc000d49500?})
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:193 +0x1de
github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor(0xc000195b00)
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:212 +0x65
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x224ee20, 0xc000c48ab0}, 0x1, 0xc00087ade0)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc0004e6710?)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0004e67d0?, 0x91af86?, 0xc000ace0c0?)
        /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).run
        /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:1407 +0x1ba5
panic: interface conversion: interface is nil, not v1.Object [recovered]
        panic: interface conversion: interface is nil, not v1.Object

Expected results:

cluster-node-tuning-operator is functional, performance profiles applied to worker nodes

Additional info:

There is no issue on a DU node of the same deployment coming from same repository, DU node is configured as requested and cluster-node-tuning-operator is functioning correctly.

must gather from rc0: https://drive.google.com/file/d/1DlzrjQiKTVnQKXdcRIijBkEKjAGsOFn1/view?usp=sharing
must gather from rc1: https://drive.google.com/file/d/1qSqQtIunQe5e1hDVDYwa90L9MpEjEA4j/view?usp=sharing

performance profile: https://gitlab.cee.redhat.com/agurenko/mavenir-ztp/-/blob/airtel-4.14/policygentemplates/group-cu-mno-ranGen.yaml

Goal
Provide an indication that advanced features are used

Problem

Today, customers and RH don't have the information on the actual usage of advanced features.

Why is this important?

  1. Better focus upsell efforts
  2. Compliance information for customers that are not aware they are not using the right subscription

 

Prioritized Scenarios

In Scope
1. Add a boolean variable in our telemetry to mark if the customer is using advanced features (PV encryption, encryption with KMS, external mode). 

Not in Scope

Integrate with subscription watch - will be done by the subscription watch team with our help.

Customers

All

Customer Facing Story
As a compliance manager, I should be able to easily see if all my clusters are using the right amount of subscriptions

What does success look like?

A clear indication in subscription watch for ODF usage (either essential or advanced). 

1. Proposed title of this feature request

  • Request to add a bool variable into telemetry which indicates the usage of any of the advanced feature, like PV encryption or KMS encryption or external mode etc.

2. What is the nature and description of the request?

  • Today, customers and RH don't have the information on the actual usage of advanced features. This feature will help RH to have a better indication on the statistics of customers using the advanced features and focus better on upsell efforts.

3. Why does the customer need this? (List the business requirements here)

  • As a compliance manager, I should be able to easily see if all my clusters are using the right amount of subscriptions.

4. List any affected packages or components.

  • Telemetry

_____________________

Link to main epic: https://issues.redhat.com/browse/RHSTOR-3173

 

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

console.openshift.io/use-i18n false in v1alpha API is converted to "" in the v1 APi, which is not a valid value for the enum type declared in the code. 

Version-Release number of selected component (if applicable):

 4.12.0-0.nightly-2022-09-25-071630

How reproducible:

Always

Steps to Reproduce:

1. Load a dynamic plugin with v1alpha API console.openshift.io/use-i18n set to 'false'
2. In the v1 API the {"spec":{"i18n":{"loadType":""}}} loadType is set to empty string, which is not a valid value defined here: https://github.com/jhadvig/api/blob/22d69793277ffeb618d642724515f249262959a5/console/v1/types_console_plugin.go#L46
https://github.com/openshift/api/pull/1186/files# 

Actual results:

{"spec":{"i18n":{"loadType":""}}}

Expected results:

{"spec":{"i18n":{"loadType":"Lazy"}}}

Additional info:

 

Description of problem:

We got a feedback from the support team that it is confusing to see switch in the Notifications column for the Alerting rule which have no alerts associated to it as user can not silence the Alerting rule. 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. oc apply -f https://gist.githubusercontent.com/vikram-raj/727629797eb9d9bfcfa2721cae2ade86/raw/7c2305e14115a1a4f4f88ebb74cdad32cbec4132/Alerting%2520rule%2520without%2520alert 
2. navigate to the Developer perspective Observe -> Alerts
3. Try to silence the VersionAlert alerting rule, nothing will happen 

Actual results:

Silence the alerting rule using the switch will do nothing

Expected results:

No switch for silence the alerting rule should be visible if no alerts are associated to the alerting rule.

Additional info:

 

ovnkube-trace: ofproto/trace fails for IPv6

[akaris@linux go-controller (fix-ovnkube-trace-ipv6)]$ oc exec -ti ovn-trace-two -n ovn-tests-two -- ovnkube-trace -src-namespace ovn-tests-two -src ovn-trace-two -dst-ip 2404:6800:4003:c06::69 -tcp
I1021 12:16:56.478752    3356 ovs.go:90] Maximum command line arguments set to: 191102
ovn-trace from pod to IP indicates success from ovn-trace-two to 2404:6800:4003:c06::69
F1021 12:16:57.075803    3356 ovnkube-trace.go:601] ovs-appctl ofproto/trace pod to IP error command terminated with exit code 2 stdOut: 
 stdErr: Bad openflow flow syntax: in_port=73af56a18042ab9, tcp, dl_src=0a:58:17:2b:b6:42, dl_dst=0a:58:69:bd:ba:d8, nw_src=fd01:0:0:5::13, nw_dst=2404:6800:4003:c06::69, nw_ttl=64, tcp_dst=80, tcp_src=12345: bad value for nw_src (fd01:0:0:5::13: invalid IP address)
ovs-appctl: ovs-vswitchd: server returned an error
command terminated with exit code 1
[akaris@linux go-controller (fix-ovnkube-trace-ipv6)]$ oc exec -ti ovn-trace-two -n ovn-tests-two -- ovnkube-trace -src-namespace ovn-tests-two -src ovn-trace-two -dst-namespace ovn-tests -dst ovn-trace -udp
I1021 12:17:26.695325    3386 ovs.go:90] Maximum command line arguments set to: 191102
ovn-trace source pod to destination pod indicates success from ovn-trace-two to ovn-trace
ovn-trace destination pod to source pod indicates success from ovn-trace to ovn-trace-two
F1021 12:17:27.708822    3386 ovnkube-trace.go:601] ovs-appctl ofproto/trace source pod to destination pod error command terminated with exit code 2 stdOut: 
 stdErr: Bad openflow flow syntax: in_port=73af56a18042ab9, udp, dl_src=0a:58:17:2b:b6:42, dl_dst=0a:58:69:bd:ba:d8, nw_src=fd01:0:0:5::13, nw_dst=fd01:0:0:5::14, nw_ttl=64, udp_dst=80, udp_src=12345: bad value for nw_src (fd01:0:0:5::13: invalid IP address)
ovs-appctl: ovs-vswitchd: server returned an error
command terminated with exit code 1

This is a clone of issue OCPBUGS-4913. The following is the description of the original issue:

Description of problem:

Currently the Terraform code waits for 45 seconds, but anecdotal data suggest we should actually wait for 3 minutes in order to avoid "failures" due to occasional slow boots of a new VM in PowerVS.

Version-Release number of selected component (if applicable):

 

How reproducible:

often enough

Steps to Reproduce:

1. run IPI installer against PowerVS
2. look for "empty tuple" in the error message when it fails to reach `bootstrap-complete`
3.

Actual results:

 

Expected results:

VMs to always have IP address assigned by DHCP after a certain wait

Additional info:

The change has already been merged into master/4.13, but 4.12 also needs this for planned PowerVS IPI GA on the z-stream.

Not all of the errors reported by the assisted API (and shown in the wait-for bootstrap complete output) actually require user action.

Some appear when the agents first register but resolve themselves relatively quickly in the natural course of events.

Some, like the availability of NTP, don't block the installation from proceeding at all.

We need to think about the best ways of exposing this information to the user.

Create a script that gathers debug information from a host running the agent ISO and exports it in a standard format so that we can ask customers to provide it for debugging when something has gone wrong (and also use it in CI).

For now, it is fine to require the user to ssh into the host to run the script. The script should be already in place inside the agent ISO.

The output should probably be a compressed tar file. That file could be saved locally, or potentially piped to stdout so that a user only has to run a command like: ssh node0 -c agent-gather >node0.tgz

Things we need to collect:

  • systemctl status and journal for each of the systemd services created by the agent installer (ideally this should be determined programmatically so we can't forget to add any)
  • network information: ifconfig; ip -j -p addr
  • Data supplied by the agent installer in /etc/assisted/*
  • /etc/containers/registries.conf
  • /etc/assisted-service/node0 (if it exists)
  • /usr/local/share/assisted-service/*.env

This is a clone of issue OCPBUGS-3706. The following is the description of the original issue:

Description of problem:

While running ./openshift-install agent wait-for install-complete --dir billi --log-level debug on a real bare metal dual stack compact cluster installation it errors out with ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused but installation is still progressing

DEBUG Uploaded logs for host openshift-master-1 cluster d8b0979d-3d69-4e65-874a-d1f7da79e19e 
DEBUG Host: openshift-master-1, reached installation stage Rebooting 
DEBUG Host: openshift-master-1, reached installation stage Configuring 
DEBUG Host: openshift-master-2, reached installation stage Configuring 
DEBUG Host: openshift-master-2, reached installation stage Joined 
DEBUG Host: openshift-master-1, reached installation stage Joined 
DEBUG Host: openshift-master-0, reached installation stage Waiting for bootkube 
DEBUG Host openshift-master-1: updated status from installing-in-progress to installed (Done) 
DEBUG Host: openshift-master-1, reached installation stage Done 
DEBUG Host openshift-master-2: updated status from installing-in-progress to installed (Done) 
DEBUG Host: openshift-master-2, reached installation stage Done 
DEBUG Host: openshift-master-0, reached installation stage Waiting for controller: waiting for controller pod ready event 
ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused 
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR 				The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR 				https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 

Version-Release number of selected component (if applicable):

4.12.0-rc.0

How reproducible:

100%

Steps to Reproduce:

1. ./openshift-install agent create image --dir billi --log-level debug 
2. mount resulting iso image and reboot nodes via iLO
3. /openshift-install agent wait-for install-complete --dir billi --log-level debug 

Actual results:

 ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused 

cluster installation is not complete and it needs more time to complete 

Expected results:

waits until the cluster installation completes

Additional info:

The cluster installation eventually completes fine if waiting after the error.

Attaching install-config.yaml and agent-config.yaml

This is a clone of issue OCPBUGS-3621. The following is the description of the original issue:

Description of problem:

EUS-to-EUS upgrade(4.10.38-4.11.13-4.12.0-rc.0), after control-plane nodes are upgraded to 4.12 successfully, unpause the worker pool to get worker nodes updated. But worker nodes failed to be updated with degraded worker pool:
```
# ./oc get node
NAME                                                   STATUS                     ROLES    AGE     VERSION
jliu410-6hmkz-master-0.c.openshift-qe.internal         Ready                      master   4h40m   v1.25.2+f33d98e
jliu410-6hmkz-master-1.c.openshift-qe.internal         Ready                      master   4h40m   v1.25.2+f33d98e
jliu410-6hmkz-master-2.c.openshift-qe.internal         Ready                      master   4h40m   v1.25.2+f33d98e
jliu410-6hmkz-worker-a-xdwvv.c.openshift-qe.internal   Ready,SchedulingDisabled   worker   4h31m   v1.23.12+6b34f32
jliu410-6hmkz-worker-b-9hnb8.c.openshift-qe.internal   Ready                      worker   4h31m   v1.23.12+6b34f32
jliu410-6hmkz-worker-c-bdv4f.c.openshift-qe.internal   Ready                      worker   4h31m   v1.23.12+6b34f32
...
# ./oc get co machine-config
machine-config   4.12.0-rc.0   True        False         True       3h41m   Failed to resync 4.12.0-rc.0 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)]
...
# ./oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-b81233204496767f2fe32fbb6cb088e1   True      False      False      3              3                   3                     0                      4h10m
worker   rendered-worker-a2caae543a144d94c17a27e56038d4c4   False     True       True       3              0                   0                     1                      4h10m
...
# ./oc describe mcp worker
Message:                   Reason:                    Status:                True    Type:                  Degraded    Last Transition Time:  2022-11-14T07:19:42Z    Message:               Node jliu410-6hmkz-worker-a-xdwvv.c.openshift-qe.internal is reporting: "Error checking type of update image: error running skopeo inspect --no-tags --retry-times 5 --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c01b0ae9870dbee5609c52b4d649334ce6854fff1237f1521929d151f6876daa: exit status 1\ntime=\"2022-11-14T07:42:47Z\" level=fatal msg=\"unknown flag: --no-tags\"\n"    Reason:                1 nodes are reporting degraded status on sync    Status:                True    Type:                  NodeDegraded
...
# ./oc logs machine-config-daemon-mg2zn
E1114 08:11:27.115577  192836 writer.go:200] Marking Degraded due to: Error checking type of update image: error running skopeo inspect --no-tags --retry-times 5 --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c01b0ae9870dbee5609c52b4d649334ce6854fff1237f1521929d151f6876daa: exit status 1
time="2022-11-14T08:11:25Z" level=fatal msg="unknown flag: --no-tags"
```

Version-Release number of selected component (if applicable):

4.12.0-rc.0

How reproducible:

 

Steps to Reproduce:

1. EUS upgrade with path 4.10.38-> 4.11.13-> 4.12.0-rc.0 with paused worker pool 
2. After master pool upgrade succeed, unpause worker pool 
3.

Actual results:

Worker pool upgrade failed

Expected results:

Worker pool upgrade succeed

Additional info:

 

This is a clone of issue OCPBUGS-3405. The following is the description of the original issue:

In case it should be used for publishing artifacts in CI jobs.

Look into to see if the following things are leaked:

  • pull secret
  • ssh key
  • potentially values in journal logs

Currently assisted installer doesn't verify that etcd is ok before reboot on the bootstrap node as wait_for_ceo in bootkube does nothing. 

In 4.13 and backported to 4.12 etcd team had added status that we can check in assisted installer in order to decide if it is safe to reboot bootstrap or not. We should check it before running shutdown command.

Eran Cohen Rom Freiman 

This is a clone of issue OCPBUGS-2633. The following is the description of the original issue:

Description of problem:

There are different versions, channel for the operator, but may be they use the same 'latest' label, when mirroring them as `additionalImages`, got the below error:

[root@ip-172-31-249-209 jian]# oc-mirror --config mirror.yaml file:///root/jian/test/
...
...
sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1 file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest
sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
info: Mirroring completed in 22.48s (125.8MB/s)
error: one or more errors occurred while uploading images

Version-Release number of selected component (if applicable):

[root@ip-172-31-249-209 jian]# oc-mirror version
Client Version: version.Info{Major:"0", Minor:"1", GitVersion:"v0.1.0", GitCommit:"6ead1890b7a21b6586b9d8253b6daf963717d6c3", GitTreeState:"clean", BuildDate:"2022-08-25T05:27:39Z", GoVersion:"go1.17.12", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1. use the below config:
[cloud-user@preserve-olm-env2 mirror-tmp]$ cat mirror.yaml
apiVersion: mirror.openshift.io/v1alpha1
kind: ImageSetConfiguration
# archiveSize: 4
mirror:
  additionalImages:
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:46a62d73aeebfb72ccc1743fc296b74bf2d1f80ec9ff9771e655b8aa9874c933
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:9e549c09edc1793bef26f2513e72e589ce8f63a73e1f60051e8a0ae3d278f394
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:c16891ee9afeb3fcc61af8b2802e56605fff86a505e62c64717c43ed116fd65e
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:5c37bd168645f3d162cb530c08f4c9610919d4dada2f22108a24ecdea4911d60
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:89a6abbf10908e9805d8946ad78b98a13a865cefd185d622df02a8f31900c4c1
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:de5b339478e8e1fc3bfd6d0b6784d91f0d3fbe0a133354be9e9d65f3d7906c2d
    - name: brew.registry.redhat.io/rh-osbs/openshift-ose-cluster-kube-descheduler-operator-bundle@sha256:fdf774c4365bde48d575913d63ef3db00c9b4dda5c89204029b0840e6dc410b1
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:d90aecc425e1b2e0732d0a90bc84eb49eb1139e4d4fd8385070d00081c80b71c
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:15cc75164335fa178c80db4212d11e4a793f53d2b110c03514ce4c79a3717ca0
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:9e66db3a282ee442e71246787eb24c218286eeade7bce4d1149b72288d3878ad
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:546b14c1f3fb02b1a41ca9675ac57033f2b01988b8c65ef3605bcc7d2645be60
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:12d7061012fd823b57d7af866a06bb0b1e6c69ec8d45c934e238aebe3d4b68a5
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:41025e3e3b72f94a3290532bdd6cabace7323c3086a9ce434774162b4b1dd601
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:672b4bee759f8115e5538a44c37c415b362fc24b02b0117fd4bdcc129c53e0a1
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:92542b22911fbd141fadc53c9737ddc5e630726b9b53c477f4dfe71b9767961f
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:1feb7073dec9341cadcc892df39ae45c427647fb034cf09dce1b7aa120bbb459
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:7ca05f93351959c0be07ec3af84ffe6bb5e1acea524df210b83dd0945372d432
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:c0fe8830f8fdcbe8e6d69b90f106d11086c67248fa484a013d410266327a4aed
    - name: brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06
    - name: brew.registry.redhat.io/openshift4/ose-descheduler@sha256:b386d0e1c9e12e9a3a07aa101257c6735075b8345a2530d60cf96ff970d3d21a


2. Run the 
$ oc-mirror --config mirror.yaml file:///root/jian/test/  

Actual results:

error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:f6b6a15c4477615ff202e73d77fc339977aeeca714b9667196509d53e2d2e4f5 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists
error: unable to push manifest to file://brew.registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator:latest: symlink sha256:6a1de43c60d021921973e81c702e163a49300254dc3b612fd62ed2753efe4f06 /root/jian/test/oc-mirror-workspace/src/v2/openshift4/ose-cluster-kube-descheduler-operator/manifests/latest.download: file exists

Expected results:

No error

Additional info:

 

Description of problem:

OpenShift Console does not filter the SecretList when displaying the ServiceAccount details page

When reviewing the details page of an OpenShift ServiceAccount, at the bottom of the page there is a SecretsList which is intended to display all of the relevant Secrets that are attached to the ServiceAccount.

In OpenShift 4.8.X, this SecretList only displayed the relevant Secrets. In OpenShift 4.9+ the SecretList now displays all Secrets within the entire Namespace.

Version-Release number of selected component (if applicable):

4.8.57 < Most recent release without issue
4.9.0 < First release with issue 
4.10.46 < Issue is still present

How reproducible:

Everytime

Steps to Reproduce:

1. Deploy a cluster with OpenShift 4.8.57 
      (or replace the OpenShift Console image with `sha256:9dd115a91a4261311c44489011decda81584e1d32982533bf69acf3f53e17540` )
2. Access the ServiceAccounts Page ( User Management -> ServiceAccounts)
3. Click a ServiceAccount to display the Details page
4. Scroll down and review the Secrets section
5. Repeat steps with an OpenShift 4.9 release 
   (or check using image `sha256:fc07081f337a51f1ab957205e096f68e1ceb6a5b57536ea6fc7fbcea0aaaece0` )

Actual results:

All Secrets in the Namespace are displayed

Expected results:

Only Secrets associated with the ServiceAccount are displayed

Additional info:

Lightly reviewing the code, the following links might be a good start:
- https://github.com/openshift/console/blob/master/frontend/public/components/secret.jsx#L126
- https://github.com/openshift/console/blob/master/frontend/public/components/service-account.jsx#L151:L151

Description of problem:

documentationBaseURL still points to 4.10

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-31-101631

How reproducible:

Always

Steps to Reproduce:

1.Check documentationBaseURL on 4.12 cluster: 
# oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.11/

2.
3.

Actual results:

1.documentationBaseURL is still pointing to 4.11

Expected results:

1.documentationBaseURL should point to 4.12

Additional info:

 

Description of problem:

Changes were introduced in 4.12 which allowed the designated router (DR) IP to be something other than the .3 address in the per node pod subnet. However, downstream ICNIv1 code was not updated. Therefore the DR may have an IP other than .3, while an ICNIv1 pod has a default route pointing to .3. This results in no egress traffic working from the pod.

 

Example pod in ICNIv1 namespace on 4.12:
[root@pod2 /]# ip route
default via 10.244.0.3 dev eth0
10.96.0.0/16 via 10.244.0.1 dev eth0
10.244.0.0/24 dev eth0 proto kernel scope link src 10.244.0.5
10.244.0.0/16 via 10.244.0.1 dev eth0
^route pointing to .3

 

But the real DR IP is claimed as .4:
k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip: 10.244.0.4
.4 set in ARP flows:
cookie=0x0, duration=25.803s, table=0, n_packets=0, n_bytes=0, priority=100,arp,in_port=ext,arp_tpa=10.244.0.4,arp_op=1 actions=move:NXM_OF_ETH_SRC[]>NXM_OF_ETH_DST[],mod_dl_src:0a:58:0a:f4:00:03,load:0x2>NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]>NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]>NXM_OF_ARP_TPA[],load:0xa580af40003->NXM_NX_ARP_SHA[],load:0xaf40004->NXM_OF_ARP_SPA[],IN_PORT,resubmit(,1)

Description of problem:

Currently openshift-installer and ARO installer have diverged in code bases. In effort from the ARO team to be able to reduce/remove this, the we are patching openshift-installer.

ARO uses a newer version of the Azure SDK. We need to backport this change to previous versions of openshift-installer

Version-Release number of selected component (if applicable):

See affected versions

How reproducible:

N/A

Steps to Reproduce:

N/A

Actual results:

N/A

Expected results:

N/A

Additional info:

 

This is a clone of issue OCPBUGS-10846. The following is the description of the original issue:

Description of problem

CI is flaky because the TestClientTLS test fails.

Version-Release number of selected component (if applicable)

I have seen these failures in 4.13 and 4.14 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 16.07% of runs (20.93% of failures) across 56 total runs and 13 jobs (76.79% failed) in 185ms

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestClientTLS&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails:

=== RUN   TestAll/parallel/TestClientTLS
=== PAUSE TestAll/parallel/TestClientTLS
=== CONT  TestAll/parallel/TestClientTLS
=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [8 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [313 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [313 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:24 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:24 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [802 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:56:25 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=104beed63d6a19782a5559400bd972b6; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown CA (560):
        { [2 bytes data]
        * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [8 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown (628):
        { [2 bytes data]
        * OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:57:00 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        Healthcheck requested
        200

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [802 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
        { [1097 bytes data]
        * TLSv1.3 (IN), TLS app data, [no content] (0):
        { [1 bytes data]
        < HTTP/1.1 200 OK
        < x-request-port: 8080
        < date: Wed, 22 Mar 2023 18:57:00 GMT
        < content-length: 22
        < content-type: text/plain; charset=utf-8
        < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None
        < cache-control: private
        <
        { [22 bytes data]

        * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact

=== CONT  TestAll/parallel/TestClientTLS
        stdout:
        000

        stderr:
        * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache
        * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/
        * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache
        *   Trying 172.30.53.236...
        * TCP_NODELAY set
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed

        * ALPN, offering h2
        * ALPN, offering http/1.1
        * successfully set certificate verify locations:
        *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
          CApath: none
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Client hello (1):
        } [512 bytes data]
        * TLSv1.3 (IN), TLS handshake, Server hello (2):
        { [122 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
        { [10 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Request CERT (13):
        { [82 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Certificate (11):
        { [1763 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, CERT verify (15):
        { [264 bytes data]
        * TLSv1.3 (IN), TLS handshake, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS handshake, Finished (20):
        { [36 bytes data]
        * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Certificate (11):
        } [799 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, CERT verify (15):
        } [264 bytes data]
        * TLSv1.3 (OUT), TLS handshake, [no content] (0):
        } [1 bytes data]
        * TLSv1.3 (OUT), TLS handshake, Finished (20):
        } [36 bytes data]
        * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
        * ALPN, server did not agree to a protocol
        * Server certificate:
        *  subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        *  start date: Mar 22 18:55:46 2023 GMT
        *  expire date: Mar 21 18:55:47 2025 GMT
        *  issuer: CN=ingress-operator@1679509964
        *  SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
        } [5 bytes data]
        * TLSv1.3 (OUT), TLS app data, [no content] (0):
        } [1 bytes data]
        > GET / HTTP/1.1
        > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com
        > User-Agent: curl/7.61.1
        > Accept: */*
        >
        { [5 bytes data]
        * TLSv1.3 (IN), TLS alert, [no content] (0):
        { [1 bytes data]
        * TLSv1.3 (IN), TLS alert, unknown CA (560):
        { [2 bytes data]
        * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

        * Closing connection 0
        curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0

=== CONT  TestAll/parallel/TestClientTLS
--- FAIL: TestAll (1538.53s)
    --- FAIL: TestAll/parallel (0.00s)
        --- FAIL: TestAll/parallel/TestClientTLS (123.10s)

Expected results

CI passes, or it fails on a different test.

Additional info

I saw that TestClientTLS failed on the test case with no client certificate and ClientCertificatePolicy set to "Required". My best guess is that the test is racy and is hitting a terminating router pod. The test uses waitForDeploymentComplete to wait until all new pods are available, but perhaps waitForDeploymentComplete should also wait until all old pods are terminated.

This is a clone of issue OCPBUGS-3253. The following is the description of the original issue:

It is very easy to accidentally use the traditional openshift-install wait-for <x>-complete commands instead of the equivalent openshift-install agent wait-for <x>-complete command. This will work in some stages of the install, but show much less information or fail altogether in other stages of the install.
If we can detect from the asset store that this was an agent-based install, we should issue a warning if the user uses the old command.

This is a clone of issue OCPBUGS-6610. The following is the description of the original issue:

Description of problem:

'Filter by resource' drop-down menu items are in English.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

 

Steps to Reproduce:

1. Navigate to Developer -> Topology -> Filter by resource
2. 'DemonSet', 'Deployment' are in English
3.

Actual results:

Content is in English

Expected results:

Content should be in target language.

Additional info:

 

With CSISnapshot capability is disabled, all Azure Disk CSI Driver Operator gets Degraded.

The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which the operator expects to exist.

Description of problem:

On storageclass creation page, the dropdown items for "Reclaim policy" and "Volume binding tyep" are not marked for i18n.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-22-143022

How reproducible:

always

Steps to Reproduce:
1.Go to storageclass creation page, check if dropdown items for "Reclaim policy" and "Volume binding type" support i18n.
2.
3.

Actual results:

1. They are not marked for i18n.

Expected results:

1. Should support i18n.

Additional info:

Description of problem:

For OVNK to become CNCF complaint, we need to support session affinity timeout feature and enable the e2e's on OpenShift side. This bug tracks the efforts to get this into 4.12 OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

OVS 2.17+ introduced an optimization of "weak references" to substantially speed up database snapshots. in some cases weak references may leak memory; to aforementioned commit fixes that and has been pulled into ovs2.17-62 and later.

This is a clone of issue OCPBUGS-672. The following is the description of the original issue:

Description of problem:

Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes:

62m         Normal    Scheduled                pod/redhat-operators-zb4j7                         Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93
62m         Normal    AddedInterface           pod/redhat-operators-zb4j7                         Add eth0 [10.129.1.112/23] from ovn-kubernetes
62m         Normal    Pulling                  pod/redhat-operators-zb4j7                         Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11"
62m         Normal    Pulled                   pod/redhat-operators-zb4j7                         Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms
62m         Normal    Created                  pod/redhat-operators-zb4j7                         Created container registry-server
62m         Normal    Started                  pod/redhat-operators-zb4j7                         Started container registry-server
62m         Warning   Unhealthy                pod/redhat-operators-zb4j7                         Startup probe failed: timeout: failed to connect service ":50051" within 1s
62m         Normal    Killing                  pod/redhat-operators-zb4j7                         Stopping container registry-server


Increasing the threshold of the probe might fix the problem:
  livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload
2. Inspect redhat-operator pod in openshift-marketplace namespace
3. Observe the resource usage ( CPU and Memory ) of the pod 

Actual results:

Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup

Expected results:

Redhat-operator startup probe succeeding and no spikes in resource on master nodes

Additional info:

Attached cpu, memory and event traces.

 

Description of problem:

Normal user cannot open the debug container from the pods(crashLoopbackoff) they created, And would be got error message:
pods "<pod name>" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-040107, 4.11.z, 4.10.z

How reproducible:

Always

Steps to Reproduce:

1. Login OCP as a normal user
   eg: flexy-htpasswd-provider
2. Create a project, go to Developer prespective -> +Add page
3. Click "Import from Git", and provide below data to get a Pods with CrashLoopBackOff state
   Git Repo URL: https://github.com/sclorg/nodejs-ex.git
   Name: nodejs-ex-git
   Run command: star a wktw
4. Navigate to /k8s/ns/<project name>/pods page, find the pod with CrashLoopBackOff status, and go to it details page -> Logs Tab
5. Click the link of "Debug container"
6. Check if the Debug container can be opened

Actual results:

6. Error message would be shown on page, user cannot open debug container via UI
   pods "nodejs-ex-git-6dd986d8bd-9h2wj-debug-tkqk2" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Expected results:

6. Normal user could use debug container without any error message

Additional info:

The debug container could be created for the normal user successfully via CommandLine
 $ oc debug <crashloopbackoff pod name> -n <project name>

Clone of https://issues.redhat.com/browse/OCPBUGS-15890

Description of problem:

Openshift Console fails to render Monitoring Dashboard when there is a Proxy expected to be used. Additionally, Websocket connections fail due to not using Proxy.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Connect to a cluster using backplane and use one of IT's proxies
2. Execute "ocm backplane console -b"
3. Attempt to view the monitoring dashbaord

Actual results:

Monitoring dashboard fails to load with an EOF error
Terminal is spammed with EOF errors

Expected results:

Monitoring dashboard should be rendered correctly
Terminal should not be spammed with error logs

Additional info:

When we apply changes as this PR, the monitoring dashboard works with proxy https://github.com/openshift/console/pull/12877

Description of problem:
ConfigMaps, Secrets, Deployments, DeploymentConfigs uses the edit form also for creation with 4.11. But BuildConfigs uses the edit form only for edit, not to create one.

Version-Release number of selected component (if applicable):
4.10 and above

How reproducible:
Always

Steps to Reproduce:
1. Switch to dev perspective
2. Navigate to build
3. Click on create

Actual results:
Opens a YAML editor to create a BuildConfig

Expected results:
Should open a form, with a YAML switcher to create a BuildConfig

Additional info:

Description of problem:

Insights operator gathers related clusteroperator's related objects from operators.openshift.io group. Ingresscontrollers are now missing, because it's a namespaceed resource and the "default" name is not provided in the related objects of the ingress clusteroperator

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3195. The following is the description of the original issue:

Description of problem:

the service ca controller start func seems to return that error as soon as its context is cancelled (which seems to happen the moment the first signal is received): https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f558246b8025584056/pkg/controller/starter.go#L24

that apparently triggers os.Exit(1) immediately https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f55824[…]om/openshift/library-go/pkg/controller/controllercmd/builder.go

the lock release doesn't happen until the periodic renew tick breaks out https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f55824[…]/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go

seems unlikely that you'd reach the call to le.release() before the call to os.Exit(1) in the other goroutine

Version-Release number of selected component (if applicable):

4.13.0

How reproducible:

~always

Steps to Reproduce:

1. oc delete -n openshift-service-ca pod <service-ca pod>

Actual results:

the old pod logs show:

W1103 09:59:14.370594       1 builder.go:106] graceful termination failed, controllers failed with error: stopped

and when a new pod comes up to replace it, it has to wait for a while before acquiring the leader lock

I1103 16:46:00.166173       1 leaderelection.go:248] attempting to acquire leader lease openshift-service-ca/service-ca-controller-lock...
 .... waiting ....
I1103 16:48:30.004187       1 leaderelection.go:258] successfully acquired lease openshift-service-ca/service-ca-controller-lock

Expected results:

new pod can acquire the leader lease without waiting for the old pod's lease to expire

Additional info:

 

Description of problem:

cluster-version-operator pod crashloop during the bootstrap process might be leading to a longer bootstrap process causing the installer to timeout and fail.

The cluster-version-operator pod is continuously restarting due to a go panic. The bootstrap process fails due to the timeout although it completes the process correctly after more time, once the cluster-version-operator pod runs correctly.

$ oc -n openshift-cluster-version logs -p cluster-version-operator-754498df8b-5gll8
I0919 10:25:05.790124       1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4                                                                                                                    
F0919 10:25:05.791580       1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused                                                        
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc00017d5e0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc000089140, 0x1, ...})                                                                                                                   
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
main.init.3.func1(0xc00012ac80?, {0x1b96f60?, 0x6?, 0x6?})
        /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
github.com/spf13/cobra.(*Command).execute(0xc00012ac80, {0xc0002fea20, 0x6, 0x6})
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
main.main()
        /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-18-234318

How reproducible:

Most of the times, with any network type and installation type (IPI, UPI and proxy).

Steps to Reproduce:

1. Install OCP 4.12 IPI
   $ openshift-install create cluster
2. Wait until bootstrap is completed

Actual results:

[...]
level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS        AGE 
openshift-cluster-version                          cluster-version-operator-754498df8b-5gll8                    0/1     CrashLoopBackOff   7 (3m21s ago)   24m 
openshift-image-registry                           image-registry-94fd8b75c-djbxb                               0/1     Pending            0               6m44s 
openshift-image-registry                           image-registry-94fd8b75c-ft66c                               0/1     Pending            0               6m44s 
openshift-ingress                                  router-default-64fbb749b4-cmqgw                              0/1     Pending            0               13m   
openshift-ingress                                  router-default-64fbb749b4-mhtqx                              0/1     Pending            0               13m   
openshift-monitoring                               prometheus-operator-admission-webhook-6d8cb95cf7-6jn5q       0/1     Pending            0               14m 
openshift-monitoring                               prometheus-operator-admission-webhook-6d8cb95cf7-r6nnk       0/1     Pending            0               14m 
openshift-network-diagnostics                      network-check-source-8758bd6fc-vzf5k                         0/1     Pending            0               18m 
openshift-operator-lifecycle-manager               collect-profiles-27726375-hlq89                              0/1     Pending            0               21m 
$ oc -n openshift-cluster-version describe pod cluster-version-operator-754498df8b-5gll8
Name:                 cluster-version-operator-754498df8b-5gll8
Namespace:            openshift-cluster-version                                                            
Priority:             2000000000              
Priority Class Name:  system-cluster-critical                                                       
Node:                 ostest-4gtwr-master-1/10.196.0.68
Start Time:           Mon, 19 Sep 2022 10:17:41 +0000                       
Labels:               k8s-app=cluster-version-operator
                      pod-template-hash=754498df8b
Annotations:          openshift.io/scc: hostaccess 
Status:               Running                      
IP:                   10.196.0.68
IPs:                 
  IP:           10.196.0.68
Controlled By:  ReplicaSet/cluster-version-operator-754498df8b
Containers:        
  cluster-version-operator:
    Container ID:  cri-o://1e2879600c89baabaca68c1d4d0a563d4b664c507f0617988cbf9ea7437f0b27
    Image:         registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69                                                                                                             
    Image ID:      registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69
    Port:          <none>                                                                                                                                                                                                                    
    Host Port:     <none>                                                                                                                                                                                                                    
    Args:                                                     
      start                                                                                                                                                                                                                                  
      --release-image=registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69                                                                                                          
      --enable-auto-update=false                                                                                                                                                                                                             
      --listen=0.0.0.0:9099                                                  
      --serving-cert-file=/etc/tls/serving-cert/tls.crt
      --serving-key-file=/etc/tls/serving-cert/tls.key                                                                                                                                                                                       
      --v=2             
    State:       Waiting 
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   I0919 10:33:07.798614       1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4
F0919 10:33:07.800115       1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
goroutine 1 [running]:                                                                                                                                                                                                                [43/497]
k8s.io/klog/v2.stacks(0x1)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc000433ea0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc0002d6630, 0x1, ...})
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
  /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
main.init.3.func1(0xc0003b4f00?, {0x1b96f60?, 0x6?, 0x6?})
  /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
github.com/spf13/cobra.(*Command).execute(0xc0003b4f00, {0xc000311980, 0x6, 0x6})
  /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
  /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
  /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
main.main()
  /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46      Exit Code:    255
      Started:      Mon, 19 Sep 2022 10:33:07 +0000
      Finished:     Mon, 19 Sep 2022 10:33:07 +0000
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:     20m
      memory:  50Mi
    Environment:
      KUBERNETES_SERVICE_PORT:  6443
      KUBERNETES_SERVICE_HOST:  127.0.0.1
      NODE_NAME:                 (v1:spec.nodeName)
      CLUSTER_PROFILE:          self-managed-high-availability
    Mounts:
      /etc/cvo/updatepayloads from etc-cvo-updatepayloads (ro)
      /etc/ssl/certs from etc-ssl-certs (ro)
      /etc/tls/service-ca from service-ca (ro)
      /etc/tls/serving-cert from serving-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
onditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  etc-ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs
    HostPathType:
  etc-cvo-updatepayloads:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cvo/updatepayloads
    HostPathType:
  serving-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-version-operator-serving-cert
    Optional:    false
  service-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      openshift-service-ca.crt
    Optional:  false
  kube-api-access:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3600
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  25m                   default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  21m                   default-scheduler  0/2 nodes are available: 2 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/2 nodes are available: 2 Preemption is no
t helpful for scheduling.
  Normal   Scheduled         19m                   default-scheduler  Successfully assigned openshift-cluster-version/cluster-version-operator-754498df8b-5gll8 to ostest-4gtwr-master-1 by ostest-4gtwr-bootstrap
  Warning  FailedMount       17m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[service-ca kube-api-access etc-ssl-certs etc-cvo-updatepayloads serving-cert]:
timed out waiting for the condition
  Warning  FailedMount       17m (x9 over 19m)     kubelet            MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
  Normal   Pulling           15m                   kubelet            Pulling image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69"
  Normal   Pulled            15m                   kubelet            Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" in 7.481824271s
  Normal   Started           14m (x3 over 15m)     kubelet            Started container cluster-version-operator
  Normal   Created           14m (x4 over 15m)     kubelet            Created container cluster-version-operator
  Normal   Pulled            14m (x3 over 15m)     kubelet            Container image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" already present on machine
  Warning  BackOff           4m22s (x52 over 15m)  kubelet            Back-off restarting failed container
  
  

Expected results:

No panic?

Additional info:

Seen in most of OCP on OSP QE CI jobs.

Attached [^must-gather-install.tar.gz]

This is a clone of issue OCPBUGS-2992. The following is the description of the original issue:

Description of problem:

The metal3-ironic container image in OKD fails during steps in configure-ironic.sh that look for additional Oslo configuration entries as environment variables to configure the Ironic instance. The mechanism by which it fails in OKD but not OpenShift is that the image for OpenShift happens to have unrelated variables set which match the regex, because it is based on the builder image, but the OKD image is based only on a stream8 image without these unrelated OS_ prefixed variables set.

The metal3 pod created in response to even a provisioningNetwork: Disabled Provisioning object will therefore crashloop indefinitely.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Deploy OKD to a bare metal cluster using the assisted-service, with the OKD ConfigMap applied to podman play kube, as in :https://github.com/openshift/assisted-service/tree/master/deploy/podman#okd-configuration
2. Observe the state of the metal3 pod in the openshift-machine-api namespace.

Actual results:

The metal3-ironic container repeatedly exits with nonzero, with the logs ending here:

++ export IRONIC_URL_HOST=10.1.1.21
++ IRONIC_URL_HOST=10.1.1.21
++ export IRONIC_BASE_URL=https://10.1.1.21:6385
++ IRONIC_BASE_URL=https://10.1.1.21:6385
++ export IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050
++ IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050
++ '[' '!' -z '' ']'
++ '[' -f /etc/ironic/ironic.conf ']'
++ cp /etc/ironic/ironic.conf /etc/ironic/ironic.conf_orig
++ tee /etc/ironic/ironic.extra
# Options set from Environment variables
++ echo '# Options set from Environment variables'
++ env
++ grep '^OS_'
++ tee -a /etc/ironic/ironic.extra

Expected results:

The metal3-ironic container starts and the metal3 pod is reported as ready.

Additional info:

This is the PR that introduced pipefail to the downstream ironic-image, which is not yet accepted in the upstream:
https://github.com/openshift/ironic-image/pull/267/files#diff-ab2b20df06f98d48f232d90f0b7aa464704257224862780635ec45b0ce8a26d4R3

This is the line that's failing:
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/scripts/configure-ironic.sh#L57

This is the image base that OpenShift uses for ironic-image (before rewriting in ci-operator):
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.ocp#L9

Here is where the relevant environment variables are set in the builder images for OCP:
https://github.com/openshift/builder/blob/973602e0e576d7eccef4fc5810ba511405cd3064/hack/lib/build/version.sh#L87

Here is the final FROM line in the OKD image build (just stream8):
https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.okd#L9

This results in the following differences between the two images:
$ podman run --rm -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:519ac06836d972047f311de5e57914cf842716e22a1d916a771f02499e0f235c -c 'env | grep ^OS_'
OS_GIT_MINOR=11
OS_GIT_TREE_STATE=clean
OS_GIT_COMMIT=97530a7
OS_GIT_VERSION=4.11.0-202210061001.p0.g97530a7.assembly.stream-97530a7
OS_GIT_MAJOR=4
OS_GIT_PATCH=0
$ podman run --rm -it --entrypoint bash quay.io/openshift/okd-content@sha256:6b8401f8d84c4838cf0e7c598b126fdd920b6391c07c9409b1f2f17be6d6d5cb -c 'env | grep ^OS_'

Here is what the OS_ prefixed variables should be used for:
https://github.com/metal3-io/ironic-image/blob/807a120b4ce5e1675a79ebf3ee0bb817cfb1f010/README.md?plain=1#L36
https://opendev.org/openstack/oslo.config/src/commit/84478d83f87e9993625044de5cd8b4a18dfcaf5d/oslo_config/sources/_environment.py

It's worth noting that ironic.extra is not consumed anywhere, and is simply being used here to save off the variables that Oslo _might_ be consuming (it won't consume the variables that are present in the OCP builder image, though they do get caught by this regex).

With pipefail set, grep returns non-zero when it fails to find an environment variable that matches the regex, as in the case of the OKD ironic-image builds.

 

Description of problem:

vSphere privilege checking failing when providing user-defined folder and/or resource pool

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-30-054458

How reproducible:

consistently

Steps to Reproduce:

1. Provide pre-existing folder and/or resource pool to the install-config
2. Perform an installation with an account with read only privileges on the datacenter and cluster
3. The installer will fail with missing privileges for the cluster and datacenter.  When a pre-existing folder and resource pool are defined, the account can hold read only privileges on the datacenter and cluster .

Actual results:

Installer reports missing privileges

Expected results:

Installer should succeed

Additional info:

 

Probably for: 1h or some such; I don't think it needs to go off immediately. But in-cluster admins and folks monitoring submitted Insights should have a way to figure out that the cluster is trying and failing to submit Telemetry. The alert should not fire when Telemetry submission has been explicitly disabled.

There is an existing alert for PrometheusRemoteWriteBehind in a similar space, but as of today, the Temeletry submissions are happening via telemeter-client, due to concerns about the load of submitting via remote-write.

Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf

  1. Generated by KNI resolv prepender NM dispatcher script
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx

OpenShift 4.8.29

  1. Generated by KNI resolv prepender NM dispatcher script
    search sepia.lab.iad2.dc.paas.redhat.com
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx
    ~~~

ENV: OpenStack IAD2, IPI installation. Connected cluster.

Version-Release number of selected component (if applicable):
OCP v4.9.31

How reproducible:
Always

Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.

Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.

Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.

Additional info:

  • Cu was trying to deploy secure Kerberos on the CoreOS nodes and it failed when the IPA-client install command failed. This is when the customer noticed this unusual behavior. They did not manually update the resolv.conf file to include the $search domain. They instead added the script below to /etc/NetworkManager/dispatcher.d/ and restarted NetworkManager on the node to fix this issue and installation was successful.
    ~~~
    #!/bin/bash

set -eo pipefail

DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \

grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \
tr '\n' ' ')"

>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"

sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp

mv /etc/resolv.tmp /etc/resolv.conf
~~~

  • Cu confirms that the $search domain was missing since the cluster was freshly installed/ They even confirmed this with a fresh new cluster as well that it was missing.
  • The fresh cluster was initially installed at v4.9.31 but was updated afterward to v4.9.43 (the latest z-stream) to see if the updates fixed anything but it didn't make any difference. The cluster is currently running v4.9.43 and shows the $search domain missing in the /etc/resolv.conf file on all nodes.

Description of problem:

During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem

Since resource type option has been moved to an advanced option in both the Deploy Image and Import from Git flows, there is confusion for some existing customers who are using the feature.

The UI no longer provides transparency of the type of resource which is being created.

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

1.
2.
3.

Actual results

Expected results

Remove Resource type from Adv Options, and place it back where it was previously.  Resource type selection is now a dropdown so that we will put it in its previous spot, but it will use a different component from 4.11.

  •  

Description of problem:
The console crashes when it used with a user settings ConfigMap that is created with a 4.13+ console. This version saves "null" for the key "console.pinnedResources" which doesn't happen before and the old console version could not handle this well.

Version-Release number of selected component (if applicable):
4.8-4.12

How reproducible:
Always, but only in the edge case that someone used a newer console first and then downgraded.

This can happen only by manually applying the user settings ConfigMap or when downgrading a cluster.

Steps to Reproduce:
Open the user-settings ConfigMap and set "console.pinedResources" to "null" (with quotes as all ConfigMap values needs to be strings)

Or run this patch command:

oc patch -n openshift-console-user-settings configmaps user-settings-kubeadmin --type=merge --patch '{"data":{"console.pinnedResources":"null"}}'

Open console...

Actual results:
Console crashes

Expected results:
Console should not crash

This is a clone of issue OCPBUGS-4367. The following is the description of the original issue:

Description of problem:

The calls to log.Debugf() from image/baseiso.go and image/oc.go are not being output when the "image create" command is run.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Every time

Steps to Reproduce:

1. Run ../bin/openshift-install agent create image --dir ./cluster-manifests/ --log-level debug

Actual results:

No debug log messages from log.Debugf() calls in pkg/asset/agent/image/oc.go

Expected results:

Debug log messages are output

Additional info:

Note from Zane: We should probably also use the real global logger instead of [creating a new one](https://github.com/openshift/installer/blob/2698cbb0ec7e96433a958ab6b864786c0c503c0b/pkg/asset/agent/image/baseiso.go#L109) with the default config that ignores the --log-level flag and prints weird `[0001]` stuff in the output for some reason. (The NMStateConfig manifests logging suffers from the same problem.)

 

 

 

This is a clone of issue OCPBUGS-2824. The following is the description of the original issue:

Description of problem:

When users adjust their browsers to small size, the deploymnet details page on the Topology page overrides the drop-down list component, which prevents the user from using the drop-down list functionality. All content on the dropdown list would be covered

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-24-103753

How reproducible:

Always

Steps to Reproduce:

1. Login OCP, go to developer perspective -> Topology page
2. Click and open one resource (eg: deployment), make sure the resource sidebar has been opened
3. Adjust the browser windows to small size
4. Check if the dropdown list component has been covered 

Actual results:

All the dorpdown list component will be covered by the deployment details page (See attachment for more details)

Expected results:

The dropdown list component should be displayed on the top, the function should work even if the windows is small

Additional info:

 

This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:

At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.

But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.

However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.

This was probably not detected in QE as well for the same reason as CI.

This is a clone of issue OCPBUGS-7973. The following is the description of the original issue:

Description of problem:

After destroyed the private cluster, the cluster's dns records left.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-02-26-022418 
4.13.0-0.nightly-2023-02-26-081527 

How reproducible:

always

Steps to Reproduce:

1.create a private cluster
2.destroy the cluster
3.check the dns record  
$ibmcloud dns zones | grep private-ibmcloud.qe.devcluster.openshift.com (base_domain)
3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b   private-ibmcloud.qe.devcluster.openshift.com     PENDING_NETWORK_ADD
$zone_id=3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b
$ibmcloud dns resource-records $zone_id
CNAME:520c532f-ca61-40eb-a04e-1a2569c14a0b   api-int.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com   CNAME   60    10a7a6c7-jp-tok.lb.appdomain.cloud   
CNAME:751cf3ce-06fc-4daf-8a44-bf1a8540dc60   api.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com       CNAME   60    10a7a6c7-jp-tok.lb.appdomain.cloud   
CNAME:dea469e3-01cd-462f-85e3-0c1e6423b107   *.apps.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com    CNAME   120   395ec2b3-jp-tok.lb.appdomain.cloud 

Actual results:

the dns records of the cluster were left

Expected results:

created dns record by installer are all deleted, after destroyed the cluster

Additional info:

this block create private cluster later, caused the maximum limit of 5 wildcard records are easily reached. (qe account limitation)
checking the *ingress-operator.log of the failed cluster, got the error: "createOrUpdateDNSRecord: failed to create the dns record: Reached the maximum limit of 5 wildcard records."

Description of problem:

In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OCP 4.8

2. Upgrade to OCP 4.9 

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-08-31-160214   True        True          50m     Working towards 4.9.47: 619 of 738 done (83% complete)

$ oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.47    True        False         4m26s   Cluster version is 4.9.47
 

Actual results:

Check packageserver CSV. It's in v0.17.0 

$ oc get csv  NAME            DISPLAY          VERSION   REPLACES   PHASE packageserver   Package Server   0.17.0               Succeeded 

Expected results:

packageserver CSV is at 0.18.3 

Additional info:

packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12

packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8

Description of problem:

When creating many egress firewalls with many rules (up to allowed 8000 rules per object limit), it takes a long time to create a single object.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. use 8000-rule egressfirewall https://gist.github.com/npinaeva/7af14f6aa3234694c6d0e082e6c58d14 and find ovnkube-master or ovnkube-control-plane log Creating *v1.EgressFirewall <namespace>/default took: <n>s 
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-5182. The following is the description of the original issue:

Description of problem:

Deploy IPI cluster on azure cloud, set region as westeurope, vm size as EC96iads_v5 or EC96ias_v5. Installation fails with below error:

12-15 11:47:03.429  level=error msg=Error: creating Linux Virtual Machine: (Name "jima-15a-m6fzd-bootstrap" / Resource Group "jima-15a-m6fzd-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The VM size 'Standard_EC96iads_v5' is not supported for creation of VMs and Virtual Machine Scale Set with '<NULL>' security type."

Similar as https://bugzilla.redhat.com/show_bug.cgi?id=2055247.

From azure portal, we can see that the type of both vm size EC96iads_v5 and EC96ias_v5 are confidential compute.

Might also need to do similar process for them as what did in bug 2055247.

 

Version-Release number of selected component (if applicable):

4.12 nightly build

How reproducible:

Always

Steps to Reproduce:

1. Prepare install-config.yaml file, set region as westeurope, vm size as EC96iads_v5 or EC96ias_v5
2. Deploy IPI azure cluster
3.

Actual results:

Install failed with error in description

Expected results:

Installer should be exited during validation and show expected error message. 

Additional info:

 

 

Description of problem:

- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted.
- We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful.
- This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again.
- The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.

Version-Release number of selected component (if applicable):

 

How reproducible:

N/A, I wasn't able to reproduce the issue.

Steps to Reproduce:

 

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO. 

similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103

 

This fix was contributed by lance5890, thanks a lot!

 

This is a clone of issue OCPBUGS-5018. The following is the description of the original issue:

Description of problem:

When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component:

$ oc get clusterversion                                                                              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS                                      
version   4.11.0-0.nightly-2022-12-16-190443   True        True          117m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api

$ oc get co                                                                                                                                                                                                                              
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE                                                                                                                                   
authentication                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h47m   
baremetal                                  4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
cloud-controller-manager                   4.12.0-rc.5                          True        False         False      5h3m    
cloud-credential                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h4m                                                                                                                                              
cluster-autoscaler                         4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
config-operator                            4.12.0-rc.5                          True        False         False      5h1m    
console                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h43m   
csi-snapshot-controller                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      
dns                                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
etcd                                       4.12.0-rc.5                          True        False         False      4h58m         
image-registry                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h54m         
ingress                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m   
insights                                   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
kube-apiserver                             4.12.0-rc.5                          True        False         False      4h50m         
kube-controller-manager                    4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             
kube-scheduler                             4.12.0-rc.5                          True        False         False      4h57m                                                                                                                                             kube-storage-version-migrator              4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-api                                4.11.0-0.nightly-2022-12-16-190443   True        True          False      4h56m   Progressing towards operator: 4.12.0-rc.5                                                                                                 
machine-approver                           4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                machine-config                             4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             marketplace                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m   
monitoring                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m                                                                                                                                             
network                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h3m          
node-tuning                                4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h59m                                                                                                                                             
openshift-apiserver                        4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h53m         
openshift-controller-manager               4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h56m                                                                                                                                             
openshift-samples                          4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
operator-lifecycle-manager                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-12-16-190443   True        False         False      4h55m                                                                                                                                             
service-ca                                 4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h                                                                                                                                                
storage                                    4.11.0-0.nightly-2022-12-16-190443   True        False         False      5h      

When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state:

$ oc get pods -n openshift-machine-api                                                                                                                                                                                                   
NAME                                           READY   STATUS             RESTARTS   AGE                                                                                                                                                                               
cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h5m                                                                                                                                                                              
cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h5m                                          
machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          94m                                                                                                                                                                               
machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          97m                                           
machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
machine-api-termination-handler-gj4pf          1/1     Running            0          4h57m                                                                                                                                                                             
machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          94m                                                                                                                                                                               
machine-api-termination-handler-l95x2          1/1     Running            0          4h54m                                                                                                                                                                             
machine-api-termination-handler-p6sw6          1/1     Running            0          4h57m   

$ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api                                                                                                                                                        
Name:                 machine-api-termination-handler-fcfq2
Namespace:            openshift-machine-api
Priority:             2000001000
Priority Class Name:  system-node-critical
.....................................................................
Events:
  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               94m                    default-scheduler  Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal
  Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
 exists. ","ErrorCode":2147947410}
  Warning  FailedCreatePodSandBox  94m                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov
erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already
 exists. ","ErrorCode":2147947410}
  Normal   Pulling                 93m (x4 over 94m)      kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"
  Warning  Failed                  93m (x4 over 94m)      kubelet            Error: ErrImagePull
  Normal   BackOff                 4m39s (x393 over 94m)  kubelet            Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258"


$ oc get pods -n openshift-machine-api -o wide
NAME                                           READY   STATUS             RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-6ff66b6655-kpgp9   2/2     Running            0          5h8m    10.130.0.10    ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
cluster-baremetal-operator-6dbcd6f76b-d9dwd    2/2     Running            0          5h8m    10.130.0.8     ip-10-0-180-35.us-east-2.compute.internal    <none>           <none>
machine-api-controllers-cdb8d979b-79xlh        7/7     Running            0          97m     10.128.0.144   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
machine-api-operator-86bf4f6d79-g2vwm          2/2     Running            0          100m    10.128.0.143   ip-10-0-138-246.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-fcfq2          0/1     ImagePullBackOff   0          97m     10.129.0.7     ip-10-0-145-114.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-gj4pf          1/1     Running            0          5h      10.0.223.37    ip-10-0-223-37.us-east-2.compute.internal    <none>           <none>
machine-api-termination-handler-krwdg          0/1     ImagePullBackOff   0          97m     10.128.0.4     ip-10-0-143-111.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-l95x2          1/1     Running            0          4h57m   10.0.172.211   ip-10-0-172-211.us-east-2.compute.internal   <none>           <none>
machine-api-termination-handler-p6sw6          1/1     Running            0          5h      10.0.146.227   ip-10-0-146-227.us-east-2.compute.internal   <none>           <none>
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
ip-10-0-143-111.us-east-2.compute.internal   Ready    worker   4h24m   v1.24.0-2566+5157800f2a3bc3   10.0.143.111   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
ip-10-0-145-114.us-east-2.compute.internal   Ready    worker   4h18m   v1.24.0-2566+5157800f2a3bc3   10.0.145.114   <none>        Windows Server 2019 Datacenter                                  10.0.17763.3770                containerd://1.18
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-145-114.us-east-2.compute.internal   aws:///us-east-2a/i-0b69d52c625c46a6a   running
[jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal
jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc   Running   m5a.large    us-east-2   us-east-2a   4h37m   ip-10-0-143-111.us-east-2.compute.internal   aws:///us-east-2a/i-05e422c0051707d16   running

This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-12-16-190443   True        True          141m    Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api
$ oc version
Client Version: 4.11.0-0.ci-2022-06-09-065118
Kustomize Version: v4.5.4
Server Version: 4.11.0-0.nightly-2022-12-16-190443
Kubernetes Version: v1.25.4+77bec7a

How reproducible:

Always

Steps to Reproduce:

1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet
2. Perform the upgrade to 4.12
3. Wait for the upgrade to hang on the machine-api component

Actual results:

The upgrade hangs when upgrading the machine-api component.

Expected results:

The upgrade suceeds

Additional info:


Description of problem:

During an upgrade from 4.12.0 to 4.12.1 a customer has observed crashlooping ovn-master pods with the following error message

$ oc logs -n openshift-ovn-kubernetes ovnkube-master-bx99r -c ovnkube-master --tail=20 -p
:Transaction causes multiple rows in "IGMP_Group" table to have identical values (mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and []) 
for index on columns "address", "datapath", and "chassis".  First row, with UUID 7e9a18fa-e58c-4547-a7cb-afa934b6cdc9, had the following index values before the trans
action: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and d9755997-e909-4d0c-8770-82a902d69a90.  Second row, with UUID 84da3622-3ac7-41f0-a6b5-536a2d5f9137, had the
 following index values before the transaction: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and 578d4dd9-cc02-4bcc-8a9c-08dcc3a94190. UUID:{GoUUID:} Rows:[]}] and
 errors []: constraint violation: Transaction causes multiple rows in "IGMP_Group" table to have identical values (mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and
 []) for index on columns "address", "datapath", and "chassis".  First row, with UUID 7e9a18fa-e58c-4547-a7cb-afa934b6cdc9, had the following index values before the 
transaction: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and d9755997-e909-4d0c-8770-82a902d69a90.  Second row, with UUID 84da3622-3ac7-41f0-a6b5-536a2d5f9137, ha
d the following index values before the transaction: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and 578d4dd9-cc02-4bcc-8a9c-08dcc3a94190.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Unknown

Steps to Reproduce:

1. Upgrade from 4.12.0 to 4.12.1
2.
3.

Actual results:

crashlooping ovnkube-master pods

Expected results:

functional ovnkube-master pods

Additional info:

This cluster was upgraded from 4.11 to 4.12.0 then to 4.12.1.
The attached case has a must-gather.

Description of problem:

When providing install-config as

platform:
 baremetal:
  apiVIP: 192.168.122.10
  ingressVIP: 192.168.122.11

agent installer fails with 
bin/openshift-install agent create cluster-manifests
FATAL failed to fetch Agent Manifests: failed to load asset "Install Config": invalid install-config configuration: [Platform.Baremetal.ApiVips: Required value: apiVips must be set for baremetal platform, Platform.Baremetal.IngressVips: Required value: ingressVips must be set for baremetal platform]
 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Git clone latest installer https://github.com/openshift/installer and build it
2. Provide install-config.yaml for baremetal platform with deprecated apiVip and ingressVip set
3. Create agent image with "bin/openshift-install agent create cluster-manifests"

Actual results:

bin/openshift-install agent create cluster-manifests
FATAL failed to fetch Agent Manifests: failed to load asset "Install Config": invalid install-config configuration: [Platform.Baremetal.ApiVips: Required value: apiVips must be set for baremetal platform, Platform.Baremetal.IngressVips: Required value: ingressVips must be set for baremetal platform]

Expected results:

agent installer should upconvert the depreacted fields and should not error. apiVip, ingressVip should be upconverted into apiVips and ingressVips respectively

Additional info:

 

Description of problem:

If using ingresscontroller.spec.routeSelector.matchExpressions or ingresscontroller.spec.namespaceSelector.matchExpressions, the route will not count in the new route_metrics_controller_routes_per_shard prometheus metric.

This is due to the logic only using "matchLabels". The logic needs to be updated to also use "matchExpressions".

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Create IC with matchExpressions:
oc apply -f - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: sharded
  namespace: openshift-ingress-operator
spec:
  domain: reproducer.$domain
  routeSelector:
    matchExpressions:
    - key: type
      operator: In
      values:
      - shard
  replicas: 1
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
EOF

2. Create the route:
oc apply -f - <<EOF
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: route-shard
  labels:
    type: shard
spec:
  to:
    kind: Service
    name: router-shard
EOF

 3. Check route_metrics_controller_routes_per_shard{name="sharded"} in prometheus, it's 0 

Actual results:

route_metrics_controller_routes_per_shard{name="sharded"} has 0 routes

Expected results:

route_metrics_controller_routes_per_shard{name="sharded"} should have 1 route

Additional info:

 

This is a clone of issue OCPBUGS-8691. The following is the description of the original issue:

Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

aws-ebs-csi-driver-controller
aws-ebs-csi-driver-operator
csi-snapshot-controller
csi-snapshot-webhook


Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

 

Description of problem:

Currently we are not gathering Machine objects. We got nomination for a rule that will use this resource.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Using a daemonset that causes failures during draining as leases are not gracefully released and instead age out as pods are killed after potentially losing network access due to daemonset pods not being terminated. 

As pointed out in https://github.com/openshift/origin/pull/27394#discussion_r964002900 


This should be fixed when moving to a deployment and is also tracked here https://issues.redhat.com/browse/BUILD-495 

Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

1. 
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2090680](https://bugzilla.redhat.com/show_bug.cgi?id=2090680). The following is the description of the original bug:

Description of problem:

Version-Release number of the following components:
4.11.0-0.nightly-2022-05-25-123329

How reproducible:
Always

Steps to Reproduce:
1. set up a cluster in a restricted network using 4.11.0-0.nightly-2022-05-25-123329
2. mirror 4.11.0-0.nightly-2022-05-25-193227 to private registry
3. upgrade the cluster to 4.11.0-0.nightly-2022-05-25-193227 without --force option
$ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd

Actual results:
Wait for 3+ hours, no any upgrade history info in clusterversion, from event log, only can see "Retrieving and verifying payload".

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-05-25-123329 True False 160m Cluster version is 4.11.0-0.nightly-2022-05-25-123329

[root@preserve-jialiu-ansible ~]# oc get clusterversion -o yaml
apiVersion: v1
items:

  • apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    metadata:
    creationTimestamp: "2022-05-26T03:51:28Z"
    generation: 3
    name: version
    resourceVersion: "62069"
    uid: b5674b4b-7295-4287-904c-94fe1112659b
    spec:
    channel: stable-4.11
    clusterID: 027285eb-b4ea-4127-85b6-031c1af7db72
    desiredUpdate:
    force: false
    image: registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd
    version: ""
    status:
    availableUpdates: null
    capabilities:
    enabledCapabilities:
  • baremetal
  • marketplace
  • openshift-samples
    knownCapabilities:
  • baremetal
  • marketplace
  • openshift-samples
    conditions:
  • lastTransitionTime: "2022-05-26T03:51:31Z"
    message: Capabilities match configured spec
    reason: AsExpected
    status: "False"
    type: ImplicitlyEnabledCapabilities
  • lastTransitionTime: "2022-05-26T03:51:31Z"
    message: Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
    reason: PayloadLoaded
    status: "True"
    type: ReleaseAccepted
  • lastTransitionTime: "2022-05-26T04:23:06Z"
    message: Done applying 4.11.0-0.nightly-2022-05-25-123329
    status: "True"
    type: Available
  • lastTransitionTime: "2022-05-26T04:21:21Z"
    status: "False"
    type: Failing
  • lastTransitionTime: "2022-05-26T04:23:06Z"
    message: Cluster version is 4.11.0-0.nightly-2022-05-25-123329
    status: "False"
    type: Progressing
  • lastTransitionTime: "2022-05-26T03:51:31Z"
    message: 'Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.11&id=027285eb-b4ea-4127-85b6-031c1af7db72&version=4.11.0-0.nightly-2022-05-25-123329":
    dial tcp 34.228.45.157:443: connect: connection timed out'
    reason: RemoteFailed
    status: "False"
    type: RetrievedUpdates
    desired:
    image: registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1
    version: 4.11.0-0.nightly-2022-05-25-123329
    history:
  • completionTime: "2022-05-26T04:23:06Z"
    image: registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1
    startedTime: "2022-05-26T03:51:31Z"
    state: Completed
    verified: false
    version: 4.11.0-0.nightly-2022-05-25-123329
    observedGeneration: 2
    versionHash: jOIXVtM5Y-g=
    kind: List
    metadata:
    resourceVersion: ""

[root@preserve-jialiu-ansible ~]# oc get event -n openshift-cluster-version
LAST SEEN TYPE REASON OBJECT MESSAGE
3h11m Warning FailedScheduling pod/cluster-version-operator-b4b6c5f9b-p7fjq no nodes available to schedule pods
3h9m Warning FailedScheduling pod/cluster-version-operator-b4b6c5f9b-p7fjq no nodes available to schedule pods
3h4m Normal Scheduled pod/cluster-version-operator-b4b6c5f9b-p7fjq Successfully assigned openshift-cluster-version/cluster-version-operator-b4b6c5f9b-p7fjq to jialiu411a-5nb8n-master-2 by jialiu411a-5nb8n-bootstrap
3h2m Warning FailedMount pod/cluster-version-operator-b4b6c5f9b-p7fjq MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
3h1m Warning FailedMount pod/cluster-version-operator-b4b6c5f9b-p7fjq Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[etc-ssl-certs etc-cvo-updatepayloads serving-cert service-ca kube-api-access]: timed out waiting for the condition
3h1m Normal Pulling pod/cluster-version-operator-b4b6c5f9b-p7fjq Pulling image "registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h1m Normal Pulled pod/cluster-version-operator-b4b6c5f9b-p7fjq Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1" in 1.384468759s
3h1m Normal Created pod/cluster-version-operator-b4b6c5f9b-p7fjq Created container cluster-version-operator
3h1m Normal Started pod/cluster-version-operator-b4b6c5f9b-p7fjq Started container cluster-version-operator
3h11m Normal SuccessfulCreate replicaset/cluster-version-operator-b4b6c5f9b Created pod: cluster-version-operator-b4b6c5f9b-p7fjq
3h11m Normal ScalingReplicaSet deployment/cluster-version-operator Scaled up replica set cluster-version-operator-b4b6c5f9b to 1
3h12m Normal LeaderElection configmap/version jialiu411a-5nb8n-bootstrap_0a3ff57f-66cf-4f93-bbe0-484effcc4383 became leader
3h12m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h12m Normal LoadPayload clusterversion/version Loading payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h12m Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m Normal LeaderElection configmap/version jialiu411a-5nb8n-master-2_83752e0b-1ef4-4c69-814f-8eeb54d50781 became leader
166m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m Normal LoadPayload clusterversion/version Loading payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
77m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"

Expected results:
CVO and `oc adm upgrade` should clearly prompt user what issues happened there, but not pending there for a long time without any info.

Additional info:
Try the same upgrade path against a connected cluster, upgrade is kicked off soon, no such issues.

Description of problem:

In ZTP input, we can put AdditionalNTPSources in order to have assisted-service mix the provided sources with those the nodes receive from DHCP.

AdditionalNTPSources in AgentConfig needs to be generated in InfraEnv in order for it to be applied in the installation

Version-Release number of selected component (if applicable):

4.11 MVP patch 2

How reproducible:

100%

Steps to Reproduce:

1. Create AgentConfig with AdditionalNTPSources like for example "0.fedora.pool.ntp.org"
2. Generate ISO
3. Deploy
4. Check the resulting cluster nodes /etc/chrony.conf

Actual results:

chrony.conf only contains DHCP provided NTP sources (if not static network deplooyment)

Expected results:

/etc/chrony.conf in all the cluster nodes should have at least a server listed:
server 0.fedora.pool.ntp.org iburst

Additional info:

 

Description of problem:

Restore size in snapshot output is not the same size of pvc request size 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create IBM cluster. 
    Flexy template: aos-4_12/ipi-on-ibmcloud/versioned-installer-  
                    private_cluster-ovn-fips-ci
    Payload: 4.12.0-0.nightly-2022-11-29-131548 
2. Create sc, pvc, dep
3. Create volumesnapshot from default volumesnapshotclass. 
4. Check the volumesnapshot output restore size 

sc_pvc_dep.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: mysc
parameters:
profile: 10iops-tier
provisioner: vpc.block.csi.ibm.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mypvc-csi
namespace: testropatil
spec:
accessModes:

  • ReadWriteOnce
    resources:
    requests:
    storage: 26Gi
    storageClassName: mysc
    volumeMode: Filesystem

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: mydep
    namespace: testropatil
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: myapp-54mtso67
    template:
    metadata:
    labels:
    app: myapp-54mtso67
    spec:
    containers:
  • image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
    name: mydep
    ports:
  • containerPort: 80
    volumeMounts:
  • mountPath: "/mnt/storage"
    name: local
    volumes:
  • name: local
    persistentVolumeClaim:
    claimName: mypvc-csi
     
    vss.yaml
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
    name: my-snapshot-new
    namespace: testropatil
    spec:
    source:
    persistentVolumeClaimName: mypvc-csi
    volumeSnapshotClassName: vpc-block-snapshot
    rohitpatil@ropatil-mac Downloads % oc get sc                           NAME                                   PROVISIONER            RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGEmysc                                   vpc.block.csi.ibm.io   Delete          WaitForFirstConsumer   true                   2m37s
    rohitpatil@ropatil-mac Downloads % oc get pvc,pod -n testropatilNAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGEpersistentvolumeclaim/mypvc-csi   Bound    pvc-1a014601-8176-4c55-93cf-d408460b9359   26Gi       RWO            mysc           27s
    NAME                         READY   STATUS    RESTARTS   AGEpod/mydep-5477fd946b-w77sw   1/1     Running   0          27s 
    rohitpatil@ropatil-mac Downloads % oc get volumesnapshot -n testropatilNAME              READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS        SNAPSHOTCONTENT                                    CREATIONTIME   AGEmy-snapshot-new   true         mypvc-csi                           1Gi           vpc-block-snapshot   snapcontent-a40f3a17-8697-4215-8a2f-77d3d5592c60   29s            32s 

    Actual results:

    volumesnapshot RESTORESIZE is 1Gi which is not the same to pvc request size(26Gi)

    Expected results:

    volumesnapshot should be the same size of pvc request size

    Additional info:

     

Description of problem:

Upgrade to 4.10 is stuck looping in syncEgressFirewall

We see transacting operations with context deadline exceeded.

It looks to be trying to process 2.8 million records is one go.

2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 

2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127       1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 

Everything is built into one operation via:
https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243

TrandactAndCheck is being called with a 10s timeout and this operation never completes. 
 

Version-Release number of selected component (if applicable):

4.10.50

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Upgrade completes

Additional info:

 

This is a clone of issue OCPBUGS-15222. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9435. The following is the description of the original issue:

PRs were previously merged to add SC2S support via AWS SDK here:

However, further updates to add support for SC2S region (us-isob-east-1) and new TC2S region (us-iso-west-1) are still required.

There are still hard coded references to the old regions in the follow locations.

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

This is a clone of issue OCPBUGS-5547. The following is the description of the original issue:

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Knative Service and delete it again with enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the secret "$name-github-webhook-secret" is not deleted.

When the user tries to create the same Knative Service again this fails with an error:

An error occurred
secrets "nodeinfo-github-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5548)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Serverless operator (tested with 1.26.0)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. In the topology select the Knative Service > "Delete Service" (not Delete App)

Actual results:
Deleted resources:

  1. Knative Service (tries it twice!) $name
  2. ImageStream $name
  3. BuildConfig $name
  4. Secret $name-generic-webhook-secret

Expected results:
Should also remove this resource

  1. Delete Knative Service should be called just once
  2. Secret $name-github-webhook-secret

Additional info:
When delete the whole application all the resources are deleted correctly (and just once)!

  1. Knative Service (just once!) $name
  2. ImageStream $name
  3. BuildConfig $name
  4. Secret $name-generic-webhook-secret
  5. Secret $name-github-webhook-secret

During a normal installation, there are hundreds of debug logs reading:

bootstrap configmap not found: configmaps "bootstrap" not found

and dozens of the form:

Still waiting for cluster to initialize: ...

with duplicate data.

We should only log when we have some new information to report, not every time we poll.

Description of problem:

While upgrading 3519 SNOs using the ZTP process from 4.12.16 to 4.13.0-rc.7 24 clusters failed to even begin upgrading because the clusterversion object reported 

Preconditions failed for payload loaded version="4.13.0-rc.7" image="quay.io/openshift-release-dev/ocp-release@sha256:aae5131ec824c301c11d0bf11d81b3996a222be8b49ce4716e9d464229a2f92b":
        Precondition "ClusterVersionUpgradeable" failed because of "AsExpected": Cluster
        operator cloud-controller-manager should not be upgraded between minor versions: '

Version-Release number of selected component (if applicable):

Hub 4.12.14
SNOs upgrading from 4.12.16 to 4.13.0-rc.7
ACM version - 2.8.0-DOWNSTREAM-2023-04-30-18-44-29

How reproducible:

24 out of 3519 or ~0.7%

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

# oc --kubeconfig /root/hv-vm/kc/vm00488/kubeconfig get no
NAME      STATUS   ROLES                         AGE     VERSION
vm00488   Ready    control-plane,master,worker   2d20h   v1.25.8+37a9a08
# oc --kubeconfig /root/hv-vm/kc/vm00488/kubeconfig get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.16   True        False         2d      Cluster version is 4.12.16
# oc --kubeconfig /root/hv-vm/kc/vm00488/kubeconfig get clusterversion version -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
  creationTimestamp: "2023-05-16T23:18:55Z"
  generation: 6
  name: version
  resourceVersion: "647935"
  uid: c70fe530-450f-453d-9350-c0ac9e533d54
spec:
  channel: candidate-4.13
  clusterID: bc98b215-fd2e-47ba-acb7-42de6c5b4d4e
  desiredUpdate:
    version: 4.13.0-rc.7
  upstream: http://e27-h01-000-r650.rdu2.scalelab.redhat.com:8081/upgrade/upgrade-graph_candidate-4.13
status:
  availableUpdates:
  - channels:
    - candidate-4.13
    - candidate-4.14
    - fast-4.13
    - stable-4.13
    image: quay.io/openshift-release-dev/ocp-release@sha256:74b23ed4bbb593195a721373ed6693687a9b444c97065ce8ac653ba464375711
    url: https://access.redhat.com/errata/RHSA-2023:1326
    version: 4.13.0
  - channels:
    - candidate-4.13
    - candidate-4.14
    image: quay.io/openshift-release-dev/ocp-release@sha256:28e51227251e0b196fc51c62b57d9181d19b104ff63784d4bde7ba59260132fd
    version: 4.13.0-rc.8
  - channels:
    - candidate-4.13
    - candidate-4.14
    image: quay.io/openshift-release-dev/ocp-release@sha256:aae5131ec824c301c11d0bf11d81b3996a222be8b49ce4716e9d464229a2f92b
    version: 4.13.0-rc.7
  - channels:
    - candidate-4.12
    - candidate-4.13
    image: quay.io/openshift-release-dev/ocp-release@sha256:7ca5f8aa44bbc537c5a985a523d87365eab3f6e72abc50b7be4caae741e093f4
    url: https://access.redhat.com/errata/RHBA-2023:2699
    version: 4.12.17
  capabilities:
    enabledCapabilities:
    - CSISnapshot
    - Console
    - Insights
    - Storage
    - baremetal
    - marketplace
    - openshift-samples
    knownCapabilities:
    - CSISnapshot
    - Console
    - Insights
    - Storage
    - baremetal
    - marketplace
    - openshift-samples
  conditions:
  - lastTransitionTime: "2023-05-17T19:01:03Z"
    status: "True"
    type: RetrievedUpdates
  - lastTransitionTime: "2023-05-16T23:18:59Z"
    message: Capabilities match configured spec
    reason: AsExpected
    status: "False"
    type: ImplicitlyEnabledCapabilities
  - lastTransitionTime: "2023-05-18T21:15:01Z"
    message: 'Preconditions failed for payload loaded version="4.13.0-rc.7" image="quay.io/openshift-release-dev/ocp-release@sha256:aae5131ec824c301c11d0bf11d81b3996a222be8b49ce4716e9d464229a2f92b":
      Precondition "ClusterVersionUpgradeable" failed because of "AsExpected": Cluster
      operator cloud-controller-manager should not be upgraded between minor versions: '
    reason: PreconditionChecks
    status: "False"
    type: ReleaseAccepted
  - lastTransitionTime: "2023-05-16T23:52:46Z"
    message: Done applying 4.12.16
    status: "True"
    type: Available
  - lastTransitionTime: "2023-05-18T18:15:53Z"
    status: "False"
    type: Failing
  - lastTransitionTime: "2023-05-17T19:56:23Z"
    message: Cluster version is 4.12.16
    status: "False"
    type: Progressing
  - lastTransitionTime: "2023-05-18T21:08:26Z"
    message: 'Cluster operator cloud-controller-manager should not be upgraded between
      minor versions: '
    reason: AsExpected
    status: "False"
    type: Upgradeable
  desired:
    channels:
    - candidate-4.12
    - candidate-4.13
    - eus-4.12
    - fast-4.12
    - fast-4.13
    - stable-4.12
    image: quay.io/openshift-release-dev/ocp-release@sha256:5339b3c4686010dc42990e0addce5aa4fddd071d6d9504dffe08a4b5059f6f38
    url: https://access.redhat.com/errata/RHSA-2023:2110
    version: 4.12.16
  history:
  - completionTime: "2023-05-17T19:56:23Z"
    image: quay.io/openshift-release-dev/ocp-release@sha256:5339b3c4686010dc42990e0addce5aa4fddd071d6d9504dffe08a4b5059f6f38
    startedTime: "2023-05-17T19:03:27Z"
    state: Completed
    verified: true
    version: 4.12.16
  - completionTime: "2023-05-16T23:52:46Z"
    image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/ocp4/openshift4@sha256:0398462b1c54758fe83619022a98bbe65f6deed71663b6665224d3ba36e43f03
    startedTime: "2023-05-16T23:18:59Z"
    state: Completed
    verified: false
    version: 4.12.11
  observedGeneration: 6
  versionHash: Ir-CNe7dSao=

This is a clone of issue OCPBUGS-19675. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13044. The following is the description of the original issue:

Description of problem:

During cluster installations/upgrades with an imageContentSourcePolicy in place but with access to quay.io, the ICSP is not honored to pull the machine-os-content image from a private registry.

Version-Release number of selected component (if applicable):

$ oc logs -n openshift-machine-config-operator ds/machine-config-daemon -c machine-config-daemon|head -1
Found 6 pods, using pod/machine-config-daemon-znknf
I0503 10:53:00.925942    2377 start.go:112] Version: v4.12.0-202304070941.p0.g87fedee.assembly.stream-dirty (87fedee690ae487f8ae044ac416000172c9576a5)

How reproducible:

100% in clusters with ICSP configured BUT with access to quay.io

Steps to Reproduce:

1. Create mirror repo:
$ cat <<EOF > /tmp/isc.yaml                                                    
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
archiveSize: 4
storageConfig:
  registry:
    imageURL: quay.example.com/mirror/oc-mirror-metadata
    skipTLS: true
mirror:
  platform:
    channels:
    - name: stable-4.12
      type: ocp
      minVersion: 4.12.13
    graph: true
EOF
$ oc mirror --dest-skip-tls  --config=/tmp/isc.yaml docker://quay.example.com/mirror/oc-mirror-metadata
<...>
info: Mirroring completed in 2m27.91s (138.6MB/s)
Writing image mapping to oc-mirror-workspace/results-1683104229/mapping.txt
Writing UpdateService manifests to oc-mirror-workspace/results-1683104229
Writing ICSP manifests to oc-mirror-workspace/results-1683104229

2. Confirm machine-os-content digest:
$ oc adm release info 4.12.13 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq
{
  "kind": "DockerImage",
  "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a1660c8086ff85e569e10b3bc9db344e1e1f7530581d742ad98b670a81477b1b"
}
$ oc adm release info 4.12.14 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq
{
  "kind": "DockerImage",
  "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ed68d04d720a83366626a11297a4f3c5761c0b44d02ef66fe4cbcc70a6854563"
}

3. Create 4.12.13 cluster with ICSP at install time:
$ grep imageContentSources -A6 ./install-config.yaml
imageContentSources:
  - mirrors:
    - quay.example.com/mirror/oc-mirror-metadata/openshift/release
    source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
  - mirrors:
    - quay.example.com/mirror/oc-mirror-metadata/openshift/release-images
    source: quay.io/openshift-release-dev/ocp-release


Actual results:

1. After the installation is completed, no pulls for a166 (4.12.13-x86_64-machine-os-content) are logged in the Quay usage logs whereas e.g. digest 22d2 (4.12.13-x86_64-machine-os-images) are reported to be pulled from the mirror. 

2. After upgrading to 4.12.14 no pulls for ed68 (4.12.14-x86_64-machine-os-content) are logged in the mirror-registry while the image was pulled as part of `oc image extract` in the machine-config-daemon:

[core@master-1 ~]$ sudo less /var/log/pods/openshift-machine-config-operator_machine-config-daemon-7fnjz_e2a3de54-1355-44f9-a516-2f89d6c6ab8f/machine-config-daemon/0.log                        2023-05-03T10:51:43.308996195+00:00 stderr F I0503 10:51:43.308932   11290 run.go:19] Running: nice -- ionice -c 3 oc image extract -v 10 --path /:/run/mco-extensions/os-extensions-content-4035545447 --registry- config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad48fe01f3e82584197797ce2151eecdfdcce67ae1096f06412e5ace416f66ce 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418008  184455 client_mirrored.go:174] Attempting to connect to quay.io/openshift-release-dev/ocp-v4.0-art-dev 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418174  184455 round_trippers.go:466] curl -v -XGET  -H "User-Agent: oc/4.12.0 (linux/amd64) kubernetes/31aa3e8" 'https://quay.io/v2/' 2023-05-03T10:51:43.419618513+00:00 stderr F I0503 10:51:43.419517  184455 round_trippers.go:495] HTTP Trace: DNS Lookup for quay.io resolved to [{34.206.15.82 } {54.209.210.231 } {52.5.187.29 } {52.3.168.193 }  {52.21.36.23 } {50.17.122.58 } {44.194.68.221 } {34.194.241.136 } {2600:1f18:483:cf01:ebba:a861:1150:e245 } {2600:1f18:483:cf02:40f9:477f:ea6b:8a2b } {2600:1f18:483:cf02:8601:2257:9919:cd9e } {2600:1f18:483:cf01 :8212:fcdc:2a2a:50a7 } {2600:1f18:483:cf00:915d:9d2f:fc1f:40a7 } {2600:1f18:483:cf02:7a8b:1901:f1cf:3ab3 } {2600:1f18:483:cf00:27e2:dfeb:a6c7:c4db } {2600:1f18:483:cf01:ca3f:d96e:196c:7867 }] 2023-05-03T10:51:43.429298245+00:00 stderr F I0503 10:51:43.429151  184455 round_trippers.go:510] HTTP Trace: Dial to tcp:34.206.15.82:443 succeed 

Expected results:

All images are pulled from the location as configured in the ICSP.

Additional info:

 

Description of problem:

Create network LoadBalancer service, but always get Connection time out when accessing the LB

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-27-135134

How reproducible:

100%

Steps to Reproduce:

1. create custom ingresscontroller that using Network LB service

$ Domain="nlb.$(oc get dns.config cluster -o=jsonpath='{.spec.baseDomain}')"
$ oc create -f - << EOF
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: nlb
  namespace: openshift-ingress-operator
spec:
  domain: ${Domain}
  replicas: 3
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: External
    type: LoadBalancerService
EOF


2. wait for the ingress NLB service is ready.

$ oc -n openshift-ingress get svc/router-nlb
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP                                                                     PORT(S)                      AGE
router-nlb   LoadBalancer   172.30.75.134   a765a5eb408aa4a68988e35b72672379-78a76c339ded64fa.elb.us-east-2.amazonaws.com   80:31833/TCP,443:32499/TCP   117s


3. curl the network LB

$ curl a765a5eb408aa4a68988e35b72672379-78a76c339ded64fa.elb.us-east-2.amazonaws.com -I
<hang>

Actual results:

Connection time out

Expected results:

curl should return 503

Additional info:

the NLB service has the annotation:
  service.beta.kubernetes.io/aws-load-balancer-type: nlb

 

This is a clone of issue OCPBUGS-2290. The following is the description of the original issue:

Description of problem:

If you try to deploy with Internal publishing strategy, and you have either already have a pubilc gateway or already permitted the VPC subnet to the DNS service, deploy will always fail.

Version-Release number of selected component (if applicable):

 

How reproducible:

Easily

Steps to Reproduce:

1. Add a public gateway to VPC network and/or add VPC subnet to permitted DNS networks
2. Set publish strategy to Internal
3. Deploy

Actual results:

Deploy fails

Expected results:

If the resources exist simply skip trying to create them.

Additional info:

Fix here https://github.com/openshift/installer/pull/6481

Description of problem:

SYN packets for new tcp connections from inside the cluster to an external destination are dropped at random. After few seconds (i.e. few retries), they eventually succeed and no more packet drop happens. Hence, this is perceived as too long TCP connection establishment delay.

Version-Release number of selected component (if applicable):

4.10.0

How reproducible:

Frequently at a concrete cluster. Other clusters with apparently similar configuration don't show the issue.

Steps to Reproduce:

1. Establish TCP connection from pod to external destination.
2.
3.

Actual results:

SYN packets dropped, long TCP establishment time, leading to timeouts.

Expected results:

No drops

Additional info:

This becomes especially harmful because it impacts communication from openshift-apiserver (not to be confused with kube-apiserver) and etcd, because the former is inside the SDN and etcd isn't.

More details will follow in comments.

This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:

Description of problem:

When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running

Version-Release number of selected component (if applicable):

quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters
hypershift operator is quay.io/hypershift/hypershift-operator:4.11
4.11.9 management cluster

How reproducible:

Happened once

Steps to Reproduce:

1.
2.
3.

Actual results:

oc get co network reports False availability

Expected results:

oc get co network reports True availability

Additional info:

 

Description of problem:

When trying to add a Cisco UCS Rackmount server as a `baremetalhost` CR the following error comes up in the metal3 container log in the openshift-machine-api namespace.

'TransferProtocolType' property which is mandatory to complete the action is missing in the request body

Full log entry:

{"level":"info","ts":1677155695.061805,"logger":"provisioner.ironic","msg":"current provision state","host":"ucs-rackmounts~ocp-test-1","lastError":"Deploy step deploy.deploy failed with BadRequestError: HTTP POST https://10.5.4.78/redfish/v1/Managers/CIMC/VirtualMedia/0/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.4.0.GeneralError: 'TransferProtocolType' property which is mandatory to complete the action is missing in the request body. Extended information: [{'@odata.type': 'Message.v1_0_6.Message', 'MessageId': 'Base.1.4.0.GeneralError', 'Message': "'TransferProtocolType' property which is mandatory to complete the action is missing in the request body.", 'MessageArgs': [], 'Severity': 'Critical'}].","current":"deploy failed","target":"active"}

Version-Release number of selected component (if applicable):

    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:30328143480d6598d0b52d41a6b755bb0f4dfe04c4b7aa7aefd02ea793a2c52b
    imagePullPolicy: IfNotPresent
    name: metal3-ironic

How reproducible:

Adding a Cisco UCS Rackmount with Redfish enabled as a baremetalhost to metal3

Steps to Reproduce:

1. The address to use: redfish-virtualmedia://10.5.4.78/redfish/v1/Systems/WZP22100SBV

Actual results:

[baelen@baelen-jumphost mce]$ oc get baremetalhosts.metal3.io  -n ucs-rackmounts  ocp-test-1
NAME         STATE          CONSUMER   ONLINE   ERROR                AGE
ocp-test-1   provisioning              true     provisioning error   23h

Expected results:

For the provisioning to be successfull.

Additional info:

 

This is a clone of issue OCPBUGS-11719. The following is the description of the original issue:

Description of problem:

According to the slack thread attached: Cluster uninstallation is stuck when load balancers are removed before ingress controllers. This can happen when the ingress controller removal fails and the control plane operator moves on to deleting load balancers without waiting.

Code ref https://github.com/openshift/hypershift/blob/248cea4daef9d8481c367f9ce5a5e0436e0e028a/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L1505-L1520

Version-Release number of selected component (if applicable):

4.12.z 4.13.z

How reproducible:

Whenever the load balancer is deleted before the ingress controller

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Load balancer deletion waits for the ingress controller deletion

Additional info:

 

Slack: https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1681310121904539?thread_ts=1681216434.676009&cid=C04EUL1DRHC 

Description of problem:

The cluster-dns-operator does not reconcile the openshift-dns namespace, which has been exposed as an issue in 4.12 due to the requirement for the namespace to have pod-security labels.

If a cluster has been incrementally updated from a version less than or equal to 4.9, the openshift-dns namespace will most likely not contain the required pod-security labels since the namespace was statically created when the cluster was installed with old namespace configuration.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always if cluster originally installed with v4.9 or less

Steps to Reproduce:

1. Install v4.9
2. Upgrade to v4.12 (incrementally if required for upgrade path)
3. openshift-dns namespace will be missing pod-security labels

Actual results:

"oc get ns openshift-dns -o yaml" will show missing pod-security labels: 

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c15,c0
    openshift.io/sa.scc.supplemental-groups: 1000210000/10000
    openshift.io/sa.scc.uid-range: 1000210000/10000
  creationTimestamp: "2020-05-21T19:36:15Z"
  labels:
    kubernetes.io/metadata.name: openshift-dns
    olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: ""
    openshift.io/cluster-monitoring: "true"
    openshift.io/run-level: "0"
  name: openshift-dns
  resourceVersion: "3127555382"
  uid: 0fb4571e-952f-4bea-bc45-461beec54369
spec:
  finalizers:
  - kubernetes

Expected results:

pod-security labels should exist:
 
 labels:
    kubernetes.io/metadata.name: openshift-dns
    olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: ""
    openshift.io/cluster-monitoring: "true"
    openshift.io/run-level: "0"
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged

Additional info:

Issue found in CI during upgrade

https://coreos.slack.com/archives/C03G7REB4JV/p1663676443155839 

Description of problem:

As part of deployment of our product (Submariner), we are creating a new worker node for the cluster by using MachineSet.

Recently, creation of a new worker node started to fail in a cluster version 4.12 on Azure cloud platform.

Version-Release number of selected component (if applicable):

Openshift 4.12

How reproducible:

Deploy Openshift 4.12 and create a new workder node by using MachineSet manifest.

Steps to Reproduce:

1. Deploy Openshift version 4.12 on Azure cloud platform
2. Create a new worker node by using MachineSet
3.

Actual results:

The worker node should be created.

Expected results:

Creation of the worker node fails with the following errors from machine-controller container in machine-api-controllers pod in openshift-machine-api namespace:

I0921 10:18:36.589819       1 actuator.go:213] subgw-central-46fe92-xrwql: actuator checking if machine existsW0921 10:18:36.731223       1 virtualmachines.go:100] vm subgw-central-46fe92-xrwql not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/subgw-central-46fe92-xrwql' under resource group 'mbabushk-azure-vvz4j-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")I0921 10:18:36.731266       1 controller.go:380] subgw-central-46fe92-xrwql: reconciling machine triggers idempotent createI0921 10:18:36.731277       1 actuator.go:85] Creating machine subgw-central-46fe92-xrwqlI0921 10:18:36.731825       1 publicips.go:58] creating public ip -subgw-central-46fe92-xrwqlI0921 10:18:37.286484       1 machine_scope.go:196] subgw-central-46fe92-xrwql: patching machineE0921 10:18:37.323111       1 actuator.go:79] Machine error: failed to reconcile machine "subgw-central-46fe92-xrwql": network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidDomainNameLabel" Message="The domain name label -subgw-central-46fe92-xrwql is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$." Details=[]W0921 10:18:37.323146       1 controller.go:382] subgw-central-46fe92-xrwql: failed to create machine: failed to reconcile machine "subgw-central-46fe92-xrwql": network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidDomainNameLabel" Message="The domain name label -subgw-central-46fe92-xrwql is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$." Details=[]I0921 10:18:37.323159       1 controller.go:422] Actuator returned invalid configuration error: failed to reconcile machine "subgw-central-46fe92-xrwql": network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidDomainNameLabel" Message="The domain name label -subgw-central-46fe92-xrwql is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$." Details=[]I0921 10:18:37.323169       1 controller.go:435] subgw-central-46fe92-xrwql: going into phase "Failed"I0921 10:18:37.324155       1 recorder.go:103] events "msg"="InvalidConfiguration: failed to reconcile machine \"subgw-central-46fe92-xrwql\": network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"InvalidDomainNameLabel\" Message=\"The domain name label -subgw-central-46fe92-xrwql is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$.\" Details=[]" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"subgw-central-46fe92-xrwql","uid":"97218e2e-fc42-48c3-b834-dafc97fd2396","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"112872"} "reason"="FailedCreate" "type"="Warning"

Additional info:

Since I didn't find a way to attach log files, I'm providing a link to a gdrive folder with the logs.

The following logs being attached:
- Cluster must gather
- Machine-controller and machineset-controller containers logs
- Applied Machine and MachineSet manifests

https://drive.google.com/drive/folders/1Xupus1hQC-CCtsTxh7R47RkOiiUgObd9?usp=sharing

Description of problem:

co/storage is not available due to csi driver not have proxy setting on ibm cloud

Version-Release number of selected component (if applicable):

{4.12.0-0.ci-2022-10-13-233744}

How reproducible:

Always

Steps to Reproduce:

1.Install ocp cluster on ibm disconnected env with http proxy
Template: private-templates/functionality-testing/aos-4_12/ipi-on-ibmcloud/versioned-installer-customer_vpc-http_proxy
2.Check co/storage
oc get co/storage
NAME      VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.12.0-0.ci-2022-10-13-233744   False       True          False      6h55m   IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment...
3.oc get pods
NAME                                                 READY   STATUS                  RESTARTS         AGE
ibm-vpc-block-csi-controller-6c4bfc9fc-6dmz7         4/5     CrashLoopBackOff        83 (113s ago)    6h55m
ibm-vpc-block-csi-driver-operator-7bd6fb5cdc-rktk2   1/1     Running                 1 (6h44m ago)    6h55m
ibm-vpc-block-csi-node-8s6dj                         0/3     Init:0/1                77 (5m34s ago)   6h52m
ibm-vpc-block-csi-node-9msld                         0/3     Init:Error              76 (5m49s ago)   6h47m
ibm-vpc-block-csi-node-fgs76                         0/3     Init:CrashLoopBackOff   76 (5m ago)      6h52m
ibm-vpc-block-csi-node-jd9fl                         0/3     Init:CrashLoopBackOff   75 (4m16s ago)   6h47m
ibm-vpc-block-csi-node-qkjxs                         0/3     Init:CrashLoopBackOff   77 (2m53s ago)   6h52m
ibm-vpc-block-csi-node-xbzm8                         0/3     Init:0/1                76 (5m13s ago)   6h47m
4.oc -n openshift-cluster-csi-drivers logs -c vpc-node-label-updater ibm-vpc-block-csi-node-xbzm8
{"level":"info","timestamp":"2022-10-14T09:18:32.436Z","caller":"nodeupdater/utils.go:57","msg":"Fetching secret configuration.","watcher-name":"vpc-node-label-updater"}
{"level":"info","timestamp":"2022-10-14T09:18:32.436Z","caller":"nodeupdater/utils.go:158","msg":"parsing conf file","watcher-name":"vpc-node-label-updater","confpath":"/etc/storage_ibmc/slclient.toml"}
{"level":"error","timestamp":"2022-10-14T09:19:02.437Z","caller":"nodeupdater/utils.go:96","msg":"Failed to Get IAM access token","watcher-name":"vpc-node-label-updater","error":"Post \"https://iam.cloud.ibm.com/oidc/token\": dial tcp 23.203.93.6:443: i/o timeout"}
{"level":"fatal","timestamp":"2022-10-14T09:19:02.437Z","caller":"cmd/main.go:140","msg":"Failed to read secret configuration from storage secret present in the cluster ","watcher-name":"vpc-node-label-updater","error":"Post \"https://iam.cloud.ibm.com/oidc/token\": dial tcp 23.203.93.6:443: i/o timeout"}

5.oc -n openshift-cluster-csi-drivers describe pod ibm-vpc-block-csi-node-xbzm8
Environment:
   ADDRESS:          /csi/csi.sock
   DRIVER_REGISTRATION_SOCK: /var/lib/kubelet/plugins/vpc.block.csi.ibm.io/csi.sock
   KUBE_NODE_NAME:       (v1:spec.nodeName)
Actual results:{code:none}

Expected results:

 

Additional info:

 

 As mentioned in AITRIAGE-3520, there multiple attempts to grab controller logs might fail at some point and override existing logs.

In the case of the ticket I mentioned, we were able to retrieve controller logs from the logs server. However, this might not always be the case for other clusters.

We need to find a way to preserve all logs, or time out log collection differently.

 

The way we thought it can be handled is by writing logs inside container and in case kube-api is not reachable we will read logs from file

Omer Tuchfeld Nir Magnezi  Mat Kowalski 

This is a clone of issue OCPBUGS-10890. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10649. The following is the description of the original issue:

Description of problem:

After a replace upgrade from OCP 4.14 image to another 4.14 image first node is in NotReady.

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                     STATUS   ROLES  AGE   VERSION
ip-10-0-128-175.us-east-2.compute.internal  Ready   worker  72m   v1.26.2+06e8c46
ip-10-0-134-164.us-east-2.compute.internal  Ready   worker  68m   v1.26.2+06e8c46
ip-10-0-137-194.us-east-2.compute.internal  Ready   worker  77m   v1.26.2+06e8c46
ip-10-0-141-231.us-east-2.compute.internal  NotReady  worker  9m54s  v1.26.2+06e8c46

- lastHeartbeatTime: "2023-03-21T19:48:46Z"
  lastTransitionTime: "2023-03-21T19:42:37Z"
  message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
   message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/.
   Has your network provider started?'
  reason: KubeletNotReady
  status: "False"
  type: Ready

Events:
 Type   Reason          Age         From          Message
 ----   ------          ----        ----          -------
 Normal  Starting         11m         kubelet        Starting kubelet.
 Normal  NodeHasSufficientMemory 11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientMemory
 Normal  NodeHasNoDiskPressure  11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
 Normal  NodeHasSufficientPID   11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientPID
 Normal  NodeAllocatableEnforced 11m         kubelet        Updated Node Allocatable limit across pods
 Normal  Synced          11m         cloud-node-controller Node synced successfully
 Normal  RegisteredNode      11m         node-controller    Node ip-10-0-141-231.us-east-2.compute.internal event: Registered Node ip-10-0-141-231.us-east-2.compute.internal in Controller
 Warning ErrorReconcilingNode   17s (x30 over 11m) controlplane      nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

ovnkube-master log:

I0321 20:55:16.270197       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270273       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:17.851497       1 master.go:719] Adding or Updating Node "ip-10-0-137-194.us-east-2.compute.internal"
I0321 20:55:25.965132       1 master.go:719] Adding or Updating Node "ip-10-0-128-175.us-east-2.compute.internal"
I0321 20:55:45.928694       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432145 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:55:46.270129       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270154       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270164       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:55:46.270201       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270284       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:52.916512       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 5 items received
I0321 20:56:06.910669       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 12 items received
I0321 20:56:15.928505       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432175 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:56:16.269611       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269637       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269646       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:56:16.269688       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269697       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269724       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

cluster-network-operator log:

I0321 21:03:38.487602       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.488312       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:38.499825       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.571013       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available
I0321 21:03:39.450912       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.450953       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.493206       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.494050       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:39.508538       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.684429       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. management cluster 4.13
2. bring up the hostedcluster and nodepool in 4.14.0-0.nightly-2023-03-19-234132
3. upgrade the hostedcluster to 4.14.0-0.nightly-2023-03-20-201450 
4. replace upgrade the nodepool to 4.14.0-0.nightly-2023-03-20-201450 

Actual results

First node is in NotReady

Expected results:

All nodes should be Ready

Additional info:

No issue with replace upgrade from 4.13 to 4.14

 

 

 

 

 

 

This is a clone of issue OCPBUGS-3761. The following is the description of the original issue:

Description of problem:

Events.Events: event view displays created pod
https://search.ci.openshift.org/?search=event+view+displays+created+pod&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.Run event scenario tests and note below results: 

Actual results:

{Expected '' to equal 'test-vjxfx-event-test-pod'. toEqual Error: Failed expectation
    at /go/src/github.com/openshift/console/frontend/integration-tests/tests/event.scenario.ts:65:72
    at Generator.next (<anonymous>:null:null)
    at fulfilled (/go/src/github.com/openshift/console/frontend/integration-tests/tests/event.scenario.ts:5:58)
    at runMicrotasks (<anonymous>:null:null)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
   }

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4701. The following is the description of the original issue:

Description of problem:

In at least 4.12.0-rc.0, a user with read-only access to ClusterVersion can see a "Control plane is hosted" banner (despite the control plane not being hosted), because hasPermissionsToUpdate is false, so canPerformUpgrade is false.

Version-Release number of selected component (if applicable):

4.12.0-rc.0. Likely more. I haven't traced it out.

How reproducible:

Always.

Steps to Reproduce:

1. Install 4.12.0-rc.0
2. Create a user with cluster-wide read-only permissions. For me, it's via binding to a sudoer ClusterRole. I'm not sure where that ClusterRole comes from, but it's:

$ oc get -o yaml clusterrole sudoer
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: "2020-05-21T19:39:09Z"
  name: sudoer
  resourceVersion: "7715"
  uid: 28eb2ffa-dccd-47e8-a2d5-6a95e0e8b1e9
rules:
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:admin
  resources:
  - systemusers
  - users
  verbs:
  - impersonate
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:masters
  resources:
  - groups
  - systemgroups
  verbs:
  - impersonate

3. View /settings/cluster

Actual results:

See the "Control plane is hosted" banner.

Expected results:

Possible cases:

  • For me in my impersonate group, I can trigger updates via the command-line by using --as system:admin. I don't know if the console supports impersonation, or wants to mention the option if it does not.
  • For users with read-only access in stand-alone clusters, telling the user they are not authorized to update makes sense. Maybe mention that their cluster admins may be able to update, or just leave that unsaid.
  • For users with managed/dedicated branding, possibly point out that updates in that environment happen via OCM. And leave it up to OCM to decide if that user has access.
  • For users with externally-hosted control planes, possibly tell them this regardless of whether they have the ability to update via some external interface or not. For externally-hosted, Red-Hat-managed clusters, the interface will presumably be OCM. For externally-hosted, customer-managed clusters, there may be some ACM or other interface? I'm not sure. But the message of "this in-cluster web console is not where you configure this stuff, even if you are one of the people who can make these decisions for this cluster" will apply for all hosted situations.

This is a clone of issue OCPBUGS-6270. The following is the description of the original issue:

Similar to how, due to the install-config validation, the baremetal platform previously required a bunch of fields that are actually ignored (OCPBUGS-3278), we similarly require values for the following fields in the platform.vsphere section:

  • vCenter
  • username
  • password
  • datacenter
  • defaultDatastore

None of these values are actually used in the agent-based installer at present, and they should not be required.

Users can work around this by specifying dummy values in the platform config (note that the VIP values are required and must be genuine):

platform:
  vsphere:
    apiVIP: 192.168.111.1
    ingressVIP: 192.168.111.2
    vCenter: a
    username: b
    password: c
    datacenter: d
    defaultDatastore: e

This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:

Description of problem:

When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running

Version-Release number of selected component (if applicable):

quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters
hypershift operator is quay.io/hypershift/hypershift-operator:4.11
4.11.9 management cluster

How reproducible:

Happened once

Steps to Reproduce:

1.
2.
3.

Actual results:

oc get co network reports False availability

Expected results:

oc get co network reports True availability

Additional info:

 

This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:

Description of problem:

Frequently we see the loading state of the topology view, even when there aren't many resources in the project.

Including an example

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. load topology
  2. if it loads successfully, keep trying  until it fails to load

Actual results:

topology will sometimes hang with the loading indicator showing indefinitely

Expected results:

topology should load consistently without fail

Reproducibility (Always/Intermittent/Only Once):

intermittent

Build Details:

4.9

Additional info:

Description of problem:

When spot instances with taints are added to the cluster on AWS, machine-api-termination-handler daemonset pods do not launch on these instances because of the taints. 

machine-api-termination-handler is used for checking the notification of  intance termination, so if it doesn't launch properly, application pods on spot instances could stop without normal shut down procedures. 

It is common to use taint-toleration to specify workloads on spot instances, because it does not require changing application manifests of other workloads. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Creating ROSA cluster
2. Adding spot instances with taints on OCM
3. oc get daemonset machine-api-termination-handler -n openshift-machine-api

Actual results:

machine-api-termination-handler pods do not launch on spot instances

Expected results:

machine-api-termination-handler pods launch on spot instances

Additional info:

Adding followings to machine-api-termination-handler daemonset could resolve the problem.
---  
tolerations:        
- operator: Exists

Description of problem:

When providing the openshift-install agent create command with installconfig + agentconfig manifests that contain the InstallConfig Proxy section, the Proxy configuration does not get applied.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Define InstallConfig with Proxy section
2.openshift-install agent create image
3.Boot ISO
4.Check /etc/assisted/manifests for InfraEnv to contain its Proxy section

Actual results:

Missing proxy

Expected results:

Proxy present and matching InstallConfig's

Additional info:

 

This is a clone of issue OCPBUGS-15589. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13526. The following is the description of the original issue:

Description of problem:

During a fresh install of an operator with conversion webhooks enabled, `crd.spec.conversion.webhook.clientConfig` is dynamically updated initially, as expected, with the proper webhook ns, name, & caBundle. However, within a few seconds, those critical settings are overwritten with the bundle’s packaged CRD conversion settings. This breaks the operator and stops the installation from completing successfully.

Oddly though, if that same operator version is installed as part of an upgrade from a prior release... the dynamic clientConfig settings are retained and all works as expected.

 

Version-Release number of selected component (if applicable):

OCP 4.10.36
OCP 4.11.18

How reproducible:

Consistently

 

Steps to Reproduce:

1. oc apply -f https://gist.githubusercontent.com/tchughesiv/0951d40f58f2f49306cc4061887e8860/raw/3c7979b58705ab3a9e008b45a4ed4abc3ef21c2b/conversionIssuesFreshInstall.yaml
2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w

 

Actual results:

Eventually, the clientConfig settings will revert to the following and stay that way.

$ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}'
map[service:map[name:dbaas-operator-webhook-service namespace:openshift-dbaas-operator path:/convert port:443]]
 conversion:
   strategy: Webhook
   webhook:
     clientConfig:
       service:
         namespace: openshift-dbaas-operator
         name: dbaas-operator-webhook-service
         path: /convert
         port: 443
     conversionReviewVersions:
       - v1alpha1
       - v1beta1

 

Expected results:

The `crd.spec.conversion.webhook.clientConfig` should instead retain the following settings.

$ oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}'
map[caBundle:LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJpRENDQVMyZ0F3SUJBZ0lJUVA1b1ZtYTNqUG93Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEEwTWpsYUZ3MHlOREV5TVRVeE9UQTBNamxhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBVGcxaEtPWW40MStnTC9PdmVKT21jbkx5MzZNWTBEdnRGcXF3cjJFdlZhUWt2WnEzWG9ZeWlrdlFlQ29DZ3QKZ2VLK0UyaXIxNndzSmRSZ2paYnFHc3pGbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZPMWNXNFBrbDZhcDdVTVR1UGNxZWhST1gzRHZNQW9HQ0NxR1NNNDlCQU1DQTBrQU1FWUNJUURxN0pkUjkxWlgKeWNKT0hyQTZrL0M0SG9sSjNwUUJ6bmx3V3FXektOd0xiZ0loQU5ObUd6RnBqaHd6WXpVY2RCQ3llU3lYYkp3SAphYllDUXFkSjBtUGFha28xCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K service:map[name:dbaas-operator-controller-manager-service namespace:redhat-dbaas-operator path:/convert port:443]]
 conversion:
   strategy: Webhook
   webhook:
     clientConfig:
       service:
         namespace: redhat-dbaas-operator
         name: dbaas-operator-controller-manager-service
         path: /convert
         port: 443
       caBundle: >-
         LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJoekNDQVMyZ0F3SUJBZ0lJZXdhVHNLS0hhbWd3Q2dZSUtvWkl6ajBFQXdJd0dERVdNQlFHQTFVRUNoTU4KVW1Wa0lFaGhkQ3dnU1c1akxqQWVGdzB5TWpFeU1UWXhPVEF5TURkYUZ3MHlOREV5TVRVeE9UQXlNRGRhTUJneApGakFVQmdOVkJBb1REVkpsWkNCSVlYUXNJRWx1WXk0d1dUQVRCZ2NxaGtqT1BRSUJCZ2dxaGtqT1BRTUJCd05DCkFBUVRFQm8zb1BWcjRLemF3ZkE4MWtmaTBZQTJuVGRzU2RpMyt4d081ZmpKQTczdDQ2WVhOblFzTjNCMVBHM04KSXJ6N1dKVkJmVFFWMWI3TXE1anpySndTbzJFd1h6QU9CZ05WSFE4QkFmOEVCQU1DQW9Rd0hRWURWUjBsQkJZdwpGQVlJS3dZQkJRVUhBd0lHQ0NzR0FRVUZCd01CTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3SFFZRFZSME9CQllFCkZJemdWbC9ZWkFWNmltdHl5b0ZkNFRkLzd0L3BNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRUY3ZXZ0RS95OFAKRnVrTUtGVlM1VkQ3a09DRzRkdFVVOGUyc1dsSTZlNEdBaUVBZ29aNmMvYnNpNEwwcUNrRmZSeXZHVkJRa25SRwp5SW1WSXlrbjhWWnNYcHM9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K 

 

Additional info:

If the operator is, instead, installed as an upgrade... vs a fresh install... the webhook settings are properly/permanently set and everything works as expected. This can be tested in a fresh cluster like this.

1. oc apply -f https://gist.githubusercontent.com/tchughesiv/703109961f22ab379a45a401be0cf351/raw/2d0541b76876a468757269472e8e3a31b86b3c68/conversionWorksUpgrade.yaml
2. oc get crd dbaasproviders.dbaas.redhat.com --template '{{ .spec.conversion.webhook.clientConfig }}' -w

This is a clone of issue OCPBUGS-4357. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Get the below error when upgrading to OCP 4.12 from 4.9->4.10->4.11.

MacBook-Pro:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-08-24-091058   True        True          4h      Unable to apply 4.12.0-0.nightly-2022-08-24-053339: the workload openshift-operator-lifecycle-manager/package-server-manager cannot roll out
   - lastTransitionTime: "2022-08-25T04:47:36Z"
    lastUpdateTime: "2022-08-25T04:47:36Z"
    message: 'pods "package-server-manager-85b6dc4d89-sdzcc" is forbidden: violates
      PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager"
      must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure

 

Version-Release number of selected component (if applicable):

MacBook-Pro:~ jianzhang$ oc exec catalog-operator-c5c655d5c-b9lcn -- olm --version
OLM version: 0.19.0
git commit: 8a984d41acc67c0bc9bfe807fadeef23f83abd44 

How reproducible:

always

Steps to Reproduce:
1. Install OCP 4.11.0-0.nightly-2022-08-24-091058
2. Upgrade it to 4.12.0-0.nightly-2022-08-24-053339

Actual results:

The cluster upgrading is blocked. Get the above errors as described.

Expected results:

 Upgraded to 4.12 from old OCP versions 4.5, 4.9 successfully.

Additional info:

MacBook-Pro:~ jianzhang$ oc get deployment package-server-manager -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "5"
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-08-25T00:14:08Z"
  generation: 5
  labels:
    app: package-server-manager
  name: package-server-manager
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 3fd29082-0e76-4b09-988e-78cb5fc7c8b5
  resourceVersion: "169028"
  uid: c8f7cbe2-4f82-40ce-9468-817ffefa903f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: package-server-manager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
      creationTimestamp: null
      labels:
        app: package-server-manager
    spec:
      containers:
      - args:
        - --name
        - $(PACKAGESERVER_NAME)
        - --namespace
        - $(PACKAGESERVER_NAMESPACE)
        command:
        - /bin/psm
        - start
        env:
        - name: PACKAGESERVER_NAME
          value: packageserver
        - name: PACKAGESERVER_IMAGE
          value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d49e1e27114f4b719bc8f3c222b2c5934d3b8028c79ec8e2bd288f6e9b5b3d5c
        - name: PACKAGESERVER_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: RELEASE_VERSION
          value: 4.12.0-0.nightly-2022-08-24-053339
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d49e1e27114f4b719bc8f3c222b2c5934d3b8028c79ec8e2bd288f6e9b5b3d5c
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: package-server-manager
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 10m
            memory: 50Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/master: ""
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsNonRoot: true
      serviceAccount: olm-operator-serviceaccount
      serviceAccountName: olm-operator-serviceaccount
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 120
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 120
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2022-08-25T03:14:20Z"
    lastUpdateTime: "2022-08-25T03:14:20Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-08-25T04:47:36Z"
    lastUpdateTime: "2022-08-25T04:47:36Z"
    message: 'pods "package-server-manager-85b6dc4d89-sdzcc" is forbidden: violates
      PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager"
      must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  - lastTransitionTime: "2022-08-25T04:57:37Z"
    lastUpdateTime: "2022-08-25T04:57:37Z"
    message: ReplicaSet "package-server-manager-85b6dc4d89" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 5
  readyReplicas: 1
  replicas: 1
  unavailableReplicas: 1 

This is a clone of issue OCPBUGS-5151. The following is the description of the original issue:

Description of problem:

Cx is not able to install new cluster OCP BM IPI. During the bootstrapping the provisioning interfaces from master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP IPI BareMetal install 

Please refer to following BUG --> https://issues.redhat.com/browse/OCPBUGS-872  The problem was solved by applying rd.net.timeout.carrier=30 to the kernel parameters of compute nodes via cluster-baremetal operator. The fix also need to be apply to the control-plane. 

  ref:// https://github.com/openshift/cluster-baremetal-operator/pull/286/files

 

Version-Release number of selected component (if applicable):

 

How reproducible:

Perform OCP 4.10.16 IPI BareMetal install.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Customer should be able to install the cluster without any issue.

Additional info:

 

Searching recent 4.12 CI, there are a number of failures in the clusteroperator/machine-config should not change condition/Available test case:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fmachine-config+should+not+change+condition%2FAvailable&maxAge=48h&type=junit' | grep '4[.]12.*failures match' | sort
periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade (all) - 129 runs, 53% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview-serial (all) - 6 runs, 50% failed, 67% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-azure-ovn-upgrade (all) - 60 runs, 50% failed, 3% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade (all) - 129 runs, 56% failed, 8% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 129 runs, 69% failed, 12% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-rt-upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade (all) - 60 runs, 57% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade (all) - 12 runs, 42% failed, 20% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-upgrade (all) - 60 runs, 40% failed, 4% of failures match = 2% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-virtualmedia (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-serial-ovn-dualstack (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-ovn-techpreview-serial (all) - 9 runs, 56% failed, 20% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-okd-4.12-e2e-vsphere (all) - 25 runs, 100% failed, 4% of failures match = 4% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.12 (all) - 6 runs, 83% failed, 20% of failures match = 17% impact

Doesn't seem like reason is getting set?

$ curl -s 'https://search.ci.openshift.org/search?name=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade&search=clusteroperator%2Fmachine-config+should+not+change+condition%2FAvailable&maxAge=48h&type=junit&context=15' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep 'clusteroperator/machine-config condition/Available status/False reason'
Aug 31 01:13:56.724 - 698s  E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-30-194744}]
Aug 31 09:09:15.460 - 1078s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-30-194744}]
Sep 01 03:31:24.808 - 1131s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}]
Sep 01 07:15:58.029 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}]

Example runs in the job I've randomly selected to drill into:

$ curl -s 'https://search.ci.openshift.org/search?name=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade&search=clusteroperator%2Fmachine-config+should+not+change+condition%2FAvailable&maxAge=48h&type=junit' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1564757706458271744
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1564879945233076224
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1565158084484009984
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1565212566194491392

Drilling into that last run, the Available=False was the whole pool-update phase:

And details from the origin's monitor:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1565212566194491392/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/machine-config
Sep 01 07:15:57.629 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.12.0-0.ci-2022-08-31-111359 because: refusing to read osImageURL version "4.12.0-0.ci-2022-09-01-053740", operator version "4.12.0-0.ci-2022-08-31-111359"
Sep 01 07:15:57.629 - 49s   E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.12.0-0.ci-2022-08-31-111359 because: refusing to read osImageURL version "4.12.0-0.ci-2022-09-01-053740", operator version "4.12.0-0.ci-2022-08-31-111359"
Sep 01 07:15:58.029 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}]
Sep 01 07:15:58.029 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}]
Sep 01 07:16:47.000 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator started a version change from [{operator 4.12.0-0.ci-2022-08-31-111359}] to [{operator 4.12.0-0.ci-2022-09-01-053740}]
Sep 01 07:16:47.377 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.12.0-0.ci-2022-09-01-053740
Sep 01 07:16:47.377 - 1037s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.12.0-0.ci-2022-09-01-053740
Sep 01 07:16:47.405 W clusteroperator/machine-config condition/Degraded status/False changed: 
Sep 01 07:18:02.614 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
Sep 01 07:34:03.000 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator version changed from [{operator 4.12.0-0.ci-2022-08-31-111359}] to [{operator 4.12.0-0.ci-2022-09-01-053740}]
Sep 01 07:34:03.699 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [{operator 4.12.0-0.ci-2022-08-31-111359}]
Sep 01 07:34:03.715 W clusteroperator/machine-config condition/Upgradeable status/True changed: 
Sep 01 07:34:04.065 I clusteroperator/machine-config versions: operator 4.12.0-0.ci-2022-08-31-111359 -> 4.12.0-0.ci-2022-09-01-053740
Sep 01 07:34:04.663 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.12.0-0.ci-2022-09-01-053740
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded

No idea if whatever was happening there is the same thing that was happening in other runs, and I haven't checked 4.11 and earlier either. The test-case is non-fatal, so it doesn't break CI, but it can cause noise like ClusterOperatorDown if it continues for 10 or more minutes. Whic PromeCIeus says actually fired in this run, although apparently the origin monitors didn't notice to complain:

So parallel asks (and I'm happy to shard into separate bugs, if that's helpful):

  • Set a reason when you go Available=False, so Telemetry can collect information to aggregate and hunt for frequent reasons to prioritize improvements.
  • Figure out at least one reason why we're going Available=False in apparently healthy CI runs. If we find and fix one reason, we can circle back later to see if there are more that remain unfixed.

aws-ebs-csi-driver-operator ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

Description of problem:

When providing the openshift-install agent create command with installconfig + agentconfig manifests that contain the InstallConfig Proxy section, the Proxy configuration does not get configured cluster-wide.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Define InstallConfig with Proxy section
2.openshift-install agent create image
3.Boot ISO
4.Check /etc/assisted/manifests for agent-cluster-install.yaml to contain the Proxy section 

Actual results:

Missing proxy

Expected results:

Proxy should be present and match with the InstallConfig

Additional info:

 

This is a clone of issue OCPBUGS-3278. The following is the description of the original issue:

Description of problem:

When doing openshift-install agent create image, one should not need to provide platform specific data like boot MAC addresses.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Create install-config with only VIPs in Baremetal platform section

apiVersion: v1
metadata:
  name: foo
baseDomain: test.metalkube.org
networking:
  clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
  machineNetwork:
    - cidr: 192.168.122.0/23
  networkType: OpenShiftSDN
  serviceNetwork:
    - 172.30.0.0/16
compute:
  - architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform: {}
    replicas: 0
controlPlane:
  name: master
  replicas: 3
  hyperthreading: Enabled
  architecture: amd64
platform:
  baremetal:
    apiVIPs:
      - 192.168.122.10
    ingressVIPs:
      - 192.168.122.11
---
apiVersion: v1beta1
metadata:
  name: foo
rendezvousIP: 192.168.122.14

2.openshift-install agent create image

Actual results:

ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors 
ERROR failed to fetch Agent Installer ISO: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: [platform.baremetal.hosts: Invalid value: []*baremetal.Host(nil): bare metal hosts are missing, platform.baremetal.Hosts: Required value: not enough hosts found (0) to support all the configured ControlPlane replicas (3)]

Expected results:

Image gets generated

Additional info:

We should go into install-config validation code, detect if we are doing agent-based installation and skip the hosts checks

This is a clone of issue OCPBUGS-2551. The following is the description of the original issue:

Description of problem:

When normal user select "All namespaces" by using the radio button "Show operands in", The ""Error Loading" error will be shown 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-18-192348, 4.11

How reproducible:

Always

Steps to Reproduce:

1. Install operator "Red Hat Intergration-Camel K" on All namespace
2. Login console by using normal user
3. Navigate to "All instances" Tab for the opertor
4. Check the radio button "All namespaces" is being selected
5. Check the page 

Actual results:

The Error Loading info will be shown on page

Expected results:

The error should not shown

Additional info:

 

This is a clone of issue OCPBUGS-4986. The following is the description of the original issue:

We should avoid errors like:

$ oc get -o json clusterversion version | jq -r '.status.history[0].acceptedRisks'
Forced through blocking failures: Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=True (), so the update from 4.13.0-0.okd-2022-12-11-064650 to 4.13.0-0.okd-2022-12-13-052859 is probably neither recommended nor supported.

Instead, tweak the logic from OCPBUGS-2727, and only append the Forced through blocking failures: prefix when the forcing was required.

Description of problem:

We need to have admin-ack in 4.12 so that admins can check the deprecated APIs and approve when they move to 4.12.Refer https://access.redhat.com/articles/6958394 for  more information. As planned we want to add the admin-ack around 4.13 feature freeze.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Install a cluster in 4.12. 
2. Run an application which uses the deprecated API. See https://access.redhat.com/articles/6958394 for more information.
3. Upgrade to 4.13

Actual results:

The upgrade happens without asking the admin to confirm that the worksloads do not use the deprecated APIs.

Expected results:

Upgrade should wait for the admin-ack.

Additional info:

This was the PR for 4.11.z https://github.com/openshift/cluster-version-operator/pull/836

Description of the problem:

In case we are installing a cluster using the kubeapi the installer fails to send the logs due to a missing volume mount of the caCert

 

time="2022-07-06T08:25:59Z" level=info msg="failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net=host --pid=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log quay.io/edge-infrastructure/assisted-installer-agent@sha256:20d9e31e37f881fcd34aed44b2ee9f143382f87cbf4b634325d2260f8dffe6c2 logs_sender -cluster-id 4d4be932-42a8-4d37-b5d2-41f42a487821 -url https://assisted-service-assisted-installer.apps.ostest.test.metalkube.org -host-id 17babad0-f2d0-419f-a69b-8c6895df26f4 -infra-env-id 37c26d69-6416-4888-bd2e-aec610f241b3 -pull-secret-token <SECRET> -insecure=false -bootstrap=true -cacert=/etc/assisted-service/service-ca-cert.crt], env vars [PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin TERM=xterm container=oci http_proxy= https_proxy= NO_PROXY= OPENSHIFT_BUILD_NAME=assisted-installer PULL_SECRET_TOKEN=<SECRET> no_proxy= HTTP_PROXY= HTTPS_PROXY= OPENSHIFT_BUILD_NAMESPACE=ci-op-8wiv6td6 BUILD_LOGLEVEL=0 HOME=/root HOSTNAME=extraworker-0], error exit status 1, waitStatus 1, Output \"time=\"06-07-2022 08:25:59\" level=fatal msg=\"Failed to initialize connection: &{%!e(string=open) %!e(string=/etc/assisted-service/service-ca-cert.crt) %!e(syscall.Errno=2)}\" file=\"send_logs.go:92\"\ntime=\"2022-07-06T08:25:59Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-8b070b62a9482fc0add228b77844b2c4e0a614e2b171ca87f76f56a4305a6ee7.scope: no such file or directory\"\""
time="2022-07-06T08:25:59Z" level=error msg="upload installation logs failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net=host --pid=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log quay.io/edge-infrastructure/assisted-installer-agent@sha256:20d9e31e37f881fcd34aed44b2ee9f143382f87cbf4b634325d2260f8dffe6c2 logs_sender -cluster-id 4d4be932-42a8-4d37-b5d2-41f42a487821 -url https://assisted-service-assisted-installer.apps.ostest.test.metalkube.org -host-id 17babad0-f2d0-419f-a69b-8c6895df26f4 -infra-env-id 37c26d69-6416-4888-bd2e-aec610f241b3 -pull-secret-token <SECRET> -insecure=false -bootstrap=true -cacert=/etc/assisted-service/service-ca-cert.crt], Error exit status 1, LastOutput \"... :92\"\ntime=\"2022-07-06T08:25:59Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-8b070b62a9482fc0add228b77844b2c4e0a614e2b171ca87f76f56a4305a6ee7.scope: no such file or directory\"\"" 

How reproducible:

100%

Steps to reproduce:

1. Install a cluster using the kubeapi

2. look for the host logs after the host reboots or the installation complete

3.

Actual results:

no host logs

Expected results:
...

Description of problem:

OVNKubernetesControllerDisconnectedSouthboundDatabase alert seems to fire in the e2e-aws-ovn-serial CI job. Note that something funny happens in the job itself, which is that a set of ovnkube-node pods get created and then deleted and then get recreated again and test runs. But the alert gets fired for the first set of pods that got deleted. From the initial screening of artifacts alone its not clear what happened to the old pods. This needs investigation

Version-Release number of selected component (if applicable):

4.12 OCP

How reproducible:

Seems like always

Steps to Reproduce:

1.https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27043/pull-ci-openshift-origin-master-e2e-aws-ovn-serial/1568166237639282688
2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27043/pull-ci-openshift-origin-master-e2e-aws-ovn-serial/1567913444936519680

Actual results:

Alert is fired

Expected results:

Alert shouldn't be fired and this is expected in the serial job then we need to silence that alert for that job, make it flaky and not fail hard if that alert fires.

Additional info:

 

Description of problem:

METAL-256 introduced Ironic API proxy pods. The pods start with IPv4, but crash loop if IPv6 is used. Blocks Assisted ZTP flow (This was with converged flow DISABLED)

[root@ocp-edge34 opt]# oc get pods -n openshift-machine-api
NAME                                                  READY   STATUS             RESTARTS         AGE
cluster-autoscaler-operator-85b7c7c69b-2wdh9          2/2     Running            2 (14h ago)      15h
cluster-baremetal-operator-8555c9dc87-t5rm4           2/2     Running            0                15h
control-plane-machine-set-operator-6c4f7fff6f-fts4p   1/1     Running            0                15h
ironic-proxy-67wkh                                    0/1     CrashLoopBackOff   164 (108s ago)   13h
ironic-proxy-9qg6h                                    0/1     CrashLoopBackOff   163 (106s ago)   13h
ironic-proxy-hxft5                                    0/1     CrashLoopBackOff   164 (108s ago)   13h
machine-api-controllers-6b4f47899b-7xqb8              7/7     Running            0                14h
machine-api-operator-544587645d-9rv4m                 2/2     Running            0                15h
metal3-7688b65d7f-kc2mg                               5/5     Running            0                13h
metal3-image-cache-4w24m                              1/1     Running            0                14h
metal3-image-cache-q7p54                              1/1     Running            0                14h
metal3-image-cache-vhnkj                              1/1     Running            0                14h
metal3-image-customization-5dcd9f4fb7-lpmrq           1/1     Running            0                13h

Apache is used for the underlying proxy, and I believe the ipv6 address probably just needs to be surrounded in brackets to pass syntax.

+ python3 -c 'import os; import sys; import jinja2; sys.stdout.write(jinja2.Template(sys.stdin.read()).render(env=os.environ))'
+ exec /usr/sbin/httpd -DFOREGROUND
AH00526: Syntax error on line 8 of /etc/httpd/conf.d/ironic-proxy.conf:
ProxyPass Unable to parse URL: https://fd2e:6f44:5dd8::79:6388/
Version-Release number of selected component (if applicable):
OCP hub 4.12.0-ec.3
2.2.0-DOWNANDBACK-2022-09-26-15-59-33

 

How reproducible:
100%

 

Steps to Reproduce:

1. Deploy ocp bm compact/HA cluster with ipv6
2. Deploy MCE + Assisted Service
3. Try to deploy a spoke via full ZTP

Actual results:
Spoke BMH on Hub cluster do nothing:
mstat-0                 mstat-master-0-0-bmh                                                  true             10h
mstat-0                 mstat-master-0-1-bmh                                                  true             10h
mstat-0                 mstat-master-0-2-bmh                                                  true             10h
mstat-0                 mstat-worker-0-0-bmh                                                  true             10h
mstat-0                 mstat-worker-0-1-bmh                                                  true             10h

 

Expected results:
ZTP flow happens and spoke cluster deployed

 

Additional info:

 


Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

The default dns-default pod is missing the "target.workload.openshift.io/management:" annotation. 

As a result when the workload partitioning feature is enabled on SNO, this pod resources will not get mutated and pinned to the reserved cpuset.

This is a regresion from 4.10. Pod spec from 4.10.17

Annotations:
...
   resources.workload.openshift.io/dns: {"cpushares": 51}
   resources.workload.openshift.io/kube-rbac-proxy: {"cpushares": 10}
   target.workload.openshift.io/management {"effect":"PreferredDuringScheduling"}

Version-Release number of selected component (if applicable):

4.11.0

How reproducible:

100%

Steps to Reproduce:

1. Install a SNO and check the annotation
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

This is a clone of issue OCPBUGS-10807. The following is the description of the original issue:

Description of problem:

Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations.

When CNO is managed by Hypershift, multus-admission-controller and other CNO-managed deployments should run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod security context to its managed deployments, including CNO, with runAsUser element inside. In such a case CNO should do the same, set security context for its managed deployments, like multus-admission-controller, to meet Hypershift security rules.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift using Kube management cluster
2.Check pod security context of multus-admission-controller

Actual results:

no pod security context is set on multus-admission-controller

Expected results:

pod security context is set with runAsUser: xxxx

Additional info:

Corresponding CNO change 

Description of problem:

Custom manifest files can be placed in the /openshift folder so that they will be applied during cluster installation.
Anyhow, if a file contains more than one manifests, all but the first are ignored.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create the following custom manifest file in the /openshift folder:

```
apiVersion: v1
kind: ConfigMap
metadata:  
  name: agent-test  
  namespace: openshift-config
data:  
  value: agent-test
---
apiVersion: v1
kind: ConfigMap
metadata: 
name: agent-test-2
namespace: openshift-config
data: 
  value: agent-test-2
```
2. Create the agent ISO image and deploy a cluster

Actual results:

ConfigMap agent-test-2 does not exist in the openshift-config namespace

Expected results:

ConfigMap agent-test-2 must exist in the openshift-config namespace

Additional info:

 

This is a clone of issue OCPBUGS-3277. The following is the description of the original issue:

I saw this occur one time when running installs in a continuous loop. This was with COMPaCT_IPV4 in a non-disconnected setup.

WaitForBootrapComplete shows it can't access the API

level=info msg=Unable to retrieve cluster metadata from Agent Rest API: no clusterID known for the cluster
level=debug msg=cluster is not registered in rest API
level=debug msg=infraenv is not registered in rest API

This is because create-cluster-and-infraenv.service failed

Failed Units: 2
  create-cluster-and-infraenv.service
  NetworkManager-wait-online.service

The agentbasedinstaller register command wasn't able to retrieve the image to get the version

Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"
Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="failed to get image openshift version from release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451" error="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"

This occurs when attempting to get the release here:
https://github.com/openshift/assisted-service/blob/master/cmd/agentbasedinstaller/register.go#L58

 

We should add a retry mechanism or restart the service to handle spurious network failures like this.

 

 

This is a clone of issue OCPBUGS-10220. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7559. The following is the description of the original issue:

Description of problem:

When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.

Version-Release number of selected component (if applicable):

4.12.3

How reproducible:

Consistent

Steps to Reproduce:

1. On a long lived cluster, add a new machineset

Actual results:

Machines reach "Provisioned" but don't join the cluster

Expected results:

Machines join cluster as nodes

Additional info:


We need to rebase openshift-sdn to kube 1.25's kube-proxy.

In particular, we need this to get https://github.com/kubernetes/kubernetes/pull/110334 into master because we will probably get asked to backport it.

This is a clone of issue OCPBUGS-17769. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17568. The following is the description of the original issue:

Description of problem:

 

Customer used Agent-based installer to install 4.13.8 on they CID env, but during install process, the bootstrap machine had oom issue, check sosreport find the init container had oom issue

NOTE: Issue is not see when testing with 4.13.6, per the customer

initContainers:

  • name: machine-config-controller
    image: .Images.MachineConfigOperator
    command: ["/usr/bin/machine-config-controller"]
    args:
  • "bootstrap"
  • "--manifest-dir=/etc/mcc/bootstrap"
  • "--dest-dir=/etc/mcs/bootstrap"
  • "--pull-secret=/etc/mcc/bootstrap/machineconfigcontroller-pull-secret"
  • "--payload-version=.ReleaseVersion"
    resources:
    limits:
    memory: 50Mi

we found the sosreport dmesg and crio logs had oom kill machine-config-controller container issue, the issue was cause by cgroup kill, so looks like the limit 50M is too small

The customer used a physical machine that had 100GB of memory

the customer had some network config in asstant install yaml file, maybe the issue is them had some nic config?

log files:
1. sosreport
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/b5501734-60be-4de4-adcf-da57e22cbb8e?usePresignedUrl=true

2. asstent installer yaml file
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/a32635cf-112d-49ed-828c-4501e95a0e7a?usePresignedUrl=true

3. bootstrap machine oom screenshot
https://attachments.access.redhat.com/hydra/rest/cases/03578865/attachments/eefe2e57-cd23-4abd-9e0b-dd45f20a34d2?usePresignedUrl=true

Description of problem:

The service project and the host project both have a private DNS zone named as "ipi-xpn-private-zone". The thing is, although platform.gcp.privateDNSZone.project is set as the host project, the installer checks the zone of the service project, and complains dns name not match. 

Version-Release number of selected component (if applicable):

$ openshift-install version
openshift-install 4.12.0-0.nightly-2022-10-25-210451
built from commit 14d496fdaec571fa97604a487f5df6a0433c0c68
release image registry.ci.openshift.org/ocp/release@sha256:d6cc07402fee12197ca1a8592b5b781f9f9a84b55883f126d60a3896a36a9b74
release architecture amd64

How reproducible:

Always, if both the service project and the host project have a private DNS zone with the same name.

Steps to Reproduce:

1. try IPI installation to a shared VPC, using "privateDNSZone" of the host project

Actual results:

$ openshift-install create cluster --dir test7
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.privateManagedZone: Invalid value: "ipi-xpn-private-zone": dns zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. did not match expected jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com 
$ 

Expected results:

The installer should check the private zone in the specified project (i.e. the host project).

Additional info:

$ yq-3.3.0 r test7/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
  computeSubnet: installer-shared-vpc-subnet-2
  controlPlaneSubnet: installer-shared-vpc-subnet-1
  createFirewallRules: Disabled
  publicDNSZone:
    id: qe-shared-vpc
    project: openshift-qe-shared-vpc
  privateDNSZone:
    id: ipi-xpn-private-zone
    project: openshift-qe-shared-vpc
  network: installer-shared-vpc
  networkProjectID: openshift-qe-shared-vpc
$ yq-3.3.0 r test7/install-config.yaml baseDomain
qe-shared-vpc.qe.gcp.devcluster.openshift.com
$ yq-3.3.0 r test7/install-config.yaml metadata
creationTimestamp: null
name: jiwei-1027a
$ 
$ openshift-install create cluster --dir test7
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.privateManagedZone: Invalid value: "ipi-xpn-private-zone": dns zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. did not match expected jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com 
$ 
$ gcloud --project openshift-qe-shared-vpc dns managed-zones list --filter='name=qe-shared-vpc'
NAME           DNS_NAME                                        DESCRIPTION  VISIBILITY
qe-shared-vpc  qe-shared-vpc.qe.gcp.devcluster.openshift.com.               public
$ gcloud --project openshift-qe-shared-vpc dns managed-zones list --filter='name=ipi-xpn-private-zone'
NAME                  DNS_NAME                                                    DESCRIPTION                         VISIBILITY
ipi-xpn-private-zone  jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com.  Preserved private zone for IPI XPN  private
$ gcloud dns managed-zones list --filter='name=ipi-xpn-private-zone'
NAME                  DNS_NAME                                       DESCRIPTION                         VISIBILITY
ipi-xpn-private-zone  jiwei-1026a.qe1.gcp.devcluster.openshift.com.  Preserved private zone for IPI XPN  private
$ 
$ gcloud --project openshift-qe-shared-vpc dns managed-zones describe qe-shared-vpc
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2020-04-26T02:50:25.172Z'
description: ''
dnsName: qe-shared-vpc.qe.gcp.devcluster.openshift.com.
id: '7036327024919173373'
kind: dns#managedZone
name: qe-shared-vpc
nameServers:
- ns-cloud-b1.googledomains.com.
- ns-cloud-b2.googledomains.com.
- ns-cloud-b3.googledomains.com.
- ns-cloud-b4.googledomains.com.
visibility: public
$ 
$ gcloud --project openshift-qe-shared-vpc dns managed-zones describe ipi-xpn-private-zone         
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2022-10-27T08:05:18.332Z'
description: Preserved private zone for IPI XPN
dnsName: jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com.
id: '5506116785330943369'
kind: dns#managedZone
name: ipi-xpn-private-zone
nameServers:
- ns-gcp-private.googledomains.com.
privateVisibilityConfig:
  kind: dns#managedZonePrivateVisibilityConfig
  networks:
  - kind: dns#managedZonePrivateVisibilityConfigNetwork
    networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
visibility: private
$ 
$ gcloud dns managed-zones describe ipi-xpn-private-zone
cloudLoggingConfig:
  kind: dns#managedZoneCloudLoggingConfig
creationTime: '2022-10-26T06:42:52.268Z'
description: Preserved private zone for IPI XPN
dnsName: jiwei-1026a.qe1.gcp.devcluster.openshift.com.
id: '7663537481778983285'
kind: dns#managedZone
name: ipi-xpn-private-zone
nameServers:
- ns-gcp-private.googledomains.com.
privateVisibilityConfig:
  kind: dns#managedZonePrivateVisibilityConfig
  networks:
  - kind: dns#managedZonePrivateVisibilityConfigNetwork
    networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc
visibility: private
$ 

 

 

Description of problem:

When scaling down the machineSet for worker nodes, a PV(vmdk) file got deleted.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

N/A

Steps to Reproduce:

1. Scale down worker nodes
2. Check VMware logs and VM gets deleted with vmdk still attached

Actual results:

After scaling down nodes, volumes still attached to the VM get deleted alongside the VM

Expected results:

Worker nodes scaled down without any accidental deletion

Additional info:

 

This is a clone of issue OCPBUGS-12878. The following is the description of the original issue:

We want to add the dual-stack tests to the CNI plugin conformance test suite, for the currently supported releases.

(This has no impact on OpenShift itself. We're just modifying a test suite that OCP does not use.)

Description of problem:

After an egressFW rule is created incorrectly (missing CIDR annotation), it will not get any configuration change to fix it.

How reproducible:

Create a wrong egressFW with missing CIDR annotation, then try to update it

Steps to Reproduce:

1. Create an EgressFW with missing CIDR notations. The rules are not correctly applied:

 

[quickcluster@upi-0 ~]$ oc apply -f egressko.yaml 
egressfirewall.k8s.ovn.org/default created
[quickcluster@upi-0 ~]$ oc get egressfirewalls.k8s.ovn.org 
NAME EGRESSFIREWALL STATUS
default EgressFirewall Rules not correctly added

2. Update the rule, adding the CIDR to correct it. The rules keep not being applied:

 

[quickcluster@upi-0 ~]$ oc apply -f egressok.yaml 
egressfirewall.k8s.ovn.org/default configured
[quickcluster@upi-0 ~]$ oc get egressfirewalls.k8s.ovn.org
NAME EGRESSFIREWALL STATUS
default EgressFirewall Rules not correctly added #

 

3. Try removing the egressFW and creating it from scratch. The status is empty:

 

[quickcluster@upi-0 ~]$ oc delete egressfirewalls.k8s.ovn.org default 
egressfirewall.k8s.ovn.org "default" deleted
[quickcluster@upi-0 ~]$ oc apply -f egressok.yaml 
egressfirewall.k8s.ovn.org/default created
[quickcluster@upi-0 ~]$ oc get egressfirewalls.k8s.ovn.org 
NAME EGRESSFIREWALL STATUS
default

 

4. Last try is deleting the namespace and recreating it. The status keeps empty:

 

[quickcluster@upi-0 ~]$ oc delete project firewall-test
project.project.openshift.io "firewall-test" deleted
[quickcluster@upi-0 ~]$ oc new-project firewall-test
[quickcluster@upi-0 ~]$ oc apply -f egressok.yaml 
egressfirewall.k8s.ovn.org/default created
[quickcluster@upi-0 ~]$ oc get egressfirewalls.k8s.ovn.org -n firewall-test 
NAME EGRESSFIREWALL STATUS
default

 

If the rule is applied in a new namespace, it works fine.

 

[quickcluster@upi-0 ~]$ oc new-project firewall-test1
[quickcluster@upi-0 ~]$ oc apply -f egressok.yaml 
egressfirewall.k8s.ovn.org/default created
[quickcluster@upi-0 ~]$ oc get egressfirewalls.k8s.ovn.org -n firewall-test1
NAME EGRESSFIREWALL STATUS
default EgressFirewall Rules applied

 

Actual results:

The rule is not fixed

Expected results:

The rule is fixed

This is a clone of issue OCPBUGS-4207. The following is the description of the original issue:

Description of problem:


We added a line to increase debugging verbosity to aid in debugging WRKLDS-540

Version-Release number of selected component (if applicable):

13

How reproducible:

very

Steps to Reproduce:

1.just a revert
2.
3.

Actual results:

Extra debugging lines are present in the openshift-config-operator pod logs

Expected results:

Extra debugging lines no longer in the openshift-config-operator pod logs

Additional info:


This is a clone of issue OCPBUGS-3228. The following is the description of the original issue:

While starting a Pipelinerun using UI, and in the process of providing the values on "Start Pipeline" , the IBM Power Customer (Deepak Shetty from IBM) has tried creating credentials under "Advanced options" with "Image Registry Credentials" (Authenticaion type). When the IBM Customer verified the credentials from  Secrets tab (in Workloads) , the secret was found in broken state. Screenshot of the broken secret is attached. 

The issue has been observed on OCP4.8, OCP4.9 and OCP4.10.

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-5960.

Description of problem:

Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.

Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.

Version-Release number of selected component (if applicable):

OCP 4.10.16

How reproducible:

Perform OCP 4.10.16 IPI BareMetal install.

Actual results:

provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.

Expected results:

provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)

Additional info:

As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.

Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.

Description of problem:

While viewing resource consumption for a specific pod, several graphes are stacked that should not be.  For example cpu/memory limits are a static value and thus should be a static line across a graph. However when viewing the Kubernetes / Compute Resources / Pod Dashboard I see limits are stacked above the usage.  This applies to both CPU and Memory Usage graphs on this dashboard.  When viewing the graph via inspect the visualization seems "fixed".

Version-Release number of selected component (if applicable):

OCP 4.11.19

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

At the current version 4.12 Openshift console cannot mix both stacked metrics with unstacked metrics on the same chart. 
The fix is to unstack metrics on charts having some limit markers such as request, limit, etc.
 

This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:

Description of problem:

Prometheus continuously restarts due to slow WAL replay

Version-Release number of selected component (if applicable):

openshift - 4.11.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Cluster running 4.10.52 had three aws-ebs-csi-driver-node pods begin to consume multiple GB of memory, causing heavy node memory pressure as the pods have no memory limit. 

All other aws-ebs-csi-driver-node pods were still in the 50-70MB range:

NAME                                            CPU(cores)   MEMORY(bytes)   
aws-ebs-csi-driver-controller-59867579b-d6s2q   0m           397Mi           
aws-ebs-csi-driver-controller-59867579b-t4wgq   0m           276Mi           
aws-ebs-csi-driver-node-4rmvk                   0m           53Mi            
aws-ebs-csi-driver-node-5799f                   0m           50Mi            
aws-ebs-csi-driver-node-6dpvg                   0m           59Mi            
aws-ebs-csi-driver-node-6ldzk                   0m           65Mi            
aws-ebs-csi-driver-node-6mbk5                   0m           54Mi            
aws-ebs-csi-driver-node-bkvsr                   0m           50Mi            
aws-ebs-csi-driver-node-c2fb2                   0m           62Mi            
aws-ebs-csi-driver-node-f422m                   0m           61Mi            
aws-ebs-csi-driver-node-lwzbb                   6m           1940Mi          
aws-ebs-csi-driver-node-mjznt                   0m           53Mi            
aws-ebs-csi-driver-node-pczsj                   0m           62Mi            
aws-ebs-csi-driver-node-pmskn                   0m           3493Mi          
aws-ebs-csi-driver-node-qft8w                   0m           68Mi            
aws-ebs-csi-driver-node-v5bpx                   11m          2076Mi          
aws-ebs-csi-driver-node-vn8km                   0m           84Mi            
aws-ebs-csi-driver-node-ws6hx                   0m           73Mi            
aws-ebs-csi-driver-node-xsk7k                   0m           59Mi            
aws-ebs-csi-driver-node-xzwlh                   0m           55Mi            
aws-ebs-csi-driver-operator-8c5ffb6d4-fk6zk     5m           88Mi            

Deleting the pods caused them to recreate, with normal memory consumption levels.

Version-Release number of selected component (if applicable):

4.10.52

How reproducible:

Unknown

Description of problem:

This bug is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2109140 on odf-console side.
Corresponding PR needed to be merged in console as well.
Please, verify this Jira console's bug and https://bugzilla.redhat.com/show_bug.cgi?id=2109140 simultaneous. Steps are exactly same, no difference.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3032. The following is the description of the original issue:

If installation fails at an early stage (e.g. pulling release images, configuring hosts, waiting for agents to come up) there is no indication that anything has gone wrong, and the installer binary may not even be able to connect.
We should at least display what is happening on the console so that users have some avenue to figure out for themselves what is going on.

Description of problem:

metal3 pod does not come up on SNO when creating Provisioning with provisioningNetwork set to Disabled

The issue is that on SNO, there is no Machine, and no BareMetalHost, it is looking of Machine objects to populate the provisioningMacAddresses field. However, when provisioningNetwork is Disabled, provisioningMacAddresses is not used anyway.

You can work around this issue by populating provisioningMacAddresses with a dummy address, like this:

kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningMacAddresses:
  - aa:aa:aa:aa:aa:aa
  provisioningNetwork: Disabled
  watchAllNamespaces: true

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Try to bring up Provisioning on SNO in 4.11.17 with provisioningNetwork set to Disabled

apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  name: provisioning-configuration
spec:
  provisioningNetwork: Disabled
  watchAllNamespaces: true

Steps to Reproduce:

1.
2.
3.

Actual results:

controller/provisioning "msg"="Reconciler error" "error"="machines with cluster-api-machine-role=master not found" "name"="provisioning-configuration" "namespace"="" "reconciler group"="metal3.io" "reconciler kind"="Provisioning"

Expected results:

metal3 pod should be deployed

Additional info:

This issue is a result of this change: https://github.com/openshift/cluster-baremetal-operator/pull/307
See this Slack thread: https://coreos.slack.com/archives/CFP6ST0A3/p1670530729168599

Description of problem:

We need to include the `openshift_apps_deploymentconfigs_strategy_total` metrics to the IO archive file.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a cluster
2. Download the IO archive
3. Check the file `config/metrics`
4. You must find `openshift_apps_deploymentconfigs_strategy_total` insde of it

Actual results:

 

Expected results:

You should see the `openshift_apps_deploymentconfigs_strategy_total` at the `config/metrics` file.

Additional info:

 

This is a clone of issue OCPBUGS-4401. The following is the description of the original issue:

Description of problem:

cluster-policy-controller has  unnecessary permissions and is able to operate on all leases in KCM namespace. This also applies to namespace-security-allocation-controller that was moved some time ago and does not need lock mechanism.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 
 
 

 

This is a clone of issue OCPBUGS-3767. The following is the description of the original issue:

Description of problem:

Start maintenance action moved from Nodes tab to Bare Metal Hosts tab

Version-Release number of selected component (if applicable):

Cluster version is 4.12.0-0.nightly-2022-11-15-024309

How reproducible:

100%

Steps to Reproduce:

1. Install Node Maintenance operator
2. Go Compute -> Nodes
3. Start maintenance from 3dots menu of worker-0-0
see https://docs.openshift.com/container-platform/4.11/nodes/nodes/eco-node-maintenance-operator.html#eco-setting-node-maintenance-actions-web-console_node-maintenance-operator

Actual results:

No 'Start maintenance' option

Expected results:

Maintenance started successfully

Additional info:

worked for 4.11

 

 

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-2997.

This is a clone of issue OCPBUGS-3287. The following is the description of the original issue:

Description of problem:

Configure both IPv4 and IPv6 addresses in api/ingress in install-config.yaml, install the cluster using agent-based installer. The cluster provisioned has only IPv4 stack for API/Ingress

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. As description
2.
3.

Actual results:

The cluster provisioned has only IPv4 stack for API/Ingress

Expected results:

The cluster provisioned has both IPv4 and IPv6 for API/Ingress

Additional info:

 

This is a clone of issue OCPBUGS-10813. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9982. The following is the description of the original issue:

Description of problem:

In assisted-installer flow bootkube service is started on Live ISO, so root FS is read-only. OKD installer attempts to pivot the booted OS to machine-os-content via `rpm-ostree rebase`. This is not necessary since we're already using SCOS in Live ISO.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:
ovnkube-trace fails on hypershift deployments:
https://bugzilla.redhat.com/show_bug.cgi?id=2066891#c8

getDatabaseURIs looks for pods with container ovnkube-master, and those don't exist in hypershift.

https://github.com/ovn-org/ovn-kubernetes/blob/6b8acf05cb6043ebdc42d9d36e700390baabea4a/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L540
~~~
// Returns nbAddress, sbAddress, protocol == "ssl", nil
func getDatabaseURIs(coreclient *corev1client.CoreV1Client, restconfig *rest.Config, ovnNamespace string) (string, string, bool, error) {
containerName := "ovnkube-master"
var err error

found := false
var podName string

listOptions := metav1.ListOptions{}
pods, err := coreclient.Pods(ovnNamespace).List(context.TODO(), listOptions)
if err != nil

{ return "", "", false, err }

for _, pod := range pods.Items {
for _, container := range pod.Spec.Containers {
if container.Name == containerName

{ found = true podName = pod.Name break }

}
}
if !found

{ klog.V(5).Infof("Cannot find ovnkube pods with container %s", containerName) return "", "", false, fmt.Errorf("cannot find ovnkube pods with container: %s", containerName) }

~~~

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

We are seeing windows to linux networking failures, across all PRs.
This is occurring across all clouds.
Example test failure

seems this could have been due to the downstream merge, the windows jobs did not pass before the PR was merged
Job that failed against the downstream merge, but did not prevent it from merging

This is blocking all PRs against the WMCO repo.

This is a clone of issue OCPBUGS-6829. The following is the description of the original issue:

Description of problem:

While/after upgrading to 4.11 2023-01-14 CoreDNS has a problem with UDP overflows so DNS lookups are very slow and cause the ingress operator upgrade to stall. We needed to work around with force_tcp following this: https://access.redhat.com/solutions/5984291

Version-Release number of selected component (if applicable):

 

How reproducible:

100%, but seems to depend on the network environemnt (excact cause unknown)

Steps to Reproduce:

1. install cluster with OKD 4.11-2022-12-02 or earlier
2. initiate upgrade to OKD 4.11-2023-01-14
3. upgrade will stall after upgrading CoreDNS

Actual results:

CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size 

Expected results:

 

Additional info:

 

Description of problem:

When using the agent based instller to zero-touch provision the cluster. If the network bandwidth is low, and the assisted-service or the assisted-service fails to pull the docker image within the timeout. The create-cluster-and-infraenv, apply-host-config, and start-cluster-installation services will be deactivated due to dependency failed. The process will be blocked, and require enable & start the service manually.

Version-Release number of selected component (if applicable):

openshift-install 4.11.0
built from commit 863cd1ea823559116e26de327705ed72ccdede8f
release image quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4
release architecture amd64

How reproducible:

Install Openshift with agent based installer with local mirror.

Steps to Reproduce:

1.Stop the local registry or limit the network bandwidth to make assisted-service-pod.service or assisted-service.service fails to started within the 90s timeout.
2.Start the local registry or mannully pull the image on the node0. 3.

Actual results:

When using the agent based instller to zero-touch pprovision  the cluster. If the network bandwidth is low, and the assisted-service or the assisted-service fails to pull the docker image within the timeout. The create-cluster-and-infraenv, apply-host-config, and start-cluster-installation services will be deactivated due to dependency failed. The process will be blocked, and require enable & start the service manually.

Expected results:

Provision start after the assisted-service started.

Additional info:

Given:
assisted-service-pod.service requires assisted-service-db.service assisted-service.service
assisted-service.service BindsTo=assisted-service-pod.service
create-cluster-and-infraenv.service Requires=assisted-service.service and PartOf=assisted-service-pod.service
apply-host-config.service Requires=create-cluster-and-infraenv.service
start-cluster-installation.service Requires=apply-host-config.service
Requires= "Configures requirement dependencies on other units. If this unit gets activated, the units listed here will be activated as well. If one of the other units gets deactivated or its activation fails, this unit will be deactivated."When assisted-service-pod.service starts, assisted-service-db.service and assisted-service.service also be started,
Once assisted-service-pod.service fails to be started, assisted-service.service also fail to be started due to "BindsTo=assisted-service-pod.service".
Then dependency failed for create-cluster-and-infraenv.service due to Requires=assisted-service.service which activation fails, Therefore it will be deactived.
Then dependency failed for apply-host-config.service, due to Requires=create-cluster-and-infraenv.service which activation fails, Therefore it will be deactived.
Then dependency failed for start-cluster-installation.service, due to Requires=apply-host-config.service which activation fails, Therefore it will be deactived.Then assisted-service-pod.service restarts, assisted-service.service and assisted-service-db.service restarts as well, since they are binded to assisted-service-pod.service.
However, create-cluster-and-infraenv.service apply-host-config.service and start-cluster-installation.service was be deactivated, they requires to be activate mannully.Eventually, assisted-service started and hang with waitting for create infraenv. The provisioning is blocked.

Description of problem:

TestEditUnmanagedPodDisruptionBudget flakes in the console-operator e2e

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Flake

Steps to Reproduce:
1. Check https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console-operator/665/pull-ci-openshift-console-operator-master-e2e-aws-operator/1562005782164148224
2.
3.

Actual results:

Expected results:

Additional info:

There is a chance that the PDB instances is not present since prior to the Unmanaged* TCs the RemoveTest is running which is removing all the console resources (Pods, Services, PDBs, ...).

 

This is a clone of issue OCPBUGS-948. The following is the description of the original issue:

Description of problem:

OLM is setting the "openshift.io/scc" label to "anyuid" on several namespaces:

https://github.com/openshift/operator-framework-olm/blob/d817e09c2565b825afd8bfc9bb546eeff28e47e7/manifests/0000_50_olm_00-namespace.yaml#L23
https://github.com/openshift/operator-framework-olm/blob/d817e09c2565b825afd8bfc9bb546eeff28e47e7/manifests/0000_50_olm_00-namespace.yaml#L8

this label has no effect and will lead to confusion.  It should be set to emptystring for now (removing it entirely will have no effect on upgraded clusters because the CVO does not remove deleted labels, so the next best thing is to clear the value).

For bonus points, OLM should remove the label entirely from the manifest and add migration logic to remove the existing label from these namespaces to handle upgraded clusters that already have it.

Version-Release number of selected component (if applicable):

Not sure how long this has been an issue, but fixing it in 4.12+ should be sufficient.

How reproducible:

always

Steps to Reproduce:

1. install cluster
2. examine namespace labels

Actual results:

label is present

Expected results:


ideally label should not be present, but in the short term setting it to emptystring is the quick fix and is better than nothing.

This is a clone of issue OCPBUGS-3304. The following is the description of the original issue:

Assisted-service can use only one mirror of the release image. In the install-config, the user may specify multiple matching mirrors. Currently the last matching mirror is the one used by assisted-service. This is confusing; we should use the first matching one instead.

This is a clone of issue OCPBUGS-3973. The following is the description of the original issue:

Description of problem:

Upgrade SNO cluster from 4.12 to 4.13, the csi-snapshot-controller is degraded with message (same with log from csi-snapshot-controller-operator): 
E1122 09:02:51.867727       1 base_controller.go:272] StaticResourceController reconciliation failed: ["csi_controller_deployment_pdb.yaml" (string): poddisruptionbudgets.policy "csi-snapshot-controller-pdb" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot delete resource "poddisruptionbudgets" in API group "policy" in the namespace "openshift-cluster-storage-operator", "webhook_deployment_pdb.yaml" (string): poddisruptionbudgets.policy "csi-snapshot-webhook-pdb" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot delete resource "poddisruptionbudgets" in API group "policy" in the namespace "openshift-cluster-storage-operator"]

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-19-191518 to 4.13.0-0.nightly-2022-11-19-182111

How reproducible:

1/1

Steps to Reproduce:

Upgrade SNO cluster from 4.12 to 4.13 

Actual results:

csi-snapshot-controller is degraded

Expected results:

csi-snapshot-controller should be healthy

Additional info:

It also happened on from scratch cluster on 4.13: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-aws-ovn-arm64-single-node/1594946128904720384

Description of problem:

acquiring node lock for assigning ip address, node: %s, ip: %sci-ln-g470i52-1d09d-slz7m-worker-westus-6wt7k10.0.128.102

This is a clone of issue OCPBUGS-13427. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12780. The following is the description of the original issue:

Description of problem:

023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health [-] Component KuryrPortHandler is dead. Last caught exception below: openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 169, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self.k8s.get(f"{constants.K8S_API_NAMESPACES}"
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 121, in get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._raise_from_response(response)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/k8s_client.py", line 99, in _raise_from_response
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     raise exc.K8sResourceNotFound(response.text)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \\"mygov-tuo-microservice-dev2-59fffbc58c-l5b79\\" not found","reason":"NotFound","details":{"name":"mygov-tuo-microservice-dev2-59fffbc58c-l5b79","kind":"pods"},"code":404}\n'
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health During handling of the above exception, another exception occurred:
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health Traceback (most recent call last):
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/logging.py", line 38, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/retry.py", line 85, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self._handler(event, *args, retry_info=info, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 98, in __call__
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     self.on_finalize(obj, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 184, in on_finalize
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     pod = self._mock_cleanup_pod(kuryrport_crd)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/controller/handlers/kuryrport.py", line 160, in _mock_cleanup_pod
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     host_ip = utils.get_parent_port_ip(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/utils.py", line 830, in get_parent_port_ip
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     parent_port = os_net.get_port(port_id)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_proxy.py", line 1987, in get_port
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return self._get(_port.Port, port)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 48, in check
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     return method(self, expected, actual, *args, **kwargs)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/proxy.py", line 513, in _get
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     resource_type=resource_type.__name__, value=value))
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1472, in fetch
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/network/v2/_base.py", line 26, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     base_path=base_path, params=params)
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health   File "/usr/lib/python3.6/site-packages/openstack/resource.py", line 1156, in _prepare_request
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health     "Request requires an ID but none was found")
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health openstack.exceptions.InvalidRequest: Request requires an ID but none was found
2023-04-20 02:08:09.770 1 ERROR kuryr_kubernetes.controller.managers.health
2023-04-20 02:08:09.918 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping
2023-04-20 02:08:09.919 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworks'
2023-04-20 02:08:10.026 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/machine.openshift.io/v1beta1/machines'
2023-04-20 02:08:10.152 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/pods'
2023-04-20 02:08:10.174 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/networking.k8s.io/v1/networkpolicies'
2023-04-20 02:08:10.857 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/namespaces'
2023-04-20 02:08:10.877 1 WARNING kuryr_kubernetes.controller.drivers.utils [-] Namespace dev-health-air-ids not yet ready: kuryr_kubernetes.exceptions.K8sResourceNotFound: Resource not found: '{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnetworks.openstack.org \\"dev-health-air-ids\\" not found","reason":"NotFound","details":{"name":"dev-health-air-ids","group":"openstack.org","kind":"kuryrnetworks"},"code":404}\n'
2023-04-20 02:08:11.024 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/services'
2023-04-20 02:08:11.078 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/api/v1/endpoints'
2023-04-20 02:08:11.170 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrports'
2023-04-20 02:08:11.344 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrnetworkpolicies'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] Stopped watching '/apis/openstack.org/v1/kuryrloadbalancers'
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.watcher [-] No remaining active watchers, Exiting...
2023-04-20 02:08:11.475 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopping

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Create a pod.
2. Stop kuryr-controller.
3. Delete the pod and the finalizer on it.
4. Delete pod's subport.
5. Start the controller.

Actual results:

Crash

Expected results:

Port cleaned up normally.

Additional info:


This is a clone of issue OCPBUGS-11985. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10343. The following is the description of the original issue:

Description of problem:

When deploying hosts using ironic's agent both the ironic service address and inspector address are required.

The ironic service is proxied such that it can be accessed at a consistent endpoint regardless of where the pod is running. This is not the case for the inspection service.

This means that if the inspection service moves after we find the address, provisioning will fail.

In particular this non-matching behavior is frustrating when using the CBO [GetIronicIP function|https://github.com/openshift/cluster-baremetal-operator/blob/6f0a255fdcc7c0e5c04166cb9200be4cee44f4b7/provisioning/utils.go#L95-L127] as one return value is usable forever but the other needs to somehow be re-queried every time the pod moves.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Relatively

Steps to Reproduce:

1. Retrieve the inspector IP from GetIronicIP
2. Reschedule the inspector service pod
3. Provision a host

Actual results:

Ironic python agent raises an exception

Expected results:

Host provisions

Additional info:

This was found while deploying clusters using ZTP

In this scenario specifically an image containing the ironic inspector IP is valid for an extended period of time. The same image can be used for multiple hosts and possibly multiple different spoke clusters.

Our controller shouldn't be expected to watch the ironic pod to ensure we update the image whenever it moves. The best we can do is re-query the inspector IP whenever a user makes changes to the image, but that may still not be often enough.

This is a clone of issue OCPBUGS-13970. The following is the description of the original issue:

if the kubeadmin secret was deleted successfully from the guest cluster, but the `SecretHashAnnotation` annotation deletion on the oauthDeployment failed, the annotation will not be reconciled again and the annotation will never be removed.

context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1684765042825929

Description of problem:

when egress firewall is applied in a namespace which name is longer than 43 symbols, acl names gets cropped and all acls for the same egress firewall object are considered equivalent. It is a known problem that we faced for network policies too.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-15515. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13810. The following is the description of the original issue:

Description of problem

CI is flaky because the TestAWSELBConnectionIdleTimeout test fails. Example failures:

Version-Release number of selected component (if applicable)

I have seen these failures in 4.14 and 4.13 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 1.24% of runs (3.52% of failures) across 404 total runs and 34 jobs (35.15% failed)

This includes two jobs:

  • pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 40 runs, 63% failed, 16% of failures match = 10% impact
  • pull-ci-openshift-cluster-ingress-operator-release-4.13-e2e-aws-operator (all) - 10 runs, 70% failed, 14% of failures match = 10% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestAWSELBConnectionIdleTimeout&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails because it times out waiting for DNS to resolve:

=== RUN   TestAll/parallel/TestAWSELBConnectionIdleTimeout
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host
    operator_test.go:2656: failed to observe expected condition: timed out waiting for the condition
    panic.go:522: deleted ingresscontroller test-idle-timeout

The above output comes from build-log.txt from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/917/pull-ci-openshift-cluster-ingress-operator-release-4.13-e2e-aws-operator/1658840125502656512.

Expected results

CI passes, or it fails on a different test.

Description of problem:

Image registry pods panic while deploying OCP in ap-south-2 AWS region

Version-Release number of selected component (if applicable):

4.11.2

How reproducible:

Deploy OCP in AWS ap-south-2 region

Steps to Reproduce:

Deploy OCP in AWS ap-south-2 region 

Actual results:

panic: Invalid region provided: ap-south-2

Expected results:

Image registry pods should come up with no errors

Additional info:

 

 

 

 

 

Description of problem:

failed even trying to "create install-config" in the epic's scenario

Version-Release number of selected component (if applicable):

$ ./openshift-install version
./openshift-install 4.12.0-0.nightly-2022-09-28-204419
built from commit 9eb0224926982cdd6cae53b872326292133e532d
release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. create vpc network, subnets, and a firewall-rule to allow ssh access to the bastion host
2. create the bastion host, with setting a valid service-account and scopes of "https://www.googleapis.com/auth/cloud-platform"
3. scp pull secret to the bastion host
4. ssh to the bastion host (subsequent steps would be on the bastion host, except told explicitly)
5. get "oc", e.g. curl https://mirror2.openshift.com/pub/openshift-v4/clients/ocp/4.9.9/openshift-client-linux-4.9.9.tar.gz -o openshift-client-linux-4.9.9.tar.gz; tar zxvf openshift-client-linux-4.9.9.tar.gz
6. obtain the installation program
7. try "create install-config" of platform "gcp" 

Actual results:

[cloud-user@jiwei-0930-02-rhel8-mirror ~]$ ./openshift-install create install-config --dir work                                         
? SSH Public Key /home/cloud-user/.ssh/id_rsa.pub                                                                                       
? Platform gcp                                                                                                                          
INFO Credentials loaded from gcloud CLI defaults                                                                                        
? Project ID OpenShift QE Shared VPC (openshift-qe-shared-vpc)                                                                          
? Region us-west1                                                                                                                       
? Base Domain qe-shared-vpc.qe.gcp.devcluster.openshift.com                                                                             
? Cluster Name jiwei-0930-03                                                                                                            
? Pull Secret [? for help] ******
FATAL failed to fetch Install Config: failed to generate asset "Install Config": credentialsMode: Forbidden: environmental authentication is only supported with Manual credentials mode 
[cloud-user@jiwei-0930-02-rhel8-mirror ~]$ 

Expected results:

"create install-config" should succeed.

Additional info:

 

 

 

 

 

Description of problem
`oc-mirror` will hit error when use docker without namespace for OCI format mirror

How reproducible:
always

Steps to Reproduce:
Copy the operator image with OCI format to localhost;
cat copy.yaml
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
operators:

  • catalog: registry.redhat.io/redhat/redhat-operator-index:v4.11
    packages:
  • name: multicluster-engine
    minVersion: '2.1.1'
    maxVersion: '2.1.2'
    `oc-mirror --config copy.yaml oci:///home/ocmirrortest/noo --use-oci-feature --oci-feature-action=copy --continue-on-error`
    Mirror the operator image with OCI format to registry without namespace :
    cat mirror.yaml
    apiVersion: mirror.openshift.io/v1alpha2
    kind: ImageSetConfiguration
    mirror:
    operators:
  • catalog: oci:///home/ocmirrortest/noo/redhat-operator-index
    packages:
  • name: multicluster-engine
    minVersion: '2.1.1'
    maxVersion: '2.1.2'

`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000`

Actual results:
2. Hit error:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000`
……
info: Mirroring completed in 30ms (0B/s)
error: mirroring images "localhost:5000//multicluster-engine/mce-operator-bundle@sha256:e7519948bbcd521390d871ccd1489a49aa01d4de4c93c0b6972dfc61c92e0ca2" is not a valid image reference: invalid reference format

Expected results:
2. No error

Additional info:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000/ocmir` works well.

Description of problem:
This is a follow up on OCPBUGSM-47202 (https://bugzilla.redhat.com/show_bug.cgi?id=2110570)

While OCPBUGSM-47202 fixes the issue specific for Set Pod Count, many other actions aren't fixed. When the user updates a Deployment with one of this options, and selects the action again, the old values are still shown.

Version-Release number of selected component (if applicable)
4.8-4.12 as well as master with the changes of OCPBUGSM-47202

How reproducible:
Always

Steps to Reproduce:

  1. Import a deployment
  2. Select the deployment to open the topology sidebar
  3. Click on actions and one of the 4 options to update the deployment with a modal
    1. Edit labels
    2. Edit annotatations
    3. Edit update strategy
    4. Edit resource limits
  4. Click on the action again and check if the data in the modal reflects the changes from step 3

Actual results:
Old data (labels, annotations, etc.) was shown.

Expected results:
Latest data should be shown

Additional info:

This is a clone of issue OCPBUGS-723. The following is the description of the original issue:

Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~

WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.

In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).

This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.

Version-Release number of selected component (if applicable): OCP 4.9

How reproducible: Occurred only twice(once in test and in current cluster)

Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty

Actual results: ClusterQuota should display the values

Expected results: ClusterQuota not displaying values

This is a clone of issue OCPBUGS-14180. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14082. The following is the description of the original issue:

Description of problem:

Since the `registry.centos.org` is closed, all the unit tests in oc relying on this registry started failing. 

Version-Release number of selected component (if applicable):

all versions

How reproducible:

trigger CI jobs and see unit tests are failing

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-13812. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13718. The following is the description of the original issue:

Description of problem:

IPI install on azure stack failed when setting platform.azure.osDiks.diskType as StandardSSD_LRS in install-config.yaml.

When setting controlPlane.platform.azure.osDisk.diskType as StandardSSD_LRS, get error in terraform log and some resources have been created.

level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS
level=error
level=error msg=  with azurestack_virtual_machine.bootstrap,
level=error msg=  on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap":
level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: expected storage_os_disk.0.managed_disk_type to be one of [Premium_LRS Standard_LRS], got StandardSSD_LRS
level=error
level=error msg=  with azurestack_virtual_machine.bootstrap,
level=error msg=  on main.tf line 107, in resource "azurestack_virtual_machine" "bootstrap":
level=error msg= 107: resource "azurestack_virtual_machine" "bootstrap" {
level=error
level=error

When setting compute.platform.azure.osDisk.diskType as StandardSSD_LRS, fail to provision compute machines

$ oc get machine -n openshift-machine-api
NAME                                     PHASE     TYPE              REGION   ZONE   AGE
jima414ash03-xkq5x-master-0              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-master-1              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-master-2              Running   Standard_DS4_v2   mtcazs          62m
jima414ash03-xkq5x-worker-mtcazs-89mgn   Failed                                      52m
jima414ash03-xkq5x-worker-mtcazs-jl5kk   Failed                                      52m
jima414ash03-xkq5x-worker-mtcazs-p5kvw   Failed                                      52m

$ oc describe machine jima414ash03-xkq5x-worker-mtcazs-jl5kk -n openshift-machine-api
...
  Error Message:           failed to reconcile machine "jima414ash03-xkq5x-worker-mtcazs-jl5kk": failed to create vm jima414ash03-xkq5x-worker-mtcazs-jl5kk: failure sending request for machine jima414ash03-xkq5x-worker-mtcazs-jl5kk: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="Storage account type 'StandardSSD_LRS' is supported by Microsoft.Compute API version 2018-04-01 and above" Target="osDisk.managedDisk.storageAccountType"
...

Based on azure-stack doc[1], supported disk types on ASH are Premium SSD, Standard HDD. It's better to do validation for diskType on Azure Stack to avoid above errors.

[1]https://learn.microsoft.com/en-us/azure-stack/user/azure-stack-managed-disk-considerations?view=azs-2206&tabs=az1%2Caz2#cheat-sheet-managed-disk-differences

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-16-085836

How reproducible:

Always

Steps to Reproduce:

1. Prepare install-config.yaml, set platform.azure.osDiks.diskType as StandardSSD_LRS
2. Install IPI cluster on Azure Stack
3.

Actual results:

Installation failed

Expected results:

Installer validate diskType on AzureStack Cloud, and exit for unsupported disk type with error message

Additional info:

 

Description of problem:

unset field networks in topology of each failureDomain, but defines platform.vsphere.vcenters.

in install-config.yaml:

    vcenters:
    - server: xxx
      user: xxx
      password: xxx
      datacenters:
      - IBMCloud
      - datacenter-2
    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        datastore: multi-zone-ds-shared
      server: ibmvcenter.vmc-ci.devcluster.openshift.com
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        datastore: multi-zone-ds-shared
      server: ibmvcenter.vmc-ci.devcluster.openshift.com
    - name: us-east-3

Launch installer to create cluster, get panic error

sh-4.4$ ./openshift-install create cluster --dir ipi --log-level debug
DEBUG OpenShift Installer 4.12.0-0.nightly-2022-09-25-071630 
DEBUG Built from commit 1fb1397635c89ff8b3645fed4c4c264e4119fa84 
DEBUG Fetching Metadata...                         
...
DEBUG       Reusing previously-fetched Master Ignition Config 
DEBUG     Generating Master Machines...            
panic: runtime error: index out of range [0] with length 0goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/machines/vsphere.getDefinedZones(0xc0003bec80)
    /go/src/github.com/openshift/installer/pkg/asset/machines/vsphere/machinesets.go:122 +0x4f8
github.com/openshift/installer/pkg/asset/machines/vsphere.Machines({0xc0011ca0b0, 0xd}, 0xc001080c80, 0xc0005cad50, {0xc000651d10, 0x13}, {0x4ab5773, 0x6}, {0x4ad49bb, 0x10})
    /go/src/github.com/openshift/installer/pkg/asset/machines/vsphere/machines.go:37 +0x250
github.com/openshift/installer/pkg/asset/machines.(*Master).Generate(0xc001118bd0, 0x5?)
 

Field platform.vsphere.failureDomains.topology.netowrks is not required in documentation.

sh-4.4$ ./openshift-install explain installconfig.platform.vsphere.failureDomains.topology
KIND:     InstallConfig
VERSION:  v1RESOURCE: <object>
  Topology describes a given failure domain using vSphere constructsFIELDS:
    computeCluster <string> -required-
      computeCluster as the failure domain This is required to be a path    datacenter <string> -required-
      datacenter is the vCenter datacenter in which virtual machines will be located and defined as the failure domain.    datastore <string> -required-
      datastore is the name or inventory path of the datastore in which the virtual machine is created/located.    folder <string>
      folder is the name or inventory path of the folder in which the virtual machine is created/located.    networks <[]string>
      networks is the list of networks within this failure domain    resourcePool <string>
      resourcePool is the absolute path of the resource pool where virtual machines will be created. The absolute path is of the form /<datacenter>/host/<cluster>/Resources/<resourcepool>. 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-25-071630

How reproducible:

always when setting platform.vsphere.vcenters and unsetting platform.vsphere.failureDomains.topology.networks
It works if no set platform.vsphere.vcenters and set platform.vsphere.failureDomains.topology.networks

Steps to Reproduce:

1. configure zones in install-config.yaml, set platform.vsphere.vcenters and unset platform.vsphere.failureDomains.topology.networks
2. install IPI cluster
3.

Actual results:

installer get panic error

Expected results:

installation is successful.

Additional info:

 

Description of problem:

Since the decomissioning of the psi cluster, and subsequent move of the rhcos release browser, product builds machine-os-images builds have been failing. See e.g. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=47565717

Version-Release number of selected component (if applicable):

4.12, 4.11, 4.10.

How reproducible:

Have ART build the image

Steps to Reproduce:

1. Have ART build the image

Actual results:

Build failure

Expected results:

Build succesful

Additional info:


Description of problem:

The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
NAMESPACE          NAME                      AGE
openshift-multus   2001-1b70-820d-4b04--13   4m53s
openshift-multus   2001-1b70-820d-4b05--13   4m49s

2.  Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             4d13h

$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A
No resources found

$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        5s              4d13h

 

Actual results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob.
The "overlapping ranges" are not used. 

Expected results:

The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.

Additional info:

 

This is a clone of issue OCPBUGS-6663. The following is the description of the original issue:

Description of problem:

When running openshift-install agent create image, and the install-config.yaml does not contain platform baremetal settings (except for VIPs) warnings are still generated as below:
DEBUG         Loading Install Config...            
WARNING Platform.Baremetal.ClusterProvisioningIP: 172.22.0.3 is ignored 
DEBUG Platform.Baremetal.BootstrapProvisioningIP: 172.22.0.2 is ignored 
WARNING Platform.Baremetal.ExternalBridge: baremetal is ignored 
WARNING Platform.Baremetal.ExternalMACAddress: 52:54:00:12:e1:68 is ignored 
WARNING Platform.Baremetal.ProvisioningBridge: provisioning is ignored 
WARNING Platform.Baremetal.ProvisioningMACAddress: 52:54:00:82:91:8d is ignored 
WARNING Platform.Baremetal.ProvisioningNetworkCIDR: 172.22.0.0/24 is ignored 
WARNING Platform.Baremetal.ProvisioningDHCPRange: 172.22.0.10,172.22.0.254 is ignored 
WARNING Capabilities: %!!(MISSING)s(*types.Capabilities=<nil>) is ignored 

It looks like these fields are populated with values from libvirt as shown in .openshift_install_state.json:
            "platform": {
                "baremetal": {
                    "libvirtURI": "qemu:///system",
                    "clusterProvisioningIP": "172.22.0.3",
                    "bootstrapProvisioningIP": "172.22.0.2",
                    "externalBridge": "baremetal",
                    "externalMACAddress": "52:54:00:12:e1:68",
                    "provisioningNetwork": "Managed",
                    "provisioningBridge": "provisioning",
                    "provisioningMACAddress": "52:54:00:82:91:8d",
                    "provisioningNetworkInterface": "",
                    "provisioningNetworkCIDR": "172.22.0.0/24",
                    "provisioningDHCPRange": "172.22.0.10,172.22.0.254",
                    "hosts": null,
                    "apiVIPs": [
                        "10.1.101.7",
                        "2620:52:0:165::7"
                    ],
                    "ingressVIPs": [
                        "10.1.101.9",
                        "2620:52:0:165::9"
                    ]

The install-config.yaml used to generate this has the following snippet:
platform:
  baremetal:
    apiVIPs:
    - 10.1.101.7
    - 2620:52:0:165::7
    ingressVIPs:
    - 10.1.101.9
    - 2620:52:0:165::9
additionalTrustBundle: |

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Happens every time

Steps to Reproduce:

1. Use install-config.yaml with no platform baremetal fields except for the VIPs
2. run openshift-install agent create image 

Actual results:

Warning messages are output

Expected results:

No warning messags

Additional info:

 

This is a clone of issue OCPBUGS-5988. The following is the description of the original issue:

Description of problem:

Etcd operator is in degraded state as one of the masters can't connect.
Master that fails to connect was previously bootstrap and pivoted as part of assisted-installer installation to master.

Etcd log:
2023-01-17T23:09:26.523562312Z 28dcf1b0a44481b0, started, test-infra-cluster-04bf4418-master-1, https://192.168.127.11:2380, https://192.168.127.11:2379, false
2023-01-17T23:09:26.523562312Z 30600b5b86e23c8e, started, etcd-bootstrap, https://192.168.127.12:2380, https://192.168.127.12:2379, false
2023-01-17T23:09:26.523562312Z 73f00626fee34a87, started, test-infra-cluster-04bf4418-master-0, https://192.168.127.10:2380, https://192.168.127.10:2379, false
2023-01-17T23:09:26.541214220Z #### attempt 0
2023-01-17T23:09:26.547811132Z       member={name="test-infra-cluster-04bf4418-master-1", peerURLs=[https://192.168.127.11:2380}, clientURLs=[https://192.168.127.11:2379]
2023-01-17T23:09:26.547811132Z       member={name="etcd-bootstrap", peerURLs=[https://192.168.127.12:2380}, clientURLs=[https://192.168.127.12:2379]
2023-01-17T23:09:26.547811132Z       member={name="test-infra-cluster-04bf4418-master-0", peerURLs=[https://192.168.127.10:2380}, clientURLs=[https://192.168.127.10:2379]
2023-01-17T23:09:26.547811132Z       target={name="etcd-bootstrap", peerURLs=[https://192.168.127.12:2380}, clientURLs=[https://192.168.127.12:2379]
2023-01-17T23:09:26.547846508Z member "https://192.168.127.12:2380" dataDir has been destroyed and must be removed from the cluster

There are couple of problems that we see:
1. For unknown reason etcd operator BootstrapTeardownController fails to start as it fails to see "openshift-etcd" namespace though by the logs it is there.
2023-01-17T21:39:43.323928903Z E0117 21:39:43.323917       1 base_controller.go:272] BootstrapTeardownController reconciliation failed: failed to get bootstrap scaling strategy: failed to get openshift-etcd names

2. DelayStrategy code was change by https://github.com/openshift/cluster-etcd-operator/pull/964/files and currently it requires 3 healthy members in order to remove. It can create issues as etcd and cluster-bootstrap(bootkube) are not synchronized and nothing is actually blocking bootstrap on stop etcd and block remove of bootstrap etcd.(at least how i understand the flow)


Version-Release number of selected component (if applicable):

 

How reproducible:

It is race as far as i understand but reproduced pretty much in our CI by installing 4.12 nightlies

Steps to Reproduce:

1.
2.
3.

Actual results:

Etcd is degrade cause third joined master etcd can't start

Expected results:

Etcd is healthy

Additional info:

 

Description of problem:

Since openenshift/cluster-ingress-operator#817 merged, the e2e-aws-operator CI job has been failing for multiple PRs in the cluster-ingress-operator repository.  In particular, the TestScopeChange test has been consistently failing. Example failures:

The operator is repeatedly logging errors like the following in those failing CI jobs:

ERROR    operator.dns_controller    controller/controller.go:121    failed to delete dnsrecord; will retry    \{"dnsrecord": {"metadata":{"name":"scope-wildcard","namespace":"openshift-ingress-operator","uid":"2cb9936f-d6a0-4377-b3ed-c5167c5e9e4d","resourceVersion":"42217","generation":2,"creationTimestamp":"2022-10-13T16:19:23Z","deletionTimestamp":"2022-10-13T16:20:27Z","deletionGracePeriodSeconds":0,"labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"scope"},"ownerReferences":[\{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"scope","uid":"713ac1c5-451b-42d1-89fd-c3910eb80fe3","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":\{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":\{".":{},"k:\{\"uid\":\"713ac1c5-451b-42d1-89fd-c3910eb80fe3\"}":{}}},"f:spec":\{".":{},"f:dnsManagementPolicy":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}}}},\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}},"subresource":"status"}]},"spec":\{"dnsName":"*.scope.ci-op-x1j7dsgt-43abb.origin-ci-int-aws.dev.rhcloud.com.","targets":["af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"},"status":\{"zones":[{"dnsZone":{"tags":{"Name":"ci-op-x1j7dsgt-43abb-45zhd-int","kubernetes.io/cluster/ci-op-x1j7dsgt-43abb-45zhd":"owned"}},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:23Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]},\{"dnsZone":{"id":"Z2GYOLTZHS5VK"},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:24Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]}],"observedGeneration":1}}, "error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com", "errorCauses": [\{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}, \{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}]}}}

The scope-wildcard dnsrecord is created for the TestScopeChange test.

Using search.ci, it seems that the failures occurred many times on #817 before it merged and then started occurring for the other PRs after #817 merged.

I filed a PR, openshift/cluster-ingress-operator#838, that reverts #817. I have run the e2e-aws-operator CI job on this PR twice. While the job has failed both times, the TestScopeChange test did not fail either time.

At this point, we have strong evidence that #817 is causing TestScopeChange to fail.

Grant Spence did some testing and determined that there is some interaction between TestAllowedSourceRangesStatus and TestScopeChange. It may suffice to serialize some tests (TestScopeChanged is currently a parallel test, as is TestAllowedSourceRangesStatus and two other tests that #817 adds).

If the problem cannot be resolved by serializing tests, it may be necessary to revert #817 to unblock CI.

Note that this issue is blocking NE-942, NE-1072, and NE-682, as well as any bugfix PRs for the master branch in openshift/cluster-ingress-operator.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Consistently.

Steps to Reproduce:

1. Run CI on a PR against the master branch of cluster-ingress-operator.

Actual results:

The TestScopeChange test fails as described.

Expected results:

TestScopeChange should not fail.

 

 

Discovered in the must gather kubelet_service.log from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1586093220087992320

It appears the guard pod names are too long, and being truncated down to where they will collide with those from the other masters.

From kubelet logs in this run:

❯ grep openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste kubelet_service.log
Oct 28 23:58:55.693391 ci-op-3hj6pnwf-4f6ab-lv57z-master-1 kubenswrapper[1657]: E1028 23:58:55.693346    1657 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-1" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:03.735726 ci-op-3hj6pnwf-4f6ab-lv57z-master-0 kubenswrapper[1670]: E1028 23:59:03.735671    1670 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-0" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
Oct 28 23:59:11.168082 ci-op-3hj6pnwf-4f6ab-lv57z-master-2 kubenswrapper[1667]: E1028 23:59:11.168041    1667 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-2" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"

This also looks to be happening for openshift-kube-scheduler-guard, kube-controller-manager-guard, possibly others.

Looks like they should be truncated further to make room for random suffixes in https://github.com/openshift/library-go/blame/bd9b0e19121022561dcd1d9823407cd58b2265d0/pkg/operator/staticpod/controller/guard/guard_controller.go#L97-L98

Unsure of the implications here, it looks a little scary.

This is a clone of issue OCPBUGS-5458. The following is the description of the original issue:

reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479

Description of problem:

Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-15720. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14874. The following is the description of the original issue:

Description of problem:

Deploying a helm chart that features a values.schema.json using either 2019-09 or 2020-20 (latest) revisions of the JSON-Schema results in the UI hanging on create with three dots loading... This is not the case if YAML view is used, since I suppose this view is not trying to be clever and let Helm validate the chart values against the schema itself.

Version-Release number of selected component (if applicable):

Reproduced in 4.13, probably affects other versions as well.

How reproducible:

100%

Steps to Reproduce:

1. Go to Helm tab.
2. Click create in top right and select Repository
3. Paste following into YAML view and click Create:

apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
  name: reproducer
spec:
  connectionConfig:
    url: 'https://raw.githubusercontent.com/tumido/helm-backstage/blog2'

4. Go to the Helm tab again (if redirected elsewhere)
5. Click create in top right and select Helm Release
6. In catalog filter select Chart repositories: Reproducer
7. Click on the single tile available (Backstage) and click Create
8. Switch to Form view
9. Leave default values and click Create
10. Stare at the always loading screen that never proceeds further.

Actual results:

Expected results:

It installs and deploys the chart

Additional info:

This is caused by a JSON Schema containing a $schema key pointing which revision of the JSON Schema standard should be used:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
}

I've managed to trace this back to this react-jsonschema-form issue:

https://github.com/rjsf-team/react-jsonschema-form/issues/2241

It seems the library used here for validation doesn't support 2019-09 draft and the most current revision 2020-20 revision.

It happens only if the chart follows the JSON Schema standard and declares the revision properly.

Workarounds:

IMO best solution:
Helm form renderer should NOT do any validation, since it can't handle the schema properly. Instead, it should leave this job to the Helm backend. Helm validates the values against the schema when installing the chart anyways. The YAML view also does no validation. That one seems to do the job properly.
 
Currently, there is no formal requirement for charts admitted to the helm curated catalog saying that the most recent JSON Schema revision is 4 years old and later 2 revisions are not supported.

Also, the Form UI should not just hang on submit. Instead, it should at least fail gracefully.

 

Related to:

https://github.com/janus-idp/helm-backstage/issues/64#issuecomment-1587678319

This is a clone of issue OCPBUGS-16390. The following is the description of the original issue:

When running an agent based deployment and in particular launching the `openshift-install agent create image`, setting network type to Contrail in the install config works fine but then the start cluster and infraenv fails with the following

```

May 13 18:25:24 agent-based-0 create-cluster-and-infraenv[3396]: time="2023-05-13T18:25:24Z" level=fatal msg="Failed to register cluster with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}"

```

 

The code creating the cluster should instead:

  • create it with OVNKubernetes
  • update the install config by setting Contrail as network type

 

 

Description of problem:
When opening the Devfile sample developer catalog, switch the project in another browser tab, and then open devfile samples link in a new tab, the current project context is getting lost.

Version-Release number of selected component (if applicable):
4.12, expecting that this happen also in older versions

How reproducible:
Always

Steps to Reproduce:
1. Switch to the developer perspective, navigate to Add > Samples
2. Open a new browser tab and create a new project
3. Ctrl+click a sample in the first tab.

Actual results:
The project has also changed in the "Import sample" page

Expected results:
The project should be used also for the new "Import sample" page

Additional info:
We had this issue earlier for other catalog entries. Other samples works already fine, just the Devfile sample links doesn't contain the current namespace.

Description of problem:

There were 4 ingress-controllers and totally 15 routes. On web console, try to query "route_metrics_controller_routes_per_shard" in Observe >> Metrics page. the stats for 3 ingress-controllers are 15, and it is 1 for the last ingress-controller

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-23-154914

How reproducible:

Create pods, services, ingress-controllers, routes, then check  "route_metrics_controller_routes_per_shard" on web console

Steps to Reproduce:

1. get cluster's base domain
% oc get dnses.config/cluster -oyaml | grep -i domain
  baseDomain: shudi-412gcpop36.qe.gcp.devcluster.openshift.com

2. create 3 clusters
% oc -n openshift-ingress-operator get ingresscontroller
NAME         AGE
default      7h5m
extertest3   120m
internal1    120m
internal2    120m
% 

3. check the spec of the 4 ingress-controllres
a, default

b, extertest3
spec:
  domain: extertest3.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
  endpointPublishingStrategy:
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: External
    type: LoadBalancerService
c, internal1
spec:
  domain: internal1.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
  endpointPublishingStrategy:
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: Internal
    type: LoadBalancerService
d, internal2
spec:
  domain: internal2.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
  endpointPublishingStrategy:
    loadBalancer:
      dnsManagementPolicy: Managed
      scope: Internal
    type: LoadBalancerService
  routeSelector:
    matchLabels:
      shard: alpha

4. check the route, there are 15 routes
% oc get route -A | awk '{print $3}'
HOST/PORT
oauth-openshift.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
console-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
downloads-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
canary-openshift-ingress-canary.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
alertmanager-main-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
prometheus-k8s-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
prometheus-k8s-federate-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
thanos-querier-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
edge1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
int1reen2-test.internal1.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
pass1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
reen1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
service-unsecure-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
int1edge2-test.internal1.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
test.shudi.com
%

% oc get route -A | awk '{print $3}' | grep apps.shudi
oauth-openshift.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
console-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
downloads-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
canary-openshift-ingress-canary.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
alertmanager-main-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
prometheus-k8s-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
prometheus-k8s-federate-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
thanos-querier-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
edge1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
pass1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
reen1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
service-unsecure-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com
%

% oc get route -A | awk '{print $3}' | grep apps.shudi | wc -l
      12
% oc get route -A | awk '{print $3}' | grep internal1 | wc -l 
       2
% oc get route -A | awk '{print $3}' | grep shudi.com | wc -l
       1
%

5. only route unsvc5 had the shard=alpha label
 % oc get route unsvc5  -oyaml | grep labels: -A2
  labels:
    name: unsvc5
    shard: alpha
 % oc get route unsvc5 -oyaml | grep spec: -A1
  spec:
    host: test.shudi.com

6. login web console(https://https://console-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com/monitoring/query-browser), then navigate to Observe >> Metrics 

7. input"route_metrics_controller_routes_per_shard ", then click the "Run queries" button. As the attached picture showed:
​​name                           value
default                        15
extertest3                     15
internal1                      15      
internal2                      1

8. Also there was a minor issue: As the attached picture showed, there were two name in the header line

Name                                           name      value                              
route_metrics_controller_routes_per_shard     default    15
route_metrics_controller_routes_per_shard     extertest3 15
route_metrics_controller_routes_per_shard     internal1  15
route_metrics_controller_routes_per_shard     internal2  1

Actual results:

​​name                         value 
default                      15
extertest3                   15 
internal1                    15
internal2                    1

Expected results:

​​name                         value
default                      12
extertest3                   0
internal1                    2 
internal2                    1

Additional info:

 

An RW mutex was introduced to the project auth cache with https://github.com/openshift/openshift-apiserver/pull/267, taking exclusive access during cache syncs. On clusters with extremely high object counts for namespaces and RBAC, syncs appear to be extremely slow (on the order of several minutes). The project LIST handler acquires the same mutex in shared mode as part of its critical path.

This is a clone of issue OCPBUGS-12910. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12904. The following is the description of the original issue:

Description of problem:

In order to test proxy installations, the CI base image for OpenShift on OpenStack needs netcat.

This is a clone of issue OCPBUGS-12186. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6770. The following is the description of the original issue:

When displaying my pipeline it is not rendered correctly with overlapping segments between parallel branches. However if I edit the pipeline then it appears fine. I have attached screenshots showing the issue.

This is a regression from 4.11 where it rendered fine.

This is a clone of issue OCPBUGS-3085. The following is the description of the original issue:

Description of problem:

IPI on BareMetal Dual stack deployment failed and Bootstrap timed out before completion

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Always

Steps to Reproduce:

1. Deploy IPI on BM using Dual stack 
2.
3.

Actual results:

Deployment failed

Expected results:

Should pass

Additional info:

Same deployment works fine on 4.11

This is a clone of issue OCPBUGS-4874. The following is the description of the original issue:

OCPBUGS-3278 is supposed to fix the issue where the user was required to provide data about the baremetal hosts (including MAC addresses) in the install-config, even though this data is ignored.

However, we determine whether we should disable the validation by checking the second CLI arg to see if it is agent.

This works when the command is:

openshift-install agent create image --dir=whatever

But fails when the argument is e.g., as in dev-scripts:

openshift-install --log-level=debug --dir=whatever agent create image

Description of problem:

The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.

The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).

The visible results of that are:

  • Spurious errors of not found resources related to the pods.
  • Rotated pod logs are not gathered even if present.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always if there are static pods.

Steps to Reproduce:

1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).

Actual results:

  • Rotated pods not gathered.
  • Errors like these
    error: errors occurred while gathering data:
        one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd
    
        [one or more errors occurred while gathering container data for pod etcd-master-0.example.net:
    
        the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net:
    
        the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net:
    
        the server could not find the requested resource]
    

Expected results:

No errors like the ones above and rotated pod logs to be gathered, if present.

Additional info:

Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.

This is a clone of issue OCPBUGS-4906. The following is the description of the original issue:

These commented out tests https://github.com/openshift/origin/blob/master/test/extended/testdata/cmd/test/cmd/templates.sh#L130-L149 are problematic, because they are testing rather important functionality of cross-namespace template processing.

This problem recently escalated after landing k8s 1.25, where there was a suspicion that new version of kube-apiserver removed that functionality. We need to bring back this test, as well as similar tests which are touching logging in functionality. https://github.com/openshift/origin/blob/master/test/extended/testdata/cmd/test/cmd/authentication.sh is another similar test being skipped due to similar reasons.

Based on my search: https://github.com/openshift/origin/blob/master/test/extended/oauth/helpers.go#L18 we could deploy Basic Auth Provider ie. password based, and group all tests relying on this functionality under a single umbrella.

The biggest question to answer is how we can properly deal with multiple IdentityProviders, so I'd suggest reaching out to Auth team for help.

The second problem that was identified is various cloud providers, so we've agreed to run this test initially only on AWS and GCP.

This is a clone of issue OCPBUGS-5136. The following is the description of the original issue:

Description of problem:

Provisioning on ilo4-virtualmedia BMC driver fails with error: "Creating vfat image failed: Unexpected error while running command"

Version-Release number of selected component (if applicable):

4.13 (but will apply to older OpenShift versions too)

How reproducible:

Always

Steps to Reproduce:

1.configure some nodes with ilo4-virtualmedia://
2.attempt provisioning
3.

Actual results:

provisioning fails with error similar to  Failed to inspect hardware. Reason: unable to start inspection: Validation of image href https://10.1.235.67:6183/ilo/boot-9db13f93-861a-4d27-b20d-2c228559faa2.iso failed, reason: HTTPSConnectionPool(host='10.1.235.67', port=6183): Max retries exceeded with url: /ilo/boot-9db13f93-861a-4d27-b20d-2c228559faa2.iso (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)')))

Expected results:

Provisioning succeeds

Additional info:

This happens after a preceding issue with missing iLO driver configuration has been fixed (https://github.com/metal3-io/ironic-image/pull/402)

Description of problem:

Have 6 runs of techpreview jobs where the jobs fails due to the MCO:

 

 

{Operator degraded (RequiredPoolsFailed): Unable to apply 4.12.0-0.ci.test-2022-09-21-183414-ci-op-qd6plyhc-latest: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)] Operator degraded (RequiredPoolsFailed): Unable to apply 4.12.0-0.ci.test-2022-09-21-183414-ci-op-qd6plyhc-latest: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)]}
 

 

looking at the MCD logs the master seems to go degraded in bootstrap due to the rendered config not being found?

 
I0921 18:49:47.091804 8171 daemon.go:444] Node ci-op-qd6plyhc-6dd9a-bfmjd-master-1 is part of the control plane I0921 18:49:49.213556 8171 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node ci-op-qd6plyhc-6dd9a-bfmjd-master-1: map[csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/openshift-gce-devel-ci-2/zones/us-central1-b/instances/ci-op-qd6plyhc-6dd9a-bfmjd-master-1"}
volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json I0921 18:49:49.215186 8171 node.go:45] Setting initial node config: rendered-master-2dde32327e4e5d15092fccbac1dcec49 I0921 18:49:49.253706 8171 daemon.go:1184] In bootstrap mode E0921 18:49:49.254046 8171 writer.go:200] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-2dde32327e4e5d15092fccbac1dcec49" not found I0921 18:49:51.232610 8171 daemon.go:499] Transitioned from state: Done -> Degraded I0921 18:49:51.249618 8171 daemon.go:1184] In bootstrap mode E0921 18:49:51.249906 8171 writer.go:200] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-2dde32327e4e5d15092fccbac1dcec49" not found

However looking at controller a rendered-config was generated correctly but it's not the missing config from above:

I0921 18:54:06.736984 1 render_controller.go:506] Generated machineconfig rendered-master-acc8491aafab8ef511a40b76372325ee from 6 configs: [{MachineConfig 00-master machineconfiguration.openshift.io/v1 } {MachineConfig 01-master-container-runtime machineconfiguration.openshift.io/v1 } {MachineConfig 01-master-kubelet machineconfiguration.openshift.io/v1 } {MachineConfig 98-master-generated-kubelet machineconfiguration.openshift.io/v1 } {MachineConfig 99-master-generated-registries machineconfiguration.openshift.io/v1 } {MachineConfig 99-master-ssh machineconfiguration.openshift.io/v1 }] I0921 18:54:06.737226 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"b2084ca6-4b33-46bf-b83b-9e98010ff085", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"5648", FieldPath:""}): type: 'Normal' reason: 'RenderedConfigGenerated' rendered-master-acc8491aafab8ef511a40b76372325ee successfully generated (release version: 4.12.0-0.ci.test-2022-09-21-183220-ci-op-9ksj7d7g-latest, controller version: a627415c240b4c7dd2f9e90f659690d9c0f623f3) I0921 18:54:06.742053 1 render_controller.go:532] Pool master: now targeting: rendered-master-acc8491aafab8ef511a40b76372325ee

 

So far I see this in the following techpreview jobs:
GCP techpreview
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-gcp-sdn-techpreview/1572638837954318336
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-gcp-sdn-techpreview-serial/1572638838793179136

Vsphere techpreview
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-nightly-4.12-e2e-vsphere-ovn-techpreview/1572638854794448896
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-nightly-4.12-e2e-vsphere-ovn-techpreview-serial/1572638855574589440

AWS Techpreview:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-aws-sdn-techpreview/1572638828672323584
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-aws-sdn-techpreview-serial/1572638829217583104

 

The above jobs affect the k8s 1.25 bump and are blocking the job.

There are also other occurances not in our PR:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/31965/rehearse-31965-pull-ci-openshift-openshift-controller-manager-master-openshift-e2e-aws-builds-techpreview/1572581504297472000

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_builder/307/pull-ci-openshift-builder-master-e2e-aws-builds-techpreview/1572599746021822464

 

Also see a quick search:
https://search.ci.openshift.org/?search=timed+out+waiting+for+the+condition%2C+error+pool+master+is+not+ready&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Did something change that would affect tech preview jobs?

Also note, this seems like a new failure. I have some of these jobs passing in the last ~ 8 days.

Prow job example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/824/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1567689639479742464

Test output:

=== RUN TestAll/serial/TestCanaryRoute
canary_test.go:78: failed to create pod openshift-ingress-canary/canary-route-check: pods "canary-route-check" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "curl" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "curl" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "curl" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "curl" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Description of problem:

When enabling OvS HWOL on 4.12.0 nightly, traffic does not pass between pods.

Version-Release number of selected component (if applicable):

4.12.0 nightly

How reproducible:

Always

Steps to Reproduce:

1. Create 2 pods with sriov and try to ping between them (same node or different node)

Actual results:

No Traffic Passes (Ping or other)

Expected results:

Traffic Passes (Ping or other)

Additional info:

Missing this commit in 4.12 branch
https://github.com/openshift/ovn-kubernetes/commit/37c6c1d7039fd4c8f3cca560691a254e720172de

This is a clone of issue OCPBUGS-17430. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13825. The following is the description of the original issue:

Description of problem:
As a part of Chaos Monkey testing we tried to delete pod machine-config-controller in SNO+1. The pod machine-config-controller restart results in restart of daemonset/sriov-network-config-daemon and linuxptp-daemonpods pods as well.

      

1m47s       Normal   Killing            pod/machine-config-controller-7f46c5d49b-w4p9s    Stopping container machine-config-controller
1m47s       Normal   Killing            pod/machine-config-controller-7f46c5d49b-w4p9s    Stopping container oauth-proxy

 

 

 

openshift-sriov-network-operator   23m         Normal   Killing            pod/sriov-network-config-daemon-pv4tr   Stopping container sriov-infiniband-cni
openshift-sriov-network-operator   23m         Normal   SuccessfulDelete   daemonset/sriov-network-config-daemon   Deleted pod: sriov-network-config-daemon-pv4tr 

Version-Release number of selected component (if applicable):

 

4.12

How reproducible:

Steps to Reproduce:

Restart the machine-config-controller pod in openshift-machine-config-operator namespace. 
1. oc get pod -n openshift-machine-config-operator 
2. oc delete  pod/machine-config-controller-xxx -n openshift-machine-config-operator 

 

 

Actual results:

It restarting the daemonset/sriov-network-config-daemon and linuxptp-daemonpods pods 

Expected results:

It should not restart these pod

Additional info:

logs : https://drive.google.com/drive/folders/1XxYen8tzENrcIJdde8sortpyY5ZFZCPW?usp=share_link

This is a clone of issue OCPBUGS-17876. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16374. The following is the description of the original issue:

Description of problem:

The topology page is crashed 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Visit developer console
2. Topology view
3.

Actual results:

Error message:
TypeError
Description:
e is null
Component trace:
f@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~app/code-refs/actions~delete-revision~dev-console-add~dev-console-deployImage~dev-console-ed~cf101ec3-chunk-5018ae746e2320e4e737.min.js:26:14244
5363/t.a@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:177913
u@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:275718
8248/t.a<@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:475504
i@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:470135
withFallback()
5174/t.default@https://console-openshift-console.apps.cl2.cloud.local/static/dev-console-topology-chunk-492be609fb2f16849dfa.min.js:1:78258
s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:237096
[...]
ne<@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592411
r@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:36:125397
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:58042
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:60087
t@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:21:54647
re@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1592722
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:791129
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:1062384
s@https://console-openshift-console.apps.cl2.cloud.local/static/main-chunk-378881319405723c0627.min.js:1:613567
t.a@https://console-openshift-console.apps.cl2.cloud.local/static/vendors~main-chunk-12b31b866c0a4fea4c58.min.js:141:244663

Expected results:

No error should be there

Additional info:

Cloud Pak Operator is installed 

This is a clone of issue OCPBUGS-3426. The following is the description of the original issue:

Description of problem:

We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update.

However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:

Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.

Steps to Reproduce:

1.
2.
3.

Actual results:

Port still exists in OVN, IP remains allocated for a deleted pod.

Expected results:

IP should be freed, port should be removed from OVN.

Additional info:

 

This is a copy of OCPBUGS-6784 for backport to 4.12.

Original Text:

Description of problem:

SNO installation performed with the assisted-installer failed 

Version-Release number of selected component (if applicable):

4.10.32
# oc get co authentication -o yaml
- lastTransitionTime: '2023-01-30T00:51:11Z'
    message: 'IngressStateEndpointsDegraded: No subsets found for the endpoints of
      oauth-server      OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs"
      not found      OAuthServerDeploymentDegraded: 1 of 1 requested instances are unavailable for
      oauth-openshift.openshift-authentication (container is waiting in pending oauth-openshift-58b978d7f8-s6x4b
      pod)      OAuthServerRouteEndpointAccessibleControllerDegraded: secret "v4-0-config-system-router-certs"

# oc logs ingress-operator-xxx-yyy -c ingress-operator 
2023-01-30T08:14:13.701799050Z 2023-01-30T08:14:13.701Z ERROR   operator.certificate_publisher_controller       certificate-publisher/controller.go:80  failed to list ingresscontrollers for secret    {"related": "", "error": "Index with name field:defaultCertificateName does not exist"}

Restarting the ingress-operator pod helped fix the issue, but a permanent fix is required.

The Bug(https://bugzilla.redhat.com/show_bug.cgi?id=2005351) was filed earlier but closed due to inactivity.

 

 

Description of problem:

Metrics page is broken

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.0 on 4.12

How reproducible:

Always

Steps to Reproduce:

1. Install Openshift Pipelines 1.9.0
2. Create a pipeline and run it several times
3. Update metrics.pipelinerun.duration-type and metrics.taskrun.duration-type to lastvalue
4. Navigate to created pipeline 
5. Switch to Metrics tab

Actual results:

The Metrics page is showing error

Expected results:

Metrics of the pipeline should be shown

Additional info:

 

Description of problem:
If cluster install failed and no tag attached to vm, run ./openshift-install destroy cluster get stuck, details pls see openshift-install.log
...
time="2022-09-28T08:19:14-04:00" level=debug msg="Delete Folder"
time="2022-09-28T08:19:14-04:00" level=debug msg="Find attached Folder on tag"
time="2022-09-28T08:19:15-04:00" level=debug msg="Folder: Expected Folder sgao-rtf6v to be empty"
time="2022-09-28T08:19:25-04:00" level=debug msg="Power Off Virtual Machines"
time="2022-09-28T08:19:25-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:25-04:00" level=debug msg="Delete Virtual Machines"
time="2022-09-28T08:19:25-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:25-04:00" level=debug msg="Delete Folder"
time="2022-09-28T08:19:25-04:00" level=debug msg="Find attached Folder on tag"
time="2022-09-28T08:19:25-04:00" level=debug msg="Folder: Expected Folder sgao-rtf6v to be empty"
time="2022-09-28T08:19:35-04:00" level=debug msg="Power Off Virtual Machines"
time="2022-09-28T08:19:35-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:35-04:00" level=debug msg="Delete Virtual Machines"
time="2022-09-28T08:19:35-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:35-04:00" level=debug msg="Delete Folder"

Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630

How reproducible:
always when cluster install failed and no tag attached to vm

Steps to Reproduce:
1. cluster install failed and no tag attached to vm
2. run ./openshift-install destroy cluster
3.

Actual results:
installer destroy get stuck

Expected results:
installer destroy should set timeout and be able to quit in such situation

Additional info:

Description of problem:

catsrc is not ready due to "compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied"

Version-Release number of selected component (if applicable):

zhaoxia@xzha-mac test % ../bin/opm version  
Version: version.Version{OpmVersion:"b94e073b5", GitCommit:"b94e073b5187ecaa687c322beccf76f1d1f26d54", BuildDate:"2022-08-29T06:30:05Z", GoOs:"darwin", GoArch:"amd64"}
zhaoxia@xzha-mac test % oc exec catalog-operator-79d885b755-6cnbp  -- olm --version
OLM version: 0.19.0
git commit: dfa7f0e70578432117e63867706630cda5366fb7

How reproducible:

always

Steps to Reproduce:

1. generate index image
zhaoxia@xzha-mac test % mkdir catalog
zhaoxia@xzha-mac test % ../bin/opm generate dockerfile catalog
zhaoxia@xzha-mac test % cat catalog.Dockerfile 
# The base image is expected to contain
# /bin/opm (with a serve subcommand) and /bin/grpc_health_probe
FROM quay.io/operator-framework/opm:latest


# Configure the entrypoint and command
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs", "--cache-dir=/tmp/cache"]


# Copy declarative config root into image at /configs and pre-populate serve cache
ADD catalog /configs
RUN ["/bin/opm", "serve", "/configs", "--cache-dir=/tmp/cache", "--cache-only"]


# Set DC-specific label for the location of the DC root directory
# in the image
LABEL operators.operatorframework.io.index.configs.v1=/configs

zhaoxia@xzha-mac test % docker build . -f catalog.Dockerfile -t quay.io/olmqe/nginxolm-operator-index:2726 
zhaoxia@xzha-mac test % docker push quay.io/olmqe/nginxolm-operator-index:2726

2. create catsrc
zhaoxia@xzha-mac test % cat catsrc.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index
  namespace: test-1
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: quay.io/olmqe/nginxolm-operator-index:2726
  updateStrategy:
    registryPoll:
      interval: 10m

oc new-project test-1
oc apply -f catsrc.yaml 
 3. check pod status
zhaoxia@xzha-mac test % oc get pod
NAME               READY   STATUS             RESTARTS        AGE
test-index-hbqlv   0/1     Error              8 (5m13s ago)   16m
test-index-l6mzq   0/1     CrashLoopBackOff   10 (59s ago)    27m

zhaoxia@xzha-mac test % oc get pod test-index-hbqlv -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.131.0.84"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.131.0.84"
          ],
          "default": true,
          "dns": {}
      }]
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"test-index","namespace":"test-1"},"spec":{"displayName":"Test","image":"quay.io/olmqe/nginxolm-operator-index:2726","publisher":"OLM-QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"10m"}}}}
    openshift.io/scc: restricted-v2
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
  creationTimestamp: "2022-08-29T06:57:55Z"
  generateName: test-index-
  labels:
    catalogsource.operators.coreos.com/update: test-index
    olm.catalogSource: ""
    olm.pod-spec-hash: 777849c67c
  name: test-index-hbqlv
  namespace: test-1
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: CatalogSource
    name: test-index
    uid: 5ef60ce9-6ade-43e1-bae4-7d69f6c9d5e0
  resourceVersion: "218774"
  uid: 7606a54a-6a7d-4979-833a-97c2f87a88b8
spec:
  containers:
  - image: quay.io/olmqe/nginxolm-operator-index:2726
    imagePullPolicy: Always
    livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: false
      runAsNonRoot: true
      runAsUser: 1001130000
    startupProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-bfzvh
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: test-index-dockercfg-wp8s4
  nodeName: qe-daily-412-0829-qf9lx-worker-1-djpwq
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1001130000
    seLinuxOptions:
      level: s0:c34,c4
    seccompProfile:
      type: RuntimeDefault
  serviceAccount: test-index
  serviceAccountName: test-index
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: kube-api-access-bfzvh
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-08-29T06:57:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-08-29T06:57:55Z"
    message: 'containers with unready status: [registry-server]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-08-29T06:57:55Z"
    message: 'containers with unready status: [registry-server]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-08-29T06:57:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://54d7a5ba94c061fb86ad056ad964dbda2824c864c6fdcd2d7d5a7ada515bc70e
    image: quay.io/olmqe/nginxolm-operator-index:2726
    imageID: quay.io/olmqe/nginxolm-operator-index@sha256:d70f38fa773ea5030b5b80bfe34d9168aabff5039ead44b7f7e7cd76f8705eb1
    lastState:
      terminated:
        containerID: cri-o://54d7a5ba94c061fb86ad056ad964dbda2824c864c6fdcd2d7d5a7ada515bc70e
        exitCode: 1
        finishedAt: "2022-08-29T07:14:23Z"
        message: |+
          Error: compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied
          Usage:
            opm serve <source_path> [flags]


          Flags:
                --cache-dir string         if set, sync and persist server cache directory
                --cache-only               sync the serve cache and exit without serving
                --debug                    enable debug logging
            -h, --help                     help for serve
            -p, --port string              port number to serve on (default "50051")
                --pprof-addr string        address of startup profiling endpoint (addr:port format)
            -t, --termination-log string   path to a container termination log file (default "/dev/termination-log")


          Global Flags:
                --skip-tls-verify   skip TLS certificate verification for container image registries while pulling bundles
                --use-http          use plain HTTP for container image registries while pulling bundles


        reason: Error
        startedAt: "2022-08-29T07:14:23Z"
    name: registry-server
    ready: false
    restartCount: 8
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=registry-server pod=test-index-hbqlv_test-1(7606a54a-6a7d-4979-833a-97c2f87a88b8)
        reason: CrashLoopBackOff
  hostIP: 10.242.0.4
  phase: Running
  podIP: 10.131.0.84
  podIPs:
  - ip: 10.131.0.84
  qosClass: Burstable
  startTime: "2022-08-29T06:57:55Z" 

Actual results:

the status of pod for catsrc is not running

Expected results:

the status of pod for catsrc is running

Additional info:

When using project openshift-marketplace, the same error will be raised.

Error: compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied

Description of problem:

Each LB created for a Service type LoadBalancer results in 1 client rule and <# of public subnets> health rules being created.  The rules per SG quota in AWS is quite small; 60 by default, and 200 hard max.  OCP has about 40 rules OOTB. Assuming an HA cluster in 3 AZs, that is 4 rules per LB.  With default AWS quota, only ~5 LBs can be create and with the hard max of 200, only ~40 LBs can be created.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1.  Create Service type LoadBalancer and observe increase in master-sg and worker-sg rules sets
2.
3.

Actual results:

4 rules are created

Expected results:

1 rules is created when the client rule is a superset of the per-subnet health rules

Additional info:

This ~4x the number of Services of type LoadBalancer.  This is required for Hypershift.

 Currently controller will set status done each time it sees host that is ready in k8s without looking if it was already set.

time="2022-09-13T19:03:45Z" level=info msg="Found new ready node ocp-2.cluster1.kpsalerno.us.ibm.com with inventory id 2da64d56-5057-78c6-ea6e-bf74a783bd79, kubernetes id 2da64d56-5057-78c6-ea6e-bf74a783bd79, updating its status to Done" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:255" request_id=6258e5a2-4e78-4148-a913-45d704a0fa1d

time="2022-09-13T19:04:05Z" level=info msg="Found new ready node ocp-2.cluster1.kpsalerno.us.ibm.com with inventory id 2da64d56-5057-78c6-ea6e-bf74a783bd79, kubernetes id 2da64d56-5057-78c6-ea6e-bf74a783bd79, updating its status to Done" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:255" request_id=49e4e63f-cf4f-4b9f-b1f3-923c473c09dd

 

 

Description of problem:

The ovn-kubernetes ovnkube-master containers are continuously crashlooping since we updated to 4.11.0-0.okd-2022-10-15-073651.

Log Excerpt:

] [] []  [{kubectl-client-side-apply Update networking.k8s.io/v1 2022-09-12 12:25:06 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:kubectl.kubernetes.io/last-applied-configuration":{}}},"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:&v1.LabelSelector{MatchLabels:map[string]string{access: true,},MatchExpressions:[]LabelSelectorRequirement{},},NamespaceSelector:nil,IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},} &NetworkPolicy{ObjectMeta:{allow-from-openshift-ingress  compsci-gradcentral  a405f843-c250-40d7-8dd4-a759f764f091 217304038 1 2022-09-22 14:36:38 +0000 UTC <nil> <nil> map[] map[] [] []  [{openshift-apiserver Update networking.k8s.io/v1 2022-09-22 14:36:38 +0000 UTC FieldsV1 {"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:nil,NamespaceSelector:&v1.LabelSelector{MatchLabels:map[string]string{policy-group.network.openshift.io/ingress: ,},MatchExpressions:[]LabelSelectorRequirement{},},IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},}]: cannot clean up egress default deny ACL name: error in transact with ops [{Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {ccdd01bf-3009-42fb-9672-e1df38190cd7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {10bbf229-8c1b-4c62-b36e-4ba0097722db}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {7b55ba0c-150f-4a63-9601-cfde25f29408}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {60cb946a-46e9-4623-9ba4-3cb35f018ed6}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s)

Additional info:

https://github.com/okd-project/okd/issues/1372

Issue persisted through update to 4.11.0-0.okd-2022-10-28-153352

must-gather: https://nbc9-snips.cloud.duke.edu/snips/must-gather.local.2859117512952590880.zip

This is a clone of issue OCPBUGS-11636. The following is the description of the original issue:

Description of problem:

The ACLs are disabled for all newly created s3 buckets, this causes all OCP installs to fail: the bootstrap ignition can not be uploaded:

level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {
level=error
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1
level=error
level=error msg=Error: error creating S3 bucket ACL for yunjiang-acl413-4dnhx-bootstrap: AccessControlListNotSupported: The bucket does not allow ACLs
level=error msg=	status code: 400, request id: HTB2HSH6XDG0Q3ZA, host id: V6CrEgbc6eyfJkUbLXLxuK4/0IC5hWCVKEc1RVonSbGpKAP1RWB8gcl5dfyKjbrLctVlY5MG2E4=
level=error
level=error msg=  with aws_s3_bucket_acl.ignition,
level=error msg=  on main.tf line 62, in resource "aws_s3_bucket_acl" "ignition":
level=error msg=  62: resource "aws_s3_bucket_acl" ignition {


Version-Release number of selected component (if applicable):

4.11+
 

How reproducible:

Always
 

Steps to Reproduce:

1.Create a cluster via IPI

Actual results:

install fail
 

Expected results:

install succeed
 

Additional info:

Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023 - https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-ownership-error-responses.html - After you apply the bucket owner enforced setting for Object Ownership, ACLs are disabled.

 

Hypershift does not use kubernetes.default.svc as the api audience on the KAS. It is set to the URL of the OIDC provider. ROSA also does this so I don't imagine this test passes for it either at the moment.

Explicit setting of the Audiences on the TokenRequest is not required. If not set, it will just default to the audiences configured in the KAS.

Causing conformance failure for hypershift
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-hypershift-main-periodics-4.13-conformance-aws-ovn/1620240601058381824

 in order to have more info to be able to debug router issue in sno , we want to see if router is healthy from node network point of view and enable router access logs,

Lets revert when https://bugzilla.redhat.com/show_bug.cgi?id=2097041 will be found

This is a clone of issue OCPBUGS-4181. The following is the description of the original issue:

Description of problem:

After configuring a webhook receiver in alertmanager to send alerts to an external tool, a customer noticed that when receiving alerts they have as source "https:///<console-url>" (notice the 3 /).

Version-Release number of selected component (if applicable):

OCP 4.10

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

https:///<console-url>

Expected results:

https://<console-url>

Additional info:

After investigating I discovered that the problem might be in the CMO code:

→ oc get Alertmanager main -o yaml | grep externalUrl
  externalUrl: https:/console-openshift-console.apps.jakumar-2022-11-27-224014.devcluster.openshift.com/monitoring
→ oc get Prometheus k8s -o yaml | grep externalUrl
  externalUrl: https:/console-openshift-console.apps.jakumar-2022-11-27-224014.devcluster.openshift.com/monitoring

Description of problem:

The origin issue is from SDB-3484. When a customer wants to update its pull-secret, we find that sometimes the insight operator does not execute the cluster transfer process with the message 'no available accepted cluster transfer'. The root cause is that the insight operator does the cluster transfer process per 24 hours, and the telemetry does the registration process per 24 hours, on the ams side, both the two call /cluster_registration do the same process, so it means the telemetry will complete the cluster transfer before the insight operator. 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Create two OCP clusters.
2. Create a PSR that will help create two 'pending' CTs. The pending CTs will be accepted after ~6 hours.
3. Wait for ~24 hours, check the PSR, and check the logs in IO, and also check the pull-secrets in the clusters.

Actual results:

The PSR is completed, but there is no successfully transfer logs in IO, and the pull-secrets in the clusters are not updated. 

Expected results:

The transfer process is executed successfully, and the pull-secrets are updated on the clusters.

Additional info:


This is a clone of issue OCPBUGS-19369. The following is the description of the original issue:

This bug has been seen during the analysis of another issue

If the Server Internal IP is not defined, CBO crashes as nil is not handled in https://github.com/openshift/cluster-baremetal-operator/blob/release-4.12/provisioning/utils.go#L99

 

I0809 17:33:09.683265       1 provisioning_controller.go:540] No Machines with cluster-api-machine-role=master found, set provisioningMacAddresses if the metal3 pod fails to start

I0809 17:33:09.690304       1 clusteroperator.go:217] "new CO status" reason=SyncingResources processMessage="Applying metal3 resources" message=""

I0809 17:33:10.488862       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.1779c769624884f4  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ValidatingWebhookConfigurationUpdated,Message:Updated ValidatingWebhookConfiguration.admissionregistration.k8s.io/baremetal-operator-validating-webhook-configuration because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,LastTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1768fd4]

 

goroutine 574 [running]:

github.com/openshift/cluster-baremetal-operator/provisioning.getServerInternalIP({0x1e774d0?, 0xc0001e8fd0?})

        /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:75 +0x154

github.com/openshift/cluster-baremetal-operator/provisioning.GetIronicIP({0x1ea2378?, 0xc000856840?}, {0x1bc1f91, 0x15}, 0xc0004c4398, {0x1e774d0, 0xc0001e8fd0})

        /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:98 +0xfb

Clone of https://issues.redhat.com/browse/OCPBUGSM-44162.

Cannot use the original as the bot won't accept a security bug:

When the change merges, the Bugzilla associated with the CVE must be set to MODIFIED. Since the DPTP bugzilla bot is not permitted to scan bugs with the SECURITY group in Bugzilla, The REP will not be able to use the bot's public functionality of moving their bug to MODIFIED.

https://docs.google.com/document/d/1KuenDafC3Ukw19jY55tkVeH8nNVVAi8TEAfqynoVfzY/edit#heading=h.ikdk6suc575k

This is a clone of issue OCPBUGS-4166. The following is the description of the original issue:

Description of problem:

This is wrapper bug for library sync of 4.12

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
OpenShift installer hits error when missing a topology section inside of a failureDomain like this in install-config.yaml:

    - name: us-east-1
      region: us-east
      zone: us-east-1a
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      topology:
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        networks:
        - ci-segment-154
        datastore: workload_share_vcsmdcncworkload2_vyC6a

Version-Release number of selected component (if applicable):

Build from latest master (4.12)

How reproducible:

Each time

Steps to Reproduce:

1. Create install-config.yaml for vsphere multi-zone
2. Leave out a topology section (under failureDomains)
3. Attempt to create cluster

Actual results:

FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.vsphere.failureDomains.topology.resourcePool: Invalid value: "//Resources": resource pool '//Resources' not found 

Expected results:

Validation of topology before attempting to create any resources

This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:

Description of problem:

Unit test failing 

=== RUN   TestNewAppRunAll/app_generation_using_context_dir
    newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby'
    --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)


Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648 

Actual results:

unit tests fail

Expected results:

TestNewAppRunAll unit test should pass

Additional info:

 

Description of problem:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"    

Version-Release number of selected component (if applicable): 4.10.22

How reproducible:  Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Steps to Reproduce:

Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Actual results: master nodes do come up.

Expected results: master nodes should come up despite that the text "master" is not in their hostname.

Additional info:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"    

The code for the cluster-baremetal-operator at the following link: 

https://github.com/openshift/cluster-baremetal-operator/blob/49d7b249c5dcef8228f206eff4530a25f03b201f/controllers/provisioning_controller.go#L441

The following condition is concerning:

if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0

The packages reveal that bmh.Name references the name inside the metadata of the BMH object. 

Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :

macs = append(macs, bmh.Spec.BootMACAddress)

 

 

This is a clone of issue OCPBUGS-4850. The following is the description of the original issue:

Description of problem:

Kuryr might take a while to create Pods because it has to create Neutron ports for the pods. If a pod gets deleted while this is being processed, a
warning Event will be generated causing the "[sig-network] pods should successfully create sandboxes by adding pod to network" to fail.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4359. The following is the description of the original issue:

Description of problem:

when fs full, update-dns-resolver fails to build a proper /etc/hosts, resulting in /etc/hosts only containing the openshift-generated-node-resolver lines, missing the localhost lines.

This causes issues on pods having hostNetwork: true  like openstack-cinder-csi-driver-controller

Version-Release number of selected component (if applicable):

OpenShift 4.10.39

How reproducible:

See: https://github.com/openshift/cluster-dns-operator/blob/a5ea3fcb7be49a12115bd6648403df3d65661542/assets/node-resolver/update-node-resolver.sh

Steps to Reproduce:

1. make sure the file system is full when running the cp at line 13
2. 
3.

Actual results:

/etc/hosts is missing the localhost lines

Expected results:

/etc/hosts should contain the localhost lines

Additional info:


In https://github.com/openshift/installer/pull/6237 we are setting the version to v1alpha1, since we are not committing to not making further changes.

Before shipping in an official release we must update to at least v1beta1, or preferably v1.

This is a clone of issue OCPBUGS-18305. The following is the description of the original issue:

Description of problem:

It appears it may be possible to have invalid CSV entries in the resolver cache, resulting in the inability to reinstall an Operator.

The situation:
--------------
A customer has removed the CSV, InstallPlan and Subscription for the GitOps Operator from the cluster but upon attempting to reinstall the Operator, the OLM was providing a conflict with existing CSV.

This CSV was not in the ETCD instance and was removed previously. Upon deleting the `operator-catalog` and `operator-lifecycle-manager` Pods, the collision was resolved and the Operator was able to installed again.
~~~
'Warning' reason: 'ResolutionFailed' constraints not satisfiable: subscription openshift-gitops-operator exists, subscription openshift-gitops-operator requires redhat-operators/openshift-marketplace/stable/openshift-gitops-operator.v1.5.8, redhat-operators/openshift-marketplace/stable/openshift-gitops-operator.v1.5.8 and @existing/openshift-operators//openshift-gitops-operator.v1.5.6-0.1664915551.p originate from package openshift-gitops-operator, clusterserviceversion openshift-gitops-operator.v1.5.6-0.1664915551.p exists and is not referenced by a subscription
~~~

Version-Release number of selected component (if applicable):

4.9.31

How reproducible:

Very intermittent, however once the issue has occurred it was impossible to avoid without deleting the Pods.

Steps to Reproduce:

1. Add Operator with manual approval InstallPlan
2. Remove Operator (Subscription, CSV, InstallPlan)
3. Attempt to reinstall Operator 

Actual results:

Very intermittent failure

Expected results:

Operators do not have conflicts with CSVs which have already been removed.

Additional info:

Briefly reviewing the OLM code, it appears an internal resolver cache is populated and used for checking constraints when an operator is installed. If there are stale entries in the cache, this would result in the described issue.
The cache appears to have been rearchitected (moved to a dedicated object) since OCP 4.9.31. Due to the nature of this issue, the request does not have clear reproduction steps to reproduce so if the issue is unable to reproduced, I would like instructions on how to dump the contents of the cache if the issue is to arise again.

This is a clone of issue OCPBUGS-1769. The following is the description of the original issue:

Description of problem:

Installer as used with AWS, during a cluster destroy, does a get-all-roles and would delete roles based on a tag. If a customer is using AWS SEA which would deny any roles doing a get-all-roles in the AWS account, the installer fails.

Instead of error-out, the installer should gracefully handle being denied get-all-roles and move onward, so that a denying SCP would not get in the way of a successful cluster destroy on AWS.

Version-Release number of selected component (if applicable):

[ec2-user@ip-172-16-32-144 ~]$ rosa version
1.2.6

How reproducible:

1. Deploy ROSA STS, private with PrivateLink with AWS SEA
2. rosa delete cluster --debug
3. watch the debug logs of the installer to see it try to get-all-roles
4. installer fails when the SCP from AWS SEA denies the get-all-roles task

Steps to Reproduce:  Philip Thomson Would you please fill out the below?

Steps list above.

Actual results:

time="2022-09-01T00:10:40Z" level=error msg="error after waiting for command completion" error="exit status 4" installID=zp56pxql
time="2022-09-01T00:10:40Z" level=error msg="error provisioning cluster" error="exit status 4" installID=zp56pxql
time="2022-09-01T00:10:40Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 4" installID=zp56pxql


time="2022-09-01T00:12:47Z" level=info msg="copied /installconfig/install-config.yaml to /output/install-config.yaml" installID=55h2cvl5
time="2022-09-01T00:12:47Z" level=info msg="cleaning up resources from previous provision attempt" installID=55h2cvl5
time="2022-09-01T00:12:47Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:48Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:48Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:12:49Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6b4b5144-2f4e-4fde-ba1a-04ed239b84c2" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6152e9c2-9c1c-478b-a5e3-11ff2508684e" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8636f0ff-e984-4f02-870e-52170ab4e7bb" installID=55h2cvl5
time="2022-09-01T00:12:52Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2385a980-dc9b-480f-955a-62ac1aaa6718" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 02ccef62-14e7-4310-b254-a0731995bd45" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: eca2081d-abd7-4c9b-b531-27ca8758f933" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6bda17e9-83e5-4688-86a0-2f84c77db759" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 827afa4a-8bb9-4e1e-af69-d5e8d125003a" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 8dcd0480-6f9e-49cb-a0dd-0c5f76107696" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5095aed7-45de-4ca0-8c41-9db9e78ca5a6" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 04f7d0e0-4139-4f74-8f67-8d8a8a41d6b9" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 115f9514-b78b-42d1-b008-dc3181b61d33" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 68da4d93-a93e-410a-b3af-961122fe8df0" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 012221ea-2121-4b04-91f2-26c31c8458b1" installID=55h2cvl5
time="2022-09-01T00:12:53Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e6c9328d-a4b9-4e69-8194-a68ed7af6c73" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 214ca7fb-d153-4d0d-9f9c-21b073c5bd35" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: 63b54e82-e2f6-48d4-bd0f-d2663bbc58bf" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: d24982b6-df65-4ba2-a3c0-5ac8d23947e1" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: e2c5737a-5014-4eb5-9150-1dd1939137c0" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7793fa7c-4c8d-4f9f-8f23-d393b85be97c" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: bef2c5ab-ef59-4be6-bf1a-2d89fddb90f1" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: ff04eb1b-9cf6-4fff-a503-d9292ff17ccd" installID=55h2cvl5
time="2022-09-01T00:12:54Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: 85e05de8-ba16-4366-bc86-721da651d770" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for IAM users" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a9d864e4-cfdf-483d-a0d2-9b48a117abc4" installID=55h2cvl5
time="2022-09-01T00:12:56Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:12:57Z" level=info msg=Disassociated id=i-03d7570547d32071d installID=55h2cvl5 name=rosa-mv9dx3-xls7g-master-profile role=ROSA-ControlPlane-Role
time="2022-09-01T00:12:57Z" level=info msg=Deleted InstanceProfileName=rosa-mv9dx3-xls7g-master-profile arn="arn:aws:iam::646284873784:instance-profile/rosa-mv9dx3-xls7g-master-profile" id=i-03d7570547d32071d installID=55h2cvl5
time="2022-09-01T00:12:57Z" level=debug msg=Terminating id=i-03d7570547d32071d installID=55h2cvl5
time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-08bee3857e5265ba4 installID=55h2cvl5
time="2022-09-01T00:12:58Z" level=debug msg=Terminating id=i-00df6e7b34aa65c9b installID=55h2cvl5
time="2022-09-01T00:13:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:13:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:14:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:15:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:16:58Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:08Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:18Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:28Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:38Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:48Z" level=debug msg="search for instances by tag matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:49Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-sint/2e99b98b94304d80 installID=55h2cvl5
time="2022-09-01T00:17:49Z" level=info msg=Deleted id=eni-0e4ee5cf8f9a8fdd2 installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="Revoked ingress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="Revoked egress permissions" id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:17:50Z" level=debug msg="DependencyViolation: resource sg-03265ad2fae661b8c has a dependent object\n\tstatus code: 400, request id: f7c35709-a23d-49fd-ac6a-f092661f6966" arn="arn:aws:ec2:ca-central-1:646284873784:security-group/sg-03265ad2fae661b8c" installID=55h2cvl5
time="2022-09-01T00:17:51Z" level=info msg=Deleted id=eni-0e592a2768c157360 installID=55h2cvl5
time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"rosa-mv9dx3.0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:52Z" level=debug msg="listing AWS hosted zones \"0ffs.p1.openshiftapps.com.\" (page 0)" id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=info msg=Deleted id=Z072427539WBI718F6BCC installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=debug msg="Revoked ingress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:53Z" level=debug msg="Revoked egress permissions" id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=sg-08bfbb32ea92f583e installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=rosa-mv9dx3-xls7g-aint/635162452c08e059 installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=info msg=Deleted id=eni-049f0174866d87270 installID=55h2cvl5
time="2022-09-01T00:17:54Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="search for matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="no deletions from us-east-1, removing client" installID=55h2cvl5
time="2022-09-01T00:17:55Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 06b804ae-160c-4fa7-92de-fd69adc07db2" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 2a5dd4ad-9c3e-40ee-b478-73c79671d744" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: e61daee8-6d2c-4707-b4c9-c4fdd6b5091c" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1b743447-a778-4f9e-8b48-5923fd5c14ce" installID=55h2cvl5
time="2022-09-01T00:17:56Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da8c8a42-8e79-48e5-b548-c604cb10d6f4" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d7840e4-a1b4-4ea2-bb83-9ee55882de54" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7f2e04ed-8c49-42e4-b35e-563093a57e5b" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: cd2b4962-e610-4cc4-92bc-827fe7a49b48" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: be005a09-f62c-4894-8c82-70c375d379a9" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 541d92f4-33ce-4a50-93d8-dcfd2306eeb0" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 6dd81743-94c4-479a-b945-ffb1af763007" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a269f47b-97bc-4609-b124-d1ef5d997a91" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 33c3c0a5-e5c9-4125-9400-aafb363c683c" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 32e87471-6d21-42a7-bfd8-d5323856f94d" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: b2cc6745-0217-44fe-a48b-44e56e889c9e" installID=55h2cvl5
time="2022-09-01T00:17:57Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 09f81582-6685-4dc9-99f0-ed33565ab4f4" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: cea9116c-2b54-4caa-9776-83559d27b8f8" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 430d7750-c538-42a5-84b5-52bc77ce2d56" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 279038e4-f3c9-4700-b590-9a90f9b8d3a2" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: 5e2f40ae-3dc7-4773-a5cd-40bf9aa36c03" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 92a27a7b-14f5-455b-aa39-3c995806b83e" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0da4f66c-c6b1-453c-a8c8-dc0399b24bb9" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: f2c94beb-a222-4bad-abe1-8de5786f5e59" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="search for IAM users" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="iterating over a page of 0 IAM users" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=debug msg="search for IAM instance profiles" installID=55h2cvl5
time="2022-09-01T00:17:58Z" level=info msg="error while finding resources to delete" error="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 829c3569-b2f2-4b9d-94a0-69644b690066" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=info msg=Deleted id=sg-03265ad2fae661b8c installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="search for matching resources by tag in ca-central-1 matching aws.Filter{\"kubernetes.io/cluster/rosa-mv9dx3-xls7g\":\"owned\"}" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="no deletions from ca-central-1, removing client" installID=55h2cvl5
time="2022-09-01T00:18:09Z" level=debug msg="search for IAM roles" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="iterating over a page of 64 IAM roles" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-ConfigRecorderRole-B749E1E6: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-ConfigRecorderRole-B749E1E6 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0e8e0bea-b512-469b-a996-8722a0f7fa25" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-CWL-Add-Subscription-Filter-9D3CF73C with an explicit deny in a service control policy\n\tstatus code: 403, request id: 288456a2-0cd5-46f1-a5d2-6b4006a5dc0e" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-S4CHZ22EC1B2 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 321df940-70fc-45e7-8c56-59fe5b89e84f" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-AWS679f53fac002430cb0da5-X9UQK0CYNPPO with an explicit deny in a service control policy\n\tstatus code: 403, request id: 45bebf36-8bf9-4c78-a80f-c6a5e98b2187" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCentralEndpointDep-1H6K6CZ6AEUBO with an explicit deny in a service control policy\n\tstatus code: 403, request id: eea00ae2-1a72-43f9-9459-a1c003194137" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomCreateSSMDocument7-1JDO2BN7QTXRH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 0ef5a102-b764-4e17-999f-d820ebc1ec12" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEBSDefaultEncrypti-19EVAXFRG2BEJ with an explicit deny in a service control policy\n\tstatus code: 403, request id: 107d0ccf-94e7-41c4-96cd-450b66a84101" installID=55h2cvl5
time="2022-09-01T00:18:10Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomEc2OperationsB1799-1WASK5J6GUYHO with an explicit deny in a service control policy\n\tstatus code: 403, request id: da9bd868-8384-4072-9fb4-e6a66e94d2a1" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGetDetectorIdRole6-9VGPM8U0HMV7 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 74fbf44c-d02d-4072-b038-fa456246b6a8" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomGuardDutyCreatePub-1W03UREYK3KTX with an explicit deny in a service control policy\n\tstatus code: 403, request id: 365116d6-1467-49c3-8f58-1bc005aa251f" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMCreateRoleE62B6-1AQL8IBN9938I with an explicit deny in a service control policy\n\tstatus code: 403, request id: 20f91de5-cfeb-45e0-bb46-7b66d62cc749" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomIAMPasswordPolicyC-16TPLHRY1FZ43 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 924fa288-f1b9-49b8-b549-a930f6f771ce" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsLogGroup49AC86-1D03LOLE2CARP with an explicit deny in a service control policy\n\tstatus code: 403, request id: 4beb233d-40d6-4016-872a-8757af8f98ee" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomLogsMetricFilter7F-DLA5E1PZSFHH with an explicit deny in a service control policy\n\tstatus code: 403, request id: 77951f62-e0b4-4a9b-a20c-ea40d6432e84" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieExportConfigR-1QT1WNNWPSL36 with an explicit deny in a service control policy\n\tstatus code: 403, request id: 13ad38c8-89dc-461d-9763-870eec3a6ba1" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomMacieUpdateSession-1NHBPTB4GOSM8 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a8fe199d-12fb-4141-a944-c7c5516daf25" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomResourceCleanupC59-1MSCB57N479UU with an explicit deny in a service control policy\n\tstatus code: 403, request id: b487c62f-5ac5-4fa0-b835-f70838b1d178" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomS3PutReplicationRo-FE5Q26BTAG9K with an explicit deny in a service control policy\n\tstatus code: 403, request id: 97bfcb55-ae1f-4859-9c12-03de09607f79" installID=55h2cvl5
time="2022-09-01T00:18:11Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSecurityHubRole660-1UX115B9Q68WX with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1094f6-714e-4042-9134-75f4c6d9d0df" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomSSMUpdateRoleD3D5C-AZ9GBJG6UM4F with an explicit deny in a service control policy\n\tstatus code: 403, request id: ca1db477-ee6a-4d03-8b57-52b335b2bbe6" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-CustomVpcDefaultSecurity-HC931RYMVKKC with an explicit deny in a service control policy\n\tstatus code: 403, request id: 1fc32d09-588b-4d80-ad62-748f7fb55efd" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-Mv9dx3Rosa81Ebf-DefaultBucketReplication-OIM43YBJSMGD with an explicit deny in a service control policy\n\tstatus code: 403, request id: 7d906cc2-eaaa-439b-97e0-503615ce5d43" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=debug msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-PipelineRole: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-PipelineRole with an explicit deny in a service control policy\n\tstatus code: 403, request id: ee6a5647-20b1-4880-932b-bfd70b945077" installID=55h2cvl5
time="2022-09-01T00:18:12Z" level=info msg="get tags for arn:aws:iam::646284873784:role/PBMMAccel-VPC-FlowLog-519F0B57: AccessDenied: User: arn:aws:sts::646284873784:assumed-role/ROSA-Installer-Role/1661991167715690329 is not authorized to perform: iam:GetRole on resource: role PBMMAccel-VPC-FlowLog-519F0B57 with an explicit deny in a service control policy\n\tstatus code: 403, request id: a424891e-48ab-4ad4-9150-9ef1076dcb9c" installID=55h2cvl5

Repeats the not authroized errors probably 50+ times.

Expected results:

For these errors not to show up during install.

Additional info:

Again this is only due to ROSA being install in an AWS SEA environment - https://github.com/aws-samples/aws-secure-environment-accelerator.

Description of problem:

Git icon shown in the repository details page should be based on the git provider.

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Create a Repository with gitlab repo url
2. Navigate to the detail page.

Actual results:

github icon is displayed for the gitlab url.

Expected results:

gitlab icon should be displayed for the gitlab url.

Additional info:

use `GitLabIcon` and `BitBucketIcon` from patternfly react-icons.

This is a clone of issue OCPBUGS-501. The following is the description of the original issue:

Description of problem: 

Version-Release number of selected component (if applicable): 4.10.16

How reproducible: Always

Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field

$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:

  • group: system:authenticated:oauth
    profile: AllRequestBodies
  • group: system:authenticated
    profile: AllRequestBodies
    profile: Default

2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc

Actual results:

Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)

Expected results: The command "oc get dc" should display the deploymentconfig without any error.

Additional info:

Description: agent.iso is created in case of invalid macAddress

Here is the content of agent-config.yaml
--------------------------------------------
kind: AgentConfig
metadata:
name: sno-cluster
spec:
rendezvousIP: 192.168.111.80
hosts:

  • hostname: master-0
    interfaces:
  • name: eno1
    macAddress: 0000
    networkConfig:
    interfaces:
  • name: eno1
    type: ethernet
    state: up
    mac-address: 00000
    ipv4:
    enabled: true
    address:
  • ip: 192.168.111.80
    prefix-length: 23
    dhcp: false
    dns-resolver:
    config:
    server:
  • 192.168.111.1
    routes:
    config:
  • destination: 0.0.0.0/0
    next-hop-address: 192.168.111.2
    next-hop-interface: eno1
    table-id: 254
    --------------------------------------------

How reproducible:

always

Repro Steps:

1) Get the latest agent-installer and build

git clone -b agent-installer https://github.com/openshift/installer.git
cd installer/
hack/build.sh

2) Create agent.iso using agent-config and install-config files.

Expected: Installer should throw an error message something like this: hosts.host[0].interfaces.macAddress: Invalid value: “0000”: macAddress must provide the valid macAddress.
And
hosts.host[0].networkConfig.interfaces.macAddress: Invalid value: “00000”: macAddress must provide the valid macAddress.

Actual: Able to create agent.iso image.

This is a clone of issue OCPBUGS-10221. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:

Description of problem:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.

Version-Release number of selected component (if applicable):

4.10.z, 4.11.z, 4.12, 4.13

How reproducible:

100% 

Steps to Reproduce:

1. Install 4.10.34
2. Switch from stable-4.10 to stable-4.11
3. 

Actual results:

Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them

Expected results:

Risks are computed in a timely manner for an interactive UX, lets say < 10s

Additional info:

This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be.

We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.

This is a clone of issue OCPBUGS-10864. The following is the description of the original issue:

Description of problem:

APIServer service not selected correctly for PublicAndPrivate when external-dns isn't configured. 
Image: 4.14 Hypershift operator + OCP 4.14.0-0.nightly-2023-03-23-050449

jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}'
PublicAndPrivate

    - lastTransitionTime: "2023-03-24T15:13:15Z"
      message: Cluster operators console, dns, image-registry, ingress, insights,
        kube-storage-version-migrator, monitoring, openshift-samples, service-ca are
        not available
      observedGeneration: 3
      reason: ClusterOperatorsNotAvailable
      status: "False"
      type: ClusterVersionSucceeding

services:
  - service: APIServer
   servicePublishingStrategy:
    type: LoadBalancer
  - service: OAuthServer
   servicePublishingStrategy:
    type: Route
  - service: Konnectivity
   servicePublishingStrategy:
    type: Route
  - service: Ignition
   servicePublishingStrategy:
    type: Route
  - service: OVNSbDb
   servicePublishingStrategy:
    type: Route

jiezhao-mac:hypershift jiezhao$ oc get service -n clusters-jz-test | grep kube-apiserver
kube-apiserver            LoadBalancer  172.30.211.131  aa029c422933444139fb738257aedb86-9e9709e3fa1b594e.elb.us-east-2.amazonaws.com  6443:32562/TCP         34m
kube-apiserver-private        LoadBalancer  172.30.161.79  ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com  6443:32100/TCP         34m
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ cat hostedcluster.kubeconfig | grep server
  server: https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443
jiezhao-mac:hypershift jiezhao$

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
E0324 11:17:44.003589   95300 memcache.go:238] couldn't get current server API group list: Get "https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443/api?timeout=32s": dial tcp 10.0.129.24:6443: i/o timeout

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create a PublicAndPrivate cluster without external-dns
2.access the guest cluster (it should fail)
3.

Actual results:

unable to access the guest cluster via 'oc get node --kubeconfig=<guest cluster kubeconfig>', some guest cluster co are not available

Expected results:

The cluster is up and running, the guest cluster can be accessed via 'oc get node --kubeconfig=<guest cluster kubeconfig>'

Additional info:

 

 

Derrick got an "old and new refs are equal" on rebase error; this is similar to OCPBUGS-1899 but I think has a different root cause. In this case, when a manual rollback is performed via the bootloader, we've computed that there's an osimageurl diff between the expected and desired state, but actually the desired state is already set.

We just need to skip doing the rebase if we're already in the target state.

(A real root of this problem again is that the whole "current/desired config" thing is trying to track state independently of the bootloader...if we made node state == container image, all of that goes away. The MCO would understand that it got booted into a previous state)

Description of problem:

Stop option for pipelinerun is not working

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.x

How reproducible:

Always

Steps to Reproduce:

1. Create a pipeline and start it
2. From Actions dropdown select  stop option

Actual results:

Pipelinerun is not getting cancelled

Expected results:

Pipelinerun should get cancelled

Additional info:

 

 

This is a clone of issue OCPBUGS-2141. The following is the description of the original issue:

Description of problem:

4.12 cluster, no pv for prometheus, the doc still link to 4.8

# oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded"))'
[
  {
    "lastTransitionTime": "2022-10-09T02:36:16Z",
    "message": "Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html",
    "reason": "PrometheusDataPersistenceNotConfigured",
    "status": "False",
    "type": "Degraded"
  }
]

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

always

Steps to Reproduce:

1. no PVs for prometheus, check the monitoring operator status
2.
3.

Actual results:

the doc still link to 4.8

Expected results:

links to the latest doc

Additional info:

slack thread: 
https://coreos.slack.com/archives/G79AW9Q7R/p1665283462123389

Description of problem:
Pipeline Repository (Pipeline-as-code) list never shows an Event type.

Version-Release number of selected component (if applicable):
4.9+

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines Operator and setup a Pipeline-as-code repository
  2. Trigger an event and a build

Actual results:
Pipeline Repository list shows a column Event type but no value.

Expected results:
Pipeline Repository list should show the Event type from the matching Pipeline Run.

Similar to the Pipeline Run Details page based on the label.

Additional info:
The list page packages/pipelines-plugin/src/components/repository/list-page/RepositoryRow.tsx renders obj.metadata.namespace as event type.

I believe we should show the Pipeline Run event type instead. packages/pipelines-plugin/src/components/repository/RepositoryLinkList.tsx uses

{plrLabels[RepositoryLabels[RepositoryFields.EVENT_TYPE]]}

to render it.

Also the Pipeline Repository details page tried to render the Branch and Event type from the Repository resource. My research says these properties doesn't exist on the Repository resource. The code should be removed from the Repository details page.

Description of problem:

When all projects are selected, workloads list page and details page shows inconsistent HorizontalPodAutoscaler actions

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-07-25-010250

How reproducible:

Always

Steps to Reproduce:

  1. cluster admin goes to All projects deployments list page, click the kebab button of deployment/api-server in openshift-apiserver namespace
  2. goes to deployment details page /k8s/ns/openshift-apiserver/deployments/apiserver, click 'Actions' and check HorizontalPodAutoscaler related action items
  3. goes to project deployment list page /k8s/ns/openshift-apiserver/deployments, check the action items

Actual results:

  1. the HPA action is 'Add PodDisruptionBudget'
  2. the HPA actions are 'Edit HorizontalPodAutoscaler' and 'Remove HorizontalPodAutoscaler'
  3. the HPA actions are 'Edit HorizontalPodAutoscaler' and 'Remove HorizontalPodAutoscaler'

Expected results:

  1. workloads list and details page should have consistent HPA action items when 'All projects' are selected

Additional info:

This is a clone of issue OCPBUGS-6213. The following is the description of the original issue:

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3450

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-4252. The following is the description of the original issue:

Description of problem: When visiting the Terminal tab of a Node details page, an error is displayed instead of the terminal

Steps to Reproduce:
1. Go to the Terminal tab of a Node details page (e.g., /k8s/cluster/nodes/ip-10-0-129-13.ec2.internal/terminal)
2. Note the error alert that appears on the page instead of the terminal.

This is a clone of issue OCPBUGS-4022. The following is the description of the original issue:

Description of problem:
Unnecessary react warning:

Warning: Each child in a list should have a unique "key" prop.

Check the render method of `NavSection`. See https://reactjs.org/link/warning-keys for more information.
NavItemHref@http://localhost:9012/static/main-785e94355aeacc12c321.js:5141:88
NavSection@http://localhost:9012/static/main-785e94355aeacc12c321.js:5294:20
PluginNavItem@http://localhost:9012/static/main-785e94355aeacc12c321.js:5582:23
div
PerspectiveNav@http://localhost:9012/static/main-785e94355aeacc12c321.js:5398:134

Version-Release number of selected component (if applicable):
4.11 was fine
4.12 and 4.13 (master) shows this warning

How reproducible:
Always

Steps to Reproduce:
1. Open browser log
2. Open web console

Actual results:
React warning

Expected results:
Obviously no react warning

This is a clone of issue OCPBUGS-5559. The following is the description of the original issue:

Description of problem:

Azure VIP 168.63.129.16 needs to be noProxy to let a VM report back about it's creation status [1]. A similar thing needs to be done for the armEndpoint of ASH - to make sure that future cluster nodes do not communicate with a Stack Hub API through proxy

[1] https://docs.microsoft.com/en-us/azure/virtual-network/what-is-ip-address-168-63-129-16

Version-Release number of selected component (if applicable):

4.10.20

How reproducible:

Need to have a proxy server in ASH and run the installer

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected these two be auto-added as they are very specific and difficult to troubleshoot

Expected results:

 

Additional info:

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2104997 against the cluster-network-operator since the fix involves changing both the operator and the installer.

This is a clone of issue OCPBUGS-3633. The following is the description of the original issue:

I think something is wrong with the alerts refactor, or perhaps my sync to 4.12.

Failed: suite=[openshift-tests], [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

Passed 1 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success

We're not getting the passes - from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.12-micro-release-openshift-release-analysis-aggregator/1592021681235300352, the successful runs don't show any record of the test at all. We need to record successes and failures for aggregation to work right.

This is a clone of issue OCPBUGS-3458. The following is the description of the original issue:

Description of problem:

Since way back in 4.8, we've had a banner with To request update recommendations, configure a channel that supports your version when ClusterVersion has RetrievedUpdates=False . But that's only one of several reasons we could be RetrievedUpdates=False. Can we pivot to passing through the ClusterVersion condition message?

Version-Release number of selected component (if applicable):

4.8 and later.

How reproducible:

100%

Steps to Reproduce:

1. Launch a cluster-bot cluster like 4.11.12.
2. Set a channel with oc adm upgrade channel stable-4.11.
3. Scale down the CVO with oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator.
4. Patch in a RetrievedUpdates condition with:

$ CONDITIONS="$(oc get -o json clusterversion version | jq -c '[.status.conditions[] | if .type == "RetrievedUpdates" then .status = "False" | .message = "Testing" else . end]')"
$ oc patch --subresource status clusterversion version --type json -p "[{\"op\": \"add\", \"path\": \"/status/conditions\", \"value\": ${CONDITIONS}}]"

5. View the admin console at /settings/cluster.

Actual results:

Advice about configuring the channel (but it's already configured).

Expected results:

See the message you patched into the RetrievedUpdates condition.

Description of problem:
Latest implementation of history pruner (pr805 [1]) had increased max upgrade history in cvo to 100, and implemented a weight based pruning priority strategy for in case history length grows any larger. This pruning however is not happening, letting history grow uncontrollably, and potentially reach resource limits of etcd or kubernetes.

Observed the following while running continuous upgrade-rollback cycles:

$ oc get clusterversion version -o json | jq '.status.history|length'
203

Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-23-223922
4.12.0-0.nightly-2022-08-23-153511

How reproducible:
1/1

Steps to Reproduce:
Same as described in bz2097067 [2], with addition of waiting a few minutes after the first rollback to allow it to reach 'Completed' state.

Actual results:
History grows uncontrollably

Expected results:
History should be pruned to keep max size of 100

Additional info:

[1] https://github.com/openshift/cluster-version-operator/pull/805
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c4

This is a clone of issue OCPBUGS-9357. The following is the description of the original issue:

Description of problem:

On an SNO node one of the CatalogSources gets deleted after multiple reboots.

In the initial stage we have 2 catalogsources:

$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 20h
openshift-marketplace redhat-operators Red Hat Operators Catalog grpc Red Hat 18h

After running several node reboots, one of the catalogsouce doesn't show up anylonger:

$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 21h

Version-Release number of selected component (if applicable):
4.11.0-fc.3

How reproducible:
Inconsistent but reproducible

Steps to Reproduce:

1. Deploy and configure SNO node via ZTP process. Configuration sets up 2 CatalogSources in a restricted environment for redhat-operators and certified-operators

  • apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
    name: certified-operators
    namespace: openshift-marketplace
    spec:
    displayName: Intel SRIOV-FEC Operator
    image: registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000/olm/far-edge-sriov-fec:v4.11
    publisher: Red Hat
    sourceType: grpc
  • apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
    name: redhat-operators
    namespace: openshift-marketplace
    spec:
    displayName: Red Hat Operators Catalog
    image: registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000/olm/redhat-operators:v4.11
    publisher: Red Hat
    sourceType: grpc

2. Reboot the node via `sudo reboot` several times

3. Check catalogsources

Actual results:

$ oc get catsrc -A
NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE
openshift-marketplace certified-operators Intel SRIOV-FEC Operator grpc Red Hat 22h

Expected results:

All catalogsources created initially are still present.

Additional info:

Attaching must-gather.

This is a clone of issue OCPBUGS-2873. The following is the description of the original issue:

Description of problem:

Prometheus fails to scrape metrics from the storage operator after some time.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always

Steps to Reproduce:

1. Install storage operator.
2. Wait for 24h (time for the certificate to be recycled).
3.

Actual results:

Targets being down because Prometheus didn't reload the CA certificate.

Expected results:

Prometheus reloads its client TLS certificate and scrapes the target successfully.

Additional info:


This is a clone of issue OCPBUGS-2988. The following is the description of the original issue:

Description of problem:

openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors:

}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
W1018 11:36:43.523673      15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to {
  "Addr": "[2620:52:0:198::10]:2379",
  "ServerName": "2620:52:0:198::10",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-18-041406

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with single stack IPv6 via ZTP procedure

Actual results:

Deployment times out and some of the operators aren't deployed successfully.

NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-10-18-041406   False       False         True       124m    APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
baremetal                                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      112m    
cloud-controller-manager                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
cloud-credential                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
cluster-autoscaler                         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
config-operator                            4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
console                                                                                                                      
control-plane-machine-set                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
dns                                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
etcd                                       4.12.0-0.nightly-2022-10-18-041406   True        False         True       121m    ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries
image-registry                             4.12.0-0.nightly-2022-10-18-041406   False       True          True       104m    Available: The registry is removed...
ingress                                    4.12.0-0.nightly-2022-10-18-041406   True        True          True       111m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
insights                                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      118s    
kube-apiserver                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      102m    
kube-controller-manager                    4.12.0-0.nightly-2022-10-18-041406   True        False         True       107m    GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused
kube-scheduler                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
kube-storage-version-migrator              4.12.0-0.nightly-2022-10-18-041406   True        False         False      117m    
machine-api                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-approver                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-config                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
marketplace                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      116m    
monitoring                                                                      False       True          True       98m     deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found
network                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
node-tuning                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
openshift-apiserver                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      104m    
openshift-controller-manager               4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
openshift-samples                                                               False       True          False      103m    The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state
operator-lifecycle-manager                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-10-18-041406   True        False         False      106m    
service-ca                                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
storage                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m  

Expected results:

Deployment succeeds without issues.

Additional info:

I was unable to run must-gather so attaching the pods logs copied from the host file system.

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-10740.

This is a clone of issue OCPBUGS-14620. The following is the description of the original issue:

Description of problem:

When installing a HyperShift cluster into ap-southeast-3 (currently only availble in the production environment), the install will never succeed due to the hosted KCM pods stuck in CrashLoopBackoff

Version-Release number of selected component (if applicable):

4.12.18

How reproducible:

100%

Steps to Reproduce:

1. Install a HyperShift Cluster in ap-southeast-3 on AWS

Actual results:

kube-controller-manager-54fc4fff7d-2t55x                 1/2     CrashLoopBackOff   7 (2m49s ago)   16m
kube-controller-manager-54fc4fff7d-dxldc                 1/2     CrashLoopBackOff   7 (93s ago)     16m
kube-controller-manager-54fc4fff7d-ww4kv                 1/2     CrashLoopBackOff   7 (21s ago)     15m

With selected "important" logs:
I0606 15:16:25.711483       1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="ConfigMap" apiVersion="v1" type="Normal" reason="LeaderElection" message="kube-controller-manager-54fc4fff7d-ww4kv_6dbab916-b4bf-447f-bbb2-5037864e7f78 became leader"
I0606 15:16:25.711498       1 event.go:294] "Event occurred" object="kube-system/kube-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="kube-controller-manager-54fc4fff7d-ww4kv_6dbab916-b4bf-447f-bbb2-5037864e7f78 became leader"
W0606 15:16:25.741417       1 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release. Please use https://github.com/kubernetes/cloud-provider-aws
I0606 15:16:25.741763       1 aws.go:1279] Building AWS cloudprovider
F0606 15:16:25.742096       1 controllermanager.go:245] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": not a valid AWS zone (unknown region): ap-southeast-3a

Expected results:

The KCM pods are Running

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

Description of problem:

https://github.com/openshift/api/pull/1186 - https://issues.redhat.com/browse/CONSOLE-3069 promoted ConsolePlugin CRD to v1.

The PR introduces also a conversion webhook from v1alpha1 to v1.

In new CRD version I18n ConsolePluginI18n is marked as optional.
The conversion webhook will not set a default valid ("Lazy"/"Preload") value writing the v1 object and a v1 object completely omitting spec.i18n will be accepted we no valid default value as well.

On the other side, at garbage collection time the object will be stuck forever due to the lack of a valid value for spec.i18n.loadType

Example,
create a v1 ConsolePlugin object:

cat <<EOF | oc apply -f -
apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: test472
spec:
  backend:
    service:
      basePath: /
      name: test472-service
      namespace: kubevirt-hyperconverged
      port: 9443
    type: Service
  displayName: Test 472 Plugin
EOF

Delete it in foreground mode:
stirabos@t14s:~$ oc delete consoleplugin test472 --timeout=30s --cascade='foreground' -v 7
I1011 18:20:03.255605   31610 loader.go:372] Config loaded from file:  /home/stirabos/.kube/config
I1011 18:20:03.266567   31610 round_trippers.go:463] DELETE https://api.ci-ln-krdzphb-72292.gcp-2.ci.openshift.org:6443/apis/console.openshift.io/v1/consoleplugins/test472
I1011 18:20:03.266581   31610 round_trippers.go:469] Request Headers:
I1011 18:20:03.266588   31610 round_trippers.go:473]     Accept: application/json
I1011 18:20:03.266594   31610 round_trippers.go:473]     Content-Type: application/json
I1011 18:20:03.266600   31610 round_trippers.go:473]     User-Agent: oc/4.11.0 (linux/amd64) kubernetes/fcf512e
I1011 18:20:03.266606   31610 round_trippers.go:473]     Authorization: Bearer <masked>
I1011 18:20:03.688569   31610 round_trippers.go:574] Response Status: 200 OK in 421 milliseconds
consoleplugin.console.openshift.io "test472" deleted
I1011 18:20:03.688911   31610 round_trippers.go:463] GET https://api.ci-ln-krdzphb-72292.gcp-2.ci.openshift.org:6443/apis/console.openshift.io/v1/consoleplugins?fieldSelector=metadata.name%3Dtest472
I1011 18:20:03.688919   31610 round_trippers.go:469] Request Headers:
I1011 18:20:03.688928   31610 round_trippers.go:473]     Authorization: Bearer <masked>
I1011 18:20:03.688935   31610 round_trippers.go:473]     Accept: application/json
I1011 18:20:03.688941   31610 round_trippers.go:473]     User-Agent: oc/4.11.0 (linux/amd64) kubernetes/fcf512e
I1011 18:20:03.840103   31610 round_trippers.go:574] Response Status: 200 OK in 151 milliseconds
I1011 18:20:03.840825   31610 round_trippers.go:463] GET https://api.ci-ln-krdzphb-72292.gcp-2.ci.openshift.org:6443/apis/console.openshift.io/v1/consoleplugins?fieldSelector=metadata.name%3Dtest472&resourceVersion=175205&watch=true
I1011 18:20:03.840848   31610 round_trippers.go:469] Request Headers:
I1011 18:20:03.840884   31610 round_trippers.go:473]     Accept: application/json
I1011 18:20:03.840907   31610 round_trippers.go:473]     User-Agent: oc/4.11.0 (linux/amd64) kubernetes/fcf512e
I1011 18:20:03.840928   31610 round_trippers.go:473]     Authorization: Bearer <masked>
I1011 18:20:03.972219   31610 round_trippers.go:574] Response Status: 200 OK in 131 milliseconds
error: timed out waiting for the condition on consoleplugins/test472

and in kube-controller-manager logs we see:

2022-10-11T16:25:32.192864016Z I1011 16:25:32.192788       1 garbagecollector.go:501] "Processing object" object="test472" objectUID=0cc46a01-113b-4bbe-9c7a-829a97d6867c kind="ConsolePlugin" virtual=false
2022-10-11T16:25:32.282303274Z I1011 16:25:32.282161       1 garbagecollector.go:623] remove DeleteDependents finalizer for item [console.openshift.io/v1/ConsolePlugin, namespace: , name: test472, uid: 0cc46a01-113b-4bbe-9c7a-829a97d6867c]
2022-10-11T16:25:32.304835330Z E1011 16:25:32.304730       1 garbagecollector.go:379] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"console.openshift.io/v1", Kind:"ConsolePlugin", Name:"test472", UID:"0cc46a01-113b-4bbe-9c7a-829a97d6867c", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:""}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference(nil)}: ConsolePlugin.console.openshift.io "test472" is invalid: spec.i18n.loadType: Unsupported value: "": supported values: "Preload", "Lazy"

Version-Release number of selected component (if applicable):

OCP 4.12.0 ec4

How reproducible:

100% 

Steps to Reproduce:

1. cat <<EOF | oc apply -f -
apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: test472
spec:
  backend:
    service:
      basePath: /
      name: test472-service
      namespace: kubevirt-hyperconverged
      port: 9443
    type: Service
  displayName: Test 472 Plugin
EOF
2. oc delete consoleplugin test472 --timeout=30s --cascade='foreground' -v 7

Actual results:

2022-10-11T16:25:32.192864016Z I1011 16:25:32.192788       1 garbagecollector.go:501] "Processing object" object="test472" objectUID=0cc46a01-113b-4bbe-9c7a-829a97d6867c kind="ConsolePlugin" virtual=false
2022-10-11T16:25:32.282303274Z I1011 16:25:32.282161       1 garbagecollector.go:623] remove DeleteDependents finalizer for item [console.openshift.io/v1/ConsolePlugin, namespace: , name: test472, uid: 0cc46a01-113b-4bbe-9c7a-829a97d6867c]
2022-10-11T16:25:32.304835330Z E1011 16:25:32.304730       1 garbagecollector.go:379] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"console.openshift.io/v1", Kind:"ConsolePlugin", Name:"test472", UID:"0cc46a01-113b-4bbe-9c7a-829a97d6867c", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:""}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference(nil)}: ConsolePlugin.console.openshift.io "test472" is invalid: spec.i18n.loadType: Unsupported value: "": supported values: "Preload", "Lazy"

Expected results:

Object correctly deleted

Additional info:

The issue doesn't happen with --cascade='background' which is the default on the CLI client

This is a clone of issue OCPBUGS-266. The following is the description of the original issue:

Description of problem: I am working with a customer who uses the web console.  From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console.  This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead.  What we really need is a third column in the Project Access tab that says whether a resource is a user or group.

 

Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well

How reproducible: Every time.  My customer is running on ROSA, but I have determined this issue to be general to OpenShift.

Steps to Reproduce:

From the oc cli, I create a group and add a user to it.

$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME                                     USERS
cluster-admins                           
dedicated-admins                         admin
techlead   admin
I create a new namespace so that I can assign a group project level access:

$ oc new-project my-namespace

$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access.  I verified the rolebinding named 'edit' is bound to a group named 'techlead'.

$ oc get rolebinding
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      15m
admin-dedicated-admins                                            ClusterRole/admin                      15m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      15m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   15m
edit                                                              ClusterRole/edit                       2m18s
system:deployers                                                  ClusterRole/system:deployer            15m
system:image-builders                                             ClusterRole/system:image-builder       15m
system:image-pullers                                              ClusterRole/system:image-puller        15m

$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:16:56Z"
  name: edit
  namespace: my-namespace
  resourceVersion: "108357"
  uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: techlead
    Now, from the same Project Access tab in the web console, I added the developer with role "View".  From this web console, it is unclear whether developer and techlead are users or groups.

Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.

$ oc get rolebinding                                                                      
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      17m
admin-dedicated-admins                                            ClusterRole/admin                      17m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      17m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   17m
edit                                                              ClusterRole/edit                       4m25s
developer-view-c15b720facbc8deb     ClusterRole/view                       90s
system:deployers                                                  ClusterRole/system:deployer            17m
system:image-builders                                             ClusterRole/system:image-builder       17m
system:image-pullers                                              ClusterRole/system:image-puller        17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:19:51Z"
  name: developer-view-c15b720facbc8deb
  namespace: my-namespace
  resourceVersion: "113298"
  uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: User
      name: developer

So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups.  This is in essence our ask for this RFE.

 

Actual results:

Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them.  Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.

 

Expected results:

Should have the ability to add groups and differentiate between users and groups.  Ideally, we're looking at a third column for user or group.

 

Additional info:

This is a clone of issue OCPBUGS-12913. The following is the description of the original issue:

Description of problem

CI is flaky because the TestRouterCompressionOperation test fails.

Version-Release number of selected component (if applicable)

I have seen these failures on 4.14 CI jobs.

How reproducible

Presently, search.ci reports the following stats for the past 14 days:

Found in 7.71% of runs (16.58% of failures) across 402 total runs and 24 jobs (46.52% failed)

GCP is most impacted:

pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator (all) - 44 runs, 86% failed, 37% of failures match = 32% impact

Azure and AWS are also impacted:

pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator (all) - 36 runs, 64% failed, 43% of failures match = 28% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 38 runs, 79% failed, 23% of failures match = 18% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=compression+error%3A+expected&maxAge=336h&context=1&type=build-log&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.

Actual results

The test fails:

TestAll/serial/TestRouterCompressionOperation 
=== RUN   TestAll/serial/TestRouterCompressionOperation
    router_compression_test.go:209: compression error: expected "gzip", got "" for canary route

Expected results

CI passes, or it fails on a different test.

This is a clone of issue OCPBUGS-6175. The following is the description of the original issue:

Description of problem:

When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].

[1]https://github.com/openshift/cluster-image-registry-operator/pull/819

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Deploy a cluster with proxy and restricted installation
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-1061. The following is the description of the original issue:

Description of problem:

grant monitoring-alertmanager-edit  role to user

# oc adm policy add-cluster-role-to-user cluster-monitoring-view testuser-11

# oc adm policy add-role-to-user monitoring-alertmanager-edit testuser-11 -n openshift-monitoring --role-namespace openshift-monitoring

monitoring-alertmanager-edit user, go to administrator console, "Observe - Alerting - Silences" page is pending to list silences, debug in the console, no findings.

 

create silence with monitoring-alertmanager-edit user for Watchdog alert, silence page is also pending, checked with kubeadmin user, "Observe - Alerting - Silences" page shows the Watchdog alert is silenced, but checked with  monitoring-alertmanager-edit user, Watchdog alert is not silenced.

this should be a regression for https://bugzilla.redhat.com/show_bug.cgi?id=1947005 since 4.9, no such issue then, but there is similiar issue with 4.9.0-0.nightly-2022-09-05-125502 now

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-08-114806

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

administrator console, monitoring-alertmanager-edit user list or create silence, "Observe - Alerting - Silences" page is pending

Expected results:

should not be pending

Additional info:

 

This is a clone of issue OCPBUGS-2851. The following is the description of the original issue:

Description of problem:

The current implementation of registries.conf support is not working as expected. This bug report will outline the expectations of how we believe this should work.

Background

The containers/image project defines a configuration file called registries.conf, which controls how image pulls can be redirected to another registry. Effectively the pull request for a given registry is redirected to another registry which can satisfy the image pull request instead. The specification for the registries.conf file is located here. For tools such as podman and skopeo, this configuration file allows those tools to indicate where images should be pulled from, and the containers/image project rewrites the image reference on the fly and tries to get the image from the first location it can, preferring these "alternate locations" and then falling back to the original location if one of the alternate locations can't satisfy the image request.

An important aspect of this redirection mechanism is it allows the "host:port" and "namespace" portions of the image reference to be redirected. To be clear on the nomenclature used in the registries.conf specification, a namespace refers to zero or more slash separated sections leading up to the image name (which is called repo in the specification and has the tag or digest after it. See repo(:_tag|@digest) below) and the host[:port] refers to the domain where the image registry is being hosted.

Example:

host[:port]/namespace[/namespace…]/repo(:_tag|@digest)

For example, if we have an image called myimage@sha:1234 the and the image normally resides in quay.io/foo/myimage@sha:1234 you could redirect the image pull request to my registry.com/bar/baz/myimage@sha:1234. Note that in this example the alternate registry location is in a different host, and the namespace "path" is different too.

Use Case

In a typical development scenario, image references within an OLM catalog should always point to a production location where the image is intended to be pulled from when a catalog is published publicly. Doing this prevents publishing a catalog which contains image references to internal repositories, which would never be accessible by a customer. By using the registries.conf redirection mechanism, we can perform testing even before the images are officially published to public locations, and we can redirect the image reference from a production location to an internal repository for testing purposes. Below is a simple example of a registries.conf file that redirects image pull requests away from prodlocation.io to preprodlocation.com:

[[registry]]
 location = "prodlocation.io/xx"
 insecure = false
 blocked = false
 mirror-by-digest-only = true
 prefix = ""
 [[registry.mirror]]
  location = "preprodlocation.com/xx"
  insecure = false

Other Considerations

  • We only care about redirection of images during image pull. Image redirection on push is out of scope.
  • We would like to see as much support for the fields and TOML tables defined in the spec as possible. That being said, there are some items we don't really care about.
    • supported:
      • support multiple [[registry]] TOML tables
      • support multiple [[registry.mirror]] TOML tables for a given [[registry]] TOML table
      • if all entires of [[registry.mirror]] for a given [[registry]] TOML table do not resolve an image, the original [[registry]] TOML locations should be used as the final fallback (this is consistent with how the specification is written, but want to make this point clear. See the specification example which describes how things should work.
      • prefix and location
        • These fields work together, so refer to the specification for how this works. If necessary, we could simplify this to only use location since we are unlikely to use the prefix option.
      • insecure
        • this should be supported for the [[registry]] and [[registry.mirror]] TOML tables so you know how to access registries. If this is not needed by oc mirror then we can forgo this field.
    • fields that require discussion:
      • we assume that digests and tags can be supplied for an image reference, but in the end digests are required for oc mirror to keep track of the image in the workspace. It's not clear if we need to support these configuration options or not:
        • mirror-by-digest-only
          • we assume this is always false since we don't need to prevent an image from being pulled if it is using a tag
        • pull-from-mirror
          • we assume this is always all since we don't need to prevent an image from being pulled if it is using a tag
    • does not need to be supported:
      • unqualified-search-registries
      • credential-helpers
      • blocked
      • aliases
  • we are not interested in supporting version 1 of registries.conf since it is deprecated

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

oc mirror -c ImageSetConfiguration.yaml --use-oci-feature --oci-feature-action mirror --oci-insecure-signature-policy --oci-registries-config registries.conf --dest-skip-tls docker://localhost:5000/example/test

Example registries.conf

[[registry]]
  prefix = ""
  insecure = false
  blocked = false
  location = "prod.com/abc"
  mirror-by-digest-only = true
  [[registry.mirror]]
    location = "internal.exmaple.io/cp"
    insecure = false
[[registry]]
  prefix = ""
  insecure = false
  blocked = false
  location = "quay.io"
  mirror-by-digest-only = true
  [[registry.mirror]]
    location = "internal.exmaple.io/abcd"
    insecure = false

 

Actual results:

images are not pulled from "internal" registry

Expected results:

images should be pulled from "internal" registry

Additional info:

The current implementation in oc mirror creates its own structs to approximate the ones provided by the containers/image project, but it might not be necessary to do that. Since the oc mirror project already uses containers/image as a dependency, it could leverage the FindRegistry function, which takes a image reference, loads the registries.conf information and returns the most appropriate [[registry]] reference (in the form of Registry struct) or nil if no match was found. Obviously custom processing will be necessary to do something useful with the Registry instance. Using this code is not a requirement, just a suggestion of another possible path to load the configuration.

This is a clone of issue OCPBUGS-4969. The following is the description of the original issue:

Description of problem:

A ROSA machinepool is created and the label k8s.ovn.org/egress-assignable is added during creation. The newly created nodes are not discovered as egressIP nodes and no egressIP addresses are assigned.

It was discovered that removing the k8s.ovn.org/egress-assignable label from the nodes, by editing the machinepool, and subsquently reapplying the label causes the nodes to be discovered as egressIP capable.

While it is possible to workaround the issue be removing and reapplying the label, this will likely not work with node auto-scaling.

 

Version-Release number of selected component (if applicable):

4.11.18

How reproducible:

Always

Steps to Reproduce:

1. Create a machinepool and label for egressIP
$ rosa create machinepool -c brosenbe --name mp-1 --labels k8s.ovn.org/egress-assignable="" --replicas=3
I: Machine pool 'mp-1' created successfully on cluster 'brosenbe'
I: To view all machine pools, run 'rosa list machinepools -c brosenbe'


2. Wait for nodes to be instantiated
$ watch -n 60 oc get nodes -l k8s.ovn.org/egress-assignable

Every 60.0s: oc get nodes -l k8s.ovn.org/egress-assignable         brosenbe.syd.csb: Fri Dec 16 15:20:47 2022
NAME                                              STATUS   ROLES    AGE     VERSION
ip-10-0-136-123.ap-southeast-2.compute.internal   Ready    worker   7m55s   v1.24.6+5658434
ip-10-0-178-34.ap-southeast-2.compute.internal    Ready    worker   7m59s   v1.24.6+5658434
ip-10-0-192-110.ap-southeast-2.compute.internal   Ready    worker   8m      v1.24.6+5658434


3. Create egressip object
$ cat << EOF >egressip.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egress-group1
spec:
  egressIPs:
  - 10.0.128.152
  - 10.0.160.152
  - 10.0.192.152
  namespaceSelector:
    matchLabels:
      env: dev
EOF


4. Apply egressip object
$ oc apply -f egressip.yaml 
egressip.k8s.ovn.org/egress-group1 created


5. Note that no IP addresses from egressip/egress-group1 have been assigned
$ oc get egressip
NAME            EGRESSIPS         ASSIGNED NODE   ASSIGNED EGRESSIPS
egress-group1   10.0.128.152
                   
$ oc get event -n default | egrep egressip | tail -1
34s         Warning   NoMatchingNodeFound         egressip/egress-group1                                      no assignable nodes for EgressIP: egress-group1, please tag at least one node with label: k8s.ovn.org/egress-assignable

$ ns=openshift-ovn-kubernetes; for pod in $(oc get pods -n $ns -l app=ovnkube-master -o name); do pod=${pod##*/}; echo $pod; oc logs -n $ns $pod -c ovnkube-master | grep 'No assignable nodes found for EgressIP' | tail -1; done
ovnkube-master-bgz84
ovnkube-master-kzgpc
ovnkube-master-pbtn9
E1216 04:21:50.578203       1 egressip.go:1567] No assignable nodes found for EgressIP: egress-group1 and requested IPs: [10.0.128.152 10.0.160.152 10.0.192.152]


6. Remove egressIP labels
$ rosa edit machinepool -c brosenbe mp-1 --replicas 3 --labels ''
I: Updated machine pool 'mp-1' on cluster 'brosenbe'


7. Wait a bit for labels to be removed...
$ watch -n 60 oc get nodes -l k8s.ovn.org/egress-assignable

Every 60.0s: oc get nodes -l k8s.ovn.org/egress-assignable          brosenbe.syd.csb: Fri Dec 16 15:51:57 2022

No resources found


8. Reapply label k8s.ovn.org/egress-assignable 
$ rosa edit machinepool -c brosenbe mp-1 --replicas 3 --labels k8s.ovn.org/egress-assignable=''
I: Updated machine pool 'mp-1' on cluster 'brosenbe'9. Wait a bit for labels to be applied...


9. Wait a while for labels to be applied
$ watch -n 60 oc get nodes -l k8s.ovn.org/egress-assignable

Every 60.0s: oc get nodes -l k8s.ovn.org/egress-assignable          brosenbe.syd.csb: Fri Dec 16 16:00:03 2022
NAME                                              STATUS   ROLES    AGE   VERSION
ip-10-0-136-123.ap-southeast-2.compute.internal   Ready    worker   47m   v1.24.6+5658434
ip-10-0-178-34.ap-southeast-2.compute.internal    Ready    worker   47m   v1.24.6+5658434
ip-10-0-192-110.ap-southeast-2.compute.internal   Ready    worker   47m   v1.24.6+5658434


10. Note that egressIP addresses have now been assigned to nodes
$ oc get egressip egress-group1
NAME            EGRESSIPS      ASSIGNED NODE                                     ASSIGNED EGRESSIPS
egress-group1   10.0.128.152   ip-10-0-167-202.ap-southeast-2.compute.internal   10.0.160.152

$ oc get egressip egress-group1 -o yaml | yq -y '.status'
items:
  - egressIP: 10.0.128.152
    node: ip-10-0-136-123.ap-southeast-2.compute.internal
  - egressIP: 10.0.192.152
    node: ip-10-0-192-110.ap-southeast-2.compute.internal
  - egressIP: 10.0.160.152
    node: ip-10-0-178-34.ap-southeast-2.compute.internal 

Actual results:

EgressIP addresses not applied to nodes with k8s.ovn.org/egress-assignable label

Expected results:

EgressIP addresses are applied to nodes with k8s.ovn.org/egress-assignable label

Additional info:

 

This is a clone of issue OCPBUGS-11773. The following is the description of the original issue:

Description of problem:

with new s3 bucket, hc failed with condition :
- lastTransitionTime: “2023-04-13T14:17:11Z”
   message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2
    s3 bucket: aws returned an error: AccessControlListNotSupported’
   observedGeneration: 3
   reason: OIDCConfigurationInvalid
   status: “False”
   type: ValidOIDCConfiguration

Version-Release number of selected component (if applicable):

 

How reproducible:

1 create s3 bucket 
$ aws s3api create-bucket --create-bucket-configuration  LocationConstraint=us-east-2 --region=us-east-2 --bucket heli-hypershift-demo-oidc-2
{
  "Location": "http://heli-hypershift-demo-oidc-2.s3.amazonaws.com/"
}
[cloud-user@heli-rhel-8 ~]$ aws s3api delete-public-access-block --bucket heli-hypershift-demo-oidc-2

2 install HO and create a hc on aws us-west-2
3. hc failed with condition:
- lastTransitionTime: “2023-04-13T14:17:11Z”    message: ‘failed to upload /.well-known/openid-configuration to the heli-hypershift-demo-oidc-2     s3 bucket: aws returned an error: AccessControlListNotSupported’    observedGeneration: 3    reason: OIDCConfigurationInvalid    status: “False”    type: ValidOIDCConfiguration

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

create a hc successfully

Additional info:

 

Description of problem:

OLM PSA plug-in is disabled for 4.12

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Very

Steps to Reproduce:

1. Install ODF
2. PSA Fails
3.

Actual results:

 

Expected results:

PSA Should not fail

Additional info:

Related to bug: https://bugzilla.redhat.com/show_bug.cgi?id=2217783

 

Additional context: https://redhat-internal.slack.com/archives/C3VS0LV41/p1687853580702379

 

Description of problem:

Name of workload get changed, when project and image stream gets changed on reloading the form on the edit deployment page of the workload

Version-Release number of selected component (if applicable):

4.9 and above

How reproducible:

Always

Steps to Reproduce:

1. Create a deployment workload
2. Select Edit Deployment option on workload
3. Verify initially name was same as workload name and field was not changeable.
4. Change the project to "openshift", image stream to "golang" or anything and tag to "latest"
5. Reload the form
6. Now check that the name also got changed to golang. 

Actual results:

Name of workload changes when project and image stream name changed on edit deployment page.

Expected results:

Workload name doesn't have to be changed, when image stream name changed on edit deployment page, as name field is not changeable.

Additional info:

While performing automation, I can see the error "the name of the object(imageStreamName) does not match the name on the URL(workloadName)", but while performing this on UI, no errors.

This is a clone of issue OCPBUGS-7485. The following is the description of the original issue:

Description of problem:

When Creating Sample Devfile from the Samples Page, corresponding Topology Icon for the app is not set. This issue is not observed when we create a BuildImage from the Samples page.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. Create a Sample Devfile App from the Samples Page
2. Go to the Topology Page and check the icon of the app created.

Actual results:

The generic Openshift logo is displayed

Expected results:

Need to show the corresponding app icon (Golang, Quarkus, etc.)

Additional info:

In case of creating sample of BuilderImage, the icon gets properly set as per the BuilderImage used.

Current label: app.openshift.io/runtime=dotnet-basic
Change to: app.openshift.io/runtime=dotnet

Description of problem:

The default catalogSources in the openshift-4.12 payload are using the 4.12 image tag

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.12 OpenShift cluster
2. Inspect the default catalogSource image tags.

Actual results:

The default catalogSources reference the 4.11 image tags.

Expected results:

The default catalogSources reference the 4.12 image tags.

Additional info:

 

 

Description of problem:

On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Observe > Alerting pages
2. Click on an alert (or go to the rules tab then click on a rule)
3. Click on one of the underlined fields (those that have a popover help)

Actual results:

 

Expected results:

 

Additional info:

 

Hi,

Description of problem

Bare Metal IPI provisioning is failing to provision the worker nodes. The metal3-machine-os-downloader InitContainer is getting in CrashLoopBackOff state because it cannot find virt-* commands in the container image.

> oc -n openshift-machine-api get pods | grep -v Running
NAME                       READY   STATUS
metal3-fc66f5846-gtq9m     0/7     Init:CrashLoopBackOff
metal3-image-cache-d4qcz   0/1     Init:1/2
metal3-image-cache-djzcf   0/1     Init:1/2
metal3-image-cache-p5mwg   0/1     Init:1/2
> oc -n openshift-machine-api logs deployment/metal3 -c metal3-machine-os-downloader
[omitted]
++ LIBGUESTFS_BACKEND=direct
++ virt-filesystems -a rhcos-412.86.202207142104-0-openstack.x86_64.qcow2 -l
/usr/local/bin/get-resource.sh: line 88: virt-filesystems: command not found
++ grep boot
++ cut -f1 '-d '
+ BOOT_DISK=
++ LIBGUESTFS_BACKEND=direct
++ virt-ls -a rhcos-412.86.202207142104-0-openstack.x86_64.qcow2 -m '' /boot/loader/entries
/usr/local/bin/get-resource.sh: line 90: virt-ls: command not found
+ BOOT_ENTRIES=
+ rm -fr /shared/tmp/tmp.CnCd2E3kxN
Version-Release number of selected component (if applicable):

OpenShift 4.12.0-ec.0+

Analysis

Since https://github.com/openshift/ocp-build-data/pull/1757, the ironic-machine-os-downloader container image is built using RHEL9 repositories.

However, following upstream move of guestfs tools to a dedicated repository [1], the libguestfs packaging differs between RHEL8 and RHEL9:

  • the libguestfs-tools-c package containing most virt-* commands is now provided by the guestfs-tools package
  • the libguestfs-tools package is now provided by the virt-win-reg package which does not require the libguestfs-tools-c package anymore

Since the Dockerfile specifies only the libguestfs-tools package, the virt-* commands are not installed when using RHEL9 repositories.

A trivial fix is to update the Dockerfile to install the guestfs-tools package instead of the libguestfs-tools package.

Regards,

Denis

This is a clone of issue OCPBUGS-12864. The following is the description of the original issue:

This is a clone of issue OCPBUGS-283. The following is the description of the original issue:

Description of problem:
Traffic leaving out of Openshift nodes with Pod/Node IP instead of Egress IP

Version-Release number of selected component (if applicable):
4.9.33
4.9.43

How reproducible:

  • every time

Steps to Reproduce:
1. Assign egress IP
2. remove `k8s.ovn.org/egress-assignable:` label from the node where the egress IP is attached
3. For some time almost a minute, The traffic leaving openshift nodes with pod IP instead of egress IP

Actual results:
traffic leaving openshift nodes with pod IP instead of egress IP

Expected results:
It should not go thru openshift pod ip

Additional info:

This is a clone of issue OCPBUGS-3358. The following is the description of the original issue:

Description of problem:
Due to changes in BUILD-407 which merged into release-4.12, we have a permafailing test `e2e-aws-csi-driver-no-refreshresource` and are unable to merge subsequent pull requests.

Version-Release number of selected component (if applicable):


How reproducible: Always

Steps to Reproduce:

1. Bring up cluster using release-4.12 or release-4.13 or master branch
2. Run `e2e-aws-csi-driver-no-refreshresource` test
3.

Actual results:
I1107 05:18:31.131666 1 mount_linux.go:174] Cannot run systemd-run, assuming non-systemd OS
I1107 05:18:31.131685 1 mount_linux.go:175] systemd-run failed with: exit status 1
I1107 05:18:31.131702 1 mount_linux.go:176] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down

Expected results:
Test should pass

Additional info:


This bug is a backport clone of [Bugzilla Bug 2050230](https://bugzilla.redhat.com/show_bug.cgi?id=2050230). The following is the description of the original bug:

Description of problem:
In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources.

Version-Release number of selected component (if applicable):

How reproducible:
NA

Steps to Reproduce:
NA

Actual results:
Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible.

Expected results:
Kube API Server and Openshift API Server should be stable.

Additional info:

User Story

As an OpenShift operator, i would like to be able to add labels to my MachineSets and nodes which contain unique values, while also using the cluster autoscaler's ability to balance similar node groups. Being able to specify additional labels through the ClusterAutoscaler CRD would allow me to do that.

Background

Something that has arisen during the investigation of https://bugzilla.redhat.com/show_bug.cgi?id=2001027 is the notion that each CSI driver could create its own zone topology labels, and that they do not have to be consistent with the well known kubernetes label.

It is possible, although not entirely confirmed, that a CSI driver might add these labels even when not in use (although running in the cluster).

Additionally, users may need the option to specify more labels to ignore (as illustrated in the discussion of the bug).

Steps

  • Add a new API field for the labels to ignore
  • it should be a list
  • write some unit tests
  • update our balance node e2e test

Stakeholders

  • cloud team, qe

Definition of Done

  • field and functionality added
  • Docs
  • product docs will need an update
  • Testing
  • unit and e2e

Description

As a user, I would like to see the type of technology used by the samples on the samples view similar to the all services view. 

On the samples view:

It is showing different types of samples, e.g. devfile, helm and all showing as .NET. It is difficult for user to decide which .Net entry to select on the list. We'll need something like the all service view where it shows the type of technology on the top right of each card for users to differentiate between the entries:

Acceptance Criteria

  1. Add visible label as the all services view on each card to show the technology used by the sample on the samples view.

Additional Details:

This is a clone of issue OCPBUGS-15512. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14969. The following is the description of the original issue:

Description of problem:

When an HCP Service LB is created, for example for an IngressController, the CAPA controller calls ModifyNetworkInterfaceAttribute. It references the default security group for the VPC in addition to the security group created for the cluster ( with the right tags). Ideally, the LBs (and any other HCP components) should not be using the default VPC SecurityGroup

Version-Release number of selected component (if applicable):

All 4.12 and 4.13

How reproducible:

100%

Steps to Reproduce:

1. Create HCP
2. Wait for Ingress to come up.
3. Look in CloudTrail for ModifyNetworkInterfaceAttribute, and see default security group referenced 

Actual results:

Default security group is used

Expected results:

Default security group should not be used

Additional info:

This is problematic as we are attempting to scope our AWS permissions as small as possible. The goal is to only use resources that are tagged with `red-hat-managed: true` so that our IAM Policies can conditioned to only access these resources. Using the Security Group created for the cluster should be sufficient, and the default Security Group does not need to be used, so if the usage can be removed here, we can secure our AWS policies that much better. Similar to OCPBUGS-11894

aws-ebs-csi-driver-controller-ca ServiceAccount does not include the HCP pull-secret in its imagePullSecrets. Thus, if a HostedCluster is created with a `pullSecret` that contains creds that the management cluster pull secret does not have, the image pull fails.

Description of problem:

Customer has noticed that object count quotas ("count/*") do not work for certain objects in ClusterResourceQuotas. For example, the following ResourceQuota works as expected:

~~~
apiVersion: v1
kind: ResourceQuota
metadata:
[..]
spec:
  hard:
    count/routes.route.openshift.io: "900"
    count/servicemonitors.monitoring.coreos.com: "100"
    pods: "100"
status:
  hard:
    count/routes.route.openshift.io: "900"
    count/servicemonitors.monitoring.coreos.com: "100"
    pods: "100"
  used:
    count/routes.route.openshift.io: "0"
    count/servicemonitors.monitoring.coreos.com: "1"
    pods: "4"
~~~

However when using "count/servicemonitors.monitoring.coreos.com" in ClusterResourceQuotas, this does not work (note the missing "used"):

~~~
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
[..]
spec:
  quota:
    hard:
      count/routes.route.openshift.io: "900"
      count/servicemonitors.monitoring.coreos.com: "100"
      count/simon.krenger.ch: "100"
      pods: "100"
  selector:
    annotations:
      openshift.io/requester: kube:admin
status:
  namespaces:
[..]
  total:
    hard:
      count/routes.route.openshift.io: "900"
      count/servicemonitors.monitoring.coreos.com: "100"
      count/simon.krenger.ch: "100"
      pods: "100"
    used:
      count/routes.route.openshift.io: "0"
      pods: "4"
~~~

This behaviour does not only apply to "servicemonitors.monitoring.coreos.com" objects, but also to other objects, such as:

- count/kafkas.kafka.strimzi.io: '0' - count/prometheusrules.monitoring.coreos.com: '100' - count/servicemonitors.monitoring.coreos.com: '100' 

The debug output for kube-controller-manager shows the following entries, which may or may not be related:

~~~
$ oc logs kube-controller-manager-ip-10-0-132-228.eu-west-1.compute.internal | grep "servicemonitor" I0511 15:07:17.297620 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.297630 1 resource_quota_monitor.go:181] QuotaMonitor using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors" I0511 15:07:17.297642 1 resource_quota_monitor.go:233] QuotaMonitor created object count evaluator for servicemonitors.monitoring.coreos.com [..] I0511 15:07:17.486279 1 patch_informers_openshift.go:90] Couldn't find informer for monitoring.coreos.com/v1, Resource=servicemonitors I0511 15:07:17.486297 1 graph_builder.go:176] using a shared informer for resource "monitoring.coreos.com/v1, Resource=servicemonitors", kind "monitoring.coreos.com/v1, Kind=ServiceMonitor" ~~~

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.15

How reproducible:

Always

Steps to Reproduce:

1. On an OCP 4.12 cluster, create the following ClusterResourceQuota:

~~~
apiVersion: quota.openshift.io/v1
kind: ClusterResourceQuota
metadata:
  name: case-03509174
spec:
  quota: 
    hard:
      count/servicemonitors.monitoring.coreos.com: "100"
      pods: "100"
  selector:
    annotations: 
      openshift.io/requester: "kube:admin"
~~~

2. As "kubeadmin", create a new project and deploy one new ServiceMonitor, for example: 

~~~
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: simon-servicemon-2
  namespace: simon-1
spec:
  endpoints:
    - path: /metrics
      port: http
      scheme: http
  jobLabel: component
  selector:
    matchLabels:
      deployment: echoenv-1
~~~

Actual results:

The "used" field for ServiceMonitors is not populated in the ClusterResourceQuota for certain objects. It is unclear if these quotas are enforced or not

Expected results:

ClusterResourceQuota for ServiceMonitors is updated and enforced

Additional info:

* Must-gather for a cluster showing this behaviour (added debug for kube-controller-manager) is available here: https://drive.google.com/file/d/1ioEEHZQVHG46vIzDdNm6pwiTjkL9QQRE/view?usp=share_link
* Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1683876047243989

This is a clone of issue OCPBUGS-1725. The following is the description of the original issue:

Description of problem:

Cluster ingress operator creates router deployments with affinity rules when running in a cluster with non-HA infrastructure plane (InfrastructureTopology=="SingleReplica") and "NodePortService" endpoint publishing strategy. With only one worker node available, rolling update of router-default stalls.

Version-Release number of selected component (if applicable):

All

How reproducible:

Create a single worker node cluster with "NodePortService" endpoint publishing strategy and try to restart the default router. Restart will not go through.

Steps to Reproduce:

1. Create a single worker node OCP cluster with HA control plane (ControlPlaneTopology=="HighlyAvailable"/"External") and one worker node (InfrastructureTopology=="SingleReplica") using "NodePortService" endpoint publishing strategy. The operator will create "ingress-default" deployment with "podAntiAffinity" block, even though the number of nodes where ingress pods can be scheduled is only one:
```
apiVersion: apps/v1
kind: Deployment
metadata:
  ...
  name: router-default
  namespace: openshift-ingress
  ...
spec:
  ...
  replicas: 1
  ...
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 50%
    type: RollingUpdate
  template:
    ...
    spec:
      affinity:
        ...
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: ingresscontroller.operator.openshift.io/deployment-ingresscontroller
                operator: In
                values:
                - default
              - key: ingresscontroller.operator.openshift.io/hash
                operator: In
                values:
                - 559d6c97f4
            topologyKey: kubernetes.io/hostname
...
```

2. Restart the default router

```
oc rollout restart deployment router-default -n openshift-ingress
```
 

Actual results:

Deployment restart does not complete and hangs forever:

```
oc get po -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-58d88f8bf6-cxnjk   0/1     Pending   0          2s
router-default-5bb8c8985b-kdg92   1/1     Running   0          2d23h
```

Expected results:

Deployment restart completes

Additional info:

 

This is a clone of issue OCPBUGS-3186. The following is the description of the original issue:

Description of problem:

fail to get clear error message when zones is not match with the the subnets in BYON

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. install-config.yaml 
 yq '.controlPlane.platform.ibmcloud.zones,.platform.ibmcloud.controlPlaneSubnets' install-config.yaml 
["ca-tor-1", "ca-tor-2", "ca-tor-3"]
- ca-tor-existing-network-1-cp-ca-tor-2
- ca-tor-existing-network-1-cp-ca-tor-3
2. openshift-install create manifests --dir byon-az-test-1

Actual results:

FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: no subnet found for ca-tor-1

Expected results:

more clear error message in install-config.yaml

Additional info:

 

 

 

 

Description of problem:

When deleting a BYOH node in Platform:none, as well as in an Azure IPI cluster the node gets reconciled correctly, however when added back to the cluster it stays in Ready,SchedulingDisabled. When checking the WMCO logs, we can observe the following log:

{"level":"error","ts":"2022-12-14T16:14:31Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","configMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"d66a3142-d52c-43f5-8a42-214ce9c88417","error":"error configuring host with address 10.0.55.21: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-node-subnet node annotation for byoh-2019: timeout waiting for k8s.ovn.org/hybrid-overlay-node-subnet node annotation: timed out waiting for the condition"

And when checking the node's annotation, it is indeed missing:

$ oc get nodes byoh-2019 -o=jsonpath="{.metadata.annotations}"
{"volumes.kubernetes.io/controller-managed-attach-detach":"true","windowsmachineconfig.openshift.io/desired-version":"7.0.0-16f486a","windowsmachineconfig.openshift.io/pub-key-hash":"1df2c166b1c401180523270e9cf6bc2cd2724b9279ea65668a3b95298525a0f5","windowsmachineconfig.openshift.io/username":"wx4EBwMICL6qT+4RY8tgbx4hiRmQdHlwUsHgVGCTVY7S5gG/G5gb/Wzv0JBLhNP9\u003cwmcoMarker\u003ejlmI5ExHPYFrd2Fw6Lxe/6PKEE5/vYAhZ2n1Z2nBIoa1xN1/HEaXhqR2CuXNe7Ez\u003cwmcoMarker\u003eg2Hg+gA=\u003cwmcoMarker\u003e=ubWA"}

Tested in Azure IPI and Platform:None, in both cases the issue got reproduced.

Version-Release number of selected component (if applicable):

$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      10h
openshift-service-ca.crt               1      10h
windows-instances                      2      9h
windows-machine-config-operator-lock   0      6h24m
windows-services-7.0.0-16f486a         2      6h23m
$ oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-rc.4   True        False         6h48m   Cluster version is 4.12.0-rc.4

How reproducible:


Steps to Reproduce:

1. Deploy a OCP 4.11 cluster with WMCO 6.0.0
2. Add one or two byoh nodes to the cluster
3. Upgrade the cluster to OCP 4.12, and later WMCO to 7.0.0
4. Remove one of the byoh nodes using: oc delete node <byoh-node-id>
5. Wait for reconciliation to bring the node back

Actual results:

The deleted node gets re-added but stays in Ready,SchedulingDisabled and the workloads left in Pending state.

Expected results:

The node gets properly added to the cluster and stays in Ready.

Additional info:


Description of problem:

The user mirrored the 4.11.0 release and attempted to use it to generate the the installation ISO in a completely disconnected environment.

When it was the turn for extracting the os image from machine-os-images, the agent based installer ran : oc adm release info --image-for=machine-os-images --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 --registry-config=/tmp/registry-config1141450352

This does not include the --icsp-file, and thus the image reference can be retrieved to perform the extraction.

 

Version-Release number of selected component (if applicable):

https://github.com/openshift/installer/releases/tag/agent-installer-v4.11.0-dev-preview-2

How reproducible:

100%

Steps to Reproduce:

1. Mirroring the images of 4.11.0 using oc adm mirror command to the local registry.
2. Created install-config.yaml with mirror config
3. Created agent-config.yaml 
4. openshift-install-sep1 agent create image --dir kni-22

 

Actual results:

INFO[0001] Start configuring static network for 3 hosts  pkg=manifests
INFO[0002] Adding NMConnection file <bond0.nmconnection>  pkg=manifests
INFO[0002] Adding NMConnection file <eno49.nmconnection>  pkg=manifests
INFO[0002] Adding NMConnection file <eno50.nmconnection>  pkg=manifests
INFO[0003] Adding NMConnection file <bond0.nmconnection>  pkg=manifests
INFO[0003] Adding NMConnection file <eno49.nmconnection>  pkg=manifests
INFO[0003] Adding NMConnection file <eno50.nmconnection>  pkg=manifests
INFO[0004] Adding NMConnection file <bond0.nmconnection>  pkg=manifests
INFO[0004] Adding NMConnection file <eno49.nmconnection>  pkg=manifests
INFO[0004] Adding NMConnection file <eno50.nmconnection>  pkg=manifests
DEBUG   Fetching BaseIso Image...
DEBUG     Fetching Agent Manifests...
DEBUG     Reusing previously-fetched Agent Manifests
DEBUG     Fetching Install Config...
DEBUG     Reusing previously-fetched Install Config
DEBUG     Fetching Mirror Registries Config...
DEBUG     Reusing previously-fetched Mirror Registries Config
DEBUG   Generating BaseIso Image...
INFO[0004] Extracting base ISO from release payload
ERRO[0014] command 'oc adm release info --image-for=machine-os-images --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 --registry-config=/tmp/registry-config1141450352' exited with non-zero exit code 1:
error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4: Get "http://quay.io/v2/": dial tcp: lookup quay.io on 10.92.86.56:53: server misbehaving
WARN[0014] Failed to extract base ISO from release payload - check registry configuration
INFO[0014] Downloading base ISO
DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/releases/rhcos-4.11/411.86.202207150124-0/x86_64/rhcos-411.86.202207150124-0-live.x86_64.iso'
ERROR failed to write asset (Agent Installer ISO) to disk: image reader not available
FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to generate asset "BaseIso Image": failed to get base ISO image: command 'oc adm release info --image-for=machine-os-images --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 --registry-config=/tmp/registry-config1141450352' exited with non-zero exit code 1:
FATAL error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4: Get "http://quay.io/v2/": dial tcp: lookup quay.io on 10.92.86.56:53: server misbehaving
FATAL

Expected results:

Image correctly generated

Additional info:

Host OS: RHEL 8.4
NMstate version: nmstate-1.0.2-5.el8.noarch

Description of problem:

Pod sometimes doesn’t work as expected when it has the same name with previous pods on OVN network cluster

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-09-05-064152

How reproducible:

Always, but need try more times

Steps to Reproduce:

1.Create a machineset
liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
machineset.machine.openshift.io/huliu-nu96a-zn7mc-workera created
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                              PHASE     TYPE   REGION    ZONE              AGE
huliu-nu96a-zn7mc-master-0        Running   AHV    Unnamed   Development-LTS   6h14m
huliu-nu96a-zn7mc-master-1        Running   AHV    Unnamed   Development-LTS   6h14m
huliu-nu96a-zn7mc-master-2        Running   AHV    Unnamed   Development-LTS   6h14m
huliu-nu96a-zn7mc-worker-5j47v    Running   AHV    Unnamed   Development-LTS   6h9m
huliu-nu96a-zn7mc-worker-thprs    Running   AHV    Unnamed   Development-LTS   6h9m
huliu-nu96a-zn7mc-workera-x54mr   Running   AHV    Unnamed   Development-LTS   6m50s
liuhuali@Lius-MacBook-Pro huali-test % oc get node                                          
NAME                              STATUS   ROLES                  AGE     VERSION
huliu-nu96a-zn7mc-master-0        Ready    control-plane,master   6h12m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-1        Ready    control-plane,master   6h12m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-2        Ready    control-plane,master   6h12m   v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-5j47v    Ready    worker                 6h      v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-thprs    Ready    worker                 6h      v1.25.12+26bab08
huliu-nu96a-zn7mc-workera-x54mr   Ready    worker                 3m7s    v1.25.12+26bab08 

2.Create a pod on the new node
liuhuali@Lius-MacBook-Pro huali-test % oc create -f kubelet-killer2.yaml
pod/kubelet-killer created
liuhuali@Lius-MacBook-Pro huali-test % cat kubelet-killer2.yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    kubelet-killer: ""
  name: kubelet-killer
  namespace: openshift-machine-api
spec:
  containers:
  - command:
    - pkill
    - -STOP
    - kubelet
    image: quay.io/openshifttest/base-alpine@sha256:3126e4eed4a3ebd8bf972b2453fa838200988ee07c01b2251e3ea47e4b1f245c
    imagePullPolicy: Always
    name: kubelet-killer
    securityContext:
      privileged: true
  enableServiceLinks: true
  hostPID: true
  nodeName: huliu-nu96a-zn7mc-workera-x54mr
  restartPolicy: Never
liuhuali@Lius-MacBook-Pro huali-test % 

3.The pod worked as expected
liuhuali@Lius-MacBook-Pro huali-test % oc get node   
NAME                              STATUS     ROLES                  AGE     VERSION
huliu-nu96a-zn7mc-master-0        Ready      control-plane,master   6h13m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-1        Ready      control-plane,master   6h14m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-2        Ready      control-plane,master   6h13m   v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-5j47v    Ready      worker                 6h2m    v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-thprs    Ready      worker                 6h2m    v1.25.12+26bab08
huliu-nu96a-zn7mc-workera-x54mr   NotReady   worker                 4m43s   v1.25.12+26bab08
liuhuali@Lius-MacBook-Pro huali-test % oc describe pod kubelet-killer  
Name:         kubelet-killer
Namespace:    openshift-machine-api
Priority:     0
Node:         huliu-nu96a-zn7mc-workera-x54mr/10.0.132.101
Start Time:   Wed, 06 Sep 2023 15:33:43 +0800
Labels:       kubelet-killer=
Annotations:  k8s.ovn.org/pod-networks:
                {"default":{"ip_addresses":["10.130.8.7/23"],"mac_address":"0a:58:0a:82:08:07","gateway_ips":["10.130.8.1"],"ip_address":"10.130.8.7/23","...
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.130.8.7"
                    ],
                    "mac": "0a:58:0a:82:08:07",
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.130.8.7"
                    ],
                    "mac": "0a:58:0a:82:08:07",
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  kubelet-killer:
    Container ID:  
    Image:         quay.io/openshifttest/base-alpine@sha256:3126e4eed4a3ebd8bf972b2453fa838200988ee07c01b2251e3ea47e4b1f245c
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      pkill
      -STOP
      kubelet
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nm9vd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-nm9vd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  AddedInterface  90s   multus   Add eth0 [10.130.8.7/23] from ovn-kubernetes
  Normal  Pulling         90s   kubelet  Pulling image "quay.io/openshifttest/base-alpine@sha256:3126e4eed4a3ebd8bf972b2453fa838200988ee07c01b2251e3ea47e4b1f245c"
  Normal  Pulled          87s   kubelet  Successfully pulled image "quay.io/openshifttest/base-alpine@sha256:3126e4eed4a3ebd8bf972b2453fa838200988ee07c01b2251e3ea47e4b1f245c" in 2.310348601s (2.310355399s including waiting)
  Normal  Created         87s   kubelet  Created container kubelet-killer
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                              PHASE     TYPE   REGION    ZONE              AGE
huliu-nu96a-zn7mc-master-0        Running   AHV    Unnamed   Development-LTS   6h17m
huliu-nu96a-zn7mc-master-1        Running   AHV    Unnamed   Development-LTS   6h17m
huliu-nu96a-zn7mc-master-2        Running   AHV    Unnamed   Development-LTS   6h17m
huliu-nu96a-zn7mc-worker-5j47v    Running   AHV    Unnamed   Development-LTS   6h11m
huliu-nu96a-zn7mc-worker-thprs    Running   AHV    Unnamed   Development-LTS   6h11m
huliu-nu96a-zn7mc-workera-x54mr   Running   AHV    Unnamed   Development-LTS   9m5s
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                                  READY   STATUS              RESTARTS   AGE
cluster-autoscaler-operator-854c6755f5-r9c2k          2/2     Running             0          5h41m
cluster-baremetal-operator-976487bc9-7czpk            2/2     Running             0          5h41m
control-plane-machine-set-operator-69684bcccd-c6jnf   1/1     Running             0          5h41m
kubelet-killer                                        0/1     ContainerCreating   0          98s
machine-api-controllers-7f574b69b5-w5swt              7/7     Running             0          155m
machine-api-operator-7f46db4fcc-v6w9p                 2/2     Running             0          5h41m

4.Try this once again. Delete the old machine and let it recreate a new one

liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-nu96a-zn7mc-workera-x54mr
machine.machine.openshift.io "huliu-nu96a-zn7mc-workera-x54mr" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                                  READY   STATUS        RESTARTS   AGE
cluster-autoscaler-operator-854c6755f5-r9c2k          2/2     Running       0          5h42m
cluster-baremetal-operator-976487bc9-7czpk            2/2     Running       0          5h42m
control-plane-machine-set-operator-69684bcccd-c6jnf   1/1     Running       0          5h42m
kubelet-killer                                        0/1     Terminating   0          2m28s
machine-api-controllers-7f574b69b5-w5swt              7/7     Running       0          156m
machine-api-operator-7f46db4fcc-v6w9p                 2/2     Running       0          5h42m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                              PHASE          TYPE   REGION    ZONE              AGE
huliu-nu96a-zn7mc-master-0        Running        AHV    Unnamed   Development-LTS   6h18m
huliu-nu96a-zn7mc-master-1        Running        AHV    Unnamed   Development-LTS   6h18m
huliu-nu96a-zn7mc-master-2        Running        AHV    Unnamed   Development-LTS   6h18m
huliu-nu96a-zn7mc-worker-5j47v    Running        AHV    Unnamed   Development-LTS   6h12m
huliu-nu96a-zn7mc-worker-thprs    Running        AHV    Unnamed   Development-LTS   6h12m
huliu-nu96a-zn7mc-workera-t8dj2   Provisioning                                      27s
liuhuali@Lius-MacBook-Pro huali-test % oc get pod                                       
NAME                                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-854c6755f5-r9c2k          2/2     Running   0          5h44m
cluster-baremetal-operator-976487bc9-7czpk            2/2     Running   0          5h44m
control-plane-machine-set-operator-69684bcccd-c6jnf   1/1     Running   0          5h44m
machine-api-controllers-7f574b69b5-w5swt              7/7     Running   0          158m
machine-api-operator-7f46db4fcc-v6w9p                 2/2     Running   0          5h44m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                        
NAME                              PHASE     TYPE   REGION    ZONE              AGE
huliu-nu96a-zn7mc-master-0        Running   AHV    Unnamed   Development-LTS   6h27m
huliu-nu96a-zn7mc-master-1        Running   AHV    Unnamed   Development-LTS   6h27m
huliu-nu96a-zn7mc-master-2        Running   AHV    Unnamed   Development-LTS   6h27m
huliu-nu96a-zn7mc-worker-5j47v    Running   AHV    Unnamed   Development-LTS   6h21m
huliu-nu96a-zn7mc-worker-thprs    Running   AHV    Unnamed   Development-LTS   6h21m
huliu-nu96a-zn7mc-workera-t8dj2   Running   AHV    Unnamed   Development-LTS   9m46s
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                              STATUS   ROLES                  AGE     VERSION
huliu-nu96a-zn7mc-master-0        Ready    control-plane,master   6h24m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-1        Ready    control-plane,master   6h25m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-2        Ready    control-plane,master   6h24m   v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-5j47v    Ready    worker                 6h13m   v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-thprs    Ready    worker                 6h13m   v1.25.12+26bab08
huliu-nu96a-zn7mc-workera-t8dj2   Ready    worker                 6m      v1.25.12+26bab08

5.Create a pod with the same name as the previous one (here is kubelet-killer) on the new node
liuhuali@Lius-MacBook-Pro huali-test % oc create -f kubelet-killer2.yaml
pod/kubelet-killer created
liuhuali@Lius-MacBook-Pro huali-test % cat kubelet-killer2.yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    kubelet-killer: ""
  name: kubelet-killer
  namespace: openshift-machine-api
spec:
  containers:
  - command:
    - pkill
    - -STOP
    - kubelet
    image: quay.io/openshifttest/base-alpine@sha256:3126e4eed4a3ebd8bf972b2453fa838200988ee07c01b2251e3ea47e4b1f245c
    imagePullPolicy: Always
    name: kubelet-killer
    securityContext:
      privileged: true
  enableServiceLinks: true
  hostPID: true
  nodeName: huliu-nu96a-zn7mc-workera-t8dj2
  restartPolicy: Never

6.Check the pod doesn’t work as expected.
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                              PHASE     TYPE   REGION    ZONE              AGE
huliu-nu96a-zn7mc-master-0        Running   AHV    Unnamed   Development-LTS   6h35m
huliu-nu96a-zn7mc-master-1        Running   AHV    Unnamed   Development-LTS   6h35m
huliu-nu96a-zn7mc-master-2        Running   AHV    Unnamed   Development-LTS   6h35m
huliu-nu96a-zn7mc-worker-5j47v    Running   AHV    Unnamed   Development-LTS   6h29m
huliu-nu96a-zn7mc-worker-thprs    Running   AHV    Unnamed   Development-LTS   6h29m
huliu-nu96a-zn7mc-workera-t8dj2   Running   AHV    Unnamed   Development-LTS   17m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                              STATUS   ROLES                  AGE     VERSION
huliu-nu96a-zn7mc-master-0        Ready    control-plane,master   6h32m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-1        Ready    control-plane,master   6h33m   v1.25.12+26bab08
huliu-nu96a-zn7mc-master-2        Ready    control-plane,master   6h32m   v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-5j47v    Ready    worker                 6h21m   v1.25.12+26bab08
huliu-nu96a-zn7mc-worker-thprs    Ready    worker                 6h21m   v1.25.12+26bab08
huliu-nu96a-zn7mc-workera-t8dj2   Ready    worker                 14m     v1.25.12+26bab08
liuhuali@Lius-MacBook-Pro huali-test % oc get pod
NAME                                                  READY   STATUS              RESTARTS   AGE
cluster-autoscaler-operator-854c6755f5-r9c2k          2/2     Running             0          6h
cluster-baremetal-operator-976487bc9-7czpk            2/2     Running             0          6h
control-plane-machine-set-operator-69684bcccd-c6jnf   1/1     Running             0          6h
kubelet-killer                                        0/1     ContainerCreating   0          7m18s
machine-api-controllers-7f574b69b5-w5swt              7/7     Running             0          174m
machine-api-operator-7f46db4fcc-v6w9p                 2/2     Running             0          6h
liuhuali@Lius-MacBook-Pro huali-test % oc describe pod kubelet-killer  
Name:         kubelet-killer
Namespace:    openshift-machine-api
Priority:     0
Node:         huliu-nu96a-zn7mc-workera-t8dj2/10.0.132.67
Start Time:   Wed, 06 Sep 2023 15:46:29 +0800
Labels:       kubelet-killer=
Annotations:  openshift.io/scc: node-exporter
Status:       Pending
IP:           
IPs:          <none>
Containers:
  kubelet-killer:
    Container ID:  
    Image:         quay.io/openshifttest/base-alpine@sha256:3126e4eed4a3ebd8bf972b2453fa838200988ee07c01b2251e3ea47e4b1f245c
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      pkill
      -STOP
      kubelet
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dcq5h (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-dcq5h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age    From          Message
  ----     ------                  ----   ----          -------
  Warning  ErrorAddingLogicalPort  7m30s  controlplane  deleteLogicalPort failed for pod openshift-machine-api_kubelet-killer: cannot delete GR SNAT for pod openshift-machine-api/kubelet-killer: failed create operation for deleting SNAT rule for pod on gateway router GR_huliu-nu96a-zn7mc-workera-x54mr: unable to get NAT entries for router &{UUID: Copp:<nil> Enabled:<nil> ExternalIDs:map[] LoadBalancer:[] LoadBalancerGroup:[] Name:GR_huliu-nu96a-zn7mc-workera-x54mr Nat:[] Options:map[] Policies:[] Ports:[] StaticRoutes:[]}: failed to get router: GR_huliu-nu96a-zn7mc-workera-x54mr, error: object not found
  Warning  FailedCreatePodSandBox  5m29s  kubelet       Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-killer_openshift-machine-api_84edbe26-680b-4c50-a8a4-71ffb82b8d9c_0(c1671822d85747016e7a619891ff5981b470c268f478a761de485f3ae3a0f2ef): error adding pod openshift-machine-api_kubelet-killer to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-machine-api/kubelet-killer/84edbe26-680b-4c50-a8a4-71ffb82b8d9c:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-machine-api/kubelet-killer c1671822d85747016e7a619891ff5981b470c268f478a761de485f3ae3a0f2ef] [openshift-machine-api/kubelet-killer c1671822d85747016e7a619891ff5981b470c268f478a761de485f3ae3a0f2ef] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
'
  Warning  FailedCreatePodSandBox  3m17s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-killer_openshift-machine-api_84edbe26-680b-4c50-a8a4-71ffb82b8d9c_0(dced805c3e86acbf5a10a8b4efbc02c64ad3c9360e23885c4fe593ca198f43b0): error adding pod openshift-machine-api_kubelet-killer to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-machine-api/kubelet-killer/84edbe26-680b-4c50-a8a4-71ffb82b8d9c:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-machine-api/kubelet-killer dced805c3e86acbf5a10a8b4efbc02c64ad3c9360e23885c4fe593ca198f43b0] [openshift-machine-api/kubelet-killer dced805c3e86acbf5a10a8b4efbc02c64ad3c9360e23885c4fe593ca198f43b0] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
'
  Warning  FailedCreatePodSandBox  65s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-killer_openshift-machine-api_84edbe26-680b-4c50-a8a4-71ffb82b8d9c_0(4bbf45588909933b9c4086274a08b7cddc2e09fe47e740ee14c74523f4f21ef2): error adding pod openshift-machine-api_kubelet-killer to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-machine-api/kubelet-killer/84edbe26-680b-4c50-a8a4-71ffb82b8d9c:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-machine-api/kubelet-killer 4bbf45588909933b9c4086274a08b7cddc2e09fe47e740ee14c74523f4f21ef2] [openshift-machine-api/kubelet-killer 4bbf45588909933b9c4086274a08b7cddc2e09fe47e740ee14c74523f4f21ef2] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
'

In the Warning Events it shows “GR_huliu-nu96a-zn7mc-workera-x54mr”, but huliu-nu96a-zn7mc-workera-x54mr is the previous node, I created the pod on huliu-nu96a-zn7mc-workera-t8dj2 in Step 5.
If create the new pod with different name, there is no such issue. 

Actual results:

The pod doesn’t worked as expected when it has the same name with previous pods.

Expected results:

The pod should worked as expected even it has the same name with previous pods.

Additional info:

The same case worked as expected on SDN network cluster.

Discussion in slack https://redhat-internal.slack.com/archives/CH76YSYSC/p1693983428736929

Description of problem:

The IBM VPC block CSI driver was rebased to v5.0.0 in this PR:
https://github.com/openshift/ibm-vpc-block-csi-driver/pull/26

However, we're missing the manifest changes from this PR in 4.12 (delayed by CI issues):
https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/45

That includes some important changes:
- add csi-snapshotter sidecar and snapshotter manifests
- only deploy volumesnapshotclass if CRD exists
- set consistent imagePullPolicy in deployment manifests
- enable topology tests

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-19805. The following is the description of the original issue:

Description of problem:

While reviewing PRs in CoreDNS 1.11.0, we stumbled upon https://github.com/coredns/coredns/pull/6179, which describes an CoreDNS crash in the kubernetes plugin if you create an EndpointSlice object contains a port without a port number.

I reproduced this myself and was able to successfully bring down all of CoreDNS so that the cluster was put into a degraded state.

We've bumped to CoreDNS 1.11.1 in 4.15, so this is concern for < 4.15.

Version-Release number of selected component (if applicable):

Less than or equal to 4.14

How reproducible:

100%

Steps to Reproduce:

1. Create an endpointslice with a port with no port number:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: example-abc
addressType: IPv4
ports:
  - name: ""

2.Shortly after creating this object, all DNS pods continuously crash:
oc get -n openshift-dns pods
NAME                  READY   STATUS             RESTARTS     AGE
dns-default-57lmh     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-h6cvm     1/2     CrashLoopBackOff   1 (4s ago)   79m
dns-default-mn7qd     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-mxq5g     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-wdrff     1/2     CrashLoopBackOff   1 (3s ago)   79m
dns-default-zs7cd     1/2     CrashLoopBackOff   1 (3s ago)   79m

Actual results:

DNS Pods crash

Expected results:

DNS Pods should NOT crash

Additional info:

 

Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:

level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []

Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.

I'm setting the priority to critical as it's causing all our jobs to fail in master.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-6714. The following is the description of the original issue:

Description of problem:

Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46

a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.

More description about issue is available in private comment below since it contains customer data.

Description of problem:

The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.24 packages.  OpenShift 4.12 is based on Kubernetes 1.25.  

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.12/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.

Expected results:

Kubernetes packages are at version v0.25.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

Description of problem:

When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart. 

The restart happens because of (ovnkube-master pods)

2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-09-114621

How reproducible:

Steps to Reproduce:
1. Run node-density on a 120 node cluster

Actual results:

Spike observed in pod-latency graph ~22s

Expected results:

Steady pod-latency graph ~4s

Additional info:

This is a clone of issue OCPBUGS-9968. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8692. The following is the description of the original issue:

Description of problem:

In hypershift context:
Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/
https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265

These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator.
This could be done by looking at the operator deployment itself or at the HCP resource.

multus-admission-controller
cloud-network-config-controller
ovnkube-master

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a hypershift cluster.
2. Check affinity rules and node selector of the operands above.
3.

Actual results:

Operands missing affinity rules and node selecto

Expected results:

Operands have same affinity rules and node selector than the operator

Additional info:

 

This is a clone of issue OCPBUGS-8342. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8258. The following is the description of the original issue:

Invoking 'create cluster-manifests' fails when imageContentSources is missing in install-config yaml:

$ openshift-install agent create cluster-manifests
INFO Consuming Install Config from target directory
FATAL failed to write asset (Mirror Registries Config) to disk: failed to write file: open .: is a directory

install-config.yaml:

apiVersion: v1alpha1
metadata:
  name: appliance
rendezvousIP: 192.168.122.116
hosts:
  - hostname: sno
    installerArgs: '["--save-partlabel", "agent*", "--save-partlabel", "rhcos-*"]'
    interfaces:
     - name: enp1s0
       macAddress: 52:54:00:e7:05:72
    networkConfig:
      interfaces:
        - name: enp1s0
          type: ethernet
          state: up
          mac-address: 52:54:00:e7:05:72
          ipv4:
            enabled: true
            dhcp: true 

This is a clone of issue OCPBUGS-14403. The following is the description of the original issue:

Description of problem:

IngressVIP is getting attached to two node at once.

Version-Release number of selected component (if applicable):

4.11.39

How reproducible:

Always in customer cluster

Actual results:

IngressVIP is getting attached to two node at once.

Expected results:

IngressVIP should get attach to only one node.

Additional info:

 

Description of problem:

`[sig-arch] events should not repeat pathologically` has started failing on aws-ovn-serial jobs.

Example run https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-ovn-serial/1562685930945384448

event happened 25 times, something is wrong: ns/openshift-ovn-kubernetes service/ovn-kubernetes-master - reason/FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service openshift-ovn-kubernetes/ovn-kubernetes-master: node "ip-10-0-176-16.us-east-2.compute.internal" not found
event happened 25 times, something is wrong: ns/openshift-ovn-kubernetes service/ovnkube-db - reason/FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service openshift-ovn-kubernetes/ovnkube-db: node "ip-10-0-176-16.us-east-2.compute.internal" not found}

https://search.ci.openshift.org/?search=FailedToUpdateEndpointSlices+Error+updating+Endpoint+Slices+for+Service+&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

As a developer, I would like to remove the random terraform provider because it is essentially unnecessary and would improve our build process.

 

The random Terraform provider is used in Azure & Azure Stack to create a random string. This could easily be done in go code and passed in as a variable. 

Removing an extra provider would decrease our build time and improve our build stability, which is often failing due to timeouts. 

 

The random string is used here in Azure (and similarly in Azure Stack):

https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L23-L27

 

One approach would be to generate the string in tfvars and pass it in as a terraform variable.

Description of problem:

Customer reports that when trying to create an application using the "Import from Git" workflow, the "Create" button at the very bottom of the form stays inactive. You can observe the issue in the video shared via Google Drive here (timestamp 00:35): https://drive.google.com/file/d/1GEA_TF5vV_ai9YDMZ3uzwEwYKkp_CY8r/view?usp=sharing

The customer can work around the issue by selecting another Import Strategy than "Builder Image" and then switching back to "Builder Image" (timestamp 00:49).

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11.25

How reproducible:

Always

Steps to Reproduce:

1. Click the "+Add" button on the left menu
2. Enter a Git repository URL
3. Select "Bitbucket" as Git type
3a. If necessary, select a "Source Secret"
4. For "Import Strategy", select "Builder Image" and select one of the available images
5. In "Application" select "Create application"
6. For "Application Name" and "Name" insert any valid value

Actual results:

"Create" button at the bottom of the form is inactive and cannot be clicked.

Changing the Import Strategy to something else and then back to "Builder Image" makes the button active.

Expected results:

Button is active after filling out all the required form fields

Additional info:

* Video of the issue provided: https://drive.google.com/file/d/1GEA_TF5vV_ai9YDMZ3uzwEwYKkp_CY8r/view?usp=sharing

Description of problem:

The Console Operator has a suite of tests responsible for assuring that Console can successfully interact with Operators managed by OLM. The operator-hub.spec test references an operator no longer present in the 4.12 certified operators catalog source: https://github.com/openshift/console/blob/master/frontend/packages/operator-lifecycle-manager/integration-tests-cypress/tests/operator-hub.spec.ts#L64

OLM is unable to set the default catalog sources to the 4.12 image tag until the test is update to reference an operator in both the 4.11 and 4.12 images of the certified operators catalog source.


Version-Release number of selected component (if applicable):4.12


How reproducible: always


Steps to Reproduce:

1. Update the certified operators catalogSource images to the 4.12 tag
2. Attempt to run the operatorhub.spec test suite.

Actual results:

The test fails

Expected results:

The test passes

Additional info:


Tracker bug for bootimage bump in 4.12. This bug should block bugs which need a bootimage bump to fix.

Description of problem:

After IPI installing a 3-node Hub Cluster, and converting them to dual stack, fd69::/125 address is seen in the Baremetal br-ex interface

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Ranodmly reproduced and this IP is assigned in one of the 3 master hub cluster nodes

Steps to Reproduce:

1. IPI install 4.12.0
2. Use the Convert from IPv4/IPv6 dual stack procedure. 
3. 

Actual results:

Check for the IP fd69::/125 in the br-ex interface

OVN CrashLoopBackOff

Expected results:

The IP is a internal OVNKUBE IP, and it should not appear on the interface.
fd69::2/125 should be present on br-ex, but make sure fd69::2 does not :

  1. show up as an address in the node Status.Addresses list at all
  2. exist in any Node object annotations

Additional info:

This is one of the issues in IPv6 that is discovered, the other issue is linked here as well.

Description of problem:

If you set a services cluster IP to an IP with a leading zero (e.g. 192.168.0.011), ovn-k should normalise this and remove the leading zero before sending it to ovn.

This was seen by me on a CI run executing the k8 test here: test/e2e/network/funny_ips.go +75

you can reproduce using that above test.

Have a read of the text there:

 43 // What are funny IPs:  
 44 // The adjective is because of the curl blog that explains the history and the problem of liberal  
 45 // parsing of IP addresses and the consequences and security risks caused the lack of normalization,
 46 // mainly due to the use of different notations to abuse parsers misalignment to bypass filters.
 47 // xref: https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/   
 48 //     
 49 // Since golang 1.17, IPv4 addresses with leading zeros are rejected by the standard library.
 50 // xref: https://github.com/golang/go/issues/30999
 51 //     
 52 // Because this change on the parsers can cause that previous valid data become invalid, Kubernetes
 53 // forked the old parsers allowing leading zeros on IPv4 address to not break the compatibility.
 54 //     
 55 // Kubernetes interprets leading zeros on IPv4 addresses as decimal, users must not rely on parser
 56 // alignment to not being impacted by the associated security advisory: CVE-2021-29923 golang
 57 // standard library "net" - Improper Input Validation of octal literals in golang 1.16.2 and below
 58 // standard library "net" results in indeterminate SSRF & RFI vulnerabilities. xref:
 59 // https://nvd.nist.gov/vuln/detail/CVE-2021-29923                                                                                                     

northd is logging an error about this also:

|socket_util|ERR|172.30.0.011:7180: bad IP address "172.30.0.011" 
...
2022-08-23T14:14:21.968Z|01839|ovn_util|WARN|bad ip address or port for load balancer key 172.30.0.011:7180

 

Also, I see the error:

E0823 14:14:34.135115    3284 gateway_shared_intf.go:600] Failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip: failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip with svcVIP 172.30.0.011, svcPort 7180, protocol TCP: value "<nil>" passed to DeleteConntrack is not an IP address 

We should normalise the IPs before sending to OVN-k. I see also theres conntrack error when trying to set this bad IP.

 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. See above k8 test

Actual results:

Leading zero IP sent to OVN

Expected results:

No leading zero IP sent to OVN

Additional info:

This is a clone of issue OCPBUGS-7879. The following is the description of the original issue:

Description of problem:

When adding an app via 'Add from git repo' my repo, which works with StoneSoup, throws an error around the contents of the devfile

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Dev viewpoint
2. Click +Add
3. Choose 'Import from Git'
4. Enter 'https://github.com/utherp0/bootcampapp

Actual results:

"Import is not possible. o.components is not iterable"

Expected results:

The Devfile works with StoneSoup

Additional info:

Devfile at https://github.com/utherp0/hacexample/blob/main/devfile.yaml

This is a clone of issue OCPBUGS-12450. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Setting up Github App from the console is lacking the required permission 
Events and Permissions: https://pipelinesascode.com/docs/install/github_apps/

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:

1. Setup Github App from administrator perspective.
2. Create Repository and configure it to use the Github App method.

Actual results:
Creates Github App with limited permission.

Expected results:
Created Github App should contain all the required permission and should trigger the pipelinerun successfully on git events.

Additional info:

Console needs to update the default_events and default_permissions here it has to be matching with the CLI - refer this.

we need to update the  See Github permission section in the UI as well.

Description of problem:

With "createFirewallRules: Enabled", after successful "create cluster" and then "destroy cluster", the created firewall-rules in the shared VPC are not deleted.

Version-Release number of selected component (if applicable):

$ ./openshift-install version
./openshift-install 4.12.0-0.nightly-2022-09-28-204419
built from commit 9eb0224926982cdd6cae53b872326292133e532d
release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. try IPI installation with "createFirewallRules: Enabled", which succeeded
2. try destroying the cluster, which succeeded
3. check firewall-rules in the shared VPC 

Actual results:

After destroying the cluster, its firewall-rules created by installer in the shared VPC are not deleted.

Expected results:

Those firewall-rules should be deleted during destroying the cluster.

Additional info:

$ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc'
NAME                                NETWORK               DIRECTION  PRIORITY  ALLOW                                                    
                                                                                                 DENY  DISABLED
ci-op-xpn-ingress-common            installer-shared-vpc  INGRESS    60000     tcp:6443,tcp:22,tcp:80,tcp:443,icmp                      
                                                                                                       False
ci-op-xpn-ingress-health-checks     installer-shared-vpc  INGRESS    60000     tcp:30000-32767,udp:30000-32767,tcp:6080,tcp:6443,tcp:226
24,tcp:32335                                                                                           False
ci-op-xpn-ingress-internal-network  installer-shared-vpc  INGRESS    60000     udp:4789,udp:6081,udp:500,udp:4500,esp,tcp:9000-9999,udp:
9000-9999,tcp:10250,tcp:30000-32767,udp:30000-32767,tcp:10257,tcp:10259,tcp:22623,tcp:2379-2380        FalseTo show all fields of the firewall, please show in JSON format: --format=json
To show all fields in table format, please see the examples in --help.
$ 
$ yq-3.3.0 r test2/install-config.yaml platform
gcp:
  projectID: openshift-qe  
  region: us-central1
  computeSubnet: installer-shared-vpc-subnet-2
  controlPlaneSubnet: installer-shared-vpc-subnet-1
  createFirewallRules: Enabled
  network: installer-shared-vpc
  networkProjectID: openshift-qe-shared-vpc
$ 
$ yq-3.3.0 r test2/install-config.yaml metadata
creationTimestamp: null
name: jiwei-1013-01
$ 
$ openshift-install create cluster --dir test2
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
INFO Consuming Install Config from target directory
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 4:06AM) for the Kubernetes API at https://api.jiwei-1013-01.qe.gcp.devcluster.openshift.com:6443...
INFO API v1.24.0+8c7c967 up
INFO Waiting up to 30m0s (until 4:20AM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 4:42AM) for the cluster at https://api.jiwei-1013-01.qe.gcp.devcluster.openshift.com:6443 to initialize...
INFO Checking to see if there is a route at openshift-console/console...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/fedora/test2/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.jiwei-1013-01.qe.gcp.devcluster.openshift.com
INFO Login to the console with user: "kubeadmin", and password: "wWPkc-8G2Lw-xe2Vw-DgWha"
INFO Time elapsed: 39m14s  
$ 
$ openshift-install destroy cluster --dir test2
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
INFO Stopped instance jiwei-1013-01-464st-worker-b-pmg5z
INFO Stopped instance jiwei-1013-01-464st-worker-a-csg2j
INFO Stopped instance jiwei-1013-01-464st-master-1
INFO Stopped instance jiwei-1013-01-464st-master-2
INFO Stopped instance jiwei-1013-01-464st-master-0
INFO Deleted 2 recordset(s) in zone qe
INFO Deleted 3 recordset(s) in zone jiwei-1013-01-464st-private-zone
INFO Deleted DNS zone jiwei-1013-01-464st-private-zone
INFO Deleted bucket jiwei-1013-01-464st-image-registry-us-central1-ulgxgjfqxbdnrhd
INFO Deleted instance jiwei-1013-01-464st-master-0
INFO Deleted instance jiwei-1013-01-464st-worker-a-csg2j
INFO Deleted instance jiwei-1013-01-464st-master-1
INFO Deleted instance jiwei-1013-01-464st-worker-b-pmg5z
INFO Deleted instance jiwei-1013-01-464st-master-2
INFO Deleted disk jiwei-1013-01-464st-master-2
INFO Deleted disk jiwei-1013-01-464st-master-1
INFO Deleted disk jiwei-1013-01-464st-worker-b-pmg5z
INFO Deleted disk jiwei-1013-01-464st-master-0
INFO Deleted disk jiwei-1013-01-464st-worker-a-csg2j
INFO Deleted address jiwei-1013-01-464st-cluster-public-ip
INFO Deleted address jiwei-1013-01-464st-cluster-ip
INFO Deleted forwarding rule a516d89f9a4f14bdfb55a525b1a12a91
INFO Deleted forwarding rule jiwei-1013-01-464st-api
INFO Deleted forwarding rule jiwei-1013-01-464st-api-internal
INFO Deleted target pool a516d89f9a4f14bdfb55a525b1a12a91
INFO Deleted target pool jiwei-1013-01-464st-api
INFO Deleted backend service jiwei-1013-01-464st-api-internal
INFO Deleted instance group jiwei-1013-01-464st-master-us-central1-a
INFO Deleted instance group jiwei-1013-01-464st-master-us-central1-c
INFO Deleted instance group jiwei-1013-01-464st-master-us-central1-b
INFO Deleted health check jiwei-1013-01-464st-api-internal
INFO Deleted HTTP health check a516d89f9a4f14bdfb55a525b1a12a91
INFO Deleted HTTP health check jiwei-1013-01-464st-api
INFO Time elapsed: 4m18s   
$ 
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc'
NAME                                          NETWORK               DIRECTION  PRIORITY  ALLOW                                                                                                                                                     DENY  DISABLED
ci-op-xpn-ingress-common                      installer-shared-vpc  INGRESS    60000     tcp:6443,tcp:22,tcp:80,tcp:443,icmp                                                                                                                             False
ci-op-xpn-ingress-health-checks               installer-shared-vpc  INGRESS    60000     tcp:30000-32767,udp:30000-32767,tcp:6080,tcp:6443,tcp:22624,tcp:32335                                                                                           False
ci-op-xpn-ingress-internal-network            installer-shared-vpc  INGRESS    60000     udp:4789,udp:6081,udp:500,udp:4500,esp,tcp:9000-9999,udp:9000-9999,tcp:10250,tcp:30000-32767,udp:30000-32767,tcp:10257,tcp:10259,tcp:22623,tcp:2379-2380        False
jiwei-1013-01-464st-api                       installer-shared-vpc  INGRESS    1000      tcp:6443                                                                                                                                                        False
jiwei-1013-01-464st-control-plane             installer-shared-vpc  INGRESS    1000      tcp:22623,tcp:10257,tcp:10259                                                                                                                                   False
jiwei-1013-01-464st-etcd                      installer-shared-vpc  INGRESS    1000      tcp:2379-2380                                                                                                                                                   False
jiwei-1013-01-464st-health-checks             installer-shared-vpc  INGRESS    1000      tcp:6080,tcp:6443,tcp:22624                                                                                                                                     False
jiwei-1013-01-464st-internal-cluster          installer-shared-vpc  INGRESS    1000      tcp:30000-32767,udp:9000-9999,udp:30000-32767,udp:4789,udp:6081,tcp:9000-9999,udp:500,udp:4500,esp,tcp:10250                                                    False
jiwei-1013-01-464st-internal-network          installer-shared-vpc  INGRESS    1000      icmp,tcp:22                                                                                                                                                     False
k8s-a516d89f9a4f14bdfb55a525b1a12a91-http-hc  installer-shared-vpc  INGRESS    1000      tcp:30268                                                                                                                                                       False
k8s-fw-a516d89f9a4f14bdfb55a525b1a12a91       installer-shared-vpc  INGRESS    1000      tcp:80,tcp:443                                                                                                                                                  FalseTo show all fields of the firewall, please show in JSON format: --format=json
To show all fields in table format, please see the examples in --help.
$ 

FYI manually deleting those firewall-rules in the shared VPC does work.
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-api
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-api].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-control-plane
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-control-plane].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-etcd
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-etcd].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-health-checks
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-health-checks].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-internal-cluster
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-internal-cluster].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-internal-network
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-internal-network].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q k8s-a516d89f9a4f14bdfb55a525b1a12a91-http-hc
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/k8s-a516d89f9a4f14bdfb55a525b1a12a91-http-hc].
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q k8s-fw-a516d89f9a4f14bdfb55a525b1a12a91
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a516d89f9a4f14bdfb55a525b1a12a91].
$ 
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc'
NAME                                NETWORK               DIRECTION  PRIORITY  ALLOW                                                                                                                                                     DENY  DISABLED
ci-op-xpn-ingress-common            installer-shared-vpc  INGRESS    60000     tcp:6443,tcp:22,tcp:80,tcp:443,icmp                                                                                                                             False
ci-op-xpn-ingress-health-checks     installer-shared-vpc  INGRESS    60000     tcp:30000-32767,udp:30000-32767,tcp:6080,tcp:6443,tcp:22624,tcp:32335                                                                                           False
ci-op-xpn-ingress-internal-network  installer-shared-vpc  INGRESS    60000     udp:4789,udp:6081,udp:500,udp:4500,esp,tcp:9000-9999,udp:9000-9999,tcp:10250,tcp:30000-32767,udp:30000-32767,tcp:10257,tcp:10259,tcp:22623,tcp:2379-2380        FalseTo show all fields of the firewall, please show in JSON format: --format=json
To show all fields in table format, please see the examples in --help.
$ 

 

 

 

 

This is a clone of issue OCPBUGS-13150. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12435. The following is the description of the original issue:

Description of problem:

If the user specifies a DNS name in an egressnetworkpolicy for which the upstream server returns a truncated DNS response, openshift-sdn does not fall back to TCP as expected but just take this as a failure.

Version-Release number of selected component (if applicable):

4.11 (originally reproduced on 4.9)

How reproducible:

Always

Steps to Reproduce:

1. Setup an EgressNetworkPolicy that points to a domain where a truncated response is returned while querying via UDP.
2.
3.

Actual results:

Error, DNS resolution not completed.

Expected results:

Request retried via TCP and succeeded.

Additional info:

In comments.

Description of problem:

When user selects a installed operator (for example, openshift elastic search) in operator hub and navigating to installed operator page from operator information page

with the help of "view it here" option, "404 Not found" information has wrongly shown/appeared although it navigates to the installed operator at the end.

 

Version-Release number of selected components (if applicable):
4.12.0-0.nightly-2022-08-15-150248
How reproducible:

 Always

 

Steps to Reproduce:

  1. Login to OCP web console.
  2. Install Operator, For example,OpenShift Elasticsearch Operator- production operators if missing.
  3. Go to the Operator hub and  search for OpenShift Elasticsearch Operator. (make sure Project filter sets to 'All projects')
  4. Click on OpenShift Elasticsearch Operator- production operators.
  5. Click on the link "View it here" from the installed operator section.
  6. View the behavior.

Actual results:

Wrong message "404: Not found" while the user selects an installed operator and navigates from operator hub to installed operator page.

 

Browser console log indicate as below

main-chunk-525818b154a57a9b220a.min.js:1 unhandled error: Uncaught TypeError: Cannot read properties of undefined (reading 'firstElementChild') TypeError: Cannot read properties of undefined (reading 'firstElementChild')
    at c (https://console-openshift-console.apps.jmekkatt-dob.ibmcloud.qe.devcluster.openshift.com/static/vendors~main-chunk-40fab65853dff2fbc413.min.js:118:125992)
    at HTMLDivElement.l (https://console-openshift-console.apps.jmekkatt-dob.ibmcloud.qe.devcluster.openshift.com/static/vendors~main-chunk-40fab65853dff2fbc413.min.js:118:126387) TypeError: Cannot read properties of undefined (reading 'firstElementChild')
    at c (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1)
    at HTMLDivElement.l (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1)
window.onerror @ main-chunk-525818b154a57a9b220a.min.js:1
vendors~main-chunk-40fab65853dff2fbc413.min.js:72303 Uncaught TypeError: Cannot read properties of undefined (reading 'firstElementChild')
    at c (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1)
    at HTMLDivElement.l (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1)
c @ vendors~main-chunk-40fab65853dff2fbc413.min.js:72303
l @ vendors~main-chunk-40fab65853dff2fbc413.min.js:72303
scroll (async)
componentWillUnmount @ vendor-patternfly-core-chunk-006bb1499791fa7cfea7.min.js:38397
hs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
bs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
hs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
bs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
Oc @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
t.unstable_runWithPriority @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171690
Hi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
Ac @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
pc @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
(anonymous) @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
t.unstable_runWithPriority @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171690
Hi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
Vi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
qi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
De @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
Yt @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377
main-chunk-525818b154a57a9b220a.min.js:1          GET https://console-openshift-console.apps.jmekkatt-dob.ibmcloud.qe.devcluster.openshift.com/api/kubernetes/apis/operators.coreos.com/v1alpha1/clusterserviceversions/elasticsearch-operator.5.5.0 404 (Not Found)
  

Expected results:

Installed operator details should show without any error when the user selects an installed operator and navigates from operator hub to installed operator page.

 

Additional info:

Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers

Attached screen share for the same issue InstalledOperatorNavigation404.mp4

Description of problem:

service machine-config-daemon-update-rpmostree-via-container is failed to deploy commit

sh-4.4# journalctl -u machine-config-daemon-update-rpmostree-via-container.service | tail
Oct 12 11:45:56 master-00.wduan-1012e-upg.qe.devcluster.openshift.com peaceful_elbakyan[2022141]: Checking out tree 845113b...done
Oct 12 11:45:56 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2019123]: Checking out tree 845113b...done
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com peaceful_elbakyan[2022141]: error: No enabled repositories
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2019123]: error: No enabled repositories
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com peaceful_elbakyan[2022141]: error: Failed to deploy commit: ExitStatus(unix_wait_status(256))
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2019123]: error: Failed to deploy commit: ExitStatus(unix_wait_status(256))
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2022949]: time="2022-10-12T11:45:57Z" level=warning msg="lstat /sys/fs/cgroup/devices/machine.slice/libpod-ea744a45645d9c8d7a79182a78525a0b9f65b13e2e997f55bf80f626dcc0e945.scope: no such file or directory"
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'.
Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 1min 9.080s CPU time 

full service log is attached

Version-Release number of selected component (if applicable):

4.12

Steps to Reproduce:

1. setup SNO cluster upi-on-baremetal with 4.11.8
2. upgrade it to 4.12.0-0.nightly-2022-10-05-053337

Actual results:

service machine-config-daemon-update-rpmostree-via-container is failed to deploy comment due to no enabled repositories issue

Expected results:

service machine-config-daemon-update-rpmostree-via-container can deploy new commit successfully

Additional info:

no proxy configured
sh-4.4# cat /etc/mco/proxy.env
# Proxy environment variables will be populated in this file. Properly
# url encoded passwords with special characters will use '%<HEX><HEX>'.
# Systemd requires that any % used in a password be represented as
# %% in a unit file since % is a prefix for macros; this restriction does not
# apply for environment files. Templates that need the proxy set should use
# 'EnvironmentFile=/etc/mco/proxy.env'.

This is a clone of issue OCPBUGS-8741. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5889. The following is the description of the original issue:

Description of problem:

Customer running a cluster with following config:
4.10.23
AWS/IPI
OVNKubernetes

Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace.

Example:
Deployment1 with two pods (A/B) exists in namespace <EXAMPLE>
Deployment2 with 1 pod hosting a service and route exists in same namespace
Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2.

Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks)

Pod A remains able to reach the local gateway on the node

Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted.

Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace.

I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace.

Additional relevant data:
- pods affects throughout cluster; no specific project/service/deployment/application
- pods ride on different nodes all the time (no one node affected)
- pods with fail condition are on same node with other pods without issue
- multiple namespaces see this problem
- all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).



Version-Release number of selected component (if applicable):

4.10.23

How reproducible:

every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.

Steps to Reproduce:

1. deploy pod with at least two replicas in a namespace with allow-from same network policy
2. deploy a different service and route example httpd instance in same namespace
3. observe that one of the two pods may fail to reach service IP after some time
4. apply annotation to pod and it is immediately able to reach services again.

Actual results:

pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster. 

Expected results:

pods should not lose access to service network peers. 

Additional info:

see next comments for relevant uploads/sosreports and inspects.

This is a clone of issue OCPBUGS-4973. The following is the description of the original issue:

Description of problem:

Config OAuth with htpasswd in the hostedcluster doesn't work as expected.

Version-Release number of selected component (if applicable):

 

How reproducible:

enable OAuth htpasswd in hostedcluster

Steps to Reproduce:

1. create passwd file for user init by htpasswd
```
htpasswd -cbB .passwd helitest helitest

oc create secret generic testuser --from-file=htpasswd=.passwd  -n clusters ``` 

2. edit hostedcluster.yaml
```
spec:
  configuration:
    oauth:
      identityProviders:
      - htpasswd:
          fileData:
            name: testuser
        mappingMethod: claim
        name: htpasswd
        type: HTPasswd
```
3. oc login hostedcluster apiserver

$ oc login https://ac0be21b169ff4399b6a2044388c38cf-5789e1b174d7424b.elb.us-east-2.amazonaws.com:6443 --username=testuser --password=testuser
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): y


Login failed (401 Unauthorized) 

Actual results:

oc login with error : "Login failed (401 Unauthorized) "

Expected results:

oc login successfully.

Additional info:

# check configmap of oauth 
$ oc get cm -n clusters-demo-02 oauth-openshift -oyaml
...
    oauthConfig:
      alwaysShowProviderSelection: false
      assetPublicURL: ""
      grantConfig:
        method: deny
        serviceAccountMethod: prompt
      identityProviders: []
      loginURL: https://ac0be21b169ff4399b6a2044388c38cf-5789e1b174d7424b.elb.us-east-2.amazonaws.com:6443
      
---> seems `identityProviders` is not synced correctly ? 

This is a clone of issue OCPBUGS-3164. The following is the description of the original issue:

During first bootstrap boot we need crio and kubelet on the disk, so we start release-image-pivot systemd task. However, its not blocking bootkube, so these two run in parallel.

release-image-pivot restarts the node to apply new OS image, which may leave bootkube in an inconsistent state. This task should run before bootkube

Description of problem:

Seeing intermittently during cluster installs

Network operator stuck in Progressing with 

network                       4.12.0-0.nightly-2022-10-25-210451   True        True          False      117m    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)


MG: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.5450303633101217331/

iptables-save on master-2 node - http://shell.lab.bos.redhat.com/~anusaxen/iptables-save


pod events
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               129m                  default-scheduler  Successfully assigned openshift-network-diagnostics/network-check-target-gnld6 to qe-anurag114e-9xkz4-master-2.c.openshift-qe.internal
  Warning  FailedMount             128m (x7 over 129m)   kubelet            MountVolume.SetUp failed for volume "kube-api-access-kfg5s" : [object "openshift-network-diagnostics"/"kube-root-ca.crt" not registered, object "openshift-network-diagnostics"/"openshift-service-ca.crt" not registered]
  Warning  NetworkNotReady         128m (x18 over 129m)  kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
  Warning  ErrorAddingLogicalPort  127m (x2 over 127m)   controlplane       addLogicalPort failed for openshift-network-diagnostics/network-check-target-gnld6: unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "qe-anurag114e-9xkz4-master-2.c.openshift-qe.internal"
  Normal   AddedInterface          127m                  multus             Add eth0 [10.130.0.3/23] from ovn-kubernetes
  Warning  ProbeError              9m (x16 over 71m)     kubelet            Readiness probe error: Get "http://10.130.0.3:8080/": dial tcp 10.130.0.3:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)
body:
  Warning  ProbeError  4m (x717 over 126m)  kubelet  Readiness probe error: Get "http://10.130.0.3:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
body:




Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

rare

Steps to Reproduce:

1.Install OCP with OVNKubernetes with HO enabled

defaultNetwork:
    type: OVNKubernetes
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork: []

2.
3.

Actual results:

Installation stuck due to network-check-target issue 

Expected results:

Installation should succeed

Additional info:

Will add additional logs

 

 

 

 

Description of problem:

TO address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When attempting to load ISO to the remote server, the InsertMedia request fails with `Base.1.5.PropertyMissing`. The system is Mt.Jade Server / GIGABYTE G242-P36. BMC is provided by Megarac.

Version-Release number of selected component (if applicable):

OCP 4.12

How reproducible:

Always

Steps to Reproduce:

1. Create a BMH against such server
2. Create InfraEnv and attempt provisioning

Actual results:

Image provisioning failed: Deploy step deploy.deploy failed with BadRequestError: HTTP POST https://192.168.53.149/redfish/v1/Managers/Self/VirtualMedia/CD1/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.5.PropertyMissing: The property TransferProtocolType is a required property and must be included in the request. Extended information: [{'@odata.type': '#Message.v1_0_8.Message', 'Message': 'The property TransferProtocolType is a required property and must be included in the request.', 'MessageArgs': ['TransferProtocolType'], 'MessageId': 'Base.1.5.PropertyMissing', 'RelatedProperties': ['#/TransferProtocolType'], 'Resolution': 'Ensure that the property is in the request body and has a valid value and resubmit the request if the operation failed.', 'Severity': 'Warning'}].

Expected results:

Image provisioning to work

Additional info:

The following patch attempted to fix the problem: https://opendev.org/openstack/sushy/commit/ecf1bcc80bd14a1836d015c3dbdb4fd88f2bbd75

but the response code checked by the logic in the patch above is `Base.1.5.ActionParameterMissing` whic doesn’t quite address the response code I’m getting, which is Base.1.5.PropertyMissing

 

 

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3316. The following is the description of the original issue:

Description of problem:

Branch name in repository pipelineruns list view should match the actual github branch name.

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

alwaus

Steps to Reproduce:

1. Create a repository
2. Trigger the pipelineruns by push or pull request event on the github 

Actual results:

Branch name contains "refs-heads-" prefix in front of the actual branch name eg: "refs-heads-cicd-demo" (cicd-demo is the branch name)

Expected results:

Branch name should be the acutal github branch name. just `cicd-demo`should be shown in the branch column.

 

Additional info:
Ref: https://coreos.slack.com/archives/CHG0KRB7G/p1667564311865459

Description of problem:

When a user tries to run `oc debug,` they end up getting errors about pod security labels:

 Ensure the target namespace has the appropriate security level set or consider creating a dedicated privileged namespace using:
	"oc create ns <namespace> -o yaml | oc label -f - security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged".
Original error:
pods "ip-10-0-129-209ec2internal-debug" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
command failed, 3 retries left

This happens since https://docs.openshift.com/container-platform/4.11/authentication/understanding-and-managing-pod-security-admission.html

Fixing it requires the user running something like

oc create ns fips-check -o yaml | \
  oc label -f - \
  security.openshift.io/scc.podSecurityLabelSync=false \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/audit=privileged \
  pod-security.kubernetes.io/warn=privileged
Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Try to run `oc debug node/....` in a new namespace

Actual results:

Error message

Expected results:

oc debug works without the user having to perform additional steps. If namespace is omitted, perhaps oc debug could create a temporary one with the correct pod security labels?

Additional info:

Description of problem:

Customer is running machine learning (ML) tasks on OpenShift Container Platform, for which large models need to be embedded in the container image. When building a new container image with large container image layers (>=10GB) and pushing it to the internal image registry, this fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

In the image registry Pod we can see the following error message:

time="2023-01-30T14:12:22.315726147Z" level=error msg="upload resumed at wrong offest: 10485760000 != 10738341637" [..]
time="2023-01-30T14:12:22.338264863Z" level=error msg="response completed with error" err.code="blob upload invalid" err.message="blob upload invalid" [..]

Backend storage is AWS S3. We suspect that this could be the following upstream bug: https://github.com/distribution/distribution/issues/1698

Version-Release number of selected component (if applicable):

Customer encountered the issue on OCP 4.11.20. We reproduced the issue on OCP 4.11.21:

$  oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.11.21
Kubernetes Version: v1.24.6+5658434

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform cluster 4.11.21 on AWS
2. Confirm registry storage is on AWS S3
3. Create a new build including a 10GB file using the following command: `printf "FROM registry.fedoraproject.org/fedora:37\nRUN dd if=/dev/urandom of=/bigfile bs=1M count=10240" | oc new-build -D -`
4. Wait for some time for the build to run

Actual results:

Pushing the new build fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

Expected results:

Push of large container image layers succeeds

Additional info:

Description of problem:

The samples operator needs to update it's imagestreams to use the Jenkins 4.12 release.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
When disable all helm chart repos the helm navigation item is disabled.

To re-enable the helm charts again the user can search for HCP or PHCPs but the action menu doesn't work if no other helm chart repo is enabled.

Version-Release number of selected component (if applicable):
Only 4.12 (4.11 is fine)

How reproducible:
Always

Steps to Reproduce:
1. Switch to developer perspective
2. Navigate to Helm > Repos > Edit the default repo and disable it
3. Helm Navigation should disappear and the content area maybe switch to 404, that's fine.
4. Navigate to Search and select HelmChartRepository as resource
5. Click on the action menu (kebab icon) to edit the HCR

Actual results:
The action menu is not shown

Expected results:
The action menu should be shown so that the user can edit or delete the HCR.

Additional info:

Description of problem: After I run the golang script for OCP-53608, I find the created 

ingress-controller couldn't be deleted

Version-Release number of selected component (if applicable): 

4.12.0-0.nightly-2022-08-15-150248

How reproducible: Run the script and try to delete the custom ingress-controller

Steps to Reproduce:
1.

% oc get clusterversion

NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS

version   4.12.0-0.nightly-2022-08-15-150248   True        False         43m     Cluster version is 4.12.0-0.nightly-2022-08-15-150248

shudi@Shudis-MacBook-Pro openshift-tests-private %

2. Run the script

shudi@Shudis-MacBook-Pro openshift-tests-private % ./bin/extended-platform-tests run all --dry-run | grep 53608 | ./bin/extended-platform-tests run -f -

...

---------------------------------------------------------

Received interrupt.  Running AfterSuite...

^C again to terminate immediately

Aug 18 10:35:51.087: INFO: Running AfterSuite actions on all nodes

Aug 18 10:35:51.088: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready

STEP: Destroying namespace "e2e-test-router-tunning-77627" for this suite.

Aug 18 10:35:54.654: INFO: Running AfterSuite actions on node 1

 

failed: (15m4s) 2022-08-18T02:35:54 "[sig-network-edge] Network_Edge should Author:shudili-Low-53608-Negative Test of Expose a Configurable Reload Interval in HAproxy [Suite:openshift/conformance/parallel]"

 

Failing tests:

 

[sig-network-edge] Network_Edge should Author:shudili-Low-53608-Negative Test of Expose a Configurable Reload Interval in HAproxy [Suite:openshift/conformance/parallel]

 

error: 1 fail, 0 pass, 0 skip (15m4s)

shudi@Shudis-MacBook-Pro openshift-tests-private % 

3.  show the ingress-controllers

shudi@Shudis-MacBook-Pro openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller

NAME       AGE

default    113m

ocp53608   42m

shudi@Shudis-MacBook-Pro openshift-tests-private %

 

4. Try to delete the ingress-controller ocp53608, when the message "ingresscontroller.operator.openshift.io "ocp53608" deleted" appears, it is hanged for a long time until the error message appears.

shudi@Shudis-MacBook-Pro openshift-tests-private % oc -n openshift-ingress-operator delete ingresscontroller ocp53608

ingresscontroller.operator.openshift.io "ocp53608" deleted

error: An error occurred while waiting for the object to be deleted: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeedingUnable to connect to the server: dial tcp 35.194.1.60:6443: i/o timeout

shudi@Shudis-MacBook-Pro openshift-tests-private %

 

5. After "ingresscontroller.operator.openshift.io "ocp53608" deleted" message appears, show the ingress-controller, ocp53608 isn't deleted

shudi@Shudis-MacBook-Pro golang % oc -n openshift-ingress-operator get ingresscontroller

NAME       AGE

default    3h

ocp53608   109m

shudi@Shudis-MacBook-Pro golang %

 

6.  After the error message(rror: An error occurred while waiting for the object to be deleted) appears, try to show the ingresscontroller

shudi@Shudis-MacBook-Pro openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller

E0818 12:21:57.272967    4168 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)

E0818 12:21:57.273379    4168 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)

E0818 12:21:57.274306    4168 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)

Unable to connect to the server: dial tcp 35.194.1.60:6443: i/o timeout

shudi@Shudis-MacBook-Pro openshift-tests-private %

 

Actual results:  ingress-controller ocp53608  is still there after executed the oc delete command

Expected results:

ingress-controller ocp53608  will be deleted soon after executed the oc delete command

Additional info:

Description of problem:

Backport perf metrics to older version for better visibility into ovn-k performance

This is a clone of issue OCPBUGS-17428. The following is the description of the original issue:

Description of problem:

If the workload has an image and it is from outside the Red Hat domain, it is collected by the workload-info gatherer.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Prepare a workload (existing or new) with an image from an external domain (e.g. docker, quay.io,...).
2. Run Insights Operator
3. Check the config/workload_info.json file in the archive.

Actual results:

{
  "pods": 132,
  "imageCount": 72,
  "images": {
    "sha256:3cfb3379dbce10c1088bc8bf2429e72984db656ecee57c359c288f23580a3ab2": {
       "layerIDs": [
         "sha256:13897c84ca5715a68feafcce9acf779f35806f42d1fcd37e8a2a5706c075252d",
         "sha256:64607cc74f9cbe0e12f167547df0cf661de5a8b1fb4ebe930a43b9f621ca457f",
         "sha256:09bec785a242f9440bd843580731fc7f0a7d72932ef6bc6190708ce1548a2ec0",
         "sha256:87307f0f97c765b52904ba48264829295ee167025bc72ab1df4d2ce2ed9f7f6c",
         "sha256:403753cee738023fc093b5ad6bd95a3e93cf84b05838f0936e943fc0e8b5140d",
         "sha256:1bd85e07834910cad1fda7967cfd86ada69377ba0baf875683e7739d8de6d1b0"
       ],
       "firstCommand": "icTsn2s_EIax",
       "firstArg": "2v1NneeWoS_9"
    },
    [...]

Expected results:

{
  "pods": 132,
  "imageCount": 72,
  "images": {
    "sha256:47b6bd1c661fa24528cc8fad8cc11b403c68980a22978d4eb1b7b989785967dd": {
      "layerIDs": [
        "sha256:13897c84ca5715a68feafcce9acf779f35806f42d1fcd37e8a2a5706c075252d",
        "sha256:64607cc74f9cbe0e12f167547df0cf661de5a8b1fb4ebe930a43b9f621ca457f",
        "sha256:09bec785a242f9440bd843580731fc7f0a7d72932ef6bc6190708ce1548a2ec0",
        "sha256:87307f0f97c765b52904ba48264829295ee167025bc72ab1df4d2ce2ed9f7f6c",
        "sha256:403753cee738023fc093b5ad6bd95a3e93cf84b05838f0936e943fc0e8b5140d",
        "sha256:1bd85e07834910cad1fda7967cfd86ada69377ba0baf875683e7739d8de6d1b0"
      ],
      "firstCommand": "icTsn2s_EIax",
      "firstArg": "2v1NneeWoS_9",
      "repository":"2W0Xq9hxQzho" <---
    }
    [...]

Additional info:

 

This is a clone of issue OCPBUGS-11450. The following is the description of the original issue:

Description of problem:

When CNO is managed by Hypershift, it's deployment has "hypershift.openshift.io/release-image" template metadata annotation. The annotation's value is used to track progress of cluster control plane version upgrades. But multus-admission-controller created and managed by CNO does not have that annotation so service providers are not able to track its version upgrades.

The proposed solution is for CNO to propagate its "hypershift.openshift.io/release-image" annotation down to the multus-admission-controller deployment. For that CNO need to have "get" access to its own deployment manifest to be able to read the deployment template metadata annotations. 

Hypershift needs code change to assign CNO "get" permission on the CNO deployment object.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Check deployment template metadata annotations on multus-admission-controller

Actual results:

No "hypershift.openshift.io/release-image" deployment template metadata annotation exists 

Expected results:

"hypershift.openshift.io/release-image" annotation must be present

Additional info:

 

This is a clone of issue OCPBUGS-14127. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14125. The following is the description of the original issue:

Description of problem:

Since registry.centos.org is closed, tests relying on this registry in e2e-agnostic-ovn-cmd job are failing.

Version-Release number of selected component (if applicable):

all

How reproducible:

Trigger e2e-agnostic-ovn-cmd job

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

vSphere 4.12 CI jobs are failing with:
admission webhook "validation.csi.vsphere.vmware.com" denied the request: AllowVolumeExpansion can not be set to true on the in-tree vSphere StorageClass

https://search.ci.openshift.org/?search=can+not+be+set+to+true+on+the+in-tree+vSphere+StorageClass&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

 

Version-Release number of selected component (if applicable):

4.12 nigthlies

How reproducible:

consistently in CI

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

This appears to have started failing in the past 36 hours.

Description of problem:

For Hardware Backed Management Ports (e.g. Virtual functions), the Egress IP Health Check Feature will error out with:
"unable to start health checking server: no mgmt ip"

Version-Release number of selected component (if applicable):

OVN-Kubernetes 4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Load OVN-Kubernetes 4.12.0 in MLX BlueField 2
2. If in NIC mode:
https://github.com/ovn-org/ovn-kubernetes/pull/3160
https://github.com/ovn-org/ovn-kubernetes/pull/3251
Patches are needed.
3. If in DPU mode then those above patches are optional.
4. Set OVNKUBE_NODE_MGMT_PORT_NETDEV environment variable to point to the Virtual Function.

Actual results:

Error in ovnkube-node:
"unable to start health checking server: no mgmt ip".
The ovnkube-node container will crash. Egress IP Health Check should be compatible with VFs as management port.

Expected results:

No Error.

Additional info:

A simple workaround is to not return an error:
go-controller/pkg/node/node.go
@@ -660,7 +660,8 @@ func (n *OvnNode) startEgressIPHealthCheckingServer(wg *sync.WaitGroup, mgmtPort
                        return fmt.Errorf("failed start health checking server due to unsettled IPv6: %w", err)
                }
        } else {
-               return fmt.Errorf("unable to start health checking server: no mgmt ip")
+               klog.Infof("Unable to start Egress IP health checking server: no mgmt ip")
+               return nil
        }

We rely on the user providing accurate information about the MAC addresses in the agent-config, because at the point we read it we haven't seen the hosts yet. However, if the user gets this wrong then chaos may ensue.

Once inventory is available, we should validate that the user has not:

  • Specified MAC addresses that belong to two different agents in the same host config; nor
  • Specified MAC addresses that belong to the same agent in two different host configs

and fail the install if they have.

This is a clone of issue OCPBUGS-10794. The following is the description of the original issue:

Description of problem:

Our telemetry contains only vCenter version ("7.0.3") and not the exact build number. We need the build number to know what exact vCenter build user has and what bugs are fixed there (e.g. https://issues.redhat.com/browse/OCPBUGS-5817).

 

This is a clone of issue OCPBUGS-2532. The following is the description of the original issue:

Description of problem:

Upgrades from OCP 4.11.9 to the latest OCP 4.12 Nightly builds including 4.12.0-ec.4 will fail.  When the upgrade fails, there are typically two operators that never get upgraded(all others do upgrade to the targeted 4.12.x release):

dns                                        4.11.9                                     True        True          False      11h     DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."...
machine-config                             4.11.9                                     True        False         False      14h

The dns.operator details state it is waiting for a 4/5 pods to become available:
# oc describe dns.operator/default
...
Status:
  Cluster Domain:  cluster.local
  Cluster IP:      172.30.0.10
  Conditions:
    Last Transition Time:  2022-10-18T03:21:44Z
    Message:               Enough DNS pods are available, and the DNS service has a cluster IP address.
    Reason:                AsExpected
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2022-10-18T03:21:44Z
    Message:               Have 4 available DNS pods, want 5.
    Reason:                Reconciling
    Status:                True
    Type:                  Progressing

The mcp reports everything is good:
# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-87fd457ffdaf49d75e62b532c22a9f1d   True      False      False      3              3                   3                     0                      14h
worker   rendered-worker-7fc68009b1facf8724cd952cb08435ff   True      False      False      2              2                   2                     0                      14h

We have performed a large number of the same upgrades, using the same configuration, and while there are times the upgrade succeeds, the large number of results do fail.  This seems to be a timing issue.  

As a current workaround, if we were to recycle the control plane nodes, the upgrade will complete successfully. 

A must-gather log is attached for review.

Version-Release number of selected component (if applicable):

Tested upgrading to all the following releases:
4.12.0-ec.4
4.12.0-0.nightly-s390x-2022-10-10-005931
4.12.0-0.nightly-s390x-2022-10-15-144437

How reproducible:

Moderate to Consistently 

Steps to Reproduce:

1. Start with a working OCP 4.11.9 Cluster.
2. Perform an upgrade to latest OCP 4.12.x nightly build.
3. Monitor the upgrade status:
   # oc get clusterversion
   —> will state % complete and waiting on dns - which never finishes.
   # oc get co
   —> the dns and machine-config operators will remain at 4.11.9
4. Upgrade will never complete. 

Actual results:

Upgrade will never complete.

Expected results:

Upgrade to the targeted release succeeds.

Additional info:

This upgrade issue occurs for both Connected and Disconnected Clusters.

 

This is a clone of issue OCPBUGS-3283. The following is the description of the original issue:

Description of problem:

We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .

This RBAC was only used 4.2 and 4.3 for

  • for making a switch from configMaps to leases in leader election

and we should remove it

Version-Release number of selected component (if applicable):{code:none}

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

According to https://issues.redhat.com/browse/OCPBUGS-705, thanks Junyun share the test env/result for install part, and we need the fix in vsphere-problem-detector, currently it reports the following missing when using the pre-existing folder and/or resource pool with ReadOnly permission:
  
1. vcenter cluster set ReadOnly permission: 
I0902 10:07:50.324782       1 vsphere_check.go:244] CheckComputeClusterPermissions:jima-permission-q84s8-worker-86gd4 failed: missing privileges for compute cluster workloads: Resource.AssignVMToPool, VApp.AssignResourcePool, VApp.Import, VirtualMachine.Config.AddNewDisk


2. datacenter set ReadOnly permission:
I0902 08:09:19.462001       1 vsphere_check.go:225] CheckAccountPermissions failed: missing privileges for datacenter OCP-DC: Resource.AssignVMToPool, VApp.Import, VirtualMachine.Config.AddExistingDisk, VirtualMachine.Config.AddNewDisk, VirtualMachine.Config.AddRemoveDevice, VirtualMachine.Config.AdvancedConfig, VirtualMachine.Config.Annotation, VirtualMachine.Config.CPUCount, VirtualMachine.Config.DiskExtend, VirtualMachine.Config.DiskLease, VirtualMachine.Config.EditDevice, VirtualMachine.Config.Memory, VirtualMachine.Config.RemoveDisk, VirtualMachine.Config.Rename, VirtualMachine.Config.ResetGuestInfo, VirtualMachine.Config.Resource, VirtualMachine.Config.Settings, VirtualMachine.Config.UpgradeVirtualHardware, VirtualMachine.Interact.GuestControl, VirtualMachine.Interact.PowerOff, VirtualMachine.Interact.PowerOn, VirtualMachine.Interact.Reset, VirtualMachine.Inventory.Create, VirtualMachine.Inventory.CreateFromExisting, VirtualMachine.Inventory.Delete, VirtualMachine.Provisioning.Clone, VirtualMachine.Provisioning.DeployTemplate, VirtualMachine.Provisioning.MarkAsTemplate, Folder.Create, Folder.Delete 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-02-194931

How reproducible:

Always 

Steps to Reproduce:

See Description of problem

Actual results:

The vsphere-problem-detector operator reports privilege missing when using pre-existing folder and/or resource pool with ReadOnly permission

Expected results:

The vsphere-problem-detector operator should not reports privilege missing in that case.

Additional info:

 

These two tests are permafailing on webhook errors related to the CRD:

[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should show the Provisioning Network as 'Disabled' [Suite:openshift/conformance/serial]

[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should [apigroup:config.openshift.io] show the Provisioning Network as 'Disabled' [Suite:openshift/conformance/serial]

[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should allow setting the ProvisioningNetwork to 'Managed' with valid settings [Suite:openshift/conformance/serial]

[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should [apigroup:config.openshift.io] allow setting the ProvisioningNetwork to 'Managed' with valid settings [Suite:openshift/conformance/serial]

job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-virtualmedia=all

Example run:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-virtualmedia/1567416810377056256

Sippy links:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=%5Bsig-installer%5D%5BFeature%3Abaremetal%5D%5BSerial%5D%20A%20baremetal%20deployment%20without%20a%20provisioning%20network%20should%20allow%20setting%20the%20ProvisioningNetwork%20to%20%27Managed%27%20with%20valid%20settings%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D

https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=%5Bsig-installer%5D%5BFeature%3Abaremetal%5D%5BSerial%5D%20A%20baremetal%20deployment%20without%20a%20provisioning%20network%20should%20show%20the%20Provisioning%20Network%20as%20%27Disabled%27%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D

Description of problem:

The API Explorer page layout is incorrect,  please check the attachment for more details

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-15-150248

How reproducible:

Always

Steps to Reproduce:
1. Login OCP, Go to Home -> API Explorer page

2. Check if there is an extra blank line between the dropdown filter and the list 

Actual results:

There is an extra blank line between the dropdown filter and the list 

Expected results:

Use right patternfly package, remove the extra blank line

Additional info:

104.0.5112.79 (Official Build) (64-bit)

This is a clone of issue OCPBUGS-13752. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13535. The following is the description of the original issue:

Description of problem:

The AdditionalTrustBundle field in install-config.yaml can be used to add additional certs, however these certs are only propagated to the final image when the ImageContentSources field is also set for mirroring. If mirroring is not set then the additional certs will be on the bootstrap but not the final image.

This can cause a problem when user has set up a proxy and wants to add additional certs as described here https://docs.openshift.com/container-platform/4.12/networking/configuring-a-custom-pki.html#installation-configure-proxy_configuring-a-custom-pki

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

- IPI cluster on GCP 
- Replaced a master node following the proccedure described here In Replacing an unhealthy etcd member whose machine is not running or whose node is not ready https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html
- Provisioning the new machine using machine-api doesn't include the new provisioned virtual machine into the instance-group which is used for the endpoints of the API-INT load balancer at cloud provider

Version-Release number of selected component (if applicable):

 

How reproducible:

- Create a new master machine
$ cat new-master0-machine.yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
  finalizers:
  - machine.machine.openshift.io
  generation: 3
  labels:
    machine.openshift.io/cluster-api-cluster: case-03507533-rfctg
    machine.openshift.io/cluster-api-machine-role: master
    machine.openshift.io/cluster-api-machine-type: master
    machine.openshift.io/instance-type: n2-standard-4
    machine.openshift.io/region: europe-southwest1
    machine.openshift.io/zone: europe-southwest1-a
  name: case-03507533-rfctg-master-0-v2
  namespace: openshift-machine-api
spec:
  lifecycleHooks:
    preDrain:
    - name: EtcdQuorumOperator
      owner: clusteroperator/etcd
  metadata: {}
  providerSpec:
    value:
      apiVersion: machine.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
        name: gcp-cloud-credentials
      deletionProtection: false
      disks:
      - autoDelete: true
        boot: true
        image: projects/rhcos-cloud/global/images/rhcos-412-86-202303211731-0-gcp-x86-64
        labels: null
        sizeGb: 128
        type: pd-ssd
      kind: GCPMachineProviderSpec
      machineType: n2-standard-8
      metadata:
        creationTimestamp: null
      networkInterfaces:
      - network: case-03507533-rfctg-network
        subnetwork: case-03507533-rfctg-master-subnet
      projectID: cee-gcp-emea
      region: europe-southwest1
      serviceAccounts:
      - email: xxxxxxxxx
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      tags:
      - case-03507533-rfctg-master
      targetPools:
      - case-03507533-rfctg-api
      userDataSecret:
        name: master-user-data
      zone: europe-southwest1-a


$ oc apply -f new-master0-machine.yaml  machine.machine.openshift.io/case-03507533-rfctg-master-0-v2 created [lperezbe@lperezbe auth]$ oc get machine -n openshift-machine-api NAME                                 PHASE     TYPE            REGION              ZONE                  AGE case-03507533-rfctg-master-0-v2      Running   n2-standard-8   europe-southwest1   europe-southwest1-a   9m39s

cloud compute backend-services describe case-03507533-rfctg-api-internal backends: - balancingMode: CONNECTION   group: https://www.googleapis.com/compute/v1/projects/cee-gcp-emea/zones/europe-southwest1-a/instanceGroups/case-03507533-rfctg-master-europe-southwest1-a - balancingMode: CONNECTION   group: https://www.googleapis.com/compute/v1/projects/cee-gcp-emea/zones/europe-southwest1-b/instanceGroups/case-03507533-rfctg-master-europe-southwest1-b - balancingMode: CONNECTION   group: https://www.googleapis.com/compute/v1/projects/cee-gcp-emea/zones/europe-southwest1-c/instanceGroups/case-03507533-rfctg-master-europe-southwest1-c 

 $ gcloud compute instance-groups list
NAME                                            LOCATION             SCOPE  NETWORK                      MANAGED  INSTANCES
case-03507533-rfctg-master-europe-southwest1-b  europe-southwest1-b  zone   case-03507533-rfctg-network  No       1
case-03507533-rfctg-master-europe-southwest1-a  europe-southwest1-a  zone   case-03507533-rfctg-network  No       0
case-03507533-rfctg-master-europe-southwest1-c  europe-southwest1-c  zone   case-03507533-rfctg-network  No       1

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

It Could be workarrounded by manually adding the instance into the instance-group

$ gcloud compute instance-groups unmanaged add-instances case-03507533-rfctg-master-europe-southwest1-a --zone=europe-southwest1-a --instances=case-03507533-rfctg-master-0-v2
Updated [https://www.googleapis.com/compute/v1/projects/cee-gcp-emea/zones/europe-southwest1-a/instanceGroups/case-03507533-rfctg-master-europe-southwest1-a]. gcloud compute instance-groups list
NAME                                            LOCATION             SCOPE  NETWORK                      MANAGED  INSTANCES
case-03507533-rfctg-master-europe-southwest1-b  europe-southwest1-b  zone   case-03507533-rfctg-network  No       1
case-03507533-rfctg-master-europe-southwest1-a  europe-southwest1-a  zone   case-03507533-rfctg-network  No       1
case-03507533-rfctg-master-europe-southwest1-c  europe-southwest1-c  zone   case-03507533-rfctg-network  No       1

This is a clone of issue OCPBUGS-2727. The following is the description of the original issue:

Description of problem:

CVO recently introduced a new precondition RecommendedUpdate[1]. While we request an upgrade to a version which is not an available update, the precondition got UnknownUpdate and blocks the upgrade.

# oc get clusterversion/version -ojson | jq -r '.status.availableUpdates'null

# oc get clusterversion/version -ojson | jq -r '.status.conditions[]|select(.type == "ReleaseAccepted")'
{
  "lastTransitionTime": "2022-10-20T08:16:59Z",
  "message": "Preconditions failed for payload loaded version=\"4.12.0-0.nightly-multi-2022-10-18-153953\" image=\"quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695\": Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"UnknownUpdate\": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.0-0.nightly-multi-2022-10-18-091108 to 4.12.0-0.nightly-multi-2022-10-18-153953 is unknown.",
  "reason": "PreconditionChecks",
  "status": "False",
  "type": "ReleaseAccepted"
}


[1]https://github.com/openshift/cluster-version-operator/pull/841/

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-multi-2022-10-18-091108

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.12 cluster
2. Upgrade to a version which is not in the available update
# oc adm upgrade --allow-explicit-upgrade --to-image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anywayRequesting update to release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695 

Actual results:

CVO precondition check fails and blocks upgrade

Expected results:

Upgrade proceeds

Additional info:

 

This is a clone of issue OCPBUGS-4350. The following is the description of the original issue:

Steps to reproduce:
Release: 4.13.0-0.nightly-2022-11-30-183109 (latest 4.12 nightly as well)
Create a HyperShift cluster on AWS, wait til its completed rolling out
Upgrade the HostedCluster by updating its release image to a newer one
Observe the 'network' clusteroperator resource in the guest cluster as well as the 'version' clusterversion resource in the guest cluster.
When the clusteroperator resource reports the upgraded release and the clusterversion resource reports the new release as applied, take a look at the ovnkube-master statefulset in the control plane namespace of the management cluster. It is still not finished rolling out.

Expected: that the network clusteroperator reports the new version only when all components have finished rolling out.

This is a clone of issue OCPBUGS-6799. The following is the description of the original issue:

Description of problem:
The pipelines -> repositories list view in Dev Console does not show the running pipelineline as the last pipelinerun in the table.

Original BugZilla Link: https://bugzilla.redhat.com/show_bug.cgi?id=2016006
OCPBUGSM: https://issues.redhat.com/browse/OCPBUGSM-36408

Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-1941.

Description of problem:

The error message of "opm alpha render-veneer semver" is not correct, "semver &{%!q(*os.file=&{{{0 0 0} 3 {0} 0 1 true true true}" is meaningless, should not be printed.

Version-Release number of selected component (if applicable):

zhaoxia@xzha-mac operator-framework-olm % opm version
Version: version.Version{OpmVersion:"2149aebcc", GitCommit:"2149aebcc71367e6fba8f5416374917dae1e6a1c", BuildDate:"2022-09-08T04:31:47Z", GoOs:"darwin", GoArch:"amd64"}

How reproducible:

always

Steps to Reproduce:

1. create file
zhaoxia@xzha-mac OCP-53915 % cat catalog-semver-veneer-1.yaml
Schema: olm.semver
Candidate:
  Bundles:
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-alpha
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-beta
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-alpha20220829
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-alpha20220830
Stable:
  Bundles:
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-beta
Fast:
  Bundles:
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-beta

2. run "opm alpha render-veneer semver" 
zhaoxia@xzha-mac operator-framework-olm % opm alpha render-veneer semver catalog-semver-veneer-1.yaml
2022/09/08 12:35:05 semver &{%!q(*os.file=&{{{0 0 0} 3 {0} <nil> 0 1 true true true} catalog-semver-veneer-1.yaml <nil> false false false})}: semver-render: unable to post-process bundle info: encountered bundle versions which differ only by build metadata, which cannot be ordered: [bundle version "1.0.1-alpha" cannot be compared to "1.0.1-alpha", bundle version "1.0.1-alpha+20220829" cannot be compared to "1.0.1-alpha"] 

3.

Actual results:

"semver &{%!q(*os.file=&{{{0 0 0} 3 {0} 0 1 true true true}" is meaningless, should not be printed.

Expected results:

no error message "semver &{%!q(*os.file=&{{{0 0 0} 3 {0} 0 1 true true true}"

Additional info:

 

This is a clone of issue OCPBUGS-1748. The following is the description of the original issue:

Description of problem:

PipelineRun templates are currently fetched from `openshift-pipelines` namespace. It has to be fetched from `openshift` namespace.

Version-Release number of selected component (if applicable):
4.11 and 1.8.1 OSP

Align with operator changes https://issues.redhat.com/browse/SRVKP-2413 in 1.8.1, UI has to update the code to fetch pipelinerun templates from openshift namespace.

This is a clone of issue OCPBUGS-11750. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11046. The following is the description of the original issue:

Description of problem:

The following test is permafeailing in Prow CI:
[tuningcni] sysctl allowlist update [It] should start a pod with custom sysctl only after adding sysctl to allowlist

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-ovn-periodic/1640987392103944192


[tuningcni]
9915/go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:26
9916  sysctl allowlist update
9917  /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:141
9918    should start a pod with custom sysctl only after adding sysctl to allowlist
9919    /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156
9920  > Enter [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855
9921  < Exit [BeforeEach] [tuningcni] - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/pkg/execute/ginkgo.go:9 @ 03/29/23 10:08:49.855 (0s)
9922  > Enter [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.855
9923  < Exit [BeforeEach] sysctl allowlist update - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:144 @ 03/29/23 10:08:49.896 (41ms)
9924  > Enter [It] should start a pod with custom sysctl only after adding sysctl to allowlist - /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:156 @ 03/29/23 10:08:49.896
9925  [FAILED] Unexpected error:
9926      <*errors.errorString | 0xc00044eec0>: {
9927          s: "timed out waiting for the condition",
9928      }
9929      timed out waiting for the condition
9930  occurred9931  In [It] at: /go/src/github.com/openshift-kni/cnf-features-deploy/cnf-tests/testsuites/e2esuite/security/tuning.go:186 @ 03/29/23 10:09:53.377

Version-Release number of selected component (if applicable):

master (4.14)

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Test fails

Expected results:

Test passes

Additional info:

PR https://github.com/openshift-kni/cnf-features-deploy/pull/1445 adds some useful information to the reported archive.

This is a clone of issue OCPBUGS-3713. The following is the description of the original issue:

This is essentially the same issue in OCPBUGS-3668 where we found the username must be fully qualified (e.g. "rbost@vsphere.local" not just "rbost").

Description of problem:

Found during 1.25 rebase work, test hit this panic in two runs of 4.12-e2e-vsphere-ovn-upi-serial:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-nightly-4.12-e2e-vsphere-ovn-upi-serial/1567239801269129216

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-nightly-4.12-e2e-vsphere-ovn-upi-serial/1567066819087306752

Full error for reference:

```github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:107 +0x96
panic({0x766b520, 0xc183570})
    runtime/panic.go:838 +0x207
k8s.io/kubernetes/test/e2e/network.glob..func15.4()
    k8s.io/kubernetes@v1.24.0/test/e2e/network/ingressclass.go:97 +0x284
github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0x300000002?)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113 +0xb1
github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0xc002466e40?)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:64 +0x125
github.com/onsi/ginkgo/internal/leafnodes.(*ItNode).Run(0x7f72ca69cfff?)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/it_node.go:26 +0x7b
github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc003305b30, 0xc00066b208?, {0x8faff00, 0xc00045edc0})
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/spec/spec.go:215 +0x28a
github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc003305b30, {0x8faff00, 0xc00045edc0})
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/spec/spec.go:138 +0xe7
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc002480280, 0xc003305b30)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:200 +0xe8
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc002480280)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:170 +0x1a5
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc002480280)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:66 +0xc5
github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc0004762d0, {0x8fb0260, 0xc002ba2690}, {0x0, 0x0}, {0xc002bb8600, 0x1, 0x1}, {0x8ff18e0, 0xc00045edc0}, ...)
    github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/suite/suite.go:62 +0x4b2
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc0024b28c0, {0xc000311420, 0xc58c8b0?, 0x4f19d80?})
    github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:448 +0x32
github.com/openshift/origin/test/extended/util.WithCleanup(0xc002527bb8)
    github.com/openshift/origin/test/extended/util/test.go:168 +0xad
main.newRunTestCommand.func1(0xc0024cc780?, {0xc000311420, 0x1, 0x1})
    github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:448 +0x325
github.com/spf13/cobra.(*Command).execute(0xc0024cc780, {0xc0003113a0, 0x1, 0x1})
    github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3fb80)
    github.com/spf13/cobra@v1.4.0/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
    github.com/spf13/cobra@v1.4.0/command.go:902
main.main.func1(0xc000de1700?)
    github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:94 +0x8a
main.main()
    github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:95 +0x476

fail [runtime/panic.go:220]: Test Panicked: runtime error: invalid memory address or nil pointer dereference
Ginkgo exit error 1: exit with code 1
```

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

Test hit panic

Expected results:

No panic

Additional info:

 

While backporting to 4.12 the node healthz server (#1570), a number of functions related to checking stale ovs ports (checkForStaleOVSInternalPortscheckForStaleOVSRepresentorInterfacescheckForStaleOVSInterfaces) were moved to pkg/node/openflow_manager.go and their related tests were left in pkg/node/healthcheck_test.go. In 4.13, we have everything under pkg/network-controller-manager. To keep consistency, let's move these to pkg/node/node.go and pkg/node/node_test.go

Description of problem:

Egress IP is not being assigned to primary interface of node as per hostsubnet definition. The issue being observed at an Openshift cluster hosted on Disconnected AWS environment.  Following steps were performed at AWS end:

- Disconnected VPC was created and installation of Openshift was done as per documentation.
- Elastic IP could not be used as it is a disconnected environment. Customer identified a free IP from same subnet as the node and modified interface of the node to add a secondary IP.

It seems cloud.network.openshift.io/egress-ipconfig annotation is need on the node to attach IP to primary interface but its missing. From SDN POD log on the same node I  could see its complaining about 'an incomplete annotation "cloud.network.openshift.io/egress-ipconfig"'. Will share more details over comments.

Version-Release number of selected component (if applicable):

Openshift 4.10.28

How reproducible:

Always

Steps to Reproduce:

1. Create a disconnected environment on AWS
2. find a free IP from subnet where a worker node is hosted and add that as secondary  IP to NIC of that node.
3. Configure hostsubnet and netnamespace on Openshift cluster

Actual results:

- Eress IP is not being attached to primary interface of node for which hostsubnet has been configured

Expected results:

- Egress IP should get configured without any issue.

Additional info:


Description of problem:

when provisioningNetwork is changed from Disabled to Managed/Unmanaged, the ironic-proxy daemonset is not removed

This causes the metal3 pod to be stuck in pending, since both pods are trying to use port 6385 on the host:

0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports. preemption: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports

Version-Release number of selected component (if applicable):

4.12rc.4

How reproducible:

Every time for me

Steps to Reproduce:

1. On a multinode cluster, change the provisioningNetwork from Disabled to Unmanaged (I didn't try Managed)
2.
3.

Actual results:

0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports. preemption: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports

Expected results:

I believe the ironic-proxy daemonset should be deleted when the provisioningNetwork is set to Managed/Unmanaged

Additional info:

If I manually delete the ironic-proxy Daemonset, the controller does not re-create it.

Currently, the AWS actuator has a static list of instance types embedded in it. This means that as new instance types are added, we have to continually update this list.

Ideally, we could fetch this information from the AWS API as we do in GCP.

DoD:

  • Investigate availability of instance memory and CPU capacity as an API on AWS
  • Determine if we can use this for the autoscaling scale from zero annotations
  • If possible, implement the change.

This is a clone of issue OCPBUGS-2144. The following is the description of the original issue:

Description of problem:

Azure IPI creates boot images using the image gallery API now, it will create two image definition resources for both hyperVGeneration V1 and V2. For arm64 cluster, the architecture in image definition hyperVGeneration V1 is x64, but it should be Arm64

Version-Release number of selected component (if applicable):

./openshift-install version
./openshift-install 4.12.0-0.nightly-arm64-2022-10-07-204251
built from commit 7b739cde1e0239c77fabf7622e15025d32fc272c
release image registry.ci.openshift.org/ocp-arm64/release-arm64@sha256:d2569be4ba276d6474aea016536afbad1ce2e827b3c71ab47010617a537a8b11
release architecture arm64

How reproducible:

always

Steps to Reproduce:

1.Create arm cluster using latest arm64 nightly build 
2.Check image definition created for hyperVGeneration V1

Actual results:

The architecture field is x64.
###
$ az sig image-definition show --gallery-name ${gallery_name} --gallery-image-definition lwanazarm1008-rc8wh --resource-group ${rg} | jq -r ".architecture"
x64
The image version under this image definition is for aarch64.
###
$ az sig image-version show --gallery-name gallery_lwanazarm1008_rc8wh --gallery-image-definition lwanazarm1008-rc8wh --resource-group lwanazarm1008-rc8wh-rg --gallery-image-version 412.86.20220922 | jq -r ".storageProfile.osDiskImage.source"
{  "uri": "https://clustermuygq.blob.core.windows.net/vhd/rhcosmuygq.vhd"}
$ az storage blob show --container-name vhd --name rhcosmuygq.vhd --account-name clustermuygq --account-key $account_key | jq -r ".metadata"
{  "Source_uri": "https://rhcos.blob.core.windows.net/imagebucket/rhcos-412.86.202209220538-0-azure.aarch64.vhd"}

Expected results:

Although no VMs with HypergenV1 can be provisioned, the architecture field should be Arm64 even for hyperGenerationV1 image definitions

Additional info:

1.The architecture in image definition hyperVGeneration V2 is Arm64 and installer will use V2 by default for arm64 vm_type, so installation didn't fail by default. But we still need to make architecture consistent in V1.

2.Need to set architecture field for both V1 and V2, now we only set architecture in V2 image definition resource. 
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L100-L128 

This bug is a backport clone of [Bugzilla Bug 2100429](https://bugzilla.redhat.com/show_bug.cgi?id=2100429). The following is the description of the original bug:

Description of problem:
[apiserver-auth] default SCC restricted allow volumes don't have "ephemeral" caused deployment with Generic Ephemeral Volumes stuck at Pending

Version-Release number of selected component (if applicable):
Cluster version is 4.11.0-0.nightly-2022-06-22-190830
$ oc version
Client Version: 4.11.0-0.nightly-2022-05-11-054135
Kustomize Version: v4.5.4
Server Version: 4.11.0-0.nightly-2022-06-22-190830
Kubernetes Version: v1.24.0+284d62a

How reproducible:
Always

Steps to Reproduce:

1. Set up a AWS OCP cluster with 4.11 nightly
2. Create a deployment with Generic Ephemeral Volumes
3. Waiting for the deployment ready and check the volume could write and read data

Test data:
wangpenghao@MacBook-Pro ~ cat temp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-dep
spec:
replicas: 1
selector:
matchLabels:
app: my-dep
template:
metadata:
labels:
app: my-dep
spec:
containers:

  • image: >-
    quay.io/openshifttest/hello-openshift@sha256:b1aabe8c8272f750ce757b6c4263a2712796297511e0c6df79144ee188933623
    name: my-container
    ports:
  • containerPort: 80
    volumeMounts:
  • mountPath: /mnt/storage
    name: inline-volume
    volumes:
  • name: inline-volume
    ephemeral:
    volumeClaimTemplate:
    metadata:
    labels:
    workloadName: my-dep
    spec:
    accessModes:
  • ReadWriteOnce
    storageClassName: gp3-csi
    resources:
    requests:
    storage: 1Gi
    wangpenghao@MacBook-Pro ~ oc apply -f temp.yaml
    Warning: would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "my-dep-mcxx803w" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "my-dep-mcxx803w" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "my-dep-mcxx803w" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "my-dep-mcxx803w" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
    deployment.apps/my-dep created
    wangpenghao@MacBook-Pro ~ oc get deploy
    NAME READY UP-TO-DATE AVAILABLE AGE
    my-dep 0/1 0 0 7s
    wangpenghao@MacBook-Pro ~ oc get event
    LAST SEEN TYPE REASON OBJECT MESSAGE
    5s Warning FailedCreate replicaset/my-dep-6bd958d877 Error creating: pods "my-dep-6bd958d877-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "ephemeral": ephemeral volumes are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
    16s Normal ScalingReplicaSet deployment/my-dep Scaled up replica set my-dep-6bd958d877 to 1

Actual results:
In Step3 : The deployment stucked at Pending caused by unable to validate against any security context constraint

Expected results:
In Step3 : The deployment should ready with the default scc restricted, the default scc restricted should allow
volumes:

  • ephemeral

Additional info:

Generic ephemeral volumes are the safer option of these two - it just creates/deletes PVCs on behalf of users. And most users can already create PVCs.

ephemeral type volume not in scc.volumes list definition
https://docs.openshift.com/container-platform/4.10/authentication/managing-security-context-constraints.html#authorization-cont[…]ing-internal-oauth

So currently if customers want to use ephemeral type volume have to use scc with:
volumes:

  • '*'
    E.g. scc/privileged

Discuss record: https://coreos.slack.com/archives/CB48XQ4KZ/p1655465586780419

Generic Ephemeral Volumes docs:
https://kubernetes.io/blog/2020/09/01/ephemeral-volumes-with-storage-capacity-tracking/#generic-ephemeral-volumes

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

This is a clone of issue OCPBUGS-13598. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6013. The following is the description of the original issue:

Description of problem:

When utilizing the OSD "Edit Cluster Ingress" feature to change the default application router from public to private or vice versa, the external AWS load balancer is removed an replaced by the cloud-ingress-operator.

When this happens, the external load balancer health checks never receive a successful check from the backend nodes, and all nodes are marked out-of-service.

Cluster operators depending on *.apps.CLUSTERNAME.devshift.org begin to fail, initially with DNS errors, which is expected, but then with EOF messages attempting to get the routes associated with their health checks, eg: 

OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.chcollin-mjtj.cvgo.s1.devshift.org/healthz": EOF

This always degrades the authentication, console and ingress (via ingress-canary) operators.

Logs from the `ovnkube-node-*` pods for the instance show VN properly updating the port for the endpoint healthcheck to that of the new port in use by the AWS LB.

The endpointSlices for the endpoint are updated/replaced, but with no change in config as far as I can tell.  They're just recreated.

The service backending the router-default pods has the proper HealthCheckNodePort configuration, matching the new AWS LB.

Curling the service via the CLUSTER_IP:NODE_PORT_HEALTH_CHECK/healthz results in a connection time out.

Curling the local health check for HAPROXY within the router-default pod via `localhost:1936/healthz` results in an OK response as expected.

After rolling the router-default pods manually with `oc rollout restart deployment router-default -n openshift-ingress`, or just deleting the pods, the cluster ends up healing, with the AWS LB seeing the backend infra nodes in service again, and cluster operators depending on the *apps.CLUSTERNAME.devshift.org domain healing on their own as well.

I'm unsure if this should go to network-ovn or network-multis (or some other component), so I'm starting here.  Please redirect me if necessary.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Login to the OCM console for the cluster (eg: https://qaprodauth.console.redhat.com/openshift for staging)
2. From the network tab, select "Edit Cluster Ingress"
3. Check or uncheck the "Make Router Private" box for the default application router - it does not matter which way you're swapping.

Actual results:

Ingress to the default router begins to fail for the *.apps routes; never becomes available

Expected results:

Ingress would fail for ~15 minutes as things are reconfigured, and then become available again.

Additional info:

Two must-gathers are available via Google drive https://drive.google.com/drive/u/1/folders/1oIkNOSY0R9Mvo-BZ1Pa3W3iDDfF_726F and shared with Red Hat employees, from a test cluster I created .  The first is from before the change, and the second is from after the change.  This is on a brand new cluster, so logs should be clean-ish.

Description of problem:

Clusters created with platform 'vsphere' in the install-config end up as type 'BareMetal' in the infrastructure CR.

Version-Release number of selected component (if applicable):

4.12.3

How reproducible:

100%

Steps to Reproduce:

1. Create a cluster through the agent installer with platform: vsphere in the install-config
2. oc get infrastructure cluster -o jsonpath='{.status.platform}' 

Actual results:

BareMetal

Expected results:

VSphere

Additional info:

The platform type is not being case converted ("vsphere" -> "VSphere") when constructing the AgentClusterInstall CR. When read by the assisted-service client, the platform reads as unknown and therefore the platform field is left blank when the Cluster object is created in the assisted API. Presumably that results in the correct default platform for the topology: None for SNO, BareMetal for everything else, but never VSphere. Since the platform VIPs are passed through a non-platform-specific API in assisted, everything worked but the resulting cluster would have the BareMetal platform.

Description of problem:

When solving flakiness of a test in IO tests, we found that there are some issues in the cluster_version_matches condition for the conditional gatherer. 

Firstly the character limit should be increased as 32 characters does not cover every possible release version as some exceed that limit. 
Furthermore, there is an error in the schema

https://github.com/openshift/insights-operator/blob/master/pkg/gatherers/conditional/gathering_rule.schema.json#L101

There is no name, it should be version

How reproducible:

Sometimes

Steps to Reproduce:

1. Spin a cluster from a PR
2. If version exceeds 32 characters, we get in the pod logs: 'Could not get version from string: "<"'
 

Actual results:

'Could not get version from string: "<"'

Expected results:

Metadata should contain "Metadata should contain invalid range error"

Additional info:

However, since there's the possibility for versions to exceed 32 characters, we shouldn't expect an error in this situation. Therefore, there might be more than one issue.

Description of problem:

The Insights Operator configuration was not properly deserialized from the YAML configuration, so the cluster transfer interval was not updated correctly.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

There's still following message in the logs:
cluster_transfer.go:78] checking the availability of cluster transfer. Next check is in 24h0m0s

Expected results:

The cluster transfer interval check should be 12h

Additional info:

 

This is a clone of issue OCPBUGS-3172. The following is the description of the original issue:

Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".

We shouldn't block operator installation if permission issues prevent dynamic plugin installation.

This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".

See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD

We do not have a well defined method to find these all just yet, identifying that would be a good first step.

 Currently on summery logs if there is kube-api issue controller will not upload logs but it should as it has file to read them from

This is a clone of issue OCPBUGS-6777. The following is the description of the original issue:

Description of problem:

"create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure

Version-Release number of selected component (if applicable):

$ ./openshift-install version
./openshift-install 4.13.0-0.nightly-2023-01-27-165107
built from commit fca41376abe654a9124f0450727579bb85591438
release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. "create manifests"
2. "create cluster" 

Actual results:

1. After "create manifests", in "<install dir>/openshift", there're 4 YAML files missing, including "99_cloud-creds-secret.yaml", "99_kubeadmin-password-secret.yaml", "99_role-cloud-creds-secret-reader.yaml", and "openshift-install-manifests.yaml", comparing with "create manifests" with an existing "install-config.yaml".
2. The installation failed without any worker nodes due to error getting credentials secret "gcp-cloud-credentials" in namespace "openshift-machine-api".

Expected results:

1. "create manifests" without an existing "install-config.yaml" should generate the same set of YAML files as "create manifests" with an existing "install-config.yaml".
2. Then the subsequent "create cluster" should succeed.

Additional info:

The working scenario: "create manifests" with an existing "install-config.yaml"

$ ./openshift-install version
./openshift-install 4.13.0-0.nightly-2023-01-27-165107
built from commit fca41376abe654a9124f0450727579bb85591438
release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529
release architecture amd64
$ 
$ mkdir test30
$ cp install-config.yaml test30
$ yq-3.3.0 r test30/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-central1
$ yq-3.3.0 r test30/install-config.yaml metadata
creationTimestamp: null
name: jiwei-0130a
$ ./openshift-install create manifests --dir test30
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" 
INFO Consuming Install Config from target directory 
WARNING Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated 
INFO Manifests created in: test30/manifests and test30/openshift 
$ 
$ tree test30
test30
├── manifests
│   ├── cloud-controller-uid-config.yml
│   ├── cloud-provider-config.yaml
│   ├── cluster-config.yaml
│   ├── cluster-dns-02-config.yml
│   ├── cluster-infrastructure-02-config.yml
│   ├── cluster-ingress-02-config.yml
│   ├── cluster-network-01-crd.yml
│   ├── cluster-network-02-config.yml
│   ├── cluster-proxy-01-config.yaml
│   ├── cluster-scheduler-02-config.yml
│   ├── cvo-overrides.yaml
│   ├── kube-cloud-config.yaml  
│   ├── kube-system-configmap-root-ca.yaml
│   ├── machine-config-server-tls-secret.yaml
│   └── openshift-config-secret-pull-secret.yaml
└── openshift
    ├── 99_cloud-creds-secret.yaml
    ├── 99_kubeadmin-password-secret.yaml
    ├── 99_openshift-cluster-api_master-machines-0.yaml
    ├── 99_openshift-cluster-api_master-machines-1.yaml
    ├── 99_openshift-cluster-api_master-machines-2.yaml
    ├── 99_openshift-cluster-api_master-user-data-secret.yaml
    ├── 99_openshift-cluster-api_worker-machineset-0.yaml
    ├── 99_openshift-cluster-api_worker-machineset-1.yaml
    ├── 99_openshift-cluster-api_worker-machineset-2.yaml
    ├── 99_openshift-cluster-api_worker-machineset-3.yaml
    ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
    ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml
    ├── 99_openshift-machineconfig_99-master-ssh.yaml
    ├── 99_openshift-machineconfig_99-worker-ssh.yaml
    ├── 99_role-cloud-creds-secret-reader.yaml
    └── openshift-install-manifests.yaml2 directories, 31 files
$ 

The problem scenario: "create manifests" without an existing "install-config.yaml", and then "create cluster"

$ ./openshift-install create manifests --dir test31
? SSH Public Key /home/fedora/.ssh/openshift-qe.pub
? Platform gcp
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
? Project ID OpenShift QE (openshift-qe)
? Region us-central1
? Base Domain qe.gcp.devcluster.openshift.com
? Cluster Name jiwei-0130b
? Pull Secret [? for help] *******
INFO Manifests created in: test31/manifests and test31/openshift
$ 
$ tree test31
test31
├── manifests
│   ├── cloud-controller-uid-config.yml
│   ├── cloud-provider-config.yaml
│   ├── cluster-config.yaml
│   ├── cluster-dns-02-config.yml
│   ├── cluster-infrastructure-02-config.yml
│   ├── cluster-ingress-02-config.yml
│   ├── cluster-network-01-crd.yml
│   ├── cluster-network-02-config.yml
│   ├── cluster-proxy-01-config.yaml
│   ├── cluster-scheduler-02-config.yml
│   ├── cvo-overrides.yaml
│   ├── kube-cloud-config.yaml
│   ├── kube-system-configmap-root-ca.yaml
│   ├── machine-config-server-tls-secret.yaml
│   └── openshift-config-secret-pull-secret.yaml
└── openshift
    ├── 99_openshift-cluster-api_master-machines-0.yaml
    ├── 99_openshift-cluster-api_master-machines-1.yaml
    ├── 99_openshift-cluster-api_master-machines-2.yaml
    ├── 99_openshift-cluster-api_master-user-data-secret.yaml
    ├── 99_openshift-cluster-api_worker-machineset-0.yaml
    ├── 99_openshift-cluster-api_worker-machineset-1.yaml
    ├── 99_openshift-cluster-api_worker-machineset-2.yaml
    ├── 99_openshift-cluster-api_worker-machineset-3.yaml
    ├── 99_openshift-cluster-api_worker-user-data-secret.yaml
    ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml
    ├── 99_openshift-machineconfig_99-master-ssh.yaml
    └── 99_openshift-machineconfig_99-worker-ssh.yaml2 directories, 27 files
$ 
$ ./openshift-install create cluster --dir test31
INFO Consuming Common Manifests from target directory
INFO Consuming Openshift Manifests from target directory
INFO Consuming Master Machines from target directory
INFO Consuming Worker Machines from target directory
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 4:17PM) for the Kubernetes API at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443...
INFO API v1.25.2+7dab57f up
INFO Waiting up to 30m0s (until 4:28PM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 4:59PM) for the cluster at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443 to initialize...
ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication
ERROR OAuthServerDeploymentDegraded:
ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address
ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_ResourceNotFound::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found
ERROR OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused
ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found
ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
ERROR Cluster operator cloud-credential Degraded is True with CredentialsFailing: 7 of 7 credentials requests are failing to sync.
INFO Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 7 credentials requests provisioned, 7 reporting errors.
ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready
ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console
ERROR RouteHealthDegraded: console route is not admitted
ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console
ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted 
ERROR Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s)
ERROR Cluster operator control-plane-machine-set Degraded is True with NoReadyMachines: No ready control plane machines found
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
ERROR Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist
ERROR NodeCADaemonAvailable: The daemon set node-ca has available replicas
ERROR ImagePrunerAvailable: Pruner CronJob has been created
INFO Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found
INFO NodeCADaemonProgressing: The daemon set node-ca is deployed
ERROR Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist
ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DNSReady=False (NoZones: The record isn't present in any zones.)
INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available...
INFO ).
INFO Not all ingress controllers are available.
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-c68b5786c-prk7x" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Pod "router-default-c68b5786c-ssrv7" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DNSReady=False (NoZones: The record isn't present in any zones.), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller)
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret
ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host  
INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107
ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 because minimum worker replica count (2) not yet met: current running replicas 0, waiting for [jiwei-0130b-25fcm-worker-a-j6t42 jiwei-0130b-25fcm-worker-b-dpw9b jiwei-0130b-25fcm-worker-c-9cdms]
ERROR Cluster operator machine-api Available is False with Initializing: Operator is initializing
ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
INFO Cluster operator network ManagementStateDegraded is False with :
INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready
INFO Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
ERROR Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
ERROR failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, image-registry, ingress, machine-api, monitoring, storage are not available
$ export KUBECONFIG=test31/auth/kubeconfig 
$ ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          74m     Unable to apply 4.13.0-0.nightly-2023-01-27-165107: some cluster operators are not available
$ ./oc get nodes
NAME                                                 STATUS   ROLES                  AGE   VERSION
jiwei-0130b-25fcm-master-0.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
jiwei-0130b-25fcm-master-1.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
jiwei-0130b-25fcm-master-2.c.openshift-qe.internal   Ready    control-plane,master   69m   v1.25.2+7dab57f
$ ./oc get machines -n openshift-machine-api
NAME                               PHASE   TYPE   REGION   ZONE   AGE
jiwei-0130b-25fcm-master-0                                        73m
jiwei-0130b-25fcm-master-1                                        73m
jiwei-0130b-25fcm-master-2                                        73m
jiwei-0130b-25fcm-worker-a-j6t42                                  65m
jiwei-0130b-25fcm-worker-b-dpw9b                                  65m
jiwei-0130b-25fcm-worker-c-9cdms                                  65m
$ ./oc get controlplanemachinesets -n openshift-machine-api
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3                           3             Active   74m
$ 

Please see the attached ".openshift_install.log", install-config.yaml snippet, and more "oc" commands outputs.

 

 

 

 

 

Description of problem: The product name for Azure Red Hat OpenShift was incorrect in Customer Case Management (CCM). As a result, the console included this incorrect product name in order for the support case link to correctly route. https://issues.redhat.com/browse/CPCCM-9926 fixed the incorrect product name, so now the support case link for Azure needs to be updated to reflect the correct product name.

This is a clone of issue OCPBUGS-11458. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6947. The following is the description of the original issue:

This is a long standing issue where gcp ovn for some reason sees dramatically more disruption to ingress during upgrades than other clouds. It can best be seen in the "ingress" graphs in charts such as: https://lookerstudio.google.com/s/v6xhLCTHHDY

Notice image-registry-new (which is ingress backed), ingress-to-console new, and ingress-to-oauth new, all of which take an average of 40s as of the time of this writing. For comparison, Azure is normally <10, and AWS <4.

You will also note the load-balancer new backend shows similar high disruption, but after conversations with network edge we now know the code paths for these two are very different, thus we're filing this as a separate bug. The SLB bug is https://issues.redhat.com/browse/OCPBUGS-6796. The two may prove to be same cause in future, as they do appear similar, but not identical even in terms of when the problems occur.

Some example prob jows are easy to find as the disruption is on average there. Note that we do not typically fail a test on these as the disruption monitoring stack is built to try to pin where we're at now, and this is a long standing issue.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620744632478470144

This job was near successful but got 45s of disruption to image-registry-new. The disruption observed can always be seen in artifacts such as: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620744632478470144/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/backend-disruption_20230201-120923.json

Expanding the first "Intervals - spyglass" chart on the main prowjob page, you can see when the disruption occurred and what else was going on in the cluster at that time.

This shows we're not getting a continuous 40+s of disruption, rather a few batches.

The ingress services all go down roughly together, the service load balancer pattern looks a little different, thus the different bug mentioned above.

For more examples just visit https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade%22%7D%5D%7D&sortField=timestamp&sort=desc, it will happen nearly every time.

When examining what else was going on when this happens, we see some clear patterns of nodes being updated.

This is a clone of issue OCPBUGS-20181. The following is the description of the original issue:

Description of problem:

unit test failures rates are high https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oc-master-unit

TestNewAppRunAll/emptyDir_volumes is failing

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1557/pull-ci-openshift-oc-master-unit/1710206848667226112

Version-Release number of selected component (if applicable):

 

How reproducible:

Run local or in CI and see that unit test job is failing

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

A cluster hit a panic in etcd operator in bootstrap:
I0829 14:46:02.736582 1 controller_manager.go:54] StaticPodStateController controller terminated
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e940ab]

goroutine 2701 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x29374c0, 0xc00217d920}, 0xc0021fb110)
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:135 +0x34b
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x7f
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2ac
Version-Release number of selected component (if applicable):

 

How reproducible:

Pulled up a 4.12 cluster and hit panic during bootstrap

Steps to Reproduce:

1.
2.
3.

Actual results:

panic as above

Expected results:

no panic

Additional info:

 

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.

2.

3.

 

Actual results:

 

Expected results:

 

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure 
  2. customer issue / SD
  3. internal RedHat testing failure

 

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

 

If it is a CI failure:

 

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

 

If it is a customer / SD issue:

 

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
  • Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Description of problem:

4.12 tech-preview jobs are suffering:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=event+happened.*no+matches+for+kind.*InsightsDataGather&maxAge=48h&type=junit' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview (all) - 10 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview-serial (all) - 10 runs, 100% failed, 90% of failures match = 90% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-techpreview (all) - 10 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-techpreview-serial (all) - 10 runs, 100% failed, 90% of failures match = 90% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview (all) - 10 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview-serial (all) - 10 runs, 100% failed, 100% of failures match = 100% impact

with runs like this failing:

: [sig-arch] events should not repeat pathologically expand_less	0s
{  1 events happened too frequently

event happened 138 times, something is wrong: ns/default namespace/default - reason/Unable to find REST mapping for %s/%s: %w InsightsDataGather.config.openshift.io%!(EXTRA string=v1, *meta.NoKindMatchError=no matches for kind "InsightsDataGather" in version "config.openshift.io/v1")}

based on events like:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview/1597393851226525696/artifacts/e2e-aws-sdn-techpreview/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "default" and (.message | contains("InsightsDataGather")))'
{
  "apiVersion": "v1",
  "count": 145,
  "eventTime": null,
  "firstTimestamp": "2022-11-29T01:32:16Z",
  "involvedObject": {
    "apiVersion": "v1",
    "kind": "Namespace",
    "name": "default",
    "namespace": "default"
  },
  "kind": "Event",
  "lastTimestamp": "2022-11-29T02:19:36Z",
  "message": "InsightsDataGather.config.openshift.io%!(EXTRA string=v1, *meta.NoKindMatchError=no matches for kind \"InsightsDataGather\" in version \"config.openshift.io/v1\")",
  "metadata": {
    "creationTimestamp": "2022-11-29T01:32:16Z",
    "name": "default.172bea26177786ae",
    "namespace": "default",
    "resourceVersion": "237357",
    "uid": "187cf3a0-cf4b-4cd1-ae72-51b5d77b7e73"
  },
  "reason": "Unable to find REST mapping for %s/%s: %w",
  "reportingComponent": "",
  "reportingInstance": "",
  "source": {
    "component": "run-resourcewatch-config-observer-controller-configobservercontroller"
  },
  "type": "Warning"
}

Version-Release number of selected component (if applicable):

4.12 tech-preview jobs are impacted.

How reproducible:

100% for some job flavors, per the search CI output above.

Steps to Reproduce:

1. Look at test results for any of the impacted job flavors.

Actual results:

Lots of NoKindMatchError events for v1 InsightsDataGather (it's only v1alpha1).

Expected results:

Passing test-cases.

Additional info:

The problematic REST-mapping client was removed from 4.13/dev as part of origin#27596.

This is a clone of issue OCPBUGS-5542. The following is the description of the original issue:

Description of problem:
The project list orders projects by its name and is smart enough to keep a "numerical order" like:

  1. test-1
  2. test-2
  3. test-11

The more prominent project dropdown is not so smart and shows just a simple "ascii ordered" list:

  1. test-1
  2. test-11
  3. test-2

Version-Release number of selected component (if applicable):
4.8-4.13 (master)

How reproducible:
Always

Steps to Reproduce:
1. Create some new projects called test-1, test-11, test-2
2. Check the project list page (in admin perspective)
3. Check the project dropdown (in dev perspective)

Actual results:
Order is

  1. test-1
  2. test-11
  3. test-2

Expected results:
Order should be

  1. test-1
  2. test-2
  3. test-11

Additional info:
none

Description of problem:

 

During ocp multinode spoke cluster creation agent provisioning is stuck on "configuring" because machineConfig service is crashing on the node.
After restarting the service still fails with 

Can't read link "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption. 

Version-Release number of selected component (if applicable):

Podman 4.0.2 + 

How reproducible:

sometimes

Steps to Reproduce:

1. deploy multinode spoke (ipxe + boot order )
2.
3.

Actual results:

4 agents in done state and 1 is in "configuring"

 

Expected results:

all agents are in "done" state

Additional info:

issue mentioned in https://github.com/containers/podman/issues/14003

 

Fix: https://github.com/containers/storage/issues/1136

 

 

 

This is a clone of issue OCPBUGS-186. The following is the description of the original issue:

Description of problem:
When resizing the browser window, the PipelineRun task status bar would overlap the status text that says "Succeeded" in the screenshot.

Actual results:
Status text is overlapped by the task status bar

Expected results:
Status text breaks to a newline or gets shortened by "..."

This is a clone of issue OCPBUGS-2479. The following is the description of the original issue:

Description of problem:

Right border radius is 0 for the pipeline visualization wrapper in dark mode but looks fine in light mode

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Switch the theme to dark mode
2. Create a pipeline and navigate to the Pipeline details page

Actual results:

Right border radius is 0, see the screenshots

Expected results:

Right border radius should be same as left border radius.

Additional info:

 

This is a clone of issue OCPBUGS-2083. The following is the description of the original issue:

Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-13052.

This is a clone of issue OCPBUGS-5733. The following is the description of the original issue:

Description of problem:

Description of parameters are not shown in pipelinerun description page

Version-Release number of selected component (if applicable):

Openshift Pipelines 1.9.0
OCP 4.12

How reproducible:

Always

Steps to Reproduce:

1. Create pipeline with parameters and add description to the params
2. Start the pipeline and navigate to created pipelinerun
3. Select 

Parameters

tab and check the description of the params 

Actual results:

Description feild of the params are empty

Expected results:

Description of the params should be present

Additional info:

 

This is a clone of issue OCPBUGS-3414. The following is the description of the original issue:

Description of problem:

The current implementation of new OCI FBC feature omits the creation of the ImageContentSourcePolicy
 and CatalogSource resources

 

Description of problem:

The default catalogSources are not being ran in restricted mode.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

1. Create an 4.12 openshift cluster
2. Check the securityContextConfig for the default catalogSources

Actual results:

$ k get catsrc  -n openshift-marketplace -o yaml | grep securityContextConfig
    securityContextConfig: legacy
    securityContextConfig: legacy
    securityContextConfig: legacy
    securityContextConfig: legacy

Expected results:

$ k get catsrc  -n openshift-marketplace -o yaml | grep securityContextConfig
      securityContextConfig: restricted
      securityContextConfig: restricted
      securityContextConfig: restricted
      securityContextConfig: restricted

Additional info:

 

 

 

 

This is a clone of issue OCPBUGS-1557. The following is the description of the original issue:

Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:

  "scheduling": {
    "automaticRestart": false,
    "onHostMaintenance": "MIGRATE",
    "preemptible": false,
    "provisioningModel": "STANDARD"
  },

From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:

If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".

But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:

$ git clone https://github.com/openshift/machine-api-provider-gcp.git
$ cd machine-api-provider-gcp
$ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api'
44f0f958 migrate to openshift/api

But that's not where the 4.9 and earlier code is located:

$ git branch -a | grep origin/release
  remotes/origin/release-4.10
  remotes/origin/release-4.11
  remotes/origin/release-4.12
  remotes/origin/release-4.13

Hunting for 4.9 code:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp
  gcp-machine-controllers                        https://github.com/openshift/cluster-api-provider-gcp                       c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6
  gcp-pd-csi-driver                              https://github.com/openshift/gcp-pd-csi-driver                              48d49f7f9ef96a7a42a789e3304ead53f266f475
  gcp-pd-csi-driver-operator                     https://github.com/openshift/gcp-pd-csi-driver-operator                     d8a891de5ae9cf552d7d012ebe61c2abd395386e

So looking there:

$ git clone https://github.com/openshift/cluster-api-provider-gcp.git
$ cd cluster-api-provider-gcp
$ git log --oneline | grep 'migrate to openshift/api'
...no hits...
$ git grep -i automaticRestart origin/release-4.9  | grep -v '"description"\|compute-gen.go'
origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json:        "automaticRestart": {

Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.

Description of problem:

If we use a macvlan with the configuration...
spec:
  config: '{ "cniVersion": "0.3.1", "name": "ran-bh-macvlan-test", "plugins": [ {"type": "macvlan","master": "vlan306", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "2001:1b74:480:603d:0304:0403:000:0000-2001:1b74:480:603d:0304:0403:0000:0004/64","gateway": "2001:1b74:480:603d::1" } } ]}'

there is an error creating the pod:

  Warning  FailedCreatePodSandBox  17s (x3 over 55s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test31_test-ecoloma-01_a593bd0a-83e7-4d31-857e-0c31491e849e_0(5cf36bd99ffa532fd34735e68caecfbc69d820ba6cb04e348c9f9f168498022f): error adding pod test-ecoloma-01_test31 to CNI network "multus-cni-network": [test-ecoloma-01/test31:ran-bh-macvlan-test]: error adding container to network "ran-bh-macvlan-test": Error at storage engine: OverlappingRangeIPReservation.whereabouts.cni.cncf.io "2001-1b74-480-603d-304-403--" is invalid: metadata.name: Invalid value: "2001-1b74-480-603d-304-403--": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
  
  
If we change the start IP address to 2001:1b74:480:603d:0304:0403:000:0001, it works ok ok.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always reproducible

Steps to Reproduce:

1. See description of problem.

Actual results:

Unable to create pod

Expected results:

IP range should be valid and pod should get created

Additional info:

 

Description of problem:

lately some of the latency tests were failing and it turned out that the failure is due to that the maximum latency found was greater than anticipated. The tests in this suite do not care about the measured latency because we do not care to tune the systems in the first place. Its only goal is to run the tests of the latency tools with different values of environment variables and validate that the tool actually runs with these parameters.

Increase the maximum latency threshold so that the test doesn't fail in case of a high latency result.

gomega truncating:
When the tests fail, we want to have a clear full log of the output of the executed latency tool. The default format.MaxLength is 4000 but it isn't always sufficient to display the full string presentation especially when there is a failure.

Disable this limitation by setting the format.MaxLength to 0, to help with better troubleshooting of the failure.

For more info: https://onsi.github.io/gomega/#adjusting-output

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:
Tests failure when running dev-console tests locally.

Version-Release number of selected component (if applicable):
At least on 4.11 and 4.12

How reproducible:
Always

Steps to Reproduce:
1. Start cypress: yarn run test-cypress-dev-console
2. Run add-page

Actual results:
Fails

Expected results:
Should pass

Additional info:

Description of problem:

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-ec.1   True        False         45h     Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded

oc --context build02 get co kube-controller-manager
NAME                      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-ec.1   True        False         True       2y87d   GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.

Description of problem:

On MicroShift, the Route API is served by kube-apiserver as a CRD. Reusing the same defaulting implementation as vanilla OpenShift through a patch to kube- apiserver is expected to resolve OCPBUGS-4189 but have no detectable effect on OCP.

Additional info:

This patch will be inert on OCP, but is implemented in openshift/kubernetes because MicroShift ingests kube-apiserver through its build-time dependency on openshift/kubernetes.

`aws-ebs-csi-driver-operator` runs in the mgmt cluster and does not need to be configured with the guest cluster proxy

hypershift proxy conformance test currently fails because the `storage` CO never becomes `Available`

https://k8s-testgrid.appspot.com/redhat-hypershift#4.12-conformance-aws-ovn-proxy

This ticket is linked with

https://issues.redhat.com/browse/SDA-8177
https://issues.redhat.com/browse/SDA-8178

As a summary, a base domain for a hosted cluster may already contain the "cluster-name".

But it seems that Hypershift also encodes it during some reconciliation step:

https://github.com/openshift/hypershift/blob/main/support/globalconfig/dns.go#L20

Then when using a DNS base domain like:

"rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

we will have A records like:

"*.apps.lponce-prod-01.rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

The expected behaviour would be that given a DNS base domain:

"rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

The resulting wildcard for Ingress would be:

"*.apps.rosa.lponce-prod-01.qtii.p3.openshiftapps.com"

Note that trying to configure a specific IngressSpec for a hosted cluster didn't work for our case, as the wildcards records are not created.

Description of the problem:

Noticed there were no thread IDs in the assisted-installer logs when debugging 240 node cluster deployment with MCE (slack thread) making it difficult to debug.

How reproducible: 100%

 

Steps to reproduce:

1. Create cluster using assisted service and start the install 

2. Look at the assisted-installer logs 

Actual results:

Logs look like

time="2022-07-14T16:17:31Z" level=info msg="Start complete installation step, with params success: true, error info: " 

Expected results: Thread ID would also print so we can understand which thread it came from


Adding setReportCaller to true will also help

Description of problem:

The storageclass "thin-csi" is created by vsphere-CSI-Driver-Operator, after deleting it manually, it should be re-created immediately. 

Version-Release number of selected component (if applicable):

4.11.4

How reproducible:

Always

Steps to Reproduce:

1. Check storageclass in running cluster, thin-csi is present:
$ oc get sc 
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate              false                  41m
thin-csi         csi.vsphere.vmware.com         Delete          WaitForFirstConsumer   true                   38m
2. Delete thin-csi storageclass:
$ oc delete sc thin-csi
storageclass.storage.k8s.io "thin-csi" deleted
3. Check storageclass again, thin-csi is not present:
$ oc get sc
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate           false                  50m
4. Check vmware-vsphere-csi-driver-operator log:
......
I0909 03:47:42.172866       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1662695014\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1662695014\" (2022-09-09 02:43:34 +0000 UTC to 2023-09-09 02:43:34 +0000 UTC (now=2022-09-09 03:47:42.172853123 +0000 UTC))"I0909 03:49:38.294962       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295468       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295765       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF

5. Only first time creating in vmware-vsphere-csi-driver-operator log:
$ oc -n openshift-cluster-csi-drivers logs vmware-vsphere-csi-driver-operator-7cc6d44b5c-c8czw | grep -i "storageclass"I0909 03:46:31.865926   1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"9e0c3e2d-d403-40a1-bf69-191d7aec202b", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'StorageClassCreated' Created StorageClass.storage.k8s.io/thin-csi because it was missing 

Actual results:

The storageclass "thin-csi" could not be re-created after deleting

Expected results:

The storageclass "thin-csi" should be re-created after deleting

Additional info:

 

When installing OCP cluster with worker nodes VM type specified as high performance, some of the configuration settings of said VMs do not match the configuration settings a high performance VM should have.

Specific configurations that do not match are described in subtasks.

 

Default configuration settings of high performance VMs:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/virtual_machine_management_guide/index?extIdCarryOver=true&sc_cid=701f2000001Css5AAC#Configuring_High_Performance_Virtual_Machines_Templates_and_Pools

When installing OCP cluster with worker nodes VM type specified as high performance, manual and automatic migration is enabled in the said VMs.
However, high performance worker VMs are created with default values of the engine, so only manual migration should be enabled.

Default configuration settings of high performance VMs:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/virtual_machine_management_guide/index?extIdCarryOver=true&sc_cid=701f2000001Css5AAC#Configuring_High_Performance_Virtual_Machines_Templates_and_Pools

How reproducible: 100%

How to reproduce:

1. Create install-config.yaml with a vmType field and set it to high performance, i.e.:

apiVersion: v1
baseDomain: basedomain.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    ovirt:
      affinityGroupsNames: []
      vmType: high_performance
  replicas: 2
...

2. Run installation

./openshift-install create cluster --dir=resources --log-level=debug

3. Check worker VM's configuration in the RHV webconsole.

Expected:
Only manual migration (under Host) should be enabled.

Actual:
Manual and automatic migration is enabled.

Description of problem:

InstanceMetadataTags are not supported in AWS C2S region(us-iso-x)

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. OCP4.11 IPI Installation on AWS C2S regions
2. 
3. 

Actual results:

 

Expected results:

 

Additional info:

Actual Error: 

"Error launching resource Instance. Unsupported Operation Specifying InstanceMetadataTags is not yet supported"

There is a related fix on upstream:

resource/aws_instance: Handle regions where instance metadata tags are unsupported
https://github.com/hashicorp/terraform-provider-aws/pull/26631

Description of problem:

i18n translation missing in "Remove component node from application" modal

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Navigate to dev console and create a workload under an Application group
2. On the Toplogy remove the workload from the Application group
3. See the i18n error in the console

Actual results:

Missing i18n key "Remove component node from application" in namespace "topology" and language "en." in console

Expected results:

No i18n error should be shown in the console.

Additional info:

 

This is a clone of issue OCPBUGS-11371. The following is the description of the original issue:

Description of problem:

oc-mirror fails to complete with heads only complaining about devworkspace-operator

Version-Release number of selected component (if applicable):

# oc-mirror version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.12.0-202302280915.p0.g3d51740.assembly.stream-3d51740", GitCommit:"3d517407dcbc46ededd7323c7e8f6d6a45efc649", GitTreeState:"clean", BuildDate:"2023-03-01T00:20:53Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Attempt a headsonly mirroring for registry.redhat.io/redhat/redhat-operator-index:v4.10

Steps to Reproduce:

1. Imageset currently:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
storageConfig:
  registry:
    imageURL: myregistry.mydomain:5000/redhat-operators
    skipTLS: false
mirror:
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
2.$ oc mirror --config=./imageset-config.yml docker://otherregistry.mydomain:5000/redhat-operators

Checking push permissions for otherregistry.mydomain:5000
Found: oc-mirror-workspace/src/publish
Found: oc-mirror-workspace/src/v2
Found: oc-mirror-workspace/src/charts
Found: oc-mirror-workspace/src/release-signatures
WARN[0026] DEPRECATION NOTICE:
Sqlite-based catalogs and their related subcommands are deprecated. Support for
them will be removed in a future release. Please migrate your catalog workflows
to the new file-based catalog format. 

The rendered catalog is invalid.

Run "oc-mirror list operators --catalog CATALOG-NAME --package PACKAGE-NAME" for more information.  

error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"  

Actual results:

error: error generating diff: channel fast: head "devworkspace-operator.v0.19.1-0.1679521112.p" not reachable from bundle "devworkspace-operator.v0.19.1"

Expected results:

For the catalog to be mirrored.

The relevant code in ironic-image was not updated to support TLS, so it still uses the old port and explicit http://

Description of problem:

egressip healthcheck through GRPC on dualstack cluster only uses v6 address when it trying to re-connect to egressIP node

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-04-081353

How reproducible:

 

Steps to Reproduce:

1. on dualstack OVN cluster, label one node to be egressip assignable
2. check leader ovnkube-master pod's log for egressip health check messages
3. set iptable to drop tcp port 9107 on the egress node, check leader ovnkube-master pod's log again

$  oc -n openshift-ovn-kubernetes logs ovnkube-master-s8gl4  -c ovnkube-master | grep health
I1004 17:10:13.752545       1 egressip_healthcheck.go:168] Connected to master-01.jechen-1004d.qe.devcluster.openshift.com (10.129.0.2:9107)
I1004 17:10:13.754308       1 egressip_healthcheck.go:168] Connected to master-00.jechen-1004d.qe.devcluster.openshift.com (10.128.0.2:9107)
I1004 17:10:13.757856       1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
I1004 17:10:13.760742       1 egressip_healthcheck.go:168] Connected to worker-02.jechen-1004d.qe.devcluster.openshift.com (10.131.0.2:9107)
I1004 17:10:13.763491       1 egressip_healthcheck.go:168] Connected to master-02.jechen-1004d.qe.devcluster.openshift.com (10.130.0.2:9107)
I1004 17:10:13.766653       1 egressip_healthcheck.go:168] Connected to worker-01.jechen-1004d.qe.devcluster.openshift.com (10.128.2.2:9107)
I1004 17:10:18.749573       1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
I1004 17:10:18.749624       1 egressip_healthcheck.go:177] Closing connection with worker-01.jechen-1004d.qe.devcluster.openshift.com (10.128.2.2:9107)
I1004 17:10:18.749635       1 egressip_healthcheck.go:177] Closing connection with master-01.jechen-1004d.qe.devcluster.openshift.com (10.129.0.2:9107)
I1004 17:10:18.749645       1 egressip_healthcheck.go:177] Closing connection with master-00.jechen-1004d.qe.devcluster.openshift.com (10.128.0.2:9107)
I1004 17:10:18.749654       1 egressip_healthcheck.go:177] Closing connection with worker-02.jechen-1004d.qe.devcluster.openshift.com (10.131.0.2:9107)
I1004 17:10:18.749663       1 egressip_healthcheck.go:177] Closing connection with master-02.jechen-1004d.qe.devcluster.openshift.com (10.130.0.2:9107)
I1004 18:21:13.753154       1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
I1004 18:21:19.749592       1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
W1004 18:21:24.750727       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:29.750396       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:34.749900       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:39.750830       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:44.750599       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:49.750640       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:54.749998       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:21:59.750512       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:22:04.749911       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:22:09.750500       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:22:14.750400       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:22:19.750448       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:22:24.749497       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:22:29.750366       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
I1004 18:24:03.020413       1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
I1004 18:24:09.750273       1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
W1004 18:24:14.749580       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:19.750138       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:24.750291       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:29.750526       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:34.750725       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:39.750496       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:44.750182       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:49.750172       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:54.749791       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:24:59.749548       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:25:04.750806       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:25:09.750666       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:25:14.750602       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:25:19.750717       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
I1004 18:28:58.561054       1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
I1004 18:29:04.749940       1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107)
W1004 18:29:09.749710       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
W1004 18:29:14.749689       1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
 

Actual results:

uses v6 mgmtIP address to try to reconnect

Expected results:

Should use both v4 and v6 address to try to reconnect

Additional info:

 

 

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

When services are deleted, the services controller cache should also remove the service from its top level cache to avoid growing forever.

While this is not an issue in 4.13 once the lb_cache rework merges [1], the 4.12 and older branches have this problem because that rework is meant for 4.13 only.

[1]: https://github.com/ovn-org/ovn-kubernetes/pull/3387

This is the location where alreadyApplied is not deleting the removal: 
https://github.com/openshift/ovn-kubernetes/blob/cf9fb51510e1870961bf3a0f064b73536757a4f8/go-controller/pkg/ovn/controller/services/services_controller.go#L269

It should do the similar changes depicted here (currently merged upstream):
https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/services_controller.go#L322-L324

 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. create service -- use unique name
2. remove service
3. notice how alreadyApplied grows and never gets smaller
4. repeat

Actual results:

^^

Expected results:

alreadyApplied should not grow forever

Additional info:

 

Description of problem:

ClusterOperator status get's updated when the conditions are re-ordered. There doesn't seem to be any change to the conditions except the reorder.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

kubectl get clusteroperator monitoring -oyaml --watch

Actual results:

status:   
  conditions: 
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    status: "True"
    type: Upgradeable

Expected results:

I would have expected no update, since nothing changed.

status:   
  conditions:   
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    message: Successfully rolled out the stack.
    reason: RollOutDone
    status: "True"
    type: Available
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-08-25T23:39:59Z"
    message: 'Prometheus is running without persistent storage which can lead to data
      loss during upgrades and cluster disruptions. Please refer to the official documentation
      to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html'
    reason: PrometheusDataPersistenceNotConfigured
    status: "False"
    type: Degraded
 

Additional info:

 

This is a clone of issue OCPBUGS-11536. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11434. The following is the description of the original issue:

Description of problem:

node-exporter profiling shows that ~16% of CPU time is spend fetching details about btrfs mounts. RHEL kernel doesn't have btrfs, so its safe to disable this exporter

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

I'd disabled Telemetry for the bulk of the CI fleet in OTA-740. But that lead to many
failures for:

[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

We should extend the checks for Telemetry enablement to include telemeterClient.enabled in the monitoring-specific ConfigMap, as well as the previously-checked pull-secret token.

This is a clone of issue OCPBUGS-13170. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12971. The following is the description of the original issue:

Description of problem:

hybrid-overlay on Windows nodes ignores existing non-persistent routes when creating the new vNIC. This becomes an issue especially on AWS where non-persistent routes are used for the local metadata server which is in turn used by the kubelet.
How reproducible:{code:none}
Always on Windows Server 2022 instances on AWS

Steps to Reproduce:

1. Bring up an hybrid-ovn OCP cluster on AWS
2. Add a Windows Server 2022 Machine or BYOH instance

Actual results:

The Windows Server 2022 node goes to NotReady

Expected results:

The Windows Server 2022 node goes to Ready

Additional info:

While this problem shows up only on AWS, it can be an issue on other providers if customers are depending on non-persistent routes on Windows nodes in general.

This is a clone of issue OCPBUGS-3096. The following is the description of the original issue:

While the installer binary is statically linked, the terraform binaries shipped with it are dynamically linked.

This could give issues when running the installer on Linux and depending on the GLIBC version the specific Linux distribution has installed. It becomes a risk when switching the base image of the builders from ubi8 to ubi9 and trying to run the installer in cs8 or rhel8.

For example, building the installer on cs9 and trying to run it in a cs8 distribution leads to:

time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform version -json"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform version -json"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/root/test/terraform/plugins"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)"
time="2022-10-31T14:31:47+01:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"cluster\" stage: failed to create cluster: failed doing terraform init: exit status 1\n/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)\n/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)\n"

How reproducible:Always

Steps to Reproduce:{code:none}
1. Build the installer on cs9
2. Run the installer on cs8 until the terraform binary are started
3. Looking at the terrform binary with ldd or file, you can get it is not a statically linked binary and the error above might occur depending on the glibc version you are running on 

Actual results:

 

Expected results:

The terraform and providers binaries have to be statically linked as well as the installer is.

Additional info:

This comes from a build of OKD/SCOS that is happening outside of Prow on a cs9-based builder image.

One can use the Dockerfile at images/installer/Dockerfile.ci and replace the builder image with one like https://github.com/okd-project/images/blob/main/okd-builder.Dockerfile

This is a clone of issue OCPBUGS-3499. The following is the description of the original issue:

Description of problem:

On clusters serving Route via CRD (i.e. MicroShift), Route validation does not perform the same validation as on OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

$ cat<<EOF | oc apply --server-side -f-
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-microshift
spec:
  to:
    kind: Service
    name: hello-microshift
EOF

route.route.openshift.io/hello-microshift serverside-applied

$ oc get route hello-microshift -o yaml

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  annotations:
    openshift.io/host.generated: "true"
  creationTimestamp: "2022-11-11T23:53:33Z"
  generation: 1
  name: hello-microshift
  namespace: default
  resourceVersion: "2659"
  uid: cd35cd20-b3fd-4d50-9912-f34b3935acfd
spec:
  host: hello-microshift-default.cluster.local
  to:
    kind: Service
    name: hello-microshift
  wildcardPolicy: None

$ cat<<EOF | oc apply --server-side -f-
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-microshift
spec:
  to:
    kind: Service
    name: hello-microshift
  wildcardPolicy: ""
EOF

Actual results:

route.route.openshift.io/hello-microshift serverside-applied

Expected results:

The Route "hello-microshift" is invalid: spec.wildcardPolicy: Invalid value: "": field is immutable 

Additional info:

** This change will be inert on OCP, which already has the correct behavior. **

 

Description of problem:

When the Insights operator is marked as disabled then the "Available" operator condition is updated every 2 mins. This is not desired and gives an impression that the operator is restarted every 2 mins 

Version-Release number of selected component (if applicable):

 

How reproducible:

No extra steps needed, just watch "oc get co insights --watch"

Steps to Reproduce:

1.
2.
3.

Actual results:

available condition transition time updated every 2 min

Expected results:

available condition is updated only when its status changed

Additional info:

 

As OpenShift user, I want ClusterCSIDriver.Spec.LogLevel to affect the vSphere CSI driver logs, so I can capture the logs with all details and send it to Red Hat for investigation.

As OpenShift developer, I want ClusterCSIDriver.Spec.LogLevel to affect the vShere CSI CSI driver logs, so I can debug the driver with all logs.

Exit criteria:

  • When ClusterCSIDriver.Spec.LogLevel is set to Debug or higher, vSphere CSI driver logs include DEBUG messages like:

2022-08-05T11:54:10.808Z DEBUG commonco/utils.go:102 Container Orchestrator init params:

Unknown macro: {InternalFeatureStatesConfigInfo}

ServiceMode:controller}

Description of problem:

seeing test failure due to panic in cvo here:

Undiagnosed panic detected in pod expand_less
              0s

                {  pods/openshift-cluster-version_cluster-version-operator-96cf55b5-rffgt_cluster-version-operator_previous.log.gz:E0915 18:38:42.763315       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
pods/openshift-cluster-version_cluster-version-operator-96cf55b5-rffgt_cluster-version-operator_previous.log.gz:E0915 18:38:42.763418       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}

full error from logs:

/E0915 18:38:42.763315       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 187 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1934980?, 0x2bc6240})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4d2604?})
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x1934980, 0x2bc6240})
	/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).calculateNext(0xc0015c6000, 0xc001df2000)
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:716 +0x14d
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:575 +0x2a9
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10000000000?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001df2000?, {0x1e44e60, 0xc002739f50}, 0x1, 0xc00058e0c0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x60?, 0x0?)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0015c6000?, {0x1e5eb30?, 0xc0000cacc0?}, 0x10?, {0x0?, 0x0?}, {0x0?, 0x0?})
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:556 +0x145
github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2()
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:387 +0x83
created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run
	/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:385 +0x4af
E0915 18:38:42.763418       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) 

 

Version-Release number of selected component (if applicable):

 

How reproducible:

currently unsure hit this in a test run, but shouldn't ever panic.

Steps to Reproduce:

1.
2.
3.

Actual results:

panic in cvo pod

Expected results:

no panic in cvo pod

Additional info:

 

This is a clone of issue OCPBUGS-1761. The following is the description of the original issue:

Description of problem:

When we configure a MC using an osImage that cannot be pulled, the machine config daemon pod spams logs saying that the node is set to "Degraded" state, but the node is not set to "Degraded" state.

Only after long time, like 20 minutes or half and hour, the node eventually becomes degraded.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-26-111919

How reproducible:

Always

Steps to Reproduce:

1. Create a MC using an osImage that cannot be pulled

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: "2022-09-27T12:48:13Z"
  generation: 1
  labels:
    machineconfiguration.openshift.io/role: worker
  name: not-pullable-image-tc54054-w75j1k67
  resourceVersion: "374500"
  uid: 7f828fbc-8da3-4f16-89e2-34e39ff830b3
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files: []
    systemd:
      units: []
  osImageURL: quay.io/openshifttest/tc54054fakeimage:latest


2. Check the logs in the machine config daemon pod, you can see this message being spammed, saying that the daemon is marking the node with "Degraded" status.

E0927 14:31:22.858546    1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found
E0927 14:34:10.698564    1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found
E0927 14:36:58.557340    1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found


Actual results:

The node is not marked as degraded as it should. Only after long time, 20 minutes or so, the node becomes degraded.

Expected results:

When the podman pull command fails and the machine config daemon sets the node state as "Degraded", the node should actually be marked as "Degraded".

Additional info:

 

 

Description of problem:

Image registry pods panic while deploying OCP in me-central-1 AWS region

Version-Release number of selected component (if applicable):

4.11.2

How reproducible:

Deploy OCP in AWS me-central-1 region

Steps to Reproduce:

Deploy OCP in AWS me-central-1 region 

Actual results:

panic: Invalid region provided: me-central-1

Expected results:

Image registry pods should come up with no errors

Additional info:

 

This is a clone of issue OCPBUGS-18764. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6513. The following is the description of the original issue:

Description of problem:

Using the web console on the RH Developer Sandbox, created the most basic Knative Service (KSVC) using the default suggested, ie image openshift/hello-openshift.

Then tried to change the displayed icon using the web UI and an error about Probes was displayed. See attached images.

The error has no relevance to the item changed.

Version-Release number of selected component (if applicable):

whatever the RH sandbox uses, this value is not displayed to users

How reproducible:

very

Steps to Reproduce:

Using the web console on the RH Developer Sandbox, created the most basic Knative Service (KSVC) using the default image openshift/hello-openshift.

Then used the webUi to edit the KSVC sample to change the icon used from an OpenShift logo to a 3Scale logo for instance.

When saving from this form an error was reported: admission webhook 'validation webhook.serving.knative.dev' denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].readiness.Probe




Actual results:

 

Expected results:

Either a failure message related to changing the icon, or the icon change to take effect

Additional info:

KSVC details as provided by the web console.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sample
  namespace: agroom-dev
spec:
  template:
    spec:
      containers:
        - image: openshift/hello-openshift

Sprig is a dependency of cno which is in turn a dependency of multiple projects while the old sprig has a vulnerability.

This is a clone of issue OCPBUGS-1125. The following is the description of the original issue:

(originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1983200)

test:
[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial]

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-etcd%5C%5D%5C%5BFeature%3ADisasterRecovery%5C%5D%5C%5BDisruptive%5C%5D+%5C%5BFeature%3AEtcdRecovery%5C%5D+Cluster+should+restore+itself+after+quorum+loss+%5C%5BSerial%5C%5D

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1413625606435770368
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1415075413717159936

some brief triaging from Thomas Jungblut on:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11/1568747321334697984

it seems the last guard pod doesn't come up, etcd operator installs this properly and the revision installer also does not spout any errors. It just doesn't progress to the latest revision. At first glance doesn't look like an issue with etcd itself, but needs to be taken a closer look at for sure.

This is a clone of issue OCPBUGS-10298. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2153. The following is the description of the original issue:

When ProjectID is not set, TenantID might be ignored in MAPO.

Context: When setting additional networks in Machine templates, networks can be identified by the means of a filter. The network filter has both TenantID and ProjectID as fields. TenantID was ignored.

Steps to reproduce:
Create a Machine or a MachineSet with a template containing a Network filter that sets a TenantID.

```
networks:

  • filter:
    id: 'the-network-id'
    tenantId: '123-123-123'
    ```

One cheap way of testing this could be to pass a valid network ID and set a bogus tenantID. If the machine gets associated with the network, then tenantID has been ignored and the bug is present. If instead MAPO errors, then in means that it has taken tenantID into consideration.

This is a clone of issue OCPBUGS-15606. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15497. The following is the description of the original issue:

I am using a BuildConfig with git source and the Docker strategy. The git repo contains a large zip file via LFS and that zip file is not getting downloaded. Instead just the ascii metadata is getting downloaded. I've created a simple reproducer (https://github.com/selrahal/buildconfig-git-lfs) on my personal github. If you clone the repo

git clone git@github.com:selrahal/buildconfig-git-lfs.git

and apply the bc.yaml file with

oc apply -f bc.yaml

Then start the build with

oc start-build test-git-lfs

You will see the build fails at the unzip step in the docker file

STEP 3/7: RUN unzip migrationtoolkit-mta-cli-5.3.0-offline.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.

I've attached the full build logs to this issue.

Description of problem:

On Pod definitions gathering, Operator should obfuscate particular environment variables (HTTP_PROXY and HTTPS_PROXY) from containers by default.

Pods from the control plane can have those variables injected from the cluster-wide proxy, and they may contain values as "user:password@[http://6.6.6.6:1234|http://6.6.6.6:1234/]".

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. In order to change deployments, scale down:
  * cluster-version-operator
  * cluster-monitoring-operator
  * prometheus-operator
2. Introduce a new environment variable on alertmanager-main statusSet with either or both HTTP_PROXY, HTTPS_PROXY. Any value but void will do.
4. Run insight-operator to get that pod definitions.
5. Check in the archive (usually config/pod/openshift-monitoring/alertmanager-main-0.json) that target environment variable(s) value is obfuscated. 

Actual results:

...
"spec": {
    ...
    "containers": {
        ...
        "env": [
            {
                "name": "HTTP_PROXY"
                "value": "jdow:1qa2wd@[http://8.8.8.8:8080|http://8.8.8.8:8080/]"
            }
        ]
    }
}
...

Expected results:

...
"spec": {
    ...
    "containers": {
        ...
        "env": [
            {
                "name": "HTTP_PROXY"
                "value": "<obfuscated>"
            }
        ]
    }
}
...

Additional info:

 

Description of problem:

According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-25-071630

How reproducible:

Always, if the Service Usage API is disabled in the GCP project.

Steps to Reproduce:

1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project.
2. Try IPI installation in the GCP project. 

Actual results:

The installation would fail finally, without any worker machines launched.

Expected results:

Installation should succeed, or the OCP doc should be updated.

Additional info:

Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. 
FYI if enabling the API, and without changing anything else, the installation could succeed. 

Description of problem:

When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.

Version-Release number of selected component (if applicable):

 

How reproducible:

with big exclude ranges, 100%

Steps to Reproduce:

1. create network-attachment-definition with a large range:

$ cat <<EOF| oc apply -f -       
apiVersion: k8s.cni.cncf.io/v1                                            
kind: NetworkAttachmentDefinition
metadata:
  name: nad-w-excludes
spec:
  config: |-
    {
      "cniVersion": "0.3.1",
      "name": "macvlan-net",
      "type": "macvlan",
      "master": "ens3",
      "mode": "bridge",
      "ipam": {
         "type": "whereabouts",
         "range": "fd43:01f1:3daa:0baa::/64",
         "exclude": [ "fd43:01f1:3daa:0baa::/100" ],
         "log_file": "/tmp/whereabouts.log",
         "log_level" : "debug"
      }
    }
EOF
2. create a pod with the network attached:

$ cat <<EOF|oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-exclude-range
  annotations:
    k8s.v1.cni.cncf.io/networks: nad-w-excludes
spec:
  containers:
  - name: pod-1
    image: openshift/hello-openshift
EOF

3. check pod status, event log and whereabouts logs after a while: 

$ oc get pods
NAME                        READY   STATUS              RESTARTS   AGE
pod-with-exclude-range      0/1     ContainerCreating   0          2m23s

$ oc get events
<...>
6m39s       Normal    Scheduled                                    pod/pod-with-exclude-range                   Successfully assigned default/pod-with-exclude-range to <worker-node>
6m37s       Normal    AddedInterface                               pod/pod-with-exclude-range                   Add eth0 [10.129.2.49/23] from openshift-sdn
2m39s       Warning   FailedCreatePodSandBox                       pod/pod-with-exclude-range                   Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

$ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log
Starting pod/<worker-node>-debug ...
To use host binaries, run `chroot /host`
2022-10-27T14:14:50Z [debug] Finished leader election
2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil>
2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf
2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 
2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82
2022-10-27T14:14:59Z [debug] Started leader election
2022-10-27T14:14:59Z [debug] OnStartedLeading() called
2022-10-27T14:14:59Z [debug] Elected as leader, do processing
2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range
2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff

Actual results:

Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:

additional network gets attached to the pod

Additional info:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.

Description of problem:

Bootstrap fail in SNO installation

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Test this in libvirt env. Agent-config and install-config in attached.
2. Use attached agent-config and install-config to create image
3. Install SNO:
virt-install --connect qemu:///system -n control-0 -r 33000 --vcpus 8 --cdrom ./agent.iso --disk pool=installer,size=120 --boot uefi,hd,cdrom --os-variant=rhel8.5 --network network=default,mac=52:54:00:aa:aa:aa --wait=-1 --check mac_in_use=off
4. There is following error in bootkube.service log:
-- Logs begin at Fri 2022-09-30 08:58:21 UTC, end at Fri 2022-09-30 09:19:40 UTC. --
Sep 30 09:00:51 test.metalkube.org systemd[1]: Starting Bootkube - bootstrap in place post reboot...
Sep 30 09:00:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Running bootkube bootstrap-in-place post reboot
Sep 30 09:00:52 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ...
Sep 30 09:00:57 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ...
Sep 30 09:01:02 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ...
Sep 30 09:01:07 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ...
Sep 30 09:01:12 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ...
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ...
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]: error: error executing jsonpath "{.items[0].status.conditions[?(@.type==\"Ready\")].status}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template:
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]:         template was:
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]:                 {.items[0].status.conditions[?(@.type=="Ready")].status}
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]:         object given to jsonpath engine was:
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]:                 map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}}
Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ...
Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]: error: error executing jsonpath "{.items[0].status.conditions[?(@.type==\"Ready\")].status}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template:
Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]:         template was:
Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]:                 {.items[0].status.conditions[?(@.type=="Ready")].status}
Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]:         object given to jsonpath engine was:
Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]:                 map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}}
Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ...
Sep 30 09:02:21 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ...
Sep 30 09:02:52 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ...

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

GCP XPN is in tech preview. There are two features which are affected:
1. selecting a DNS zone from a different project should only be allowed if tech preview is enabled in the install config. (Using a DNS zone from a different project will fail to install due to outstanding work in the cluster ingress operator). 
2. GCP XPN passes through the installer host service account for control plane nodes. This should only happen if XPN (networkProjectID) is enabled. It should not happen during normal installs.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

For install config fields:
1.specify a project ID for a DNS zone without featureSet: TechPreviewNoUpgrade
2.run openshift-install create manifests
====
For service accounts:
1. perform normal (not XPN) install
2. Check service account on control plane VM

 

Actual results:

For install config fields: you can specify project id without an error
For service accounts: the control plane vm will have same service account used for install

Expected results:

For install config fields: installer should complain that tech preview is not enabled
For service accounts: should have a new service account, created during install

Additional info:

 

Description of problem:

hypershift pull secret update failed on 4.12

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Currently when installing Openshift on the Openstack cluster name length limit is allowed to  14 characters.
Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters.

Version : OCP 4.8.5 UPI Disconnected 
Environment : Openstack 16 

Issue:
User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters.

Error events :
~~~
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

 

Actual results:

Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack

Expected results:

The 14 characters limit should be change for the OCP clustername on Openstack

Additional info:

 

Description of problem:

When adding new nodes to the existing cluster, the newly allocated node-subnet can be overlapped with the existing node.

Version-Release number of selected component (if applicable):

openshift 4.10.30

How reproducible:

It's quite hard to reproduce but  there is a possibility it can happen any time. 

Steps to Reproduce:

1. Create a OVN dual-stack cluster
2. add nodes to the existing cluster
3. check the allocated node subnet 

Actual results:

Some newly added nodes have the same node-subnet and ovn-k8s-mp0 IP as some existing nodes.

Expected results:

Should have duplicated node-subnet and ovn-k8s-mp0 IP

Additional info:

Additional info can be found at the case 03329155 and the must-gather attached(comment #1) 

% omg logs ovnkube-master-v8crc -n openshift-ovn-kubernetes -c ovnkube-master | grep '2022-09-30T06:42:50.857'
2022-09-30T06:42:50.857031565Z W0930 06:42:50.857020       1 master.go:1422] Did not find any logical switches with other-config
2022-09-30T06:42:50.857112441Z I0930 06:42:50.857099       1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node worker01.ss1.samsung.local
2022-09-30T06:42:50.857122455Z I0930 06:42:50.857105       1 master.go:1003] Allocated Subnets [10.129.4.0/23 fd02:0:0:a::/64] on Node oam04.ss1.samsung.local
2022-09-30T06:42:50.857130289Z I0930 06:42:50.857122       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node worker01.ss1.samsung.local
2022-09-30T06:42:50.857140773Z I0930 06:42:50.857132       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.129.4.0/23","fd02:0:0:a::/64"]}] on node oam04.ss1.samsung.local
2022-09-30T06:42:50.857166726Z I0930 06:42:50.857156       1 master.go:1003] Allocated Subnets [10.128.2.0/23 fd02:0:0:5::/64] on Node oam01.ss1.samsung.local
2022-09-30T06:42:50.857176132Z I0930 06:42:50.857157       1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node rhel01.ss1.samsung.local
2022-09-30T06:42:50.857176132Z I0930 06:42:50.857167       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.2.0/23","fd02:0:0:5::/64"]}] on node oam01.ss1.samsung.local
2022-09-30T06:42:50.857185257Z I0930 06:42:50.857157       1 master.go:1003] Allocated Subnets [10.128.6.0/23 fd02:0:0:d::/64] on Node call03.ss1.samsung.local
2022-09-30T06:42:50.857192996Z I0930 06:42:50.857183       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node rhel01.ss1.samsung.local
2022-09-30T06:42:50.857200017Z I0930 06:42:50.857190       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.6.0/23","fd02:0:0:d::/64"]}] on node call03.ss1.samsung.local
2022-09-30T06:42:50.857282717Z I0930 06:42:50.857258       1 master.go:1003] Allocated Subnets [10.130.2.0/23 fd02:0:0:7::/64] on Node call01.ss1.samsung.local
2022-09-30T06:42:50.857304886Z I0930 06:42:50.857293       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.130.2.0/23","fd02:0:0:7::/64"]}] on node call01.ss1.samsung.local
2022-09-30T06:42:50.857338896Z I0930 06:42:50.857314       1 master.go:1003] Allocated Subnets [10.128.4.0/23 fd02:0:0:9::/64] on Node f501.ss1.samsung.local
2022-09-30T06:42:50.857349485Z I0930 06:42:50.857329       1 master.go:1003] Allocated Subnets [10.131.2.0/23 fd02:0:0:8::/64] on Node call02.ss1.samsung.local
2022-09-30T06:42:50.857371344Z I0930 06:42:50.857354       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.4.0/23","fd02:0:0:9::/64"]}] on node f501.ss1.samsung.local
2022-09-30T06:42:50.857371344Z I0930 06:42:50.857361       1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.2.0/23","fd02:0:0:8::/64"]}] on node call02.ss1.samsung.local

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

1 The debugging endpoint /debug/pprof is exposed over the unauthenticated 10251 port
2 This debugging endpoint can potentially leak sensitive information

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-15848. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15773. The following is the description of the original issue:

Description of problem:

The upgrade Helm Release tab in OpenShift GUI Developer console is not refreshing with updated values.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Add below Helm chart repository from CLI

~~~
apiVersion: helm.openshift.io/v1beta1
kind: HelmChartRepository
metadata:
  name: prometheus-community
spec:
  connectionConfig:
    url: 'https://prometheus-community.github.io/helm-charts'
  name: prometheus-community
~~~
2. Goto GUI and select Developer console --> +Add --> Developer Catalog --> Helm Chart --> Select Prometheus Helm chart --> Install Helm chart --> From dropdown of chart version select 22.3.0 --> Install

3. You will see the image tag as v0.63.0
~~~
    image:
      digest: ''
      pullPolicy: IfNotPresent
      repository: quay.io/prometheus-operator/prometheus-config-reloader
      tag: v0.63.0
~~~ 
4. Once that is installed Goto Helm --> Helm Releases --> Prometheus --> Upgrade --> From dropdown of chart version select 22.4.0 --> the page does not refresh with new value of the tag.

~~~
    image:
      digest: ''
      pullPolicy: IfNotPresent
      repository: quay.io/prometheus-operator/prometheus-config-reloader
      tag: v0.63.0
~~~

NOTE: The same steps before installing the helm chart, when we select different versions the value is being updated.
Goto GUI and select Developer console --> +Add --> Developer Catalog --> Helm Chart --> Select Prometheus Helm chart --> Install Helm chart --> From dropdown of chart version select 22.3.0 --> Now select different chart version like 22.7.0 or 22.4.0

Actual results:

The The yaml view of Upgrade Helm Release tab shows the values of older chart version.

Expected results:

The yaml view of Upgrade Helm Release tab should contain latest values as per selected chart version.

Additional info:

 

This is a clone of issue OCPBUGS-16158. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15060. The following is the description of the original issue:

Description of problem:

When clicking on "Duplicate RoleBinding" in the OpenShift Container Platform Web Console, users are taken to a form where they can review the duplicated RoleBinding.

When the RoleBinding has a ServiceAccount as a subject, clicking "Create" leads to the following error:

An error occurred
Error "Unsupported value: "rbac.authorization.k8s.io": supported values: """ for field "subjects[0].apiGroup".

The root cause seems to be that the field "subjects[0].apiGroup" is set to "rbac.authorization.k8s.io" even for "kind: ServiceAccount" subjects. For "kind: ServiceAccount" subjects, this field is not necessary but the "namespace" field should be set instead.

The functionality works as expected for User and Group subjects.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.19

How reproducible:

Always

Steps to Reproduce:

1. In the OpenShift Container Platform Web Console, click on "User Management" => "Role Bindings"
2. Search for a RoleBinding that has a "ServiceAccount" as the subject. On the far right, click on the dots and choose "Duplicate RoleBinding"
3. Review the fields, set a new name for the duplicated RoleBinding, click "Create"

Actual results:

Duplicating fails with the following error message being shown:

An error occurred
Error "Unsupported value: "rbac.authorization.k8s.io": supported values: """ for field "subjects[0].apiGroup".

Expected results:

RoleBinding is duplicated without an error message

Additional info:

Reproduced with OpenShift Container Platform 4.12.18 and 4.12.19

Description of problem:

Duplicate notification "Getting started" would be shown on Search page 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-26-111919

How reproducible:

Always

Steps to Reproduce:

1. Login OCP as normal user, and change to developer prespective, create a new project
2. Delete the project on page (switch to Administator prespective, go to Home -> Projects page)
3. Switch to Developer prespective, and go to Search page, check the notification "Getting Started"

Actual results:

Two notification shown on page

Expected results:

Only one should exist

Additional info:

 

This is a clone of issue OCPBUGS-10213. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8468. The following is the description of the original issue:

Description of problem:

RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions

Version-Release number of selected component (if applicable):

master/4.14

How reproducible:

always

Steps to Reproduce:

1. openshift-install create install-config
2. Try to select ap-south-2 as a region
3.

Actual results:

New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.

Expected results:

Installer supports and displays the new regions in the Survey

Additional info:

See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23

 

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-7529.

This is a clone of issue OCPBUGS-9985. The following is the description of the original issue:

Description of problem:

DNS Local endpoint preference is not working for TCP DNS requests for Openshift SDN.

Reference code: https://github.com/openshift/sdn/blob/b58a257b896d774e0a092612be250fb9414af5ca/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L999-L1012

This is where the DNS request is short-circuited to the local DNS endpoint if it exists. This is important because DNS local preference protects against another outstanding bug, in which daemonset pods go stale for a few second upon node shutdown (see https://issues.redhat.com/browse/OCPNODE-549 for fix for graceful node shutdown). This appears to be contributing to DNS issues in our internal CI clusters. https://lookerstudio.google.com/reporting/3a9d4e62-620a-47b9-a724-a5ebefc06658/page/MQwFD?s=kPTlddLa2AQ shows large amounts of "dns_tcp_lookup" failures, which I attribute to this bug.

UDP DNS local preference is working fine in Openshift SDN. Both UDP and TCP local preference work fine in OVN. It's just TCP DNS Local preference that is not working Openshift SDN.

Version-Release number of selected component (if applicable):

4.13, 4.12, 4.11

How reproducible:

100%

Steps to Reproduce:

1. oc debug -n openshift-dns
2. dig +short +tcp +vc +noall +answer CH TXT hostname.bind
# Retry multiple times, and you should always get the same local DNS pod.

Actual results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-dnbsp"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-gzlhm"

Expected results:

[gspence@gspence origin]$ oc debug -n openshift-dns
Starting pod/image-debug ...
Pod IP: 10.128.2.10
If you don't see a command prompt, try pressing enter.
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8"
sh-4.4# dig +short +tcp +vc +noall +answer CH TXT hostname.bind
"dns-default-glgr8" 

Additional info:

https://issues.redhat.com/browse/OCPBUGS-488 is the previous bug I opened for UDP DNS local preference not working.

iptables-save from a 4.13 vanilla cluster bot AWS,SDN: https://drive.google.com/file/d/1jY8_f64nDWi5SYT45lFMthE0vhioYIfe/view?usp=sharing 

This is a clone of issue OCPBUGS-2281. The following is the description of the original issue:

Description of problem:

E2E test cases for knative and pipeline packages have been disabled on CI due to respective operator installation issues. 
Tests have to be enabled after new operator version be available or the issue resolves

References:
https://coreos.slack.com/archives/C6A3NV5J9/p1664545970777239

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-6661. The following is the description of the original issue: 

Description of problem:

CRL list is capped at 1MB due to configmap max size. If multiple public CRLs are needed for ingress controller the CRL pem file will be over 1MB. 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create CRL configmap with the following distribution points: 

         Issuer: C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1
         Subject: SOME SIGNED CERT            X509v3 CRL Distribution Points: 
                Full Name:
                  URI:http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.cr  
       
      
# curl -o DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl
# openssl crl -in  DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl -inform DER -out  DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 
# du -bsh DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 
604K    DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem


I still need to find more intermediate CRLS to grow this. 

Actual results:

2023-01-25T13:45:01.443Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "crl", "object": {"name":"custom","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "custom", "reconcileID": "d49d9b96-d509-4562-b3d9-d4fc315226c0", "error": "failed to ensure client CA CRL configmap for ingresscontroller openshift-ingress-operator/custom: failed to update configmap: ConfigMap \"router-client-ca-crl-custom\" is invalid: []: Too long: must have at most 1048576 bytes"}

Expected results:

First be able to create a configmap where data only accounted to the 1MB max (see additional info below for more details), second some way to compress or allow a large CRL list that would be larger than 1MB

Additional info:

Only using this CRL and it being only 600K still causes issue and it could be due to  the `last-applied-configuration` annotation on the configmap. This is added since we do an apply operation (update) on the configmap. I am not sure if this is counting towards the 1MB max. 

https://github.com/openshift/cluster-ingress-operator/blob/release-4.10/pkg/operator/controller/crl/crl_configmap.go#L295 

Not sure if we could just replace the configmap.   

 

This is a clone of issue OCPBUGS-14635. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13140. The following is the description of the original issue:

Description of problem:

According to the Red Hat documentation https://docs.openshift.com/container-platform/4.12/networking/ovn_kubernetes_network_provider/configuring-egress-ips-ovn.html, the maximum number of IP aliases per node is 10 - "Per node, the maximum number of IP aliases, both IPv4 and IPv6, is 10.".

Looking at the code base, the number of allowed IPs is calculated as
Capacity = defaultGCPPrivateIPCapacity (which is set to 10) + cloudPrivateIPsCount (that is number of available IPs from the range) - currentIPv4Usage (number of assigned v4 IPs) - currentIPv6Usage (number of assigned v6 IPs)
https://github.com/openshift/cloud-network-config-controller/blob/master/pkg/cloudprovider/gcp.go#L18-L22

Speaking to GCP, they support up to 100 alias IP ranges (not IPs) per vNIC.

Can Red Hat confirm
1) If there is a limitation of 10 from OCP and why?
2) If there isn't a limit, what is the maximum number of egress IPs that could be supported per node?

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Case:  03487893
It is one of the most highlighted bug from our customer.

 

Description of problem:

cloud-network-config-controller pod crashloops in proxy deployments as it tries to reach Openstack keystone API directly (not through the proxy) and there is no connectivity.

NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS          AGE
openshift-cloud-network-config-controller          cloud-network-config-controller-c4867b748-vlq9h              0/1     CrashLoopBackOff   158 (2m10s ago)   13h

$ oc -n openshift-cloud-network-config-controller logs -p cloud-network-config-controller-c4867b748-vlq9h
W0927 05:48:18.678947       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0927 05:48:18.680269       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
I0927 05:48:26.754377       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0927 05:48:26.755413       1 openstack.go:121] Custom CA bundle found at location '/kube-cloud-config/ca-bundle.pem' - reading certificate information
F0927 05:48:28.233519       1 main.go:101] Error building cloud provider client, err: Get "https://10.46.44.10:13000/": dial tcp 10.46.44.10:13000: connect: no route to host
goroutine 51 [running]:
k8s.io/klog/v2.stacks(0x1)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x37696c0, 0x3, 0x0, 0xc000636000, 0x1, {0x2cbcbd8?, 0x1?}, 0xc000438400?, 0x0)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x37696c0, 0x237798a?, 0x0, {0x0, 0x0}, 0x7fff81041af7?, {0x23a20d0, 0x2d}, {0xc00052c050, 0x1, ...})
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:1516
main.main.func1({0x26e5638, 0xc00016c040})
        /go/src/github.com/openshift/cloud-network-config-controller/cmd/cloud-network-config-controller/main.go:101 +0x26d
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x11bgoroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00052bb60?, {0x26cee20, 0xc000581740}, 0x1, 0xc00052bb60)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x135
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00016c080?, 0x60db88400, 0x0, 0x20?, 0x7fea470ec108?)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew(0xc0000a8120, {0x26e5638?, 0xc00016c040?})
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:268 +0xd0
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0000a8120, {0x26e5638, 0xc00025fcc0})
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:212 +0x12f
k8s.io/client-go/tools/leaderelection.RunOrDie({0x26e5638, 0xc00025fcc0}, {{0x26e7430, 0xc00062afa0}, 0x1fe5d61a00, 0x18e9b26e00, 0x60db88400, {0xc00065e630, 0xc000634810, 0x0}, ...})
        /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x94
main.main()
        /go/src/github.com/openshift/cloud-network-config-controller/cmd/cloud-network-config-controller/main.go:86 +0x450

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-26-050728

How reproducible:

Always

Steps to Reproduce:

1. Install OCP with proxy

Actual results:

Bootstrap failure and pod crashloop

Expected results:

Successful installation

Additional info:

Please find the must-gather here.

Description of problem:

Pods are being terminated on Kubelet restart if they consume any device.

In case of CNV this Pods are carrying VMs and the assuption is that Kubelet will not terminate the Pod in this case.

Version-Release number of selected component (if applicable):

4.14 / 4.13.z / 4.12.z

How reproducible:

This should be reproducable with any device plugin as far as goes my understanding

Steps to Reproduce:

1. Create Pod requesting device plugin
2. Restart Kubelet
3.

Actual results:

Admission error -> Pod terminates

Expected results:

No error -> Existing & Running Pods will continue running after Kubelet restart

Additional info:

The culprit seems to be https://github.com/kubernetes/kubernetes/pull/116376

We should deprecate and eventually remove react-helmet as a shared plugin dependency. This dependency is small, and plugins can bring their own version if needed.

This requires updated our webpack plugin to allow dependency fallbacks when a shared dependency is not present.

cc Vojtech Szocs 

 

AC:

  • Update docs in the GitHub pages to state that we are deprecating the react-helmet as a shared plugin dependency

This is a clone of issue OCPBUGS-6053. The following is the description of the original issue:

Description of problem:

When a ClusterVersion's `status.availableUpdates` has a value of `null` and `Upgradeable=False`, a run time error occurs on the Cluster Settings page as the UpdatesGraph component expects `status.availableUpdates` to have a non-empty value.

Steps to Reproduce:

1.  Add the following overrides to ClusterVersion config (/k8s/cluster/config.openshift.io~v1~ClusterVersion/version)

spec:
  overrides:
    - group: apps
      kind: Deployment
      name: console-operator
      namespace: openshift-console-operator
      unmanaged: true    
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      name: console-operator
      namespace: ''
      unmanaged: true
2.  Visit /settings/cluster and note the run-time error (see attached screenshot) 

Actual results:

An error occurs.

Expected results:

The contents of the Cluster Settings page render.

Description of problem:

Deployed hypershift cluster with recent multi-arch build. 
Storage cluster operator has become available but having below warning message


PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/attacher_role.yaml" (string): clusterroles.rbac.authorization.k8s.io "ibm-powervs-block-external-attacher-role" is forbidden: user "system:serviceaccount:openshift-cluster-csi-drivers:powervs-block-csi-driver-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:openshift-cluster-csi-drivers" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:
PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: {APIGroups:["csi.storage.k8s.io"], Resources:["csinodeinfos"], Verbs:["get" "list" "watch"]}
PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/attacher_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "ibm-powervs-block-external-attacher-role" not found

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.Deploy 4.12.0-0.nightly-multi-2022-09-01-220105 nightly build

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3432. The following is the description of the original issue:

Description of problem:

E2E test cases for knative and pipeline packages have been disabled on CI due to respective operator installation issues. 
Tests have to be enabled after new operator version be available or the issue resolves

References:
https://coreos.slack.com/archives/C6A3NV5J9/p1664545970777239

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

install 4.12 of IPv6 single stack disconnected cluster: etcd member is in abnormal status:

  1. oc get co|grep etcd
    etcd 4.12.0-0.nightly-2022-10-23-204408 False True True 15h EtcdMembersAvailable: 1 of 2 members are available, openshift-qe-057.arm.eng.rdu2.redhat.com is unhealthy

E1026 03:35:58.409977 1 etcdmemberscontroller.go:73] Unhealthy etcd member found: openshift-qe-057.arm.eng.rdu2.redhat.com, took=, err=create client failure: failed to make etcd client for endpoints https://[26xx:52:0:1eb:3xx3:5xx:fxxe:7550]:2379: context deadline exceeded

How reproducible:
not Always

Steps to Reproduce:
As description
Actual results:
As title
Expected results
etcd co stauts is normal

This is a clone of issue OCPBUGS-2260. The following is the description of the original issue:

TRT-594 investigates failed CI upgrade runs due to alert KubePodNotReady firing.  The case was a pod getting skipped over for scheduling over two successive master node update / restarts.  The case was determined valid so the ask is to be able to have the monitoring aware that master nodes are restarting and scheduling may be delayed.   Presuming we don't want to change the existing tolerance for the non master node restart cases could we suppress it during those restarts and fall back to a second alert with increased tolerances only during those restarts, if we have metrics indicating we are restarting.  Or similar if there are better ways to handle.

The scenario is:

  • A master node (1) is out of service during upgrade
  • A pod (A) is created but can not be scheduled due to anti-affinity rules as the other nodes already host a pod of that definition
  • A second pod (B) from the same definition is created after the first
  • Pod (A) attempts scheduling but fails as the master (1) node is still updating
  • Master (1) node completes updating
  • Pod (B) attempts scheduling and succeeds
  • Next Master (2) node begins updating
  • Pod (A) can not be scheduled on the next attempt(s) as the active master nodes already have pods placed and the next master (2) node is unavailable
  • Master (2) node completes updating
  • Pod (A) is scheduled

This is a clone of issue OCPBUGS-16171. The following is the description of the original issue:

Description of problem:

tls-server-name is missing when using oc client causing x509: certificate signed by unknown authority. Here is PR https://github.com/openshift/oc/issues/1457. 

Please use this slack thread for any reference. 
https://redhat-internal.slack.com/archives/CKJR6200N/p1689170150899359

 

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

A kubeconfig might look like this:apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJ.......<redacted>....
    server: https://celonis.teleport.sh:443
    tls-server-name: kube-teleport-proxy-alpn.celonis.teleport.sh
  name: celonis.teleport.sh
contexts:
- context:
    cluster: celonis.teleport.sh
    user: celonis.teleport.sh-<cluster_name>
  name: celonis.teleport.sh-<cluster_name>
current-context: celonis.teleport.sh-<cluster_name>When running oc commands against the default namespace these commands succeed:oc get pods
No resources found in default namespace.
When running the oc project <namespace> command this modifies the context and sets the namespace. When this operation occurs the tls-server-name attribute is not obeyed which then causes an x509$oc project <namespace>
Now using project "<namespace>" on server "https://celonis.teleport.sh:443".
$oc get pods --loglevel 8
I0613 12:41:05.038351  122971 loader.go:373] Config loaded from file:  /home/kwoodson/.kube/config
I0613 12:41:05.046716  122971 round_trippers.go:463] GET https://celonis.teleport.sh:443/api/v1/namespaces/<namespace>/pods?limit=500
I0613 12:41:05.046732  122971 round_trippers.go:469] Request Headers:
I0613 12:41:05.046741  122971 round_trippers.go:473]     User-Agent: oc/4.13.0 (linux/amd64) kubernetes/92b1a3d
I0613 12:41:05.046746  122971 round_trippers.go:473]     Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json
I0613 12:41:05.121700  122971 round_trippers.go:574] Response Status:  in 74 milliseconds
I0613 12:41:05.121716  122971 round_trippers.go:577] Response Headers:
I0613 12:41:05.121787  122971 helpers.go:264] Connection error: Get https://celonis.teleport.sh:443/api/v1/namespaces/<namespace>/pods?limit=500: x509: certificate signed by unknown authority
Unable to connect to the server: x509: certificate signed by unknown authority
If we review the kubeconfig we can see a new context was added as well as a new server:apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <redacted>
    server: https://celonis.teleport.sh:443
  name: celonis-teleport-sh:443
- cluster:
    certificate-authority-data: <redacted>
    server: https://celonis.teleport.sh:443
    tls-server-name: kube-teleport-proxy-alpn.celonis.teleport.sh
  name: celonis.teleport.sh
contexts:
- context:
    cluster: celonis.teleport.sh
    user: celonis.teleport.sh-<namespace>
  name: celonis.teleport.sh-<namespace>
- context:
    cluster: celonis-teleport-sh:443
    namespace: <namespace>
    user: kwoodson/celonis-teleport-sh:443
  name: <namespace>/celonis-teleport-sh:443/kwoodson
current-context: <namespace>/celonis-teleport-sh:443/kwoodson
As we can see the tls-server-name attribute in this context was not copied to the new context. If I copy this attribute from the old cluster to the new cluster this works. 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

 

The pipeline run nodes used to show a focus border when they were in focus but no longer do.

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. Load the pipeline runs
  2. Use the tab key to move between nodes

Actual results:

There is no indication of which node has the focus

Expected results:

There should be a focus border indicating the current focus node.

Reproducibility (Always/Intermittent/Only Once):

always

Build Details:

4.12

Workaround:

Additional info:

Previously:

Currently:

This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:

The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.

This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:

0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found  

The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.

$ oc get cm vsphere-csi-config -o yaml  -n openshift-cluster-csi-drivers | grep datacenters
    datacenters = "IBMCloud" 

 

This is a clone of issue OCPBUGS-5734. The following is the description of the original issue:

Description of problem:

In https://issues.redhat.com/browse/OCPBUGSM-46450, the VIP was added to noProxy for StackCloud but it should also be added for all national clouds.

Version-Release number of selected component (if applicable):

4.10.20

How reproducible:

always

Steps to Reproduce:

1. Set up a proxy
2. Deploy a cluster in a national cloud using the proxy
3.

Actual results:

Installation fails

Expected results:

 

Additional info:

The inconsistence was discovered when testing the cluster-network-operator changes https://issues.redhat.com/browse/OCPBUGS-5559

Description of problem:

KafkSink current desctiption in odc is `Kafka Sink is Addressable, it receives events and send them to a Kafka topic.` and this should be `A KafkaSink takes a CloudEvent, and sends it to an Apache Kafka Topic.  Events can be specified in either Structured or Binary mode.` as provided by Serverless team

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Install Serverless operator
2. Create CR for knativeKafka in knative-eventing ns
3. go to dev perspective -> add -> event sink
4. Check the description of kafka sink

Actual results:

 

Expected results:

Update the description to as provided by serverless team

Additional info:

 

Description of problem:

Jenkins and Jenkins Agent Base image versions needs to be updated to use the latest images to mitigate known CVEs in plugins and Jenkins versions.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

health_statuses_insights metrics is showing disabled rules in "total". In other fields, it shows the correct amount.
In the code linked below, we can see that the "Disabled" rules are only skipped during the value assigning of TotalRisk

https://github.com/openshift/insights-operator/blob/master/pkg/insights/insightsreport/insightsreport.go#L268

How reproducible:

Always

Steps to Reproduce:

1. Upload a fake archive to trigger health checks (for example with rule CVE_2020_8555_kubernetes)
2. Disable one of the rules through https://console.redhat.com/api/insights-results-aggregator/v1/clusters/{cluster.id}/rules/{rule}/error_key/{error_key}/disable
3. Create support secret and set endpoint="https://httpstat.us/200"
4. restart insights operator
5. wait for alerts to trigger
6. Check health_statuses_insights metrics. 

rule:

ccx_rules_ocp.external.rules.ocp_version_end_of_life.report

error_key:

OCP4X_BEYOND_EOL

 

Actual results:

"moderate" health_statuses_insights shows 2 triggers
"total" shows 3. Therefore, it is accounting for the deactivated rule.

Expected results:

"moderate" health_statuses_insights shows 2 triggers
"total" health_statuses_insights shows 2 triggers (doesn't account for deactivated rule)

Additional info:

If there is any issue in triggering this events, you may contact me and I can help with the steps.

 

Description of problem:

failed to run command in pod with network-tools script pod-run-netns-command locally

Version-Release number of selected component (if applicable):

Client Version: 4.12.0-0.nightly-2022-07-25-055755
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2022-09-28-204419
Kubernetes Version: v1.24.0+8c7c967

How reproducible:

100%

Steps to Reproduce:

1.configure KUBECONFIG
[cloud-user@preserved-qiowang debug-scripts]$ export | grep kube
declare -x KUBECONFIG="/var/tmp/kubeconfig412"
[cloud-user@preserved-qiowang debug-scripts]$ oc get nodes
NAME                                                         STATUS   ROLES                  AGE     VERSION
qiowang-09291-chllb-master-0.c.openshift-qe.internal         Ready    control-plane,master   7h16m   v1.24.0+8c7c967
qiowang-09291-chllb-master-1.c.openshift-qe.internal         Ready    control-plane,master   7h16m   v1.24.0+8c7c967
qiowang-09291-chllb-master-2.c.openshift-qe.internal         Ready    control-plane,master   7h16m   v1.24.0+8c7c967
qiowang-09291-chllb-worker-a-2zq28.c.openshift-qe.internal   Ready    worker                 6h59m   v1.24.0+8c7c967
qiowang-09291-chllb-worker-b-226ft.c.openshift-qe.internal   Ready    worker                 6h59m   v1.24.0+8c7c967
qiowang-09291-chllb-worker-c-wq52c.c.openshift-qe.internal   Ready    worker                 6h59m   v1.24.0+8c7c967

2. clone the openshift/network-tools repo to local

3. create project test, create pod hello-world
[cloud-user@preserved-qiowang debug-scripts]$ oc project
Using project "test" on server "https://api.qiowang-09291.qe.gcp.devcluster.openshift.com:6443".
[cloud-user@preserved-qiowang debug-scripts]$ oc get pods
NAME                READY   STATUS    RESTARTS   AGE
hello-world-j9v9g   1/1     Running   0          68s
hello-world-rrwjf   1/1     Running   0          68s

4. run ping command in the pod hello-world-j9v9g with script pod-run-netns-command locally
[cloud-user@preserved-qiowang debug-scripts]$ ./network-tools pod-run-netns-command test hello-world-j9v9g ping 8.8.8.8 -c 5
ERROR: Command returned non-zero exit code, check output or logs.

Actual results:

failed to run command in pod hello-world-j9v9g with script pod-run-netns-command locally

Expected results:

can run ping 8.8.8.8 -c 5 in pod hello-world-j9v9g with script pod-run-netns-command locally

Additional info:

 

This is a clone of issue OCPBUGS-5891. The following is the description of the original issue:

Description of problem:

When used in heads-only mode, oc-mirror does not record the operator bundles minimum version if a target name is set.

The record values ensures that bundles that still exist in the catalog are included as part of the generated catalog and that the associated images are not pruned. This behavior will prune bundles that have when no minimum version is set in the imageset configuration and the bundles still exist in the source catalog.

Version-Release number of selected component (if applicable):

Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202212011938.p0.g8bf1402.assembly.stream-8bf1402", GitCommit:"8bf14023aa018e12425e29993e6f53f0ab07e6ab", GitTreeState:"clean", BuildDate:"2022-12-01T19:56:31Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

Using the advanced cluster management package as an example.

1. Find the latest bundle for acm in the release-2.6 channel with oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10-1663021232 --package advanced-cluster-management
2. Create an mirror set configuration to mirror an operator from an older catalog version

apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
storageConfig:
  local:
    path: test
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10-1663021232
      targetName: test
      targetTag: test
      packages:
        - name: advanced-cluster-management
          channels:
            - name: release-2.6


2. Run oc-mirror --config config-with-operators.yaml file://
3. Check the bundle minimum version on the metadata using oc-mirror describe mirror_seq1_000000.tar under the field operators, the advanced-cluster-management should show version found in Step 1.
4. Create another ImageSetConfiguration for a later version of the catalog
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
storageConfig:
  local:
    path: test 
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
      targetName: test
      targetTag: test
      packages:
        - name: advanced-cluster-management
          channels:
            - name: release-2.6
4. Check the bundle minimum version on the metadata using oc-mirror describe mirror_seq2_000000.tar under the operators field. 

Actual results:

The catalog entry in the metadata shows packages as null.

Expected results:

It should have the advanced-cluster-managament package with the original minimum version or an updated minimum version if the original bundle was pruned.
 

 

This is a clone of issue OCPBUGS-5129. The following is the description of the original issue:

Description of problem:

I attempted to install a BM SNO with the agent based installer.
In the install_config, I disabled all supported capabilities except marketplace. Install_config snippet: 

capabilities:
  baselineCapabilitySet: None
  additionalEnabledCapabilities:
  - marketplace

The system installed fine but the capabilities config was not passed down to the cluster. 

clusterversion: 
status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - Storage
      - baremetal
      - marketplace
      - openshift-samples
      knownCapabilities:
      - CSISnapshot
      - Console
      - Insights
      - Storage
      - baremetal
      - marketplace
      - openshift-samples

oc -n kube-system get configmap cluster-config-v1 -o yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: ptp.lab.eng.bos.redhat.com
    bootstrapInPlace:
      installationDisk: /dev/disk/by-id/wwn-0x62cea7f04d10350026c6f2ec315557a0
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform: {}
      replicas: 0
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform: {}
      replicas: 1
    metadata:
      creationTimestamp: null
      name: cnfde8
    networking:
      clusterNetwork:
      - cidr: 10.128.0.0/14
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.16.231.0/24
      networkType: OVNKubernetes
      serviceNetwork:
      - 172.30.0.0/16
    platform:
      none: {}
    publish: External
    pullSecret: ""





Version-Release number of selected component (if applicable):

4.12.0-rc.5

How reproducible:

100%

Steps to Reproduce:

1. Install SNO with agent based installer as described above
2.
3.

Actual results:

Capabilities installed  

Expected results:

Capabilities not installed 

Additional info:

 

Description of problem:

This bug is a copy of https://bugzilla.redhat.com/show_bug.cgi?id=2137616 as fix needs to go on OCP side.
For must gather and attached screenshots please refer the bugzilla.
Add Capacity button does not exist after upgrade OCP version [OCP4.11->OCP4.12]

Version-Release number of selected component (if applicable):

ODF Version:4.11.3-3
OCP Version: 4.12.0-0.nightly-2022-10-24-103753
Provider: AWS

How reproducible:

 

Steps to Reproduce:

1.Install ODF4.11 +OCP4.11
2.Upgrade OCP4.11 to OCP4.12
3.Log in to the OpenShift Web Console.
4.Click Operators → Installed Operators.
5.Click OpenShift Data Foundation Operator.
6.Click the Storage Systems tab.
7.Click the Action Menu (⋮) on the far right of the storage system name to extend the options menu.
"Add Capacity" button does not exist on menu.
*Attached Screenshot 

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4490. The following is the description of the original issue:

Description of problem:

When hypershift HostedCluster has endpointAccess: Private, the csi-snapshot-controller is in CrashLoopBackoff because the guest APIServer url in the admin-kubeconfig isn't reachable in Private mode.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4656. The following is the description of the original issue:

Description of problem:

`/etc/hostname` may exist, but be empty. `vsphere-hostname` service should check that the file is not empty instead of just that it exists.

OKD's machine-os-content starting from F37 has an empty /etc/hostname file, which breaks joining workers in vsphere IPI

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Install OKD w/ workers on vsphere
2.
3.

Actual results:


Workers get hostname resolved using NM

Expected results:


Workers get hostname resolved using vmtoolsd

Additional info:


Description of problem:
project viewer is able to see a 'Create Pod Disruption Budget' button on Pods list page while the creation will fail finally due to less permission, in this way console should not show a 'Create Pod Disruption Budget' button for project viewer, other resources list page doesn’t have the issue

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-09-16-212009

How reproducible:
Always

Steps to Reproduce:
1. normal user has a project and workloads

  1. oc get all -n yapei1-project
    NAME READY STATUS RESTARTS AGE
    pod/example-787f749bb-czkms 1/1 Running 0 79s
    pod/example-787f749bb-m7wxt 1/1 Running 0 79s
    pod/example-787f749bb-mw8jv 1/1 Running 0 79s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/example 3/3 3 3 79s

NAME DESIRED CURRENT READY AGE
replicaset.apps/example-787f749bb 3 3 3 79s

2. grant another user with view access to user project 'yapei1-project'

  1. oc adm policy add-role-to-user view uiauto1 -n yapei1-project
    clusterrole.rbac.authorization.k8s.io/view added: "uiauto1"
    3. login with user 'uiauto1' and check the permissions on Pods list page

Actual results:
3. project viewer 'uiauto1' can see pods list successfully, at the same time console also shows a 'Create Pod Disruption Budget' button while the creation will finally fail if project viewer tries to create a pod

Expected results:
3. console should not show 'Create Pod Disruption Budget' button for a project viewer

Additional info:
For comparison: we doesn't show resource creation button('Create xxx' button) on other workloads list page for a project viewer, such as Deployments, DeploymentConfigs list etc

This is a clone of issue OCPBUGS-7893. The following is the description of the original issue:

Description of problem:
The TaskRun duration diagram on the "Metrics" tab of pipeline is set to only show 4 TaskRuns in the legend regardless of the number of TaskRuns on the diagram.

 

 

Expected results:

All TaskRuns should be displayed in the legend.

Description of problem:

When you migrate a HostedCluster, the AWSEndpointService conflicts from the old MGMT Server with the new MGMT Server. The AWSPrivateLink_Controller does not have any validation when this happens. This is needed to make the Disaster Recovery HC Migration works. So the issue will raise up when the nodes of the HostedCluster cannot join the new Management cluster because the AWSEndpointServiceName is still pointing to the old one.

Version-Release number of selected component (if applicable):

4.12
4.13
4.14

How reproducible:

Follow the migration procedure from upstream documentation and the nodes in the destination HostedCluster will keep in NotReady state.

Steps to Reproduce:

1. Setup a management cluster with the 4.12-13-14/main version of the HyperShift operator.
2. Run the in-place node DR Migrate E2E test from this PR https://github.com/openshift/hypershift/pull/2138:
bin/test-e2e \
  -test.v \
  -test.timeout=2h10m \
  -test.run=TestInPlaceUpgradeNodePool \
  --e2e.aws-credentials-file=$HOME/.aws/credentials \
  --e2e.aws-region=us-west-1 \
  --e2e.aws-zones=us-west-1a \
  --e2e.pull-secret-file=$HOME/.pull-secret \
  --e2e.base-domain=www.mydomain.com \
  --e2e.latest-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \
  --e2e.previous-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \
  --e2e.skip-api-budget \
  --e2e.aws-endpoint-access=PublicAndPrivate

Actual results:

The nodes stay in NotReady state

Expected results:

The nodes should join the migrated HostedCluster

Additional info:

 

This is a clone of issue OCPBUGS-11187. The following is the description of the original issue:

Description of problem:

EgressIP was NOT migrated to correct workers after deleting machine it was assigned in GCP XPN cluster.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-29-235439

How reproducible:

Always

Steps to Reproduce:

1. Set up GCP XPN cluster.
2. Scale two new worker nodes
% oc scale --replicas=2 machineset huirwang-0331a-m4mws-worker-c -n openshift-machine-api        
machineset.machine.openshift.io/huirwang-0331a-m4mws-worker-c scaled

3. Wait the two new workers node ready.
 % oc get machineset -n openshift-machine-api
NAME                            DESIRED   CURRENT   READY   AVAILABLE   AGE
huirwang-0331a-m4mws-worker-a   1         1         1       1           86m
huirwang-0331a-m4mws-worker-b   1         1         1       1           86m
huirwang-0331a-m4mws-worker-c   2         2         2       2           86m
huirwang-0331a-m4mws-worker-f   0         0                             86m
% oc get nodes
NAME                                                          STATUS   ROLES                  AGE     VERSION
huirwang-0331a-m4mws-master-0.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-master-1.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-master-2.c.openshift-qe.internal         Ready    control-plane,master   82m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal   Ready    worker                 71m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal   Ready    worker                 71m     v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   Ready    worker                 8m22s   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal   Ready    worker                 8m22s   v1.26.2+dc93b13
3. Label one new worker node as egress node
 % oc label node huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" 
node/huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal labeled

4. Create egressIP object
oc get egressIP
NAME         EGRESSIPS     ASSIGNED NODE                                                 ASSIGNED EGRESSIPS
egressip-1   10.0.32.100   huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   10.0.32.100
5. Label second new worker node as egress node 
% oc label node huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal k8s.ovn.org/egress-assignable="" 
node/huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal labeled
6. Delete the assigned egress node
% oc delete machines.machine.openshift.io huirwang-0331a-m4mws-worker-c-rhbkr  -n openshift-machine-api
machine.machine.openshift.io "huirwang-0331a-m4mws-worker-c-rhbkr" deleted
 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE   VERSION
huirwang-0331a-m4mws-master-0.c.openshift-qe.internal         Ready    control-plane,master   87m   v1.26.2+dc93b13
huirwang-0331a-m4mws-master-1.c.openshift-qe.internal         Ready    control-plane,master   86m   v1.26.2+dc93b13
huirwang-0331a-m4mws-master-2.c.openshift-qe.internal         Ready    control-plane,master   87m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-a-hfqsn.c.openshift-qe.internal   Ready    worker                 76m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-b-vbqf2.c.openshift-qe.internal   Ready    worker                 76m   v1.26.2+dc93b13
huirwang-0331a-m4mws-worker-c-wnm4r.c.openshift-qe.internal   Ready    worker                 13m   v1.26.2+dc93b13
29468 W0331 02:48:34.917391       1 egressip_healthcheck.go:162] Could not connect to huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal (10.129.4.2:9107): context       deadline exceeded
29469 W0331 02:48:34.917417       1 default_network_controller.go:903] Node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal is not ready, deleting it from egre      ss assignment
29470 I0331 02:48:34.917590       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:Logical_Switch_Port Row:map[o      ptions:{GoMap:map[router-port:rtoe-GR_huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column       _uuid == {6efd3c58-9458-44a2-a43b-e70e669efa72}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
29471 E0331 02:48:34.920766       1 egressip.go:993] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi      ch is not reachable, will attempt rebalancing
29472 E0331 02:48:34.920789       1 egressip.go:997] Allocator error: EgressIP: egressip-1 assigned to node: huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal whi      ch is not ready, will attempt rebalancing
29473 I0331 02:48:34.920808       1 egressip.go:1212] Deleting pod egress IP status: {huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal 10.0.32.100} for EgressIP:       egressip-1

Actual results:

The egressIP was not migrated to correct worker
 oc get egressIP      
NAME         EGRESSIPS     ASSIGNED NODE                                                 ASSIGNED EGRESSIPS
egressip-1   10.0.32.100   huirwang-0331a-m4mws-worker-c-rhbkr.c.openshift-qe.internal   10.0.32.100

Expected results:

The egressIP should migrated to correct worker from deleted node.

Additional info:


 A related slack thread: here

The error:

 which: no kustomize in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin:/go/bin)
+ curl -L --retry 5 https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv4.3.0/kustomize_v4.3.0_linux_amd64.tar.gz
+ tar -zx -C /usr/bin/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  1523    0  1523    0     0  27196      0 --:--:-- --:--:-- --:--:-- 26719
Warning: Problem : HTTP error. Will retry in 300 seconds. 5 retries left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
gzip: stdin: not in gzip format
tar: Child died with signal 13
tar: Error is not recoverable: exiting now 

Source: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_assisted-service/4260/pull-ci-openshift-assisted-service-release-ocm-2.6-e2e-ai-operator-ztp-disconnected/1561941429180174336

A related job search: https://search.ci.openshift.org/?search=gzip%3A+stdin%3A+not+in+gzip+format&maxAge=336h&context=1&type=junit&name=assisted&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

This is a clone of issue OCPBUGS-1428. The following is the description of the original issue:

Description of problem:

When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message:
the service account does not have any API secret sa=testx-ns/testx-sa
This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace  is causing the issue.
The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.

Version-Release number of selected component (if applicable):

4.11.0 & 4.11.1

How reproducible:

100% reproducible

Steps to Reproduce:

1.Create namespace: oc new-project testx-ns
2. oc apply -f api-secret-issue.yaml

Actual results:

 

Expected results:

 

Additional info:

API YAML:

cat api-secret-issue.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: zsecret
  namespace: testx-ns
  annotations:
   kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
  mykey: mypass

apiVersion: v1
kind: ServiceAccount
metadata:
  name: testx-sa
  namespace: testx-ns

kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: testx-og
  namespace: testx-ns
spec:
  serviceAccountName: "testx-sa"
  targetNamespaces:
  - testx-ns

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: testx-role
  namespace: testx-ns
rules:

  • apiGroups: ["*"]
      resources: ["*"]
      verbs: ["*"] 
      

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: testx-rolebinding
  namespace: testx-ns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: testx-role
subjects:

  • kind: ServiceAccount
      name: testx-sa
      namespace: testx-ns

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-operator
  namespace: testx-ns
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace

Description of problem:

To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Already merged in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/660

Description of problem:
When the user runs:

openshift-install agent create image --dir cluster-manifests

But the manifests are either not in cluster-manifests or are missing, the error code generated by the tool leads users to believe that they are missing some tool dependency:

ERROR failed to write asset (Agent Installer ISO) to disk: image reader not available

Version-Release number of selected component (if applicable):4.11.0

How reproducible: 100%

Steps to Reproduce:
1. rm -fr /tmp/cluster-manifests && mkdir /tmp/cluster-manifests
2.openshift-install agent create image --dir cluster-manifests

Actual results:
ERROR failed to write asset (Agent Installer ISO) to disk: image reader not available

Expected results:
Error: Missing manifets in the specified cluster manifest directory: "/tmp/cluster-manifests"

Additional info:

This is a clone of issue OCPBUGS-7719. The following is the description of the original issue:

Description of problem:

An update from 4.13.0-ec.2 to 4.13.0-ec.3 stuck on:

$ oc get clusteroperator machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.13.0-ec.2   True        True          True       30h     Unable to apply 4.13.0-ec.3: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 105, ready 105, updated: 105, unavailable: 0)]

The worker MachineConfigPool status included:

Unable to find source-code formatter for language: node. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      type: NodeDegraded
    - lastTransitionTime: "2023-02-16T14:29:21Z"
      message: 'Failed to render configuration for pool worker: Ignoring MC 99-worker-generated-containerruntime
        generated by older version 8276d9c1f574481043d3661a1ace1f36cd8c3b62 (my version:
        c06601510c0917a48912cc2dda095d8414cc5182)'

Version-Release number of selected component (if applicable):

4.13.0-ec.3. The behavior was apparently introduced as part of OCPBUGS-6018, which has been backported, so the following update targets are expected to be vulnerable: 4.10.52+, 4.11.26+, 4.12.2+, and 4.13.0-ec.3.

How reproducible:

100%, when updating into a vulnerable release, if you happen to have leaked MachineConfig.

Steps to Reproduce:

1. 4.12.0-ec.1 dropped cleanUpDuplicatedMC. Run a later release, like 4.13.0-ec.2.
2. Create more than one KubeletConfig or ContainerRuntimeConfig targeting the worker pool (or any pool other than master). The number of clusters who have had redundant configuration objects like this is expected to be small.
3. (Optionally?) delete the extra KubeletConfig and ContainerRuntimeConfig.
4. Update to 4.13.0-ec.3.

Actual results:

Update sticks on the machine-config ClusterOperator, as described above.

Expected results:

Update completes without issues.

This bug is a backport clone of [Bugzilla Bug 2113973](https://bugzilla.redhat.com/show_bug.cgi?id=2113973). The following is the description of the original bug:

If we define a custom scc like this:

allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: false
allowPrivilegedContainer: false
allowedCapabilities: []
apiVersion: security.openshift.io/v1
defaultAddCapabilities: []
fsGroup:
type: MustRunAs
groups:

  • system:authenticated
    kind: SecurityContextConstraints
    metadata:
    annotations:
    kubernetes.io/description: MCP Vault Unsealer
    meta.helm.sh/release-name: vault
    meta.helm.sh/release-namespace: mcp-vault
    creationTimestamp: "2022-07-25T11:09:53Z"
    generation: 2
    labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault-unsealer
    app.kubernetes.io/version: 3.7.0
    helm.sh/chart: vault-unsealer-3.7.1
    name: vault-unsealer
    resourceVersion: "1793493"
    uid: 6b6d88be-03c0-476d-8602-2e94e4ecfcb5
    priority: null
    readOnlyRootFilesystem: true
    requiredDropCapabilities:
  • KILL
  • MKNOD
  • SETUID
  • SETGID
    runAsUser:
    type: RunAsAny
    seLinuxContext:
    type: MustRunAs
    supplementalGroups:
    type: RunAsAny
    users:
  • system:serviceaccount:mcp-vault:vault-unsealer
    volumes:
  • configMap
  • hostPath
  • secret

we can see that the pod originally has this scc:

oc get pod machine-config-operator-7f57686f5c-g895k -o yaml | grep scc
openshift.io/scc: hostmount-anyuid

After applying the new SCC ( even if we set a higher priority ) the pod is showing after restart:

oc get pod machine-config-operator-7f57686f5c-jg2jv -o yaml | grep scc
openshift.io/scc: vault-unsealer

This is a clone of issue OCPBUGS-10844. The following is the description of the original issue:

Description of problem:

When modifying a secret in the Management Console that has a binary file inclued (such as a keystore), the keystore will get corrupted post the modification and therefore impact application functionality (as the keystore can not be read).

$ openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365
$ cat cert.pem key.pem > file.crt.txt
$ openssl pkcs12 -export -in file.crt.txt -out mykeystore.pkcs12 -name myAlias -noiter -nomaciter
$ oc create secret generic keystore --from-file=mykeystore.pkcs12 --from-file=cert.pem --from-file=key.pem -n project-300

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: project-300
spec:
  containers:
  - name: mypod
    image: quay.io/rhn_support_sreber/curl:latest
    volumeMounts:
    - name: foo
      mountPath: "/keystore"
      readOnly: true
  volumes:
  - name: foo
    secret:
      secretName: keystore
      optional: true

# Getting the md5sum from the file on the local Laptop to compare with what is available in the pod
$ md5sum mykeystore.pkcs12
c189536854e59ab444720efaaa76a34a  mykeystore.pkcs12

sh-5.2# ls -al /keystore/..data/
total 16
drwxr-xr-x. 2 root root  100 Mar 24 11:19 .
drwxrwxrwt. 3 root root  140 Mar 24 11:19 ..
-rw-r--r--. 1 root root 1992 Mar 24 11:19 cert.pem
-rw-r--r--. 1 root root 3414 Mar 24 11:19 key.pem
-rw-r--r--. 1 root root 4380 Mar 24 11:19 mykeystore.pkcs12

sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12
c189536854e59ab444720efaaa76a34a  /keystore/..data/mykeystore.pkcs12
sh-5.2#

Edit cert.pem in secret using the Management Console

$ oc delete pod mypod -n project-300

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  namespace: project-300
spec:
  containers:
  - name: mypod
    image: quay.io/rhn_support_sreber/curl:latest
    volumeMounts:
    - name: foo
      mountPath: "/keystore"
      readOnly: true
  volumes:
  - name: foo
    secret:
      secretName: keystore
      optional: true

sh-5.2# ls -al /keystore/..data/
total 20
drwxr-xr-x. 2 root root   100 Mar 24 12:52 .
drwxrwxrwt. 3 root root   140 Mar 24 12:52 ..
-rw-r--r--. 1 root root  1992 Mar 24 12:52 cert.pem
-rw-r--r--. 1 root root  3414 Mar 24 12:52 key.pem
-rw-r--r--. 1 root root 10782 Mar 24 12:52 mykeystore.pkcs12

sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12
56f04fa8059471896ed5a3c54ade707c  /keystore/..data/mykeystore.pkcs12
sh-5.2#      

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-03-23-204038   True        False         91m     Cluster version is 4.13.0-0.nightly-2023-03-23-204038

The modification was done in the Management Console, selecting the secret and then use: Actions -> Edit Secrets -> Modifying the value of cert.pem and submiting via Save button

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.13.0-0.nightly-2023-03-23-204038 and 4.12.6

How reproducible:

Always

Steps to Reproduce:

1. See above the details steps

Actual results:

# md5sum on the Laptop for the file
$ md5sum mykeystore.pkcs12
c189536854e59ab444720efaaa76a34a  mykeystore.pkcs12

# md5sum of the file in the pod after the modification in the Management Console
sh-5.2# md5sum /keystore/..data/mykeystore.pkcs12
56f04fa8059471896ed5a3c54ade707c  /keystore/..data/mykeystore.pkcs12

The file got corrupted and is not usable anymore. The binary file though should not be modified if no changes was made on it's value, when editing the secret in the Mansgement Console.

Expected results:

The binary file though should not be modified if no changes was made on it's value, when editing the secret in the Mansgement Console.

Additional info:

A similar problem was alredy fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1879638 but that was, when the binary file was uploaded. Possible that the secret edit functionality is also missing binary file support.

This is a clone of issue OCPBUGS-13084. The following is the description of the original issue:

Description of problem:
CU wanted to restrict access to vcenter API and originating traffic needs to use a configured EgressIP. This is working fine for the machine API but the vsphere CSI driver controller uses the host network and hence the configured EgressIP isn't used. 

Is it possible to disable this( use of host-network) for CSI controller?

slack thread: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1683135077822559

This is a clone of issue OCPBUGS-5466. The following is the description of the original issue:

Description of problem:

It is possible to change some of the fields in default catalogSource specs and the Marketplace Operator will not revert the changes 

Version-Release number of selected component (if applicable):

4.13.0 and back

How reproducible:

Always

Steps to Reproduce:

1. Create a 4.13.0 OpenShift cluster
2. Set the redhat-operator catalogSource.spec.grpcPodConfig.SecurityContextConfig field to `legacy`.

Actual results:

The field remains set to `legacy` mode.

Expected results:

The field is reverted to `restricted` mode.

Additional info:
This code needs to be updated to account for new fields in the catalogSource spec.

 

 

 

This is a clone of issue OCPBUGS-8702. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8523. The following is the description of the original issue:

Description of problem:

Due to rpm-ostree regression (OKD-63) MCO was copying /var/lib/kubelet/config.json into /run/ostree/auth.json on FCOS and SCOS. This breaks Assisted Installer flow, which starts with Live ISO and doesn't have /var/lib/kubelet/config.json

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

Event souces are not shown in topology

Version-Release number of selected component (if applicable):

Have verified it on 4.12.0-0.nightly-2022-09-20-095559

How reproducible:

 

Steps to Reproduce:

1. Install Serverless operator
2. Create CR for knative-serving and knative-eventing respectively
3. Create/select a ns -> go to dev console -> add -> event souce
4. Create any event source

 

 

Actual results:

Can't see created resouoce(Event source) in topology

Expected results:

Should be able to see created resoouce on topology

Additional info:

 

Description of problem:

Recently during an audit on a user's cluster, it was discovered that
OLM's certificate generation functionality has a few minor shortcomings.

  1. The generated CA and server cert do not include a common name,
    which causes some tooling to have trouble tracing the cert chain.
  2. The generated CA and server cert include unnecessary key usages,
    which means those certificates can be used for more than their
    intended purposes.

How reproducible: Always

Joe Lanford could you please double check what I've put below? QE is asking for a bug ticket for this fix (makes sense as it helps them verify everything is correct and gives us traceability)

Steps to Reproduce:

oc get secret -n openshift-operator-lifecycle-manager packageserver-service-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text

Actual results:

  • Common Name not present in certificate data
  • X509v3 extensions looks include:
       
        X509v3 Key Usage: critical
            Digital Signature, Certificate Sign
        X509v3 Extended Key Usage: 
           TLS Web Client Authentication, TLS Web Server Authentication

Expected results:

  • Common Name must be present in certificate
  • X509v3 extensions should NOT include Digital Signature under Key Usage
  • X509v3 extensions should NOT include Extended Key Usage (other than *TLS Web Server Authentication*)

Description of problem:

`create a project` link is enabled for users who do not have permission to create a project. This issue surfaces itself in the developer sandbox.

Version-Release number of selected component (if applicable):

4.11.5

How reproducible:

 

Steps to Reproduce:

1. log into dev sandbox, or a cluster where the user does not have permission to create a project
2. go directly to URL /topology/all-namespaces

Actual results:

`create a project` link is enabled. Upon clicking the link and submitting the form, the project fails to create; as expected.

Expected results:

`create a project` link should only be available to users with the correct permissions.

Additional info:

The project list pages are not directly available to the user in the UI through the project selector. The user must go directly to the URL.

It's possible to encounter this situation when a user logs in with multiple accounts and returns to a previous url.

 

This is a clone of issue OCPBUGS-4654. The following is the description of the original issue:

Description of problem:

When we introduced aarch64 support for IPI on Azure, we changed the Installer from using managed images (no architecture support) to using Image Galleries (architecture support). This means that the place where the Installer looks for rhcos bootimages has changed from "/resourceGroups/$rg_name/providers/Microsoft.Compute/images/$cluster_id" to "/resourceGroups/$rg_name/providers/Microsoft.Compute/galleries/gallery_$cluster_id/images/$cluster_id/versions/$rhcos_version".
This has been properly handled in the IPI workflow, with changes to the terraform configs [1]. However, our ARM template for UPI installs [2] still uploads images via Managed Images and therefore breaks workflows provisioning compute nodes with MAO.

[1] https://github.com/openshift/installer/pull/6304
[2] https://github.com/openshift/installer/blob/release-4.12/upi/azure/02_storage.json

Version-Release number of selected component (if applicable):

4.13 and 4.12

How reproducible:

always

Steps to Reproduce:

Any workflow that provisions compute nodes with MAO. For example, in the UPI deploy with ARM templates:
1. Execute 06_workers.json template with compute.replicas: 0 in the install-config, then run the oc scale command to "activate" MAO provision (`oc scale --replicas=1 machineset $machineset_name -n openshift-machine-api`)
2. Skip 06_workers.json but set compute.replicas: 3 in the install-config. MAO will provision nodes as part of the cluster deploy.

Actual results:

Error Message:           failed to reconcile machine 
"maxu-upi2-gc7n8-worker-eastus3-68gdx": failed to create vm 
maxu-upi2-gc7n8-worker-eastus3-68gdx: failure sending request for 
machine maxu-upi2-gc7n8-worker-eastus3-68gdx: cannot create vm: 
compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: 
StatusCode=404 -- Original Error: Code="GalleryImageNotFound" 
Message=""The gallery image 
/subscriptions/53b8f551-.../resourceGroups/maxu-upi2-gc7n8-rg/providers/Microsoft.Compute/galleries/gallery_maxu_upi2_gc7n8/images/maxu-upi2-gc7n8-gen2/versions/412.86.20220930
 is not available in eastus region. Please contact image owner to 
replicate to this region, or change your requested region."" 
Target="imageReference"

But the image can be found at:
/subscriptions/53b8f551-.../resourceGroups/maxu-upi2-gc7n8-rg/providers/Microsoft.Compute/images/maxu-upi2-gc7n8-gen2

Expected results:

No errors and the bootimage is loaded from the Image Gallery.

Additional info:

02_storage.json template will have to be rewritten to use Image Gallery instead of Managed Images.

This is a clone of issue OCPBUGS-6049. The following is the description of the original issue:

Description of problem:

We show the UpdateInProgress component (the progress bars) when the cluster update status is Failing, UpdatingAndFailing, or Updating.  The inclusion of the Failing case results in a bug where the progress bars can display when an update is not occurring (see attached screenshot).  

Steps to Reproduce:

1.  Add the following overrides to ClusterVersion config (/k8s/cluster/config.openshift.io~v1~ClusterVersion/version)

spec:
  overrides:
    - group: apps
      kind: Deployment
      name: console-operator
      namespace: openshift-console-operator
      unmanaged: true    
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      name: console-operator
      namespace: ''
      unmanaged: true
2.  Wait for ClusterVersion changes to roll out.
3.  Visit /settings/cluster and note the progress bars are present and displaying 100% but the cluster is not updating

Actual results:

Progress bars are displaying when not updating.

Expected results:

Progress bars should not display when updating.

Description of problem:

Possibly a regression introduced by OCPBUGS-7898, but a 4.12.14 cluster with None infrastructure submitted the following Insights for the cloud-controller-manager ClusterOperator:

2023-05-05T00:08:07Z Upgradeable=False AsExpected:

Version-Release number of selected component (if applicable):

4.12.14

How reproducible:

Unclear.

Steps to Reproduce:

1. Run a 4.12.14 cluster, for some unclear subset of cluster configuration.
2. $ oc get -o json clusteroperator cloud-controller-manager | jq '.status.conditions[] | select(.type == "Upgradeable")'

Actual results:

False with AsExpected and an empty message.

Expected results:

True with AsExpected, or False with a different reason and a message.

This is a clone of issue OCPBUGS-3883. The following is the description of the original issue:

While doing a PerfScale test of we noticed that the ovnkube pods are not being spread out evenly among the available workers. Instead they are all stacking on a few until they fill up the available allocatable ebs volumes (25 in the case of m5 instances that we see here).

An example from partway through our 80 hosted cluster test when there were ~30 hosted clusters created/in progress

There are 24 workers available:

```

$ for i in `oc get nodes l node-role.kubernetes.io/worker=,node-role.kubernetes.io/infra!=,node-role.kubernetes.io/workload!= | egrep -v "NAME" | awk '{ print $1 }'`;    do  echo $i `oc describe node $i | grep -v openshift | grep ovnkube -c`; done
ip-10-0-129-227.us-west-2.compute.internal 0
ip-10-0-136-22.us-west-2.compute.internal 25
ip-10-0-136-29.us-west-2.compute.internal 0
ip-10-0-147-248.us-west-2.compute.internal 0
ip-10-0-150-147.us-west-2.compute.internal 0
ip-10-0-154-207.us-west-2.compute.internal 0
ip-10-0-156-0.us-west-2.compute.internal 0
ip-10-0-157-1.us-west-2.compute.internal 4
ip-10-0-160-253.us-west-2.compute.internal 0
ip-10-0-161-30.us-west-2.compute.internal 0
ip-10-0-164-98.us-west-2.compute.internal 0
ip-10-0-168-245.us-west-2.compute.internal 0
ip-10-0-170-103.us-west-2.compute.internal 0
ip-10-0-188-169.us-west-2.compute.internal 25
ip-10-0-188-194.us-west-2.compute.internal 0
ip-10-0-191-51.us-west-2.compute.internal 5
ip-10-0-192-10.us-west-2.compute.internal 0
ip-10-0-193-200.us-west-2.compute.internal 0
ip-10-0-193-27.us-west-2.compute.internal 7
ip-10-0-199-1.us-west-2.compute.internal 0
ip-10-0-203-161.us-west-2.compute.internal 0
ip-10-0-204-40.us-west-2.compute.internal 23
ip-10-0-220-164.us-west-2.compute.internal 0
ip-10-0-222-59.us-west-2.compute.internal 0

```

This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster

This is a clone of issue OCPBUGS-4491. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7729. The following is the description of the original issue:

Description of problem:

Etcd's liveliness probe should be removed. 

Version-Release number of selected component (if applicable):

4.11

Additional info:

When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again..  

There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.  

Description of problem:

oc-mirror shouldn't clean out the operator versions that are not referenced in the channel anymore

Cu has following ImageSetConfiguration. They are running oc-mirror GitVersion: 4.11.0-2022082035.p0.g3c1c80c.assembly.stream-3c1c80c.

apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10
      targetName: bit-redhat-operator-catalog-platform-essentials-index
      packages:
        - name: elasticsearch-operator
          channels:
            - name: stable
          minVersion: 5.4.2
        - name: cluster-logging
          channels:
            - name: stable
          minVersion: 5.4.2

This works at first and it makes 5.4.2 available in their internal catalog. However after some time the version 5.4.2 disappears out of their catalog and we get the following error while syncing:

ERRO[0108] Operator elasticsearch-operator was not found, please check name, minVersion, maxVersion, and channels in the config file.
ERRO[0108] Operator cluster-logging was not found, please check name, minVersion, maxVersion, and channels in the config file.

The issue is that the original configured version 5.4.2 is not anymore in the catalog, which we can verify by querying the catalog:

$ oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10 --package elasticsearch-operator --channel stable
WARN[0022] DEPRECATION NOTICE:
Sqlite-based catalogs and their related subcommands are deprecated. Support for
them will be removed in a future release. Please migrate your catalog workflows
to the new file-based catalog format.
VERSIONS
5.5.0

$ oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10 --package elasticsearch-operator --channel stable-5.4
WARN[0019] DEPRECATION NOTICE:
Sqlite-based catalogs and their related subcommands are deprecated. Support for
them will be removed in a future release. Please migrate your catalog workflows
to the new file-based catalog format.
VERSIONS
5.4.4

So,
a) the version 5.4.2 completely disappeard
b) the stable channel now starts with 5.5.0

The oc-mirror would clean out the versions that are not referenced anymore and thus we would assume that 5.4.2 would be cleaned from the mirror. Which they definitely do not want to happen, since they still have that version on clusters in their environment.

It is quite tedious to keep editing the image-set.yaml for all the versions that disappear out of the catalog

Version-Release number of selected component (if applicable):

oc-mirror GitVersion: 4.11.0-2022082035.p0.g3c1c80c.assembly.stream-3c1c80c

How reproducible:

100%

Steps to Reproduce:

1. Create an ImageSetConfiguration to mirror a particular operator
2. Mirror the operator to mirror registry using oc-mirror
3. The specified version of the operator disappears from the catalog after a few days when there are changes in the channel and start getting the mentioned error on sync.

Actual results:

Operator disappears from the catalog

Expected results:

The mentioned version of the operator to be available in mirror registry even after it's not referenced in the channel

Additional info:


When we create an HCP, the Root CA in the HCP namespaces has the certificate and key named as

  • ca.key
  • ca.crt
    But to cert manager expects them to be named as
  • tls.key
  • tls.cert

Done criteria: The Root CA should have the certificate and key named as the cert manager expects.

This is a clone of issue OCPBUGS-15787. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15427. The following is the description of the original issue:

Description of problem:

As a cluster-admin, users can see pipelines section while using the `import from git` feature in the developer mode from web console.

However if users logged in as a normal user or a project admin, they are not able to see the pipelines section.

Version-Release number of selected component (if applicable):

Tested in OCP v4.12.18 and v4.12.20 

How reproducible:

Always

Steps to Reproduce:

Prerequisite- Install Red Hat OpenShift pipelines operator
1. Login as a kube-admin user from web console
2. Go to Developer View
3. Click on +Add
4. Under Git Repository, open page -> Import from git
5. Enter Git Repo URL (example git url- https://github.com/spring-projects/spring-petclinic)
6. Check if there are 3 section : General , Pipelines , Advance options
7. Then Login as a project admin user
8. Perform all the steps again from step 2 to step 6

Actual results:

Pipelines section is not visible when logged in as a project admin. Only General and Advance options sections are visible in import from git.
However Pipeline section is visible as a cluster-admin.

Expected results:

Pipelines section should be visible when logged in as a project admin, along with General and Advance options sections in import from git.

Additional info:

I checked by creating a separate rolebinding and clusterrolebindings to assign access for pipeline resources like below :
~~~
$ oc create clusterrole pipelinerole1 --verb=create,get,list,patch,delete --resource=tektonpipelines,openshiftpipelinesascodes
$ oc create clusterrole pipelinerole2 --verb=create,get,list,patch,delete --resource=repositories,pipelineruns,pipelines
$ oc adm policy add-cluster-role-to-user pipelinerole1 user1
$ oc adm policy add-role-to-user pipelinerole2 user1
~~~
However, even after assigning these rolebindings/clusterrolebinsings to the users , users are not able to see the Pipelines section.

There is capacity limit on egressIP for different cloud provider, for example, GCP, the limit is 10.

If the number of egressIP added to hostsubnet exceeds the capability limit, it is expected some logging message is emitted to event log, that can be seen through "oc get event"

 

On a GCP with SDN plugin, configure egressCIDRs on one worker node, configured 12 netnamespaces, each has 1 egressIP configured, the total number of egressIP for the hostsubnet has exceeded its capacity limit of 10.   No event log was seen to indicate that the number of egressIP for the hostsubnet has exceeded the limit.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-08-02-014045   True        False         160m    Cluster version is 4.11.0-0.nightly-2022-08-02-014045

 

See attachment for more details.

 

This is a clone of issue OCPBUGS-11054. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11038. The following is the description of the original issue:

Description of problem:

Backport support starting in 4.12.z to a new GCP region europe-west12

Version-Release number of selected component (if applicable):

4.12.z and 4.13.z

How reproducible:

Always

Steps to Reproduce:

1. Use openhift-install to deploy OCP in europe-west12

Actual results:

europe-west12 is not available as a supported region in the user survey

Expected results:

europe-west12 to be available as a supported region in the user survey

Additional info:

 

This is a clone of issue OCPBUGS-1327. The following is the description of the original issue:

See this comment for some updated information

Description of problem:
During IPI installation on IBM Cloud (x86_64), some of the worker machines have been seen to have no network connectivity during their initial bootup. Investigations were performed with IBM Cloud VPC to attempt to identify the issue, but in all appearances, all virtualization appears to be working.

Unfortunately due to this issue, no network traffic, no access to these worker machines is available to help identify the issue (Ignition is stuck without network traffic), so no SSH or console login is available to collect logs, or perform any testing on these machines.

The only content available is the console output, showing ignition is stuck due to the network issue.

Version-Release number of selected component (if applicable):
4.12.0

How reproducible:
About 60%

Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud
2. Wait for the worker machines to be provisioned, causing IPI to fail waiting on machine-api operator
3. Check console of worker machines failing to report in to cluster (in this case 2 of 3 failed)

Actual results:
IPI creation failed waiting on machine-api operator to complete all worker node deployment

Expected results:
Successful IPI creation on IBM Cloud

Additional info:
As stated, investigation was performed by IBM Cloud VPC, but no further investigation could be performed since no access to these worker machines is available. Any further details that could be provided to help identify the issue would be helpful.

This appears to have become more prominent recently as well, causing concern for IBM Cloud's IPI GA support on the 4.12 release.

The only solution to restore network connectivity is rebooting the machine, which loses ignition bring up (I assume it must be triggered manually now), and in the case of IPI, isn't a great mitigation.

Description of problem:

Customer is not able anymore to provision new baremetal nodes in 4.10.35 using the same rootDeviceHints used in 4.10.10.
Customer uses HP DL360 Gen10, with exteranal SAN storage that is seen by the system as a multipath device. Latest IPA versions are implementing some changes to avoid wiping shared disks and this seems to affect what we should provide as rootDeviceHints.
They used to put /dev/sda as rootDeviceHints, in 4.10.35 it doesn't make the IPA write the image to the disk anymore because it sees the disk as part of a multipath device, we tried using the on top multipath device /dev/dm-0, the system is then able to write the image to the disk but then it gets stuck when it tried to issue a partprobe command, rebooting the systems to boot from the disk does not seem to help complete the provisioning, no workaround so far.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

by trying to provisioning a baremetal node with a multipath device.

Steps to Reproduce:

1. Create a new BMH using a multipath device as rootDeviceHints
2.
3.

Actual results:

The node does not get provisioned

Expected results:

the node gets provisioned correctly

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2115265](https://bugzilla.redhat.com/show_bug.cgi?id=2115265). The following is the description of the original bug:

Description of problem:
Starting with https://github.com/openshift/console/pull/11866 the action (kebab icon) menu button on the right side in the search is changed from a `ResourceKebab` to a `LazyActionMenu` for `HelmChartRepositories`.

We use this implementation it in other places as well and maybe also in other table rows?

When the search shows multiple tables and the user opens the menu at the end of one table, the dropdown options is shown below the "Add from navigation" or "Remove from navigation" button of the next table.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Navigate to the search
2. Open the resource selector and search for HelmChartRepo, select HelmChartRepositories and ProjectHelmChartRepositories
3. HCR should have at least one repo. Click on the action menu of the first table.

Actual results:
The new menu is shown partly behind the button "Add from navigation" or "Remove from navigation"

Some buttons are not clickable.

Expected results:
The menu should shown above the button "Add from navigation" or "Remove from navigation"

All buttons should be clickable.

Additional info:

Description of problem:

a freshly installed 4.12 cluster should have stable-4.12 channel by default

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-02-154321

How reproducible:

100%

Steps to Reproduce:

install 4.12 cluster

Actual results:

oc get clusterversion/version -ojson | jq .spec.channel
"stable-4.11"

Expected results:

oc get clusterversion/version -ojson | jq .spec.channel
"stable-4.12"

Additional info:

 

Description of problem:

The pod count for maxUnavailable of 2 or more is displayed as singular

Version-Release number of selected component (if applicable):

4.12.0-ec.2

How reproducible:

 

Steps to Reproduce:

1. Create a Deployment
2. Add a PDB to the Deployment and set the maxUnavailable to 2
3. Goto Deployment details page

Actual results:

The Max unavailable 6 of 3 pod

Expected results:

Should be Max unavailable 6 of 3 pods

Additional info:

 

Description of problem:

$ oc adm must-gather -- gather_ingress_node_firewall
[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dec5a08681e11eedcd31f075941b74f777b9187f0e711a498a212f9d96adb2f
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 0ef60b50-4378-431d-8ca2-faa5af098274
ClusterVersion: Stable at "4.12.0-0.nightly-2022-09-26-111919"
ClusterOperators:
    clusteroperator/insights is not available (Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed
) because Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed[must-gather      ] OUT namespace/openshift-must-gather-fr7kc created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xx2fh created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dec5a08681e11eedcd31f075941b74f777b9187f0e711a498a212f9d96adb2f created
[must-gather-xvfj4] POD 2022-09-28T16:57:00.887445531Z /bin/bash: /usr/bin/gather_ingress_node_firewall: Permission denied
[must-gather-xvfj4] OUT waiting for gather to complete
[must-gather-xvfj4] OUT downloading gather output
[must-gather-xvfj4] OUT receiving incremental file list
[must-gather-xvfj4] OUT ./
[must-gather-xvfj4] OUT 
[must-gather-xvfj4] OUT sent 27 bytes  received 40 bytes  26.80 bytes/sec
[must-gather-xvfj4] OUT total size is 0  speedup is 0.00
[must-gather      ] OUT namespace/openshift-must-gather-fr7kc deleted
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xx2fh deleted
Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
ClusterID: 0ef60b50-4378-431d-8ca2-faa5af098274
ClusterVersion: Stable at "4.12.0-0.nightly-2022-09-26-111919"
ClusterOperators:
    clusteroperator/insights is not available (Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed
) because Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

I saw the following while trying to debug the following "unexpectedly found multiple equivalent ACLs" error.

Add a generic networkpolicy:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: allow-same-namespace
namespace: nbc9-demo-project
spec:
podSelector: {}
ingress:

  • from:
  • podSelector: {}
    policyTypes:
  • Ingress

$ kubectl get pod ovnkube-master-pk89w -o jsonpath='

{range .spec.containers[]} {@.image}

'
quay.io/openshift/okd-content@sha256:79ee71e045a7b224a132f6c75b4220ec35b9a06049061a6bd9ca9fc976c412e5

[root@dev-nkjpp-master-2 ~]# ovnkube -v
I0609 17:33:34.930787 58 ovs.go:93] Maximum command line arguments set to: 191102
Version: 0.3.0
Git commit: 7bf36eea28fe66365d0dfdf8c39e3311ea14d19b
Git branch: release-4.10
Go version: go1.16.6
Build date: 2022-05-27
OS/Arch: linux amd64

Which then fails to apply, retries, and when the networkpolicy is deleted, the ovnkube-master pod segfaults:

I0609 17:00:26.653710 1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project
E0609 17:00:26.656858 1 ovn.go:753] Failed to create network policy nbc9-demo-project/allow-same-namespace, error: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [

{UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df310 Name:0xc0010df320 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df330}

{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df390 Name:0xc0010df3d0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df3e0}

]
I0609 17:00:51.437895 1 policy_retry.go:46] Network Policy Retry: nbc9-demo-project/allow-same-namespace retry network policy setup
I0609 17:00:51.437935 1 policy_retry.go:63] Network Policy Retry: Creating new policy for nbc9-demo-project/allow-same-namespace
I0609 17:00:51.437941 1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project
I0609 17:00:51.438174 1 policy_retry.go:65] Network Policy Retry create failed for nbc9-demo-project/allow-same-namespace, will try again later: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [

{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc002215e00 Name:0xc002215e70 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc002215e80}

{UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0022b0310 Name:0xc0022b03a0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc000070ab0}

]
I0609 17:01:02.679219 1 policy.go:1174] Deleting network policy allow-same-namespace in namespace nbc9-demo-project

E0609 17:01:02.679407 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 249 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1c19c80, 0x2e9a810)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1c19c80, 0x2e9a810)
/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a021d5]

goroutine 249 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1c19c80, 0x2e9a810)
/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65

Please let me know if any further information is required. I have a must-gather for this cluster but the file attachment tool in bugzilla won't let me attach anything larger than 19.5MB (the must-gather is 212.1MB)

Description of problem:

In the Konnectivity SOCKS proxy: currently the default is to proxy cloud endpoint traffic: https://github.com/openshift/hypershift/blob/main/konnectivity-socks5-proxy/main.go#L61

Due to this after this change: https://github.com/openshift/hypershift/commit/0c52476957f5658cfd156656938ae1d08784b202

The oauth server had a behavior change where it began to proxy iam traffic instead of not proxying it. This causes a regression in Satellite environments running with an HTTP_PROXY server. The original network traffic path needs to be restored

Version-Release number of selected component (if applicable):

4.13 4.12

How reproducible:

100%

Steps to Reproduce:

1. Setup HTTP_PROXY IBM Cloud Satellite environment
2. In the oauth-server pod run a curl against iam (curl -v https://iam.cloud.ibm.com)
3. It will log it is using proxy

Actual results:

It is using proxy 

Expected results:

It should send traffic directly (as it does in 4.11 and 4.10)

Additional info:

 

Description of problem:

The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected. 

Version-Release number of selected component (if applicable):

4.10.z, may exist on other OCP versions as well. 

How reproducible:

always

Steps to Reproduce:

1. Create a KubeletConfig on the node:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 500Mi
      ephemeral-storage: 10Gi


2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage

Actual results:

The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)  

Expected results:

The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default) 

Additional info:

The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example:

root        6824       1 27 Sep30 ?        1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s


 

Description of problem:

opm serve fails with message:

Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

(The easiest reproducer involves serving an empty catalog)

1. mkdir /tmp/catalog

2. using Dockerfile /tmp/catalog.Dockerfile based on 4.12 docs (https://access.redhat.com/documentation/en-us/openshift_container_platform/4.12/html-single/operators/index#olm-creating-fb-catalog-image_olm-managing-custom-catalogs
# The base image is expected to contain
# /bin/opm (with a serve subcommand) and /bin/grpc_health_probe
FROM registry.redhat.io/openshift4/ose-operator-registry:v4.12

# Configure the entrypoint and command
ENTRYPOINT ["/bin/opm"]
CMD ["serve", "/configs"]

# Copy declarative config root into image at /configs
ADD catalog /configs

# Set DC-specific label for the location of the DC root directory
# in the image
LABEL operators.operatorframework.io.index.configs.v1=/configs

3. build the image `cd /tmp/ && docker build -f catalog.Dockerfile .`

4. execute an instance of the container in docker/podman `docker run --name cat-run [image-file]`

5. error

Using a dockerfile generated from opm (`opm generate dockerfile [dir]`) works, but includes precache and cachedir options to opm.

 

Actual results:

Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root

Expected results:

opm generates cache in default /tmp/cache location and serves without error

Additional info:

 

 

This is a clone of issue OCPBUGS-7837. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

There is a bug where creating OLM subscription manifests early in the installation process results in those OLM operators not being installed.

This is because the OLM installation Jobs fail when they are tried early in the installation process, and OLM does not retry those jobs sufficiently and eventually gives up on them.

This should be solved starting OCP 4.12, but until then, we should solve this using Assisted.

A way to solve this is to delay the installation of OLM operators to only occur after the cluster is up and healthy. 

This can be done by creating the subscriptions with "installPlanApproval" set to "Manual" instead of "Automatic". Then once the cluster is up and healthy, the assisted-controller should approve the InstallPlans that OLM will create for the operators. This will then trigger the installation which is more likely to succeed since the cluster is up and healthy at this point

Description of problem:

Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes.

$ oc get machines -A
NAMESPACE               NAME                          PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-kwtf8-master-0         Running                               23h
openshift-machine-api   ostest-kwtf8-master-1         Running                               23h
openshift-machine-api   ostest-kwtf8-master-2         Running                               23h
openshift-machine-api   ostest-kwtf8-worker-0-g7nrw   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-lrkvb   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-vwrsk   Provisioning                          23h

$ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller
[...]
E1018 10:51:49.355143       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"

Version-Release number of selected component (if applicable):

4.10.0-0.nightly-2022-10-14-023020

How reproducible:

Always

Steps to Reproduce:

1. Install 4.10 within provider networks (in primary or secondary interface)

Actual results:

Installation failure:
4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out

Expected results:

Successful installation

Additional info:

Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.

 

This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:

Description of problem:

We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.

One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby 
   /etc/node-sizing.env 
on its master nodes contained an empty SYSTEM_RESERVED_ES value:

---
cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=5.36Gi
SYSTEM_RESERVED_CPU=0.11
SYSTEM_RESERVED_ES=
---

causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.

We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.

A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES

This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. 

For clusterB the conditions are more well-known of why the value is empty.

However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. 

We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps  through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

Expected results:

If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

Additional info:

 

This is a clone of issue OCPBUGS-17365. The following is the description of the original issue:

When we update a Secret referenced in the BareMetalHost, an immediate reconcile of the corresponding BMH is not triggered. In most states we requeue each CR after a timeout, so we should eventually see the changes.

In the case of BMC Secrets, this has been broken since the fix for OCPBUGS-1080 in 4.12.

This is a clone of issue OCPBUGS-4758. The following is the description of the original issue:

Description of problem:

See: https://issues.redhat.com/browse/CPSYN-143

tldr:  Based on the previous direction that 4.12 was going to enforce PSA restricted by default, OLM had to make a few changes because the way we run catalog pods (and we have to run them that way because of how the opm binary worked) was incompatible w/ running restricted.

1) We set openshift-marketplace to enforce restricted (this was our choice, we didn't have to do it, but we did)
2) we updated the opm binary so catalog images using a newer opm binary don't have to run privileged
3) we added a field to catalogsource that allows you to choose whether to run the pod privileged(legacy mode) or restricted.  The default is restricted.  We made that the default so that users running their own catalogs in their own NSes (which would be default PSA enforcing) would be able to be successful w/o needing their NS upgraded to privileged.

Unfortunately this means:
1) legacy catalog images(i.e. using older opm binaries) won't run on 4.12 by default (the catalogsource needs to be modified to specify legacy mode.
2) legacy catalog images cannot be run in the openshift-marketplace NS since that NS does not allow privileged pods.  This means legacy catalogs can't contribute to the global catalog (since catalogs must be in that NS to be in the global catalog).

Before 4.12 ships we need to:
1) remove the PSA restricted label on the openshift-marketplace NS
2) change the catalogsource securitycontextconfig mode default to use "legacy" as the default, not restricted.

This gives catalog authors another release to update to using a newer opm binary that can run restricted, or get their NSes explicitly labeled as privileged (4.12 will not enforce restricted, so in 4.12 using the legacy mode will continue to work)

In 4.13 we will need to revisit what we want the default to be, since at that point catalogs will start breaking if they try to run in legacy mode in most NSes.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

The name of "Role" on Compute -> Nodes page should update to "Roles" to match the name in the CLI

Compare with other resources, the title of the column should keep pace with the name in CLI

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-15-150248

How reproducible:

Always

Steps to Reproduce:
1.  Login OCP with CLI, use below command to get nodes information

     $ oc get nodes
2. Go to Compute -> nodes page, check the column name of "Role"
3.

Actual results:

CLI will return information as below shown, and the title of the column is "ROLES"

NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-145-18.us-east-2.compute.internal    Ready    worker   9h    v1.24.0+4f0dd4d
ip-10-0-145-203.us-east-2.compute.internal   Ready    master   9h    v1.24.0+4f0dd4d
ip-10-0-163-205.us-east-2.compute.internal   Ready    master   9h    v1.24.0+4f0dd4d
ip-10-0-169-118.us-east-2.compute.internal   Ready    worker   9h    v1.24.0+4f0dd4d
ip-10-0-198-234.us-east-2.compute.internal   Ready    master   9h    v1.24.0+4f0dd4d
ip-10-0-212-34.us-east-2.compute.internal    Ready    worker   9h    v1.24.0+4f0dd4d

But in UI, the name of ROLES is "Role" which is incorrect. (Attached)

Expected results:

The title of "Role" should update to "Roles"

Additional info:

This is a clone of issue OCPBUGS-14336. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1829. The following is the description of the original issue:

Description of problem:

Link to Openshift Route from service is breaking because of hardcoded value of targetPort. If the targetPort gets changed, the route still points to the older value of port as it's hardcoded

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install the latest available version of Openshift Pipelines
2. Create the pipeline and triggerbinding using the attached files
3. Add trigger to the created pipeline from devconsole UI, select the above created triggerbinding while adding trigger
4. Trigger an event using the curl command curl -X POST -d '{ "url": "https://www.github.com/VeereshAradhya/cli" }' -H 'Content-Type: application/json' <route> and make sure that the pipelinerun gets started
5. Update the tagetPort in the svc from 8080 to 8000
6. Again use the above curl command to trigger one more event

Actual results:

The curl command throws error

Expected results:

The curl command should be successful and the pipelinerun should get started successfully

Additional info:

Error:
curl -X POST -d '{ "url": "https://www.github.com/VeereshAradhya/cli" }' -H 'Content-Type: application/json' http://el-event-listener-3o9zcv-test-devconsole.apps.ve412psi.psi.ospqa.com
<html>
  <head>
    <meta name="viewport" content="width=device-width, initial-scale=1">    <style type="text/css">
      body {
        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        line-height: 1.66666667;
        font-size: 16px;
        color: #333;
        background-color: #fff;
        margin: 2em 1em;
      }
      h1 {
        font-size: 28px;
        font-weight: 400;
      }
      p {
        margin: 0 0 10px;
      }
      .alert.alert-info {
        background-color: #F0F0F0;
        margin-top: 30px;
        padding: 30px;
      }
      .alert p {
        padding-left: 35px;
      }
      ul {
        padding-left: 51px;
        position: relative;
      }
      li {
        font-size: 14px;
        margin-bottom: 1em;
      }
      p.info {
        position: relative;
        font-size: 20px;
      }
      p.info:before, p.info:after {
        content: "";
        left: 0;
        position: absolute;
        top: 0;
      }
      p.info:before {
        background: #0066CC;
        border-radius: 16px;
        color: #fff;
        content: "i";
        font: bold 16px/24px serif;
        height: 24px;
        left: 0px;
        text-align: center;
        top: 4px;
        width: 24px;
      }      @media (min-width: 768px) {
        body {
          margin: 6em;
        }
      }
    </style>
  </head>
  <body>
    <div>
      <h1>Application is not available</h1>
      <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>      <div class="alert alert-info">
        <p class="info">
          Possible reasons you are seeing this page:
        </p>
        <ul>
          <li>
            <strong>The host doesn't exist.</strong>
            Make sure the hostname was typed correctly and that a route matching this hostname exists.
          </li>
          <li>
            <strong>The host exists, but doesn't have a matching path.</strong>
            Check if the URL path was typed correctly and that the route was created using the desired path.
          </li>
          <li>
            <strong>Route and path matches, but all pods are down.</strong>
            Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
          </li>
        </ul>
      </div>
    </div>
  </body>
</html>

Note:

The above scenario works fine if we create triggers using the yaml files instead of using devconsole UI

This is a clone of issue OCPBUGS-19894. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17391. The following is the description of the original issue:

the pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration job started failing recently when the
ovnkube-master daemonset would not finish rolling out after 360s.

taking the must gather to debug which happens a few minutes after the test
failure you can see that the daemonset is still not ready, so I believe that
increasing the timeout is not the answer.

some debug info:

 

static-kas git:(master) oc --kubeconfig=/tmp/kk get daemonsets -A 
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-dns node-resolver 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-ingress-canary ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 8h
openshift-machine-api machine-api-termination-handler 0 0 0 0 0 kubernetes.io/os=linux,machine.openshift.io/interruptible-instance= 8h
openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 8h
openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 8h
openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 9h
openshift-multus multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 9h
openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 9h
openshift-network-diagnostics network-check-target 6 6 6 6 6 beta.kubernetes.io/os=linux 9h
openshift-ovn-kubernetes ovnkube-master 3 3 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 9h
openshift-ovn-kubernetes ovnkube-node 6 6 6 6 6 beta.kubernetes.io/os=linux 9h
Name: ovnkube-master
Selector: app=ovnkube-master
Node-Selector: beta.kubernetes.io/os=linux,node-role.kubernetes.io/master=
Labels: networkoperator.openshift.io/generates-operator-status=stand-alone
Annotations: deprecated.daemonset.template.generation: 3
kubernetes.io/description: This daemonset launches the ovn-kubernetes controller (master) networking components.
networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14
networkoperator.openshift.io/hybrid-overlay-status: disabled
networkoperator.openshift.io/ip-family-mode: single-stack
release.openshift.io/version: 4.14.0-0.ci.test-2023-08-04-123014-ci-op-c6fp05f4-latest
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=ovnkube-master
component=network
kubernetes.io/os=linux
openshift.io/component=network
ovn-db-pod=true
type=infra
Annotations: networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14
networkoperator.openshift.io/hybrid-overlay-status: disabled
networkoperator.openshift.io/ip-family-mode: single-stack
target.workload.openshift.io/management:
{"effect": "PreferredDuringScheduling"}
Service Account: ovn-kubernetes-controller

 

it seems there is one pod that is not coming up all the way and that pod has
two containers not ready (sbdb and nbdb). logs from those containers below:

 

static-kas git:(master) oc --kubeconfig=/tmp/kk describe pod ovnkube-master-7qlm5 -n openshift-ovn-kubernetes | rg '^ [a-z].*:|Ready'
northd:
Ready: True
nbdb:
Ready: False
kube-rbac-proxy:
Ready: True
sbdb:
Ready: False
ovnkube-master:
Ready: True
ovn-dbchecker:
Ready: True
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c sbdb
2023-08-04T13:08:49.127480354Z + [[ -f /env/_master ]]
2023-08-04T13:08:49.127562165Z + trap quit TERM INT
2023-08-04T13:08:49.127609496Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes
2023-08-04T13:08:49.127637926Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
2023-08-04T13:08:49.127637926Z + transport=ssl
2023-08-04T13:08:49.127645167Z + ovn_raft_conn_ip_url_suffix=
2023-08-04T13:08:49.127682687Z + [[ 10.0.42.108 == \: ]]
2023-08-04T13:08:49.127690638Z + db=sb
2023-08-04T13:08:49.127690638Z + db_port=9642
2023-08-04T13:08:49.127712038Z + ovn_db_file=/etc/ovn/ovnsb_db.db
2023-08-04T13:08:49.127854181Z + [[ ! ssl:10.0.102.2:9642,ssl:10.0.42.108:9642,ssl:10.0.74.128:9642 =~ .:10\.0\.42\.108:. ]]
2023-08-04T13:08:49.128199437Z ++ bracketify 10.0.42.108
2023-08-04T13:08:49.128237768Z ++ case "$1" in
2023-08-04T13:08:49.128265838Z ++ echo 10.0.42.108
2023-08-04T13:08:49.128493242Z + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
2023-08-04T13:08:49.128535253Z + CLUSTER_INITIATOR_IP=10.0.102.2
2023-08-04T13:08:49.128819438Z ++ date -Iseconds
2023-08-04T13:08:49.130157063Z 2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2
2023-08-04T13:08:49.130170893Z + echo '2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2'
2023-08-04T13:08:49.130170893Z + initialize=false
2023-08-04T13:08:49.130179713Z + [[ ! -e /etc/ovn/ovnsb_db.db ]]
2023-08-04T13:08:49.130318475Z + [[ false == \t\r\u\e ]]
2023-08-04T13:08:49.130406657Z + wait 9
2023-08-04T13:08:49.130493659Z + exec /usr/share/ovn/scripts/ovn-ctl -db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-sb-log=-vconsole:info -vfile:off -vPATTERN:console:%D
{%Y-%m-%dT%H:%M:%S.###Z}
|%05N|%c%T|%p|%m' run_sb_ovsdb
2023-08-04T13:08:49.208399304Z 2023-08-04T13:08:49.208Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log
2023-08-04T13:08:49.213507987Z ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (No such file or directory)
2023-08-04T13:08:49.224890005Z 2023-08-04T13:08:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2023-08-04T13:08:49.224912156Z 2023-08-04T13:08:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt failed (No such file or directory)
2023-08-04T13:08:49.255474964Z 2023-08-04T13:08:49.255Z|00002|raft|INFO|local server ID is 7f92
2023-08-04T13:08:49.333342909Z 2023-08-04T13:08:49.333Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
2023-08-04T13:08:49.348948944Z 2023-08-04T13:08:49.348Z|00004|reconnect|INFO|ssl:10.0.102.2:9644: connecting...
2023-08-04T13:08:49.349002565Z 2023-08-04T13:08:49.348Z|00005|reconnect|INFO|ssl:10.0.74.128:9644: connecting...
2023-08-04T13:08:49.352510569Z 2023-08-04T13:08:49.352Z|00006|reconnect|INFO|ssl:10.0.102.2:9644: connected
2023-08-04T13:08:49.353870484Z 2023-08-04T13:08:49.353Z|00007|reconnect|INFO|ssl:10.0.74.128:9644: connected
2023-08-04T13:08:49.889326777Z 2023-08-04T13:08:49.889Z|00008|raft|INFO|server 2501 is leader for term 5
2023-08-04T13:08:49.890316765Z 2023-08-04T13:08:49.890Z|00009|raft|INFO|rejecting append_request because previous entry 5,1538 not in local log (mismatch past end of log)
2023-08-04T13:08:49.891199951Z 2023-08-04T13:08:49.891Z|00010|raft|INFO|rejecting append_request because previous entry 5,1539 not in local log (mismatch past end of log)
2023-08-04T13:08:50.225632838Z 2023-08-04T13:08:50Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
2023-08-04T13:08:50.225677739Z 2023-08-04T13:08:50Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
2023-08-04T13:08:50.227772827Z Waiting for OVN_Southbound to come up.
2023-08-04T13:08:55.716284614Z 2023-08-04T13:08:55.716Z|00011|raft|INFO|ssl:10.0.74.128:43498: learned server ID 3dff
2023-08-04T13:08:55.716323395Z 2023-08-04T13:08:55.716Z|00012|raft|INFO|ssl:10.0.74.128:43498: learned remote address ssl:10.0.74.128:9644
2023-08-04T13:08:55.724570375Z 2023-08-04T13:08:55.724Z|00013|raft|INFO|ssl:10.0.102.2:47804: learned server ID 2501
2023-08-04T13:08:55.724599466Z 2023-08-04T13:08:55.724Z|00014|raft|INFO|ssl:10.0.102.2:47804: learned remote address ssl:10.0.102.2:9644
2023-08-04T13:08:59.348572779Z 2023-08-04T13:08:59.348Z|00015|memory|INFO|32296 kB peak resident set size after 10.1 seconds
2023-08-04T13:08:59.348648190Z 2023-08-04T13:08:59.348Z|00016|memory|INFO|atoms:35959 cells:31476 monitors:0 n-weak-refs:749 raft-connections:4 raft-log:1543 txn-history:100 txn-history-atoms:7100
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c nbdb 
2023-08-04T13:08:48.779743434Z + [[ -f /env/_master ]]
2023-08-04T13:08:48.779743434Z + trap quit TERM INT
2023-08-04T13:08:48.779825516Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes
2023-08-04T13:08:48.779825516Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
2023-08-04T13:08:48.779825516Z + transport=ssl
2023-08-04T13:08:48.779825516Z + ovn_raft_conn_ip_url_suffix=
2023-08-04T13:08:48.779825516Z + [[ 10.0.42.108 == \: ]]
2023-08-04T13:08:48.779825516Z + db=nb
2023-08-04T13:08:48.779825516Z + db_port=9641
2023-08-04T13:08:48.779825516Z + ovn_db_file=/etc/ovn/ovnnb_db.db
2023-08-04T13:08:48.779887606Z + [[ ! ssl:10.0.102.2:9641,ssl:10.0.42.108:9641,ssl:10.0.74.128:9641 =~ .:10\.0\.42\.108:. ]]
2023-08-04T13:08:48.780159182Z ++ bracketify 10.0.42.108
2023-08-04T13:08:48.780167142Z ++ case "$1" in
2023-08-04T13:08:48.780172102Z ++ echo 10.0.42.108
2023-08-04T13:08:48.780314224Z + OVN_ARGS='--db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
2023-08-04T13:08:48.780314224Z + CLUSTER_INITIATOR_IP=10.0.102.2
2023-08-04T13:08:48.780518588Z ++ date -Iseconds
2023-08-04T13:08:48.781738820Z 2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108
2023-08-04T13:08:48.781753021Z + echo '2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108'
2023-08-04T13:08:48.781753021Z + initialize=false
2023-08-04T13:08:48.781753021Z + [[ ! -e /etc/ovn/ovnnb_db.db ]]
2023-08-04T13:08:48.781816342Z + [[ false == \t\r\u\e ]]
2023-08-04T13:08:48.781936684Z + wait 9
2023-08-04T13:08:48.781974715Z + exec /usr/share/ovn/scripts/ovn-ctl -db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D
{%Y-%m-%dT%H:%M:%S.###Z}
|%05N|%c%T|%p|%m' run_nb_ovsdb
2023-08-04T13:08:48.851644059Z 2023-08-04T13:08:48.851Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
2023-08-04T13:08:48.852091247Z ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
2023-08-04T13:08:48.875126148Z 2023-08-04T13:08:48.875Z|00002|raft|INFO|local server ID is c503
2023-08-04T13:08:48.911846610Z 2023-08-04T13:08:48.911Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
2023-08-04T13:08:48.918864408Z 2023-08-04T13:08:48.918Z|00004|reconnect|INFO|ssl:10.0.102.2:9643: connecting...
2023-08-04T13:08:48.918934490Z 2023-08-04T13:08:48.918Z|00005|reconnect|INFO|ssl:10.0.74.128:9643: connecting...
2023-08-04T13:08:48.923439162Z 2023-08-04T13:08:48.923Z|00006|reconnect|INFO|ssl:10.0.102.2:9643: connected
2023-08-04T13:08:48.925166154Z 2023-08-04T13:08:48.925Z|00007|reconnect|INFO|ssl:10.0.74.128:9643: connected
2023-08-04T13:08:49.861650961Z 2023-08-04T13:08:49Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
2023-08-04T13:08:49.861747153Z 2023-08-04T13:08:49Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
2023-08-04T13:08:49.875272530Z 2023-08-04T13:08:49.875Z|00008|raft|INFO|server fccb is leader for term 6
2023-08-04T13:08:49.875302480Z 2023-08-04T13:08:49.875Z|00009|raft|INFO|rejecting append_request because previous entry 6,1732 not in local log (mismatch past end of log)
2023-08-04T13:08:49.876027164Z Waiting for OVN_Northbound to come up.
2023-08-04T13:08:55.694760761Z 2023-08-04T13:08:55.694Z|00010|raft|INFO|ssl:10.0.74.128:57122: learned server ID d382
2023-08-04T13:08:55.694800872Z 2023-08-04T13:08:55.694Z|00011|raft|INFO|ssl:10.0.74.128:57122: learned remote address ssl:10.0.74.128:9643
2023-08-04T13:08:55.706904913Z 2023-08-04T13:08:55.706Z|00012|raft|INFO|ssl:10.0.102.2:43230: learned server ID fccb
2023-08-04T13:08:55.706931733Z 2023-08-04T13:08:55.706Z|00013|raft|INFO|ssl:10.0.102.2:43230: learned remote address ssl:10.0.102.2:9643
2023-08-04T13:08:58.919567770Z 2023-08-04T13:08:58.919Z|00014|memory|INFO|21944 kB peak resident set size after 10.1 seconds
2023-08-04T13:08:58.919643762Z 2023-08-04T13:08:58.919Z|00015|memory|INFO|atoms:8471 cells:7481 monitors:0 n-weak-refs:200 raft-connections:4 raft-log:1737 txn-history:72 txn-history-atoms:8165
➜ static-kas git:(master)

This seems to happen very frequently now, but was not happening before around July 21st.

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration?buildId=1684628739427667968

 

Description of problem:

IPI installation failed with master nodes being NotReady and CCM error "alicloud: unable to split instanceid and region from providerID".

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-05-053337

How reproducible:

Always

Steps to Reproduce:

1. try IPI installation on alibabacloud, with credentialsMode being "Manual"
2.
3.

Actual results:

Installation failed.

Expected results:

Installation should succeed.

Additional info:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          34m     Unable to apply 4.12.0-0.nightly-2022-10-05-053337: an unknown error has occurred: MultipleErrors
$ 
$ oc get nodes
NAME                           STATUS     ROLES                  AGE   VERSION
jiwei-1012-02-9jkj4-master-0   NotReady   control-plane,master   30m   v1.25.0+3ef6ef3
jiwei-1012-02-9jkj4-master-1   NotReady   control-plane,master   30m   v1.25.0+3ef6ef3
jiwei-1012-02-9jkj4-master-2   NotReady   control-plane,master   30m   v1.25.0+3ef6ef3
$ 

CCM logs:
E1012 03:46:45.223137       1 node_controller.go:147] node-controller "msg"="fail to find ecs" "error"="cloud instance api fail, alicloud: unable to split instanceid and region from providerID, error unexpected providerID="  "providerId"="alicloud://"
E1012 03:46:45.223174       1 controller.go:317] controller/node-controller "msg"="Reconciler error" "error"="find ecs: cloud instance api fail, alicloud: unable to split instanceid and region from providerID, error unexpected providerID=" "name"="jiwei-1012-02-9jkj4-master-0" "namespace"="" 

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/145768/ (Finished: FAILURE)
10-12 10:55:15.987  ./openshift-install 4.12.0-0.nightly-2022-10-05-053337
10-12 10:55:15.987  built from commit 84aa8222b622dee71185a45f1e0ba038232b114a
10-12 10:55:15.987  release image registry.ci.openshift.org/ocp/release@sha256:41fe173061b00caebb16e2fd11bac19980d569cd933fdb4fab8351cdda14d58e
10-12 10:55:15.987  release architecture amd64

FYI the installation could succeed with 4.12.0-0.nightly-2022-09-28-204419:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/145756/ (Finished: SUCCESS)
10-12 09:59:19.914  ./openshift-install 4.12.0-0.nightly-2022-09-28-204419
10-12 09:59:19.914  built from commit 9eb0224926982cdd6cae53b872326292133e532d
10-12 09:59:19.914  release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc
10-12 09:59:19.914  release architecture amd64

 

 

This is a clone of issue OCPBUGS-5346. The following is the description of the original issue:

Description of problem:

The vSphere status health item is misleading.

More info: https://coreos.slack.com/archives/CUPJTHQ5P/p1672829660214369

 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Have OCP 4.12 on vSphere
2. On the Cluster Dashboard (landing page), check the vSphere Status Health (static plugin)
3.

Actual results:

The icon shows pregress but nothing is progressing when the modal dialog is open

Expected results:

No misleading message and icon are rendered.

Additional info:

Since the Problem detector is not a reliable source and modification of the HealthItem in the OCP Console is too complex task for the recent state of release, a non-misleading text is good-enough.

Because the agent ISO is ephemeral, it is probably safe to allow a user to log in to it with a password. If the network configuration is broken, a user may have no other way to debug it other than to log in through the console, which is currently not possible.

The best password to set would be the kubeadmin password used for the OpenShift GUI, since we'll have generated that already.

We must take care to test that this does not result in the installed nodes on disk allowing login with a password.

This is a clone of issue OCPBUGS-7419. The following is the description of the original issue:

Description of problem:

When installing SNO tt takes CVO about 2 minutes from the time all cluster operators are available (progressing=false and degraded=false) to set the clusterversion status to Available=true

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.install SNO with bootstrap in place (https://github.com/eranco74/bootstrap-in-place-poc)
2. monitor the cluster operators staus
 and the clusterversion
3.

Actual results:

All cluster operators are available about 2 minutes before CVO set the clsuter version to available=true

Expected results:

expected it to sync faster

Additional info:

attached must-gather logs
and full audit log

This is a clone of issue OCPBUGS-2891. The following is the description of the original issue:

Deprovisioning can fail with the error:

level=warning msg=unrecognized elastic load balancing resource type listener arn=arn:aws:elasticloadbalancing:us-west-2:460538899914:listener/net/a9ac9f1b3019c4d1299e7ededc92b42b/a6f0655da877ddd4/45e05ee69d99bab0

 

Further background is available in this write up:

https://docs.google.com/document/d/1TsTqIVwHDmjuDjG7v06w_5AAbXSisaDX-UfUI9-GVJo/edit#

 

Incident channel:

incident-aws-leaking-tags-for-deleted-resources

 

This is a clone of issue OCPBUGS-5165. The following is the description of the original issue:

Currently, the Dev Sandbox clusters sends the clusterType "OSD" instead of "DEVSANDBOX" because the configuration annotations of the console config are automatically overridden by some SyncSets.

Open Dev Sandbox and browser console and inspect window.SERVER_FLAGS.telemetry

Description of problem:

"opm alpha render-veneer semver" raise error when no "Candidate" in config yaml

Version-Release number of selected component (if applicable):

zhaoxia@xzha-mac semver % opm version
Version: version.Version{OpmVersion:"11644a543", GitCommit:"11644a5433442c33698d2eee8d3f865b0d9386c0", BuildDate:"2022-08-29T08:16:54Z", GoOs:"darwin", GoArch:"amd64"}

How reproducible:

always

Steps to Reproduce:

1. prepare catalog-semver-veneer-wrong.yaml 
zhaoxia@xzha-mac semver % cat catalog-semver-veneer-wrong.yaml 
Schema: olm.semver
GenerateMajorChannels: false
GenerateMinorChannels: true
Stable:
  Bundles:
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.2
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v2.1.0
Fast:
  Bundles:
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v2.0.1
  - Image: quay.io/olmqe/nginxolm-operator-bundle:v2.1.0 

2. run "opm alpha render-veneer semver"
zhaoxia@xzha-mac semver % opm alpha render-veneer semver catalog-semver-veneer-wrong.yaml
2022/08/29 16:48:56 semver "catalog-semver-veneer-wrong.yaml": semver-render: no bundles specified or no bundles could be rendered

3.

Actual results:

error "no bundles specified or no bundles could be rendered" is raised.

Expected results:

no error

Additional info:

 

Description of problem:

Not all rules removed from iptables after disabling multinetworkpolicy

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Configure sriov (nodepolicy + sriovnetwork)
2. Configure 2 pods
3. enable MutiNetworkPolicy
4. apply ~20 rules for pod1:
 spec:
  podSelector:
    matchLabels:
      pod: pod1
  policyTypes:
  - Ingress
  ingress: []
5. Disable multinetworkpolicy
6. send ping pod2 => pod1

Actual results:

Traffic is still blocked

Expected results:

Traffic should be passed

Additional info:

Before disabling multiNetworkPolicy:
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default net-attach-def:ns1/sriovnetwork2" -j MULTI-0-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default24 net-attach-def:ns1/sriovnetwork2" -j MULTI-1-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default17 net-attach-def:ns1/sriovnetwork2" -j MULTI-2-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default15 net-attach-def:ns1/sriovnetwork2" -j MULTI-3-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default14 net-attach-def:ns1/sriovnetwork2" -j MULTI-4-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default7 net-attach-def:ns1/sriovnetwork2" -j MULTI-5-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default5 net-attach-def:ns1/sriovnetwork2" -j MULTI-6-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default20 net-attach-def:ns1/sriovnetwork2" -j MULTI-7-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default19 net-attach-def:ns1/sriovnetwork2" -j MULTI-8-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default11 net-attach-def:ns1/sriovnetwork2" -j MULTI-9-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default10 net-attach-def:ns1/sriovnetwork2" -j MULTI-10-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default9 net-attach-def:ns1/sriovnetwork2" -j MULTI-11-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default6 net-attach-def:ns1/sriovnetwork2" -j MULTI-12-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default3 net-attach-def:ns1/sriovnetwork2" -j MULTI-13-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default16 net-attach-def:ns1/sriovnetwork2" -j MULTI-14-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default13 net-attach-def:ns1/sriovnetwork2" -j MULTI-15-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default2 net-attach-def:ns1/sriovnetwork2" -j MULTI-16-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default22 net-attach-def:ns1/sriovnetwork2" -j MULTI-17-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default21 net-attach-def:ns1/sriovnetwork2" -j MULTI-18-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default18 net-attach-def:ns1/sriovnetwork2" -j MULTI-19-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default12 net-attach-def:ns1/sriovnetwork2" -j MULTI-20-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default8 net-attach-def:ns1/sriovnetwork2" -j MULTI-21-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default4 net-attach-def:ns1/sriovnetwork2" -j MULTI-22-INGRESS
-A MULTI-0-INGRESS -j DROP
-A MULTI-1-INGRESS -j DROP
-A MULTI-2-INGRESS -j DROP
-A MULTI-3-INGRESS -j DROP
-A MULTI-4-INGRESS -j DROP
-A MULTI-5-INGRESS -j DROP
-A MULTI-6-INGRESS -j DROP
-A MULTI-7-INGRESS -j DROP
-A MULTI-8-INGRESS -j DROP
-A MULTI-9-INGRESS -j DROP
-A MULTI-10-INGRESS -j DROP
-A MULTI-11-INGRESS -j DROP
-A MULTI-12-INGRESS -j DROP
-A MULTI-13-INGRESS -j DROP
-A MULTI-14-INGRESS -j DROP
-A MULTI-15-INGRESS -j DROP
-A MULTI-16-INGRESS -j DROP
-A MULTI-17-INGRESS -j DROP
-A MULTI-18-INGRESS -j DROP
-A MULTI-19-INGRESS -j DROP
-A MULTI-20-INGRESS -j DROP
-A MULTI-21-INGRESS -j DROP
-A MULTI-22-INGRESS -j DROP
=============================================================
After disabling multiNetworkPolicy:
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default5 net-attach-def:ns1/sriovnetwork2" -j MULTI-0-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default24 net-attach-def:ns1/sriovnetwork2" -j MULTI-1-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default17 net-attach-def:ns1/sriovnetwork2" -j MULTI-2-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default15 net-attach-def:ns1/sriovnetwork2" -j MULTI-3-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default7 net-attach-def:ns1/sriovnetwork2" -j MULTI-4-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default3 net-attach-def:ns1/sriovnetwork2" -j MULTI-5-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default20 net-attach-def:ns1/sriovnetwork2" -j MULTI-6-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default19 net-attach-def:ns1/sriovnetwork2" -j MULTI-7-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default9 net-attach-def:ns1/sriovnetwork2" -j MULTI-8-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default6 net-attach-def:ns1/sriovnetwork2" -j MULTI-9-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default16 net-attach-def:ns1/sriovnetwork2" -j MULTI-10-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default2 net-attach-def:ns1/sriovnetwork2" -j MULTI-11-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default22 net-attach-def:ns1/sriovnetwork2" -j MULTI-12-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default21 net-attach-def:ns1/sriovnetwork2" -j MULTI-13-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default18 net-attach-def:ns1/sriovnetwork2" -j MULTI-14-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default8 net-attach-def:ns1/sriovnetwork2" -j MULTI-15-INGRESS
-A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default4 net-attach-def:ns1/sriovnetwork2" -j MULTI-16-INGRESS
-A MULTI-0-INGRESS -j DROP
-A MULTI-1-INGRESS -j DROP
-A MULTI-2-INGRESS -j DROP
-A MULTI-3-INGRESS -j DROP
-A MULTI-4-INGRESS -j DROP
-A MULTI-5-INGRESS -j DROP
-A MULTI-6-INGRESS -j DROP
-A MULTI-7-INGRESS -j DROP
-A MULTI-8-INGRESS -j DROP
-A MULTI-9-INGRESS -j DROP
-A MULTI-10-INGRESS -j DROP
-A MULTI-11-INGRESS -j DROP
-A MULTI-12-INGRESS -j DROP
-A MULTI-13-INGRESS -j DROP
-A MULTI-14-INGRESS -j DROP
-A MULTI-15-INGRESS -j DROP
-A MULTI-16-INGRESS -j DROP

 

This is a clone of issue OCPBUGS-15476. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15282. The following is the description of the original issue:

Description of problem:

When upgrading a 4.11.33 cluster to 4.12.21, the Cluster Version Operator is stuck waiting for the Network Operator to update:

$ omc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.43   True        True          14m     Working towards 4.12.21: 672 of 831 done (80% complete), waiting on network

CVO pod log states:

2023-06-16T12:07:22.596127142Z I0616 12:07:22.596023       1 metrics.go:490] ClusterOperator network is not setting the 'operator' version

Indeed the NO version is empty:

$ omc get co network -o json|jq '.status.versions'
null

However, it's status is available and not progressing, not degraded: 

$ omc get co network
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
network             True        False         False      19m
   
Network operator pod log states:

2023-06-16T12:08:56.542287546Z I0616 12:08:56.542271       1 connectivity_check_controller.go:138] ConnectivityCheckController is waiting for transition to desired version (4.12.21) to be completed.
2023-06-16T12:04:40.584407589Z I0616 12:04:40.584349       1 ovn_kubernetes.go:1437] OVN-Kubernetes master and node already at release version 4.12.21; no changes required

The Network Operator pod, however, has the version correctly:
$ omc get pods -n openshift-network-operator -o jsonpath='{.items[].spec.containers[0].env[?(@.name=="RELEASE_VERSION")]}'|jq
{
  "name": "RELEASE_VERSION",
  "value": "4.12.21"
}

Restarts of the related pods had no effect.  I have trace logs of the Network Operator available.  It looked like it might be related to https://github.com/openshift/cluster-network-operator/pull/1818 but that looks to be code introduced in 4.14.

 

Version-Release number of selected component (if applicable):

 

How reproducible:

I have not reproduced.

Steps to Reproduce:

1.  Cluster version began at stable 4.10.56
2.  Upgraded to 4.11.43 successfully
3.  Upgraded to 4.12.21 and is stuck. 

Actual results:

CVO Stuck waiting on NO to complete, NO 

Expected results:

NO to update its version so the CVO can continue.

Additional info:

Bare Metal IPI cluster with OVN Networking.

This is a clone of issue OCPBUGS-17160. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14049. The following is the description of the original issue:

Description of problem:

After all cluster operators have reconciled after the password rotation, we can still see authentication failures in keystone (attached screenshot of splunk query)

Version-Release number of selected component (if applicable):

Environment:
- OpenShift 4.12.10 on OpenStack 16
- The cluster is managed via RHACM, but password rotation shall be done via "regular"  OpenShift means.

How reproducible:

Rotated the OpenStack credentials according to the documentation [1]

[1] https://docs.openshift.com/container-platform/4.12/authentication/managing_cloud_provider_credentials/cco-mode-passthrough.html#manually-rotating-cloud-creds_cco-mode-passthrough 

Additional info:

- we can't trace back where these authentication failures come from - they do disappear after a cluster upgrade (so when nodes are rebooted and all pods are restarted which indicates that there's still a component using the old credentials)
- The relevant technical integration points _seem_ to be working though (LBaaS, CSI, Machine API, Swift)

What is the business impact? Please also provide timeframe information.

- We cannot rely on splunk monitoring for authentication issues since it's currently constantly showing authentication errors - We cannot be entirely sure that everything works as expected since we don't know the component that doesn't seem to use the new credentials

 

This is a clone of issue OCPBUGS-12267. The following is the description of the original issue:

Description of problem:

When using the k8sResourcePrefix x-descriptor with custom resource kinds, the form-view dropdown selection currently doesn't accept the initial user selection...requiring the user to make their selection twice. Also...if the configuration panel contains multiple custom resource dropdowns, then each previous dropdown selection on the panel is also cleared each time the user configures another custom resource dropdown, requiring the user to also reconfigure each previous selection.Here's an example of my configuration below:specDescriptors:
          - displayName: Collection
            path: collection
            x-descriptors:
              - >-
                urn:alm:descriptor:io.kubernetes:abc.zzz.com:v1beta1:Collection
          - displayName: Endpoints
            path: 'mapping[0].endpoints[0].name'
            x-descriptors:
              - >-
                urn:alm:descriptor:io.kubernetes:abc.zzz.com:v1beta1:Endpoint
          - displayName: Requested Credential Secret
            path: 'mapping[0].endpoints[0].credentialName'
            x-descriptors:
              - 'urn:alm:descriptor:io.kubernetes:Secret'
          - displayName: Namespaces
            path: 'mapping[0].namespace'
            x-descriptors:
              - 'urn:alm:descriptor:io.kubernetes:Namespace'
With this configuration, when a user wants to select a Collection or Endpoint from the form view dropdown, the user is forced to make their selection twice before the selection is accepted in the dropdown. Also, if the user does configure the Collection dropown, and then decides to configure the Endpoint dropdown, once the Endpoint selection is made, the Collection dropdown is then cleared.

Version-Release number of selected component (if applicable):

4.8

How reproducible:

Always

Steps to Reproduce:

1. Create a new project: 
  oc new-project descriptor-test
2. Create the resources in this gist: 
  oc create -f https://gist.github.com/TheRealJon/99aa89c4af87c4b68cd92a544cd7c08e/raw/a633ad172ff071232620913d16ebe929430fd77a/reproducer.yaml
3. In the admin console, go to the installed operators page in project 'descriptor-test'
4. Select Mock Operator from the list
5. Select "Create instance" in the Mock Resource provided API card
6. Scroll to the field-1
7. Select 'example-1' from the dropdown

Actual results:

Selection is not retained on the first click.

Expected results:

The selection should be retained on the first click.

Additional info:

In addition to this behavior, if a form has multiple k8sResourcePrefix dropdown fields, they all get cleared when attempting to select an item from one of them.

Description of problem:
When the user selects Serverless as an import strategy and tried to import a Devfile, the import fails because of an invalid Deployment.

Could reproduce this already in 4.11, but its even more prominent in 4.12 when the console automatically selects the resource type serverless when the Serverless operator is installed.

Version-Release number of selected component (if applicable):
Works on 4.10
Failed on 4.11 and 4.12 master

How reproducible:
Always

Steps to Reproduce:
1. Install and setup Serverless operator
1. Switch to dev perspective, navigate to add > import from git
3. Enter a non-Devfile git URL like https://github.com/jerolimov/nodeinfo
4. On 4.11 select resource type Serverless (on 4.12 this should be selected automatically)
5. Update the git URL to a repo with a Devfile like https://github.com/nodeshift-starters/devfile-sample
6. Press create

Actual results:
Import fails with error:

Error "Invalid value: "": name part must be non-empty" for field "spec.template.labels".

Expected results:
Devfile should be imported

Additional info:

Description of problem:

Container networking pods cannot access the host network pods on another node which caused some operators DEGRADED

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-10-23-204408   False       True          True       63m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jhou.arm.eng.rdu2.redhat.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
baremetal                                  4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
cloud-controller-manager                   4.12.0-0.nightly-2022-10-23-204408   True        False         False      68m     
cloud-credential                           4.12.0-0.nightly-2022-10-23-204408   True        False         False      78m     
cluster-autoscaler                         4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
config-operator                            4.12.0-0.nightly-2022-10-23-204408   True        False         False      63m     
console                                    4.12.0-0.nightly-2022-10-23-204408   False       False         False      30m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.jhou.arm.eng.rdu2.redhat.com): Get "https://console-openshift-console.apps.jhou.arm.eng.rdu2.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
control-plane-machine-set                  4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
csi-snapshot-controller                    4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
dns                                        4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
etcd                                       4.12.0-0.nightly-2022-10-23-204408   False       True          True       13m     EtcdMembersAvailable: 1 of 2 members are available, openshift-qe-048.arm.eng.rdu2.redhat.com is unhealthy
image-registry                             4.12.0-0.nightly-2022-10-23-204408   True        False         False      39m     
ingress                                    4.12.0-0.nightly-2022-10-23-204408   True        False         True       47m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
insights                                   4.12.0-0.nightly-2022-10-23-204408   True        False         False      56m     
kube-apiserver                             4.12.0-0.nightly-2022-10-23-204408   True        False         False      50m     
kube-controller-manager                    4.12.0-0.nightly-2022-10-23-204408   True        False         True       60m     GarbageCollectorDegraded: error querying alerts: client_error: client error: 403
kube-scheduler                             4.12.0-0.nightly-2022-10-23-204408   True        False         False      54m     
kube-storage-version-migrator              4.12.0-0.nightly-2022-10-23-204408   True        False         False      63m     
machine-api                                4.12.0-0.nightly-2022-10-23-204408   True        False         False      51m     
machine-approver                           4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
machine-config                             4.12.0-0.nightly-2022-10-23-204408   True        False         False      29m     
marketplace                                4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
monitoring                                 4.12.0-0.nightly-2022-10-23-204408   True        False         False      38m     
network                                    4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
node-tuning                                4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
openshift-apiserver                        4.12.0-0.nightly-2022-10-23-204408   True        False         False      30m     
openshift-controller-manager               4.12.0-0.nightly-2022-10-23-204408   True        False         False      56m     
openshift-samples                          4.12.0-0.nightly-2022-10-23-204408   True        False         False      43m     
operator-lifecycle-manager                 4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-10-23-204408   True        False         False      62m     
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-10-23-204408   True        False         False      43m     
service-ca                                 4.12.0-0.nightly-2022-10-23-204408   True        False         False      63m     
storage                                    4.12.0-0.nightly-2022-10-23-204408   True        False         False      63m


$ oc get pod -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS      AGE   IP                                  NODE                                       NOMINATED NODE   READINESS GATES
router-default-58f6498646-gf6ns   1/1     Running   1 (79m ago)   93m   2620:52:0:1eb:3673:5aff:fe9e:5abc   openshift-qe-049.arm.eng.rdu2.redhat.com   <none>           <none>
router-default-58f6498646-qjtbk   1/1     Running   1 (79m ago)   93m   2620:52:0:1eb:3673:5aff:fe9e:593c   openshift-qe-052.arm.eng.rdu2.redhat.com   <none>           <none>


$ oc get pod -n openshift-network-diagnostics -o wide
NAME                                    READY   STATUS    RESTARTS   AGE    IP              NODE                                       NOMINATED NODE   READINESS GATES
network-check-source-5f967d78bc-cfwz4   1/1     Running   0          103m   fd01:0:0:3::9   openshift-qe-052.arm.eng.rdu2.redhat.com   <none>           <none>
network-check-target-52krv              1/1     Running   0          91m    fd01:0:0:4::3   openshift-qe-049.arm.eng.rdu2.redhat.com   <none>           <none>
network-check-target-56q9q              1/1     Running   0          91m    fd01:0:0:3::5   openshift-qe-052.arm.eng.rdu2.redhat.com   <none>           <none>
network-check-target-ggqsf              1/1     Running   0          103m   fd01:0:0:2::4   openshift-qe-048.arm.eng.rdu2.redhat.com   <none>           <none>
network-check-target-xfrq4              1/1     Running   0          103m   fd01:0:0:1::3   openshift-qe-047.arm.eng.rdu2.redhat.com   <none>           <none>
network-check-target-zrglr              1/1     Running   0          73m    fd01:0:0:6::4   openshift-qe-051.arm.eng.rdu2.redhat.com   <none>           <none>
network-check-target-zwb4t              1/1     Running   0          91m    fd01:0:0:5::5   openshift-qe-053.arm.eng.rdu2.redhat.com   <none>           <none>

####Failed from containers pod on openshift-qe-053.arm.eng.rdu2.redhat.com to access ingress pods

$ oc rsh -n openshift-network-diagnostics network-check-target-zwb4t
sh-4.4$ curl https://[2620:52:0:1eb:3673:5aff:fe9e:5abc]:443 -k -I
^C
sh-4.4$ curl https://[2620:52:0:1eb:3673:5aff:fe9e:593c]:443 -k -I
^C

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-23-204408

How reproducible:

always

Steps to Reproduce:

1. Deploy ipv6 disconnect single cluster
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Added a script to collect PodNetworkConnectivityChecks to able to view the overall status of the pod network connectivity.

Current must-gather collects the contents of `openshift-network-diagnostics` but does not collect the PodNetworkConnectivityCheck.

Version-Release number of selected component (if applicable):

4.12, 4.11, 4.10

This is a clone of issue OCPBUGS-12919. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7836. The following is the description of the original issue:

Description of problem:

The MCDaemon has a codepath for "pivot" used in older versions, and then as part of solutions articles to initiate a direct pivot to an ostree version, mostly used when things fail.

As of 4.12 this codepath should no longer work due to us switching to new format OSImage, so we should fully deprecate it.

This is likely where it fails:
https://github.com/openshift/machine-config-operator/blob/ecc6bf3dc21eb33baf56692ba7d54f9a3b9be1d1/pkg/daemon/rpm-ostree.go#L248

Version-Release number of selected component (if applicable):

4.12+

How reproducible:

Not sure but should be 100%

Steps to Reproduce:

1. Follow https://access.redhat.com/solutions/5598401
2.
3.

Actual results:

fails

Expected results:

MCD telling you pivot is deprecated

Additional info:

 

This is a clone of issue OCPBUGS-9464. The following is the description of the original issue:

Description of problem:

mtls connection is not working when using an intermetiate CA appart from the root CA, both with CRL defined.
The Intermediate CA Cert had a published CDP which directed to a CRL issued by the root CA.

The config map in the openshift-ingress namespace contains the CRL as issued by the root CA. The CRL issued by the Intermediate CA is not present since that CDP is in the user cert and so not in the bundle.

When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.

When attempting to connect using a user certificate issued by the to Root CA the connection is successful.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:
Always

Steps to Reproduce:

1. Configure CA and intermediate CA with CRL
2. Sign client certificate with the intermediate CA
3. Configure mtls in openshift-ingress

Actual results:

When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.

Expected results:

Be able to connect with client certificated signed by the intermediate CA

Additional info:

Description of problem:

OCPBUGS-3499 and OCPBUGS-3501 both require a more recent version of openshift/library-go containing the shared validation and host-assignment logic.

This is a clone of issue OCPBUGS-5016. The following is the description of the original issue:

Description of problem:

When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel ->  Actions -> Edit Pipeline -> YAML view 

Actual results:

displayed content is incorrect.

Expected results:

Get the content of the current pipeline, not the "pipeline create" content.

Additional info:

If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content.
~
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel ->  Actions -> Edit Pipeline -> YAML view :Display resource content normally
~

This is a clone of issue OCPBUGS-10661. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:

Description of problem:

Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build.

This is resulting in all jobs invoked by the 4.12 nightlies failing to install.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-13-172313 and later

How reproducible:

consistently in 4.12 nightlies only(ci builds do not seem to be impacted).

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log

job=pull-ci-openshift-origin-master-e2e-gcp-builds=all

This test has started permafailing on e2e-gcp-builds:

[sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod [apigroup:build.openshift.io][apigroup:image.openshift.io]

The error in the test says

Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:21 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8"
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Successfully pulled image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" in 1.763914719s
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Created: Created container test
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Started: Started container test
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:24 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" already present on machine
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:25 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Unhealthy: Readiness probe failed: Get "http://10.129.2.63:8080/": dial tcp 10.129.2.63:8080: connect: connection refused
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:26 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} BackOff: Back-off restarting failed container

This is a clone of issue OCPBUGS-3123. The following is the description of the original issue:

Description of problem:

Support for tech preview API extensions was introduced in https://github.com/openshift/installer/pull/6336 and https://github.com/openshift/api/pull/1274 .  In the case of https://github.com/openshift/api/pull/1278 , config/v1/0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml was introduced which seems to result in both 0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml and 0000_10_config-operator_01_infrastructure-Default.crd.yaml being rendered by the bootstrap.  As a result, both CRDs are created during bootstrap.  However, one of them(in this case the tech preview CRD) fails to be created.  

We may need to modify the render command to be aware of feature gates when rendering manifests during bootstrap.  Also, I'm open hearing other views on how this might work. 

Version-Release number of selected component (if applicable):

https://github.com/openshift/cluster-config-operator/pull/269 built and running on 4.12-ec5 

How reproducible:

consistently

Steps to Reproduce:

1. bump the version of OpenShift API to one including a tech preview version of the infrastructure CRD
2. install openshift with the infrastructure manifest modified to incorporate tech preview fields
3. those fields will not be populated upon installation

Also, checking the logs from bootkube will show both being installed, but one of them fails.

Actual results:

 

Expected results:

 

Additional info:

Excerpts from bootkube log
Nov 02 20:40:01 localhost.localdomain bootkube.sh[4216]: Writing asset: /assets/config-bootstrap/manifests/0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml
Nov 02 20:40:01 localhost.localdomain bootkube.sh[4216]: Writing asset: /assets/config-bootstrap/manifests/0000_10_config-operator_01_infrastructure-Default.crd.yaml


Nov 02 20:41:23 localhost.localdomain bootkube.sh[5710]: Created "0000_10_config-operator_01_infrastructure-Default.crd.yaml" customresourcedefinitions.v1.apiextensions.k8s.io/infrastructures.config.openshift.io -n
Nov 02 20:41:23 localhost.localdomain bootkube.sh[5710]: Skipped "0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml" customresourcedefinitions.v1.apiextensions.k8s.io/infrastructures.config.openshift.io -n  as it already exists

 

 

 

This is a clone of issue OCPBUGS-4955. The following is the description of the original issue:

Description of problem:

Customer needs "IfNotPresent" ImagePullPolicy set for bundle unpacker images which reference iamges by digest. Currently, policy is set to "Always" no matter what.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Install an operator via bundle referencing an image by digest
2.Check the bundle unpacker pod

Actual results:

Image pull policy will be set to "Always"

Expected results:

Image pull policy will be set to "IfNotPresent" when pulling via digest

Additional info:

 

Description of problem

CI is flaky because of test failures such as the following:

[sig-arch] events should not repeat pathologically
{  2 events happened too frequently

event happened 21 times, something is wrong: node/ip-10-0-162-91.us-west-2.compute.internal hmsg/e277cb97cf - pathological/true reason/ErrorReconcilingNode roles/worker [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-162-91.us-west-2.compute.internal, macAddress annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal"] From: 17:47:14Z To: 17:47:15Z result=reject 
event happened 22 times, something is wrong: node/ip-10-0-162-91.us-west-2.compute.internal hmsg/e277cb97cf - pathological/true reason/ErrorReconcilingNode roles/worker [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-162-91.us-west-2.compute.internal, macAddress annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal"] From: 17:47:15Z To: 17:47:16Z result=reject }

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/901/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial/1638557665338593280. Search.ci has many more similar failures.

Version-Release number of selected component (if applicable):

I have seen this in 4.12, 4.13, and 4.14 CI jobs.

How reproducible:

Presently, search.ci shows the following stats for the past two days:

Found in 0.25% of runs (1.49% of failures) across 44431 total runs and 4957 jobs (16.76% failed) in 321ms

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=event+happened+%5Cd%2B+times%2C+something+is+wrong%3A+.*macAddress+annotation+not+found+for+node&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure.

Additional info:

In the search.ci results, the failures all appear to be in jobs with "serial" or "etcd-scaling" in the names. The failing jobs include AWS, Azure, and GCP, and no other platforms. I only checked the past 2 days because search.ci failed to load with a longer time horizon.

This bug is a backport clone of [Bugzilla Bug 2092811](https://bugzilla.redhat.com/show_bug.cgi?id=2092811). The following is the description of the original bug:

+++ This bug was initially created as a clone of Bug #1926943 +++

The customer is facing this issue:

I0530 05:19:11.481797 1 vsphere_check.go:220] CheckDefaultDatastore failed: defaultDatastore "FI-HML-DC2-CONT-1" in vSphere configuration: datastore FI-HML-DC2-CONT-1: datastore name is too long: escaped volume path "var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts\\x5bFI\\x2dHML\\x2dDC2\\x2dCONT\\x2d1\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-fi\\x2dhmy\\x2dsas\\x2dprod\\x2dnp868\\x2d\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000
x2d000000000000.vmdk" must be under 255 characters, got 255

Looks like the bug has resurfaced.

Description of problem:

Since coreos-installer writes to stdout, its logs are not available for us.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Git icon shown in the repository details page should be based on the git provider.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

 

Steps to Reproduce:

1. Create a Repository with gitlab repo url
2. Trigger a PLR for the repository 
3. Navigates to PLR details page

Actual results:

github icon is displayed for the gitlab url and URL is not correct

Expected results:

gitlab icon should be displayed for the gitlab url. And repository URL should be correct

Additional info:

use `GitLabIcon` and `BitBucketIcon` from patternfly react-icons.

This is a clone of issue OCPBUGS-855. The following is the description of the original issue:

Description of problem:

When setting the allowedregistries like the example below, the openshift-samples operator is degraded:

oc get image.config.openshift.io/cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Image
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: "2020-12-16T15:48:20Z"
  generation: 2
  name: cluster
  resourceVersion: "422284920"
  uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5
spec:
  additionalTrustedCA:
    name: registry-ca
  allowedRegistriesForImport:
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
  - domainName: registry.redhat.io/redhat/redhat-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/redhat-marketplace-index
    insecure: true
  - domainName: registry.redhat.io/redhat/certified-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/community-operator-index
    insecure: true
  registrySources:
    allowedRegistries:
    - quay.io
    - registry.redhat.io
    - registry.rijksapps.nl
    - registry.access.redhat.com
    - registry.redhat.io/redhat/redhat-operator-index
    - registry.redhat.io/redhat/redhat-marketplace-index
    - registry.redhat.io/redhat/certified-operator-index
    - registry.redhat.io/redhat/community-operator-index


oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.21   True        False         False      5d13h   
baremetal                                  4.10.21   True        False         False      450d    
cloud-controller-manager                   4.10.21   True        False         False      94d     
cloud-credential                           4.10.21   True        False         False      624d    
cluster-autoscaler                         4.10.21   True        False         False      624d    
config-operator                            4.10.21   True        False         False      624d    
console                                    4.10.21   True        False         False      42d     
csi-snapshot-controller                    4.10.21   True        False         False      31d     
dns                                        4.10.21   True        False         False      217d    
etcd                                       4.10.21   True        False         False      624d    
image-registry                             4.10.21   True        False         False      94d     
ingress                                    4.10.21   True        False         False      94d     
insights                                   4.10.21   True        False         False      104s    
kube-apiserver                             4.10.21   True        False         False      624d    
kube-controller-manager                    4.10.21   True        False         False      624d    
kube-scheduler                             4.10.21   True        False         False      624d    
kube-storage-version-migrator              4.10.21   True        False         False      31d     
machine-api                                4.10.21   True        False         False      624d    
machine-approver                           4.10.21   True        False         False      624d    
machine-config                             4.10.21   True        False         False      17d     
marketplace                                4.10.21   True        False         False      258d    
monitoring                                 4.10.21   True        False         False      161d    
network                                    4.10.21   True        False         False      624d    
node-tuning                                4.10.21   True        False         False      31d     
openshift-apiserver                        4.10.21   True        False         False      42d     
openshift-controller-manager               4.10.21   True        False         False      22d     
openshift-samples                          4.10.21   True        True          True       31d     Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}
operator-lifecycle-manager                 4.10.21   True        False         False      624d    
operator-lifecycle-manager-catalog         4.10.21   True        False         False      624d    
operator-lifecycle-manager-packageserver   4.10.21   True        False         False      31d     
service-ca                                 4.10.21   True        False         False      624d    
storage                                    4.10.21   True        False         False      113d  


After applying the fix as described here(  https://access.redhat.com/solutions/6547281 ) it is resolved:
oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}'

But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.21   True        False         31d     Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-5287. The following is the description of the original issue:

Description of problem:

See https://issues.redhat.com/browse/THREESCALE-9015.  A problem with the Red Hat Integration - 3scale - Managed Application Services operator prevents it from installing correctly, which results in the failure of operator-install-single-namespace.spec.ts integration test.

This is a clone of issue OCPBUGS-15722. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14665. The following is the description of the original issue:

Description of problem:

In Helm Charts we define a values.schema.json file - a JSON schema for all the possible values the user can set in a chart. This schema needs to follow JSON schema standard. The standard includes something called $ref - a reference to the either local or remote definition. If we use a schema with remote references in OCP, it causes various troubles in OCP. Different OCP versions gives different results, also on the same OCP version you can get different results based on how tight down the cluster networking is.

Prerequisites (if any, like setup, operators/versions):

Tried in Developer Sandbox, OpenShift Local, Baremetal Public Cluster in Operate First, OCP provisioned through clusterbot. It behaves differently in each instance. Individual cases are described below.

Steps to Reproduce

1. Go to the "Helm" tab in Developer Perspective
2. Click "Create" in top right and select "Repository"
3. Use following ProjectHelmChartRepository resource and click "Create" (this repo contains single chart, this chart has values.schema.json with content linked below):

apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
  name: reproducer
spec:
  connectionConfig:
    url: https://raw.githubusercontent.com/tumido/helm-backstage/reproducer

4. Go back the "Helm" tab in Developer Perspective
5. Click "Create" in top right and select "Helm Release"
6. In filters section of the catalog in the "Chart repositories" select "Reproducer"
7. Click on the single tile available (Backstage)
8. Click "Install Helm Chart"
9. Either you will be greeted with various error screens or you see the "YAML view" tab (this tab selection is not the default and is remembered during user session only I suppose)
10. Select "Form view"

Actual results:

Various error screens depending on OCP version and network restrictions. I've attached screen captures how it behaves in different settings.

Expected results:

Either render the form view (resolve remote references) or make it obvious that remote references are not supporter. Optionally fallback to the "YAML view" regarding that user doesn't have full schema available, but the chart is still deployable.

Reproducibility (Always/Intermittent/Only Once):

Depends on the environment
Always in OpenShift Local, Developer Sandbox, cluster bot clusters

Build Details:

Workaround:

1. Select any other chart to install, click "Install Helm Chart"
2. Change the view to "YAML view"
3. Go back to the Helm catalog without actually deploying anything
4. Select the faulty chart and click "Install Helm Chart"
5. Proceed with installation

Additional info:

Console should be using v1 version of the ConsolePlugin model rather then the old v1alpha1.

CONSOLE-3077 was updating this version, but did not made the cut for the 4.12 release. Based on discussion with Samuel Padgett we should be backporting to 4.12.

 

The risk should be minimal since we are only updating the model itself + validation + Readme

Description of problem:

In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12.

Installed COs cannot be disabled, so CVO is supposed to implicitly enable them.


$ oc get clusterversion -oyaml
{
  "apiVersion": "config.openshift.io/v1",
     "kind": "ClusterVersion",
     "metadata": {
         "creationTimestamp": "2022-09-30T05:02:31Z",
         "generation": 3,
         "name": "version",
         "resourceVersion": "134808",
         "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb"
     },
     "spec": {
         "capabilities": {
             "additionalEnabledCapabilities": [
                 "openshift-samples"
             ],
             "baselineCapabilitySet": "None"
         },
         "channel": "stable-4.11",
         "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34",
         "desiredUpdate": {
             "force": true,
             "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
             "version": ""
         }
     },
     "status": {
         "availableUpdates": null,
         "capabilities": {
             "enabledCapabilities": [
                 "openshift-samples"
             ],
             "knownCapabilities": [
                 "Console",
                 "Insights",
                 "Storage",
                 "baremetal",
                 "marketplace",
                 "openshift-samples"
             ]
         },
         "conditions": [
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel",
                 "reason": "VersionNotFound",
                 "status": "False",
                 "type": "RetrievedUpdates"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Capabilities match configured spec",
                 "reason": "AsExpected",
                 "status": "False",
                 "type": "ImplicitlyEnabledCapabilities"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"",
                 "reason": "PayloadLoaded",
                 "status": "True",
                 "type": "ReleaseAccepted"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:23:18Z",
                 "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419",
                 "status": "True",
                 "type": "Available"
             },
             {
                 "lastTransitionTime": "2022-09-30T07:05:42Z",
                 "status": "False",
                 "type": "Failing"
             },
             {
                 "lastTransitionTime": "2022-09-30T07:41:53Z",
                 "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419",
                 "status": "False",
                 "type": "Progressing"
             }
         ],
         "desired": {
             "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
             "version": "4.12.0-0.nightly-2022-09-28-204419"
         },
         "history": [
             {
                 "completionTime": "2022-09-30T07:41:53Z",
                 "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
                 "startedTime": "2022-09-30T06:42:01Z",
                 "state": "Completed",
                 "verified": false,
                 "version": "4.12.0-0.nightly-2022-09-28-204419"
             },
             {
                 "completionTime": "2022-09-30T05:23:18Z",
                 "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631",
                 "startedTime": "2022-09-30T05:02:33Z",
                 "state": "Completed",
                 "verified": false,
                 "version": "4.11.0-0.nightly-2022-09-29-191451"
             }
         ],
         "observedGeneration": 3,
         "versionHash": "CSCJ2fxM_2o="
     }
 }

$ oc get co
 NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      93m     
cloud-controller-manager                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h56m   
cloud-credential                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h59m   
cluster-autoscaler                         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h53m   
config-operator                            4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
console                                    4.11.0-0.nightly-2022-09-29-191451   True        False         False      3h45m   
control-plane-machine-set                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      117m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
dns                                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h53m   
etcd                                       4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h52m   
image-registry                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m   
ingress                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      151m    
insights                                   4.11.0-0.nightly-2022-09-29-191451   True        False         False      3h48m   
kube-apiserver                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h50m   
kube-controller-manager                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h51m   
kube-scheduler                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h51m   
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-28-204419   True        False         False      91m     
machine-api                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h50m   
machine-approver                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
machine-config                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h52m   
monitoring                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h44m   
network                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h55m   
node-tuning                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      113m    
openshift-apiserver                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
openshift-controller-manager               4.12.0-0.nightly-2022-09-28-204419   True        False         False      113m    
openshift-samples                          4.12.0-0.nightly-2022-09-28-204419   True        False         False      116m    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
service-ca                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
storage                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-28-204419

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.11 cluster with only openshift-samples enabled
2. Upgrade to 4.12
3.

Actual results:

The 4.12 introduced optional CO console and insights are not upgraded to 4.12

Expected results:

All the installed COs get upgraded

Additional info:

 

This is a clone of issue OCPBUGS-3314. The following is the description of the original issue:

Description of problem:

triggers[].gitlab.secretReference[1] disappears when a 'buildconfig' is edited on ‘From View’

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Always

Steps to Reproduce:

1. Configure triggers[].gitlab.secretReference[1] as below 

~~~
spec:
 .. 
  triggers:
    - type: ConfigChange
    - type: GitLab
      gitlab:
        secretReference:
          name: m24s40-githook
~~~
2. Open ‘Edit BuildConfig’ buildconfig  with ‘From’ View:
 - Buildconfigs -> Actions -> Edit Buildconfig

3. Click ‘YAML view’ on top. 

Actual results:

The 'secretReference' configured earlier has disappeared. You can click [Reload] button which will bring the configuration back.

Expected results:

'secretReference' configured in buildconfigs do not disappear. 

Additional info:


[1]https://docs.openshift.com/container-platform/4.10/rest_api/workloads_apis/buildconfig-build-openshift-io-v1.html#spec-triggers-gitlab-secretreference

 

This is a clone of issue OCPBUGS-6018. The following is the description of the original issue:

This is a public clone of OCPBUGS-3821

The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:

  1. the containerruntimeconfigcontroller creates a new containerruntimeconfig due to the update
  2. the template controller finishes re-creating the base configs
  3. the kubeletconfig errors long enough and doesn't finish until after 2

This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.

When a HostedCluster is configured as `Private`, annotate the necessary hosted CP components (API and OAuth) so that External DNS can still create public DNS records (pointing to private IP resources).

The External DNS record should be pointing to the resource for the PrivateLink VPC Endpoint. "We need to specify the IP of the A record. We can do that with a cluster IP service."

Context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1675432805760719

Description of problem:

go test -mod=vendor -test.v -race  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.test]
pkg/libovsdbops/acl_test.go:98:15: undefined: FindACLs
pkg/libovsdbops/acl_test.go:105:15: undefined: UpdateACLsOps
FAIL    github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops [build failed]
FAIL

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

When using an install-config with missing VIP values set in the baremetal-platform section, we attempt to get defaults for them by doing a DNS lookup on the cluster domain name. If this lookup fails, we set the error message from DNS as the default value, resulting in a very confusing error message:

[platform.baremetal.apiVIPs: Invalid value: []string{"DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host"}: ip <nil> is invalid, platform.baremetal.apiVIPs: Invalid value: "DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host": "DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host" is not a valid IP, platform.baremetal.apiVIPs: Invalid value: "DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host": IP expected to be in one of the machine networks: 192.168.122.0/23]

This has been the case since the inception of baremetal IPI, but it has gotten considerably worse in 4.12 due to the VIP fields changing from a single string to a list.

If the user doesn't supply a value and we can't generate a sensible default, we should report that the error is that they didn't supply a value, not that they supplied an invalid value that they did not in fact supply:

[platform.baremetal.apiVIPs: Required value: must specify at least one VIP for the API, platform.baremetal.apiVIPs: Required value: must specify VIP for API, when VIP for ingress is set]

Description of problem:
Installed and uninstalled some helm charts, and got now an issue with helm charts on all our releases. The issue is solved in 4.13.

The frontend tries to load /api/helm/releases?ns=christoph and the backend crashes with the error below.

Tl;dr:

It crashes here in the helm lib: https://github.com/openshift/console/blob/release-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go#L66

And the missing out of bounds check is added on master: https://github.com/openshift/console/blob/master/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go#L66

As part of the helm bump https://github.com/openshift/console/pull/12246

2023/02/15 13:09:09 http: panic serving [::1]:43264: runtime error: slice bounds out of range [:3] with capacity 0
goroutine 3291 [running]:                                                                                             
net/http.(*conn).serve.func1()                                                                                                                                                                                                              
        /usr/lib/golang/src/net/http/server.go:1850 +0xbf                                                             
panic({0x2f8d700, 0xc0004dfaa0})                                                                                      
        /usr/lib/golang/src/runtime/panic.go:890 +0x262                                                               
helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc000776930?})                  
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000b2ff80, 0xc0004bbe60)                                                                                                                                                              
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f
helm.sh/helm/v3/pkg/action.(*List).Run(0xc0005fb800)                                                                  
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5
github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?)                
        /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b
github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc00086d180}, 0x7fea2c6e5900?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc00086d180?}, 0x7fea56daf108?)
        /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c     
net/http.HandlerFunc.ServeHTTP(0xc0009b8170?, {0x351ae00?, 0xc00086d180?}, 0xc000c5b9f8?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f 
net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc00086d180}, 0xc000248800)       
        /usr/lib/golang/src/net/http/server.go:2487 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc00086d180}, 0x7fea2c5c8248?)
        /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af
net/http.HandlerFunc.ServeHTTP(0xc0009ed667?, {0x351ae00?, 0xc00086d180?}, 0x109034e?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc001048120?}, {0x351ae00, 0xc00086d180}, 0xc000248800)
        /usr/lib/golang/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0007580a0, {0x351cca0, 0xc000145740})
        /usr/lib/golang/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /usr/lib/golang/src/net/http/server.go:3102 +0x4db
2023/02/15 13:09:09 http: panic serving [::1]:43256: runtime error: slice bounds out of range [:3] with capacity 0
goroutine 3290 [running]:
net/http.(*conn).serve.func1()
        /usr/lib/golang/src/net/http/server.go:1850 +0xbf
panic({0x2f8d700, 0xc000273440})
        /usr/lib/golang/src/runtime/panic.go:890 +0x262
helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc0004dc8a0?})
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000de8e88, 0xc0011cb400)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f
helm.sh/helm/v3/pkg/action.(*List).Run(0xc00068d800)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5
github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b
github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b60b60}, 0x7fea2c47e700?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b60b60?}, 0x7fea56daf5b8?)
        /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c
net/http.HandlerFunc.ServeHTTP(0xc0003d72b0?, {0x351ae00?, 0xc000b60b60?}, 0xc000bcd9f8?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b60b60}, 0xc000cabd00)
        /usr/lib/golang/src/net/http/server.go:2487 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b60b60}, 0x7fea2c6d9838?)
        /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af
net/http.HandlerFunc.ServeHTTP(0xc000344f47?, {0x351ae00?, 0xc000b60b60?}, 0x109034e?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc001048180?}, {0x351ae00, 0xc000b60b60}, 0xc000cabd00)
net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b60b60}, 0xc000cabd00)                                                                                                                                                         
        /usr/lib/golang/src/net/http/server.go:2487 +0x149                                                                                                                                                                                  
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b60b60}, 0x7fea2c6d9838?)                                                                                                                         
        /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af                                                                                                                                                      
net/http.HandlerFunc.ServeHTTP(0xc000344f47?, {0x351ae00?, 0xc000b60b60?}, 0x109034e?)                                                                                                                                                      
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f                                                                                                                                                                                   
net/http.serverHandler.ServeHTTP({0xc001048180?}, {0x351ae00, 0xc000b60b60}, 0xc000cabd00)                                                                                                                                                  
        /usr/lib/golang/src/net/http/server.go:2947 +0x30c                                                                                                                                                                                  
net/http.(*conn).serve(0xc000758000, {0x351cca0, 0xc000145740})                                                                                                                                                                             
        /usr/lib/golang/src/net/http/server.go:1991 +0x607                                                                                                                                                                                  
created by net/http.(*Server).Serve
        /usr/lib/golang/src/net/http/server.go:3102 +0x4db
2023/02/15 13:09:09 http: panic serving [::1]:42956: runtime error: slice bounds out of range [:3] with capacity 0
goroutine 3261 [running]:
net/http.(*conn).serve.func1()
        /usr/lib/golang/src/net/http/server.go:1850 +0xbf
panic({0x2f8d700, 0xc000273740})
        /usr/lib/golang/src/runtime/panic.go:890 +0x262
helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc0005f6000?})
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc00094a570, 0xc0003d79e0)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f
helm.sh/helm/v3/pkg/action.(*List).Run(0xc00068d800)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5
github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b
github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b48a80}, 0x7fea2c403300?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b48a80?}, 0x7fea56dafa68?)
        /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c
net/http.HandlerFunc.ServeHTTP(0xc0011cbb60?, {0x351ae00?, 0xc000b48a80?}, 0xc000ff59f8?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b48a80}, 0xc0002a3c00)
        /usr/lib/golang/src/net/http/server.go:2487 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b48a80}, 0x7fea2c478e18?)
        /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af
net/http.HandlerFunc.ServeHTTP(0xc00084bfc7?, {0x351ae00?, 0xc000b48a80?}, 0x109034e?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc000c3f890?}, {0x351ae00, 0xc000b48a80}, 0xc0002a3c00)
        /usr/lib/golang/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0008a9f40, {0x351cca0, 0xc000145740})
        /usr/lib/golang/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /usr/lib/golang/src/net/http/server.go:3102 +0x4db
2023/02/15 13:09:09 http: panic serving [::1]:42954: runtime error: slice bounds out of range [:3] with capacity 0
goroutine 3247 [running]:
net/http.(*conn).serve.func1()
        /usr/lib/golang/src/net/http/server.go:1850 +0xbf
panic({0x2f8d700, 0xc000273a88})
        /usr/lib/golang/src/runtime/panic.go:890 +0x262
helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc0005f78f0?})
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000de9560, 0xc0009b8c00)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f
helm.sh/helm/v3/pkg/action.(*List).Run(0xc0005fb800)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5
github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b
github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b60ee0}, 0x7fea2effb100?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b60ee0?}, 0x7fea56daf5b8?)
        /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c
net/http.HandlerFunc.ServeHTTP(0xc0002a91d0?, {0x351ae00?, 0xc000b60ee0?}, 0xc000c319f8?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b60ee0}, 0xc000cab000)
        /usr/lib/golang/src/net/http/server.go:2487 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b60ee0}, 0x7fea2eff84e8?)
        /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af
net/http.HandlerFunc.ServeHTTP(0xc000df4be7?, {0x351ae00?, 0xc000b60ee0?}, 0x109034e?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc000d2d320?}, {0x351ae00, 0xc000b60ee0}, 0xc000cab000)
        /usr/lib/golang/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0002688c0, {0x351cca0, 0xc000145740})
        /usr/lib/golang/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /usr/lib/golang/src/net/http/server.go:3102 +0x4db
2023/02/15 13:09:09 http: panic serving [::1]:55334: runtime error: slice bounds out of range [:3] with capacity 0
goroutine 3328 [running]:
net/http.(*conn).serve.func1()
        /usr/lib/golang/src/net/http/server.go:1850 +0xbf
panic({0x2f8d700, 0xc000273dd0})
        /usr/lib/golang/src/runtime/panic.go:890 +0x262
helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc000d0b020?})
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305
helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000de98a8, 0xc0001cb670)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f
helm.sh/helm/v3/pkg/action.(*List).Run(0xc000dad800)
        /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5
github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b
github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b610a0}, 0x7fea2effb100?)
        /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb
github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b610a0?}, 0x7fea56daf5b8?)
        /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c
net/http.HandlerFunc.ServeHTTP(0xc000430260?, {0x351ae00?, 0xc000b610a0?}, 0xc000e469f8?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b610a0}, 0xc000537900)
        /usr/lib/golang/src/net/http/server.go:2487 +0x149
github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b610a0}, 0x7fea2c6da648?)
        /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af
net/http.HandlerFunc.ServeHTTP(0xc000df53f7?, {0x351ae00?, 0xc000b610a0?}, 0x109034e?)
        /usr/lib/golang/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc0005f7a10?}, {0x351ae00, 0xc000b610a0}, 0xc000537900)
        /usr/lib/golang/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc000c203c0, {0x351cca0, 0xc000145740})
        /usr/lib/golang/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /usr/lib/golang/src/net/http/server.go:3102 +0x4db

Version-Release number of selected component (if applicable):
4.8-4.12 doesn't show a helm release list.
4.13 works fine

How reproducible:
Always with this Helm chart secret:

Steps to Reproduce:
Unable to reproduce this manually again.

But you can apply the Secret at the end to any namespace and test it with that on 4.8-4.12.

Actual results:
Crash

Expected results:
No crash

Additional info:

Secret to reproduce this issue:

kind: Secret
apiVersion: v1
metadata: 
  name: sh.helm.release.v1.dotnet.v1
  labels: 
    name: dotnet
    owner: helm
    status: deployed
    version: '1'
data: 
  release: >-
    H4sIAAAAAAAC/+S9a3ObTNIw/Ff06v74OgkgKxu5aj8YYiEUiUTI4rTZ2mIGDEjD4REgGe2T//7UzAAChGzLcZLr3r2qrorFYejpc/d0z/y7H1qB07/p21EaOmn/qu+HD1H/5t/9B3+bpP+ynRhFuWP3b/ocww3eMdw79vqeG9xcj25Y7v3H4XA0ZJkh9/8z7A3D9K/6yHrNW7aDnJQ8T34kcOvHqR+F/Zu+FCaphVAPRkGMH+pf9ZPUSrMEA11+56ofRqmDL30PjSjb9t7Ld/c9K457ftIDmY9sP3T/v9591Nv5zr6Xeg692kORm1z1tll48z38HkaQXOgB+IHio/fu3UOEULTHd+UodXqpZ6W9HH/iM/l44IRpb+8j1Ns6cbRNe9/7d9utFFiu8y1D6Hu/Z4V273u/usJbcPP14eF7v5eFqY9qsPhJNcn3va8hdLrvXdHP+3hA+mXg9KwsjQIr9aGFUN7bRgg5di/K0vf9H1d96FnbFNM0cFLLtlIL/92m+87ZJhTjzHvmPXtCh9vexEFBj4zVS6MCLjw5SoUK5ciHFn4n6V/1N06+j7Z20r/5R3+L5xs4+HLx0X9e9a3YV6sP77j+Vd8KwygtBrj5N4X9X9kW9W/6XprGyc2HD66fehl4D6PgQxQ7YeL5D+k7z0HBO/J08qH4Z+sgx0qc5IMd7UMUWfaHrWN7VvqOfv8dmWjXtfepe+j/+HHVRxHc9G/CDKGrfuoEMbIIl/2jQl918YP89f5u+T59xLikOO47gySVRBRIwnBpao/I0GU026DDUhsebHGcAIEfPSxiEwzUXBKGXxV14Ro6v5dEdJDEKWtpjxtLG4bSkl+BnOcsTR1IEyUyl7xvaygxBT4BnH2YCXxua9cfBTfeGTm9JonT9WyQ/k0SUWZwj6wprlwpUHa2OES2MMwMjUWSf5tJE3YkCUxqBqMEiKOB4MZfwUBB+DuGvnAdbcRCn78zdT4BA5Sa2pCRJnYMxL0LA3UPBlNGEqZjGE6nQBuHpsqzQNz7kjjOTOHWX2qsZ3LqwtYek0UwXlvMKDB9ybW1IWNpe9cWPVTOlc5b3gGdT0xdQTOf/wYCGbXmHMOcXwOO3QNRZczlvoQxJt9f8gNLe0wkcYokccza4ig1dCU2uHECJhsXknmqG0kcsbZw/YXSSM1MQou//73/46qDuP/yHBQ72+R9GqM2fRVkBigzl7e+KY4YEKjMLBh6QFv5ksBi+v7NyfmNqZmerT0yTV4gz7kzZHpgoiKYD0MgjnxD22cgGKfmasSZ+jS3NEwPdiSE6d9mSx6BYOHOdHYkuHhsxjVFNbC0IZKE6QYMlMzUFxkQx76pPR4k/zZ90JkvlqgmYDk8WMJobYnj3P4cuc4gcWcbOTL0KVPCgp/F1y1tuAYTddOYVygjIKprWxzl98fxsxpssenfZgvO82C4yBY6v1cDNYcc2gGf8LoHJ7eZNVB9U59mmMYwH8YgJ/M8WNoo++rzf3Pys2N8kiZjlvJnEx8Ybiw7syBljUDNMbymPs8s7dMOaOPM0GxkCqzv3BfzRlM8Fw9yq2zFqbkdoLW5PMIIBjwCoRxZmsnMArSbDaYsCJUYaKuPkljI0W2Bf224IfA8QQ/IqYmpyQwYTOeGNkVgMi/54xxOiIxSfPAxCOTExnxQpzHmkaXkzp7GbQxCmTG04drsmPs9GYMv+LSYC4Er+nKev1GK8Xlffl+ntKDPjlkwWbhfntE7X5a3mRqME1tTD+V4hdxgWn6kcFZyUcj2kDE0WPJoIbeEv8/ILTFSMAoffPd9bgWoUzdrhvbIYl4xQjUG4iIztaFnBI+I6oTYgyLSavxZ6KHhDopqBjkvNsMF4TN7fffF4lBmfo7cBRlL+Qy4YWBp8AvQVMbQFM8W7z4K/q1LaFfQo1PWKh1yTeYrCXxS8A15XxJ4Qq/udx89I1ATmBPe+CSJwxgECgLhwpXpnA5QVNdf3cglenDCs/bn6Isk3LrSRNmR6wL5xtbShptSJo/0ovp6FhTvCkPyHJEBAtutK4kE/o/S5KwNojRdNedZ0YXaj8jU7+p8UOHe1pW9rS8ygm/h1lfE0dri1Jzam5Xf5K8TePe2LkcrTl3DQGWOcPON6zU81GTxwnFbOkoS+AO29wa3KunIODqPx55Y+oLSQLTjih7CrWvr0/jst0M1t4j8VrDmpmZQvHfwNgzUoK2vT+cj70CoIGei3Fm6VMCN4WpcP/sNgxtltqhe23dKDP2WbilwDQdKbugKMgebNh7uC/wU/CjvbH26NlWZgcGYMTV7WKNLAINRevyNYUxjECw+SndUp6wGSm5q41QVx2HFm0JT/jr4i/Im+abqAVXxzLzQXQX8T+kPi/o89/ZkigyNXRkazGxdRqsA24DxgfK8eoDiuLA5dfhr987pa13eG5rcgkVNILc60qrGt3DAe5jfztGrC15w/juFrxS5eLyzPBRiP/Dx3tTk3NQXDbjgRE3AWEbGYIrqfP6snHWOyZ/wFsWjmtnr6MtRTxpddEYgNNo4I8/bEz6RBI8BLPKAtj/3bkz0lsBnWP8R3/iz+zRcwq07W5bzvPWVu9HqfqOudFZeLdTpSlX5h9V4+m25UT+rglSDC8sCheE8fmQG+3K2zi9gMPo/2N+QJnsX6urOFldN/dqtGwrdpB4qOa1dK+WMjDFRdpo2ToHQBccUQW7EwkBGMO+0Pye4sSfT2ORsBMPSvt3ibwyhuPoo3ck7EJixyciRoQ1Ds/V+t+2nZj+w4hdYflNU90AcDWfBeA/FRxwtMNjaryZTbOVzW0SY8ggEY59EDxjqYLy3VBPBUI4Bd/1RmhiPQlBqnxJi1oO3cWrqimeKY8a4Jxx5oWeHrXRh0Q+llaaS1/luwzPfuwB7b1gahFtqkYKjF4I9ZiCiNY6QAHedwfp8D5H7IDD17zFQbEjChkaFm6w5zhBz397Up4yFuSLkc+xNw1CJTX1+ziN5AUXtKuPSIqmh83EVJKwjMi2Yj7j5Ii4dmEYAKwQsssXxxtAVjzpBzzorjYCZOHAFmhtC0f1ucnT428pi0eX0dBhkNTO0aWJq2LFWH7uEFwxUBk6w80dY0JU2pwZQ8jelcvJAMNzZImXziq3E0hEf7U1teLBEFBBHLR8xMEChyal5gy2EW1fb14zXJGkqLGFKDLS0jv9WV4DFPeo0rqMySVA3QP7sNmo9f+sXYvFRCi9yKv3mt/nRydhYVLXHzQrjQ8Djy3tTm2MnJoXio2eLqwwOeGTkwzXgcBCCMhwQaAfeLoMXqe6ETI740dvKjo79sVCuhdjMT4xzpZKwMqUq6aiUq2BSKp2n1NCVtXXXUhNP8+WBJCIGyg5unggYShU0URDQ+QQ7bSXPt4MaykOnMLw2WPkqHJ0jgv9D9OVp9Y0ySydBV0WjplNwer/hPNbU3JdOA6dgWuycJYYRMQvi6I5jgHVPvjkDf3zGQSHOtGdpw7rRazkI/MDUWk5AIaNPmY/Cofva1lk1epDEHSzxOZl6ILBRN//xOxgqh9Mx8P9M05E+BvAtPUC+WZedBf5+6cjY4jg3OZVZiaPcFloOcUUbbEaUncGNkvI9bK5scbQG3L6VFDjiveX4VSary9mpwZqbehF46E3a1Pmkwzk8M37DDD/Ou4KizoCs4rfE0k0EAvUAWT4H3BR1wHzyTIO3X+bcVvADEXEm5s2BjM25by5P+Xt+L70G754pYh6QD9i9MoIOfiF2QInfai4d3+xw3O/yroD9aX2jZrZ/yq+mNuR+Bl78/pflifv2Gr5BIDRFYoNP+aW697OwKuF0B17MH81v2cEosTUW3WsjFoTK4av7NP886WrW/KQuHVTTqx6c8ImlyZ4toh3w2X19nFM9h20dgW9h6Ep0GoBV+G6Oixqub1Yf42JeC80dmKipuWJ3tjZkYN6lJ19Otzbe34rfnImSG6uGbbs0wD7ylu4xMBg37HUHnNfdibYmf8FnfYcLQj//Z3mK2MLA0mzZ0G9P7cul8chr+Eh/PV0qnL7Q5+kO58gKdpJuHSs4F6L/pnioHT9S/2nVsQhUD/Gb4/3VYsrSX5bIIvFoa+v8AnCPsXFMuCaAkz3wOXLtyZR9WVJlG2Wpc0lCJZzubF1Bbc3cxjgMRiyOsp7E+JiO9RfGNAOqSLqEWUYwNOMqnf0KWCH2io/LMx4MbGSPsVe+oONMlP2ZKCXHUkCWzfRpaOrKgi7XX79ASxR0C5WkS/vZ4hF3tqjmQENZkUR6KpKtj8mY+jS1tGGhLY/WzNJwZCqzMFDHpmgjtTH+sOLiF34nBqGMjIGamyt1Y3LqqvFdxO+wN+Esn0ppn+ITTOaZxanD2tLR1tQxTPu00uZkrJeO07Bo3GUp9+5xTVE92GKFt8+LlUw9kYB4T7UIgt+YusxUONnI/IIjli+g1mz1qnk9//23n7PBjT9Ti2sSS15fXjm5d9/MZJHvPs9Pa+O3zKOB//oSXOPbX33+0+y49Ec0eHcERPUrlttZ8Br4O73A0uM4s/yeONudD51nsrX1ZfOqFGPxF0ua17J29gTtO5YOj7gu56CS5Ytq2bfKtq2fh6e7XKTb/tSzORdmseh7HV6cXJVD/fOqv7NQ5pBqPFJPQcryojB1HtPP/rYsj3NCCyDH7t+k28zBP3flHeLnLYmfd2+5J7WHNwNSbYivJbEF8Y2qqq9/1c8SR6F1fPLxiQcLJc6Pq36UpXFGShs3fmj3b2iZ5fFbV/04S7ylA7dOSsH5gS8hVL901d86DxU4MNo67yhIWyeJsi3EU6fPJam1TbP42zZaOzDt3/StOMYgbv3u6sSytNDZOSiKne2HhPPf1T7jPPZ/XBVVrHgSterJb1v8QupTvFfIJRO/6gdRFqbfrNTr3/RryyLlotcHPPHaAP3/+Z/eccCeG/U8Z+vgb9fI5IS78TYKqp+P6dYSojC1/NDZVijwQz89vYr8nRM6SfJtGwEHA5zCeBnBjUNoE0fbtEAQqarEvxtVlOTOVfHcJ+YTQ8BPIxih/k3/XvjWv+qn1tZ10m/VI5gxt45l+43v4pHE4qsFeqqBjwBsHYLnpH/DdlCZ+LgNrFOWrkNQgpwiWqVqCRi3D5h4TjkOPL1kO0nqh4TAwm3HK60v+mHiwGzr3Nmuc+9sg+LVbxHyYd6/6SuO7W8xJ5JK22OhavUk9s5t1yGTLpTxfR5jlAsoS1JnK2HU7iKUBc4c81SFBHotqYTGRRGwUCm8X3fOduvbTnWbyPhRtAtAsLT3iSlIKQjQcwISMpSLRr5yMDgPAe3O/+rf+tZEYeDnaDfj4gPgrlPIyZGpsd4sGOVmPtrAYBzYArOX81H1XrWY01xn9LGpaATW5ULNOnKd/eniElHrS+mjJEx3RhAjY7DoXITCbo0xmMZwQtxdArdSVPzBnGcscVUGkKQyVZrYHlaptvjJLYLTXelaSP7+NFH+3Dy6FsROFt7OzG2c+HCg5KSq2N+7UjAk1br6cv/k+11zLioHd6Z/2ZxnPn9XVsNifIECZ7CsjikSKjNf6oZpwiTn8GHjoL6bvvWFR1JphJ+TQpmB2NTn0rkxC95REOTk3NJ5khwi7yLFMwnsY0aaoJ295AcGqY5WdmVF84wrTe3ZuZxcbyT12nMtEiC/godL90CaTBEQxwwO1TH9GhXaP8mvx8rKC3hWmPqAG2HeyDq//5xsFt+ccUoMORzGruic7kafVwIv0GqeV/CaPo1/H6+pyyWVtYmlK5KtSUX1vRzb4ih/gr/Owk8qAX8X/Bs7tmllot/9bseifMVfHVVNvw3vvGeLLpGD2WY4VgV+X8EgmjHmJTCQXDiZ7qxAXdsCv7H0KXXzw00mud0wPpzVt1OySGrqHqOIKL9knlo+PdiTaQwC6M+wbQjVBAhT+yxe6fs49F/DAO1pGobI2+cSBklUYhjQin9nyQ8sXYks7brsuJhY+oLyIdXjLORWrqHPqXxNpjswWLhmMMqbHRyX82pHheJTeqbgGxJ+ER0AuCml2Sv0x5N6cTLPrXJeoppPB3P3hcUsdRo0Fgqep/lFtv/nZOJCH6Br7s+Of57uddhZyKlV50yjCvZl+DrhCSMY7YCoesD/ORwoolqmMrH/Fz+BC9fS5y6WH2pTG1XAQzDAsjNFlM9UD3KYPliupGfp+/C0/1bot/r3/izfqKfzp37UsSjmUOiM6Wkl9gvwsYie4reL9U+5mHSR3QlGPrUJr7E78t7U5NgMEOapgSWqOcXR0c+2OZQAgffNJe1aKPUS4IbrUtfawrV7r43wuzEIzB0MWJriXUcuibUm84+zfLQBnHxoFf2tAcfsjFqB04wWfwVgME1nh0Um+yOq9ybzssMK25608PfT4wKcxwBt/0t0IO3++LO83JL/CAxgId9FvFl0SxrByoUT9WCJ6uYZO1SHF4FQTQv73iULiSRM7wAnb029uL+c2m+kc5vdLK/Sszyo41kSppmtPSYSYn4K5yuMR5JSpZ0A1BbuXVuXkSnwOxA8DukCMl38/XbPuFNG2RlcimCxEDz9A3qk0fkg/Jm4XW3wJunWdWFB45nPy2Awxfcz7LcBjRZDfPX5yJ4oe3iIdjNO2RmDeWupVt6B5ahW4CcVhbNY5zA7WbjmZlxNv4xHtFBXr+sO+6Hw8w6zgXrAc51pRUGyTuBMSzhhPoxskU1e4V8jEBoX+I48kIJxDoPx8GfxroRTZGoEH6RADMcPpva4eRnOT7taTXG0hvmIMXR5C/NRDGi8n5IlkyXbKkYZbUzNjEGwSvG3LX26AwGLQLhI7WCcWxqJi9OGvs9fFVMeix4vi12O+Yqfi11EGKiI4HHZKOJ0sS0F4iKT7tgdDLAdHRJbVi1bix5jT/jDV//TrqOLltpITt6BQEZwohR/m7GJ50vHqDqNZ3q9A4ZtFI0/gdcLc0FVbqlbj7w0/un0P8plLRyzYzlWNySvRX2yTWM3AHEUkPyXLrX7STpzT40ek/wpHTjN8XcwDz/93Em+KAahgkxOzV8T792Haopljc6L35mkGLiEg8S55VLfK3IZdb64TP/XaPhmeqghr2+jj9bE/9R5pvg7sDSbhQEdY8axHgiwj8LWZbOhd2D+6TU5oqLMRl2VOuU35YCxPNCxCL9U5T409z2YIoPuZPFWOD2Y+pSzNKJXsH4agAFppAlng+rbO2nAs0bwGEPOIz55tSStz49/L1kCN3yNnhdHuT2ZX5SDMfRpbugb/yd1ejV/LJukMeHt+fZlOObozgjUT7mr4fU1dpOUZ/z+nBb2z3we+1TYT8jMnPeAz39bsLdlSd4vidsuWAc4E1tPd4B7RIZ2/Qx8z+WlFVSWr5kkN2NUuUublMbIERioiUnj7AJP6uac3/kiXdVcr6o1vvwFcjJEphYu4K5dKI42dD0THY744HcWLT5/u7wUKat6LR+8MMdyoTwc7RaRidX9eE5wYmjX7j0jL0p8/Np4aYjsib2DpPwV7qg84tioKGfUFml17UU5llet2Z2Rpaqc9/J330KWL8/JEhpCztvZZKcpTLPjHC6B/0U83ZDxRnnsz+Prcnl/zr5OQaDQkshq7UZBMGBjUPhtRSf+L8XTC8t+/3fgT8S2BmEZdY1AjQzdjMFAomthIsogp65tHfMgjiGHyBiQOR6etyuX5vDqMpsi56K8/y/J5z2Hy5mpb3CsnmB7S9Yhib2RLoD5Rba3seZ6UtJ7Sa7z7fJ1F/gtLXjJTnfeDgaIcZb8ulUC/dvw2PzuX57XPv8hPD1Xbv/TOu51tQAvt6+08V3N7An2iwqf+U7m2+XpNCYmMXKA5wvarQb+ZXh+GU5e8Ny53P2T7z9B5+AxtgO13mB8YY756Me+mM879YKZm5pK8pqSeALTRTlQGhc3d7kzuFFK8p5lc6fP7klNBInNPzGztbFrNegeZlyzsXKmecjQHhlLKHL4Z++znoH92/Go3VgZmdo4sUX3Vfm3dmP5XyJPrY03VqPZHssLjjt/6fptE6/hvEW769QSVQ9MlKiIL9Zn72tjskaun6Xl5TkRskZyGW08GE49Z/mWMkQa/C+hxdbWpggGQ0TjMlTGfztbHB+swRzHzyRvM6Mbp6QtnO7K3UNnmprZAcoBNyR59pluejBAtJZRq8vgK/KlgZrZGzMHHPOXWAujGwi8LZ4x75sBCrGfPdP5nSnU12Gkc/dx3J8a+u1Oqq3vmEvWA+I+tTh1iOkAJip+x7P06WGmoWym3aXleEAb72fa+NCkEYtAaMYwGGUA01VgOUOfxiT+1OevoN9TzeoXrZU8VSf8nA4rW+WKuEa9JrFuME4sPca+B8mXPLnHWI5p/oiey2HCy+b3MzmK19QdP4enO7KOSOJp9UDqEqhP1LJrw2YutagvbeUs6jhGpsDHwOe9xn5zwlvVi9GNJqpNsf54vdjjruR3WmNP4Stw2Yk7W/z0VvWHr8lz/5ZamBfJV7FzKwiVw0ty10/Bc64m5omaCw5wjyzQVBkMSMzwx21Oe/OTX7GGZIujnNQCDKoNN3avqFcpcXf4477Q5LhxyJvXUgxIq6tncqvdcXOR1/g1HRvA/XEfsgOmt+c3skYJxJFnTuY7unYzIpvUNfyQ8FU82LWpzx+R4YWmbGDOU3iWjQ3lDrQ+hdah4PnPaK8GZ2kKtpMezHkPx8RwsCA5C/xMfVNDmP+c34n9P4NT9/ZkvrO5UW5xjztDw7zN7zBNpAHxMRlTK2LmJ/y+0w3nGhv1tPh/Wp4acYYmL10ve9rHs0XPswX+YIkjFogLUs9q4jkG40QSx2sYjA4w5/eGPiUt5IY23JA8uKgWrfS8ZGqPqSTSEydgp09wgZ2l9ezf7EDNYYA2f6wPcmAiGE5jfK/wRfKyT6HY7iWT7lAKxfHansybtcH0vvuF1Ko2T0Gw9LkLRDWwBd4H4jiz8k2B0/ZY9HnM02W9iSmUa/i1+p5zcn+6iViD/8r7+F+yJcBV/8FHzWNwlLvbz/O794F93OVBym+z+426ku48BETGLQ70+LJYNtmgWHItWmiKEoRg4ZbtooRE4p2r5cOviqrYX5opKmaGRVH4FEtiMySSxFEm3ZGUyD1NiVx/Efz5WhrzualPEdTVGHIIi+GXRaAmYACL9ouT+yXrp4a+oeHMZEPKe4o2KNr2xI0P5rL4Rs6TA2+kstT3NjbBoCrZz0ytKKcT5dzUSOrZa5fi0jCBtJXuyVjiOJPG83xZazeSRBLCkzDL1D0Gm1lDIwfzhJYuM6QFbaweytaB8pAfilN5BzSW4NrQ565B2NZEEIdNrfKlSq1OVAx/WXb9UQh41xCnHuTcL4Cb154vS5z5DS1Nl9IaTC7QNykdi61KuGdL9vgsKaVR829L5V5Rp5qiTh9UdTonBxUVBz3MdPVA58uib0tFXTHq8n4zlpXlbTrTilJvn90bunzA6tj8zGxWd+P7FWt/W22a3zM11rO0/Wh6p8qLFZqTed1GXzRtxEqi7IGCxrboxTA/lp0b4caF4vhQtOrgMQ/0erGhemOjezZ1liyh5Uwvrom3FM8ilRGrPBhKvE0kcZQX9F+TMLqWrsBqpmwlMpY8dwzDUWLeR18MkXcdbZyC25jyCtk9ivJRdbiFz5/Ccxu7hnj7haoZIk9DSZwOpQnvwYFMQ3qBSe2yXEJQtMXqUVVU+UHZoLmi3WaCG62lfO4WJfrlZv0Ul0UrBAhUrKIYkJclNSs8D9JuUOOd2PRpW5TNjbH5wM8Xz1B+rFphMA5/nwzUWw9+VgYO1n+4DFRtQCF+3yv5YXQ6f8Wl+/09MQbdCW7UPOxFdaku5SNqV1AGB4oHxEd3JvDYttDWOm6c0eX2MqWH+YHwBU2hTKqSZxIyf3WxLPBDQ2MToWi77zwwpzpwpNle0nmgTClzuhy1ZY7apgKGfSyY2uPOzslhZ540UUjalyyniapncK5LZXBVtrwQWTqWLVV0ood9YduHiC5JbV1miO3VhmEpE4ZYO0xGw7qnOuQIy0NFl0qvDRQPhgoLx3wOuBgZA4XSYDJFZqDmdVccDOysbZtnGLfPHT6i3bpAVw82Lb0j/EDa27QxQ+aPXZ/i0Dz7eEAYCzl1Q3VWZ0k61k9fHgSeurVL/pN01wrfJntXEdXA0NXEFvD98cYUEbmH9cqyGeq6D7fxJyyTsyWT4nfxv7X/3Qfhtvrdcb/1P/9JCPeu0XAFMX/v3ddut9GYD3EZi9blwi873XaD/ySNiRtb8E55oF7pq4waOhLrXY1p7oWpL71f1i5UzDuFxbdgzqaGNiT7dVoTJQXCxjWC8cHUaBljIQfURi4Leol87UAxm+CX7AdaO/TwOZwaBF71gG34g8CviBxRH7f080rbtoOBGmCfkhxmSPy49LSVj2WiN1jSqy3XKsd2Kq5soyq+p8/TqlWlWNqrntWbrVclnme64pkD+eE8/rH+P8F9FZ6QvdpxmKYtShocD9Ipdpc0BCqP7ZLcsqUU82oR3jRae5op8oJPtQWRZWOi5HYhQytOZUrfifpdWN/KyOj2Nco2D+KbwEDlLOzvc8QnwvqSqUI2n5+b2pij+hQdsO0vxqB7q2q3ZB5n+KoWKj/uHBwLiEOEdT/l0es2vlgYIDIfhUOMNCZbgxTb0FD9a2pDgpciPglr2zR8aizxc4ixJqpfyGNuaoT/U1NUOWx/oKjms8E8t7CfQHamU9ZgMEUzTb2mMlOXreMhl0WbjCdNZI8c5nmy1HHtLoKVCws7DHJ+dwzjCxwe45+TbXUKe1TRp5jnzuZOl5ukyR7re7d+ABrls2pLEvcoU7Tlnshyc5yztHvA/l0dN/fzbL6kOpTwWWGnzS4dfBY/vA+DcWZwbibdYX9fLmz6JxdU6YbKxnfxbnlYaiOWftIm+6XtOX+YFMHb+LglCp1D6QNQvMBQxWOtKA2J30jthV/yxPF5WuZ3G5nLYTjz+TUc4HtRXPhtXmv7kcpelDkB4nNyj3UYqMyTuZV6Gcsj9auOB1Qqh1o728EgKblTmh+3dflz/No4TC842ho4UP2Z1tqepbZNwMlcirYz7ANj3jF1L4YD5UB9Eh77MG09c/YQPMoH57ZnoH5ErfRrY+oyhnNT4sOsxW54ruVZMF/X82yuvZRuRX5pUh3BdSxFe7neKWMaHFMft/MSWc8W5ehYDt38RrEV0JAc3NzOO5U7NE/mJ7S86GBAiodzS8fHHFDHQYcNO6DRGK2S6+D0+crPpHyc1eTi/DYlxziL5AUKXdTM7RFa7et29iI+owcXqoV/drJlSFMfvAgvGJaaT0V0xG3aplMz3lTcwofu3Oblue04sH/ctRUMGEjYr2ht3XAbnclhxBiGrnEehLNLSL+wNYXyEsxHNd+PLejJHg9nJf54y66cOSy1sC0s5NxaDrQ4T4jSyFVPtnc4cxDtZ2YkCXYLt7DNd418GPUJurdFqHTkLyybrNNAL/z1Bn7ZYwxk6yQXl1a+eCOXqpTtQKd+dmhSXq4thRuimtkBOZDVA2HsXTLXs0u7A8WDXIrjjozaT5bs1j/TWj4HVx4u/UR51LJZttXhz1VLHJSG9aXXepwgu9KEHIT90SjtnfDHW51LGtXPKirstkzj/pe3oxNbSu3nUac1zpErW885YlvqJahuC48HWxxnTnBXyd5bLY3W8tjl1hRl/J7WYThZtj6ZT7Plu/BJyI7bZHn3eKh7I8+Mx1hwjywcKAhu0ILyzJuVMZQwHWZaq32vgufTGk5UH4gI2yCsN+vtxtXfMx3PhZwd1po3bXs+4fVa67kxqZ4p6ZgV8S+Sxmm5XpCRNYIxE838UfOasHlL2fjJdurSfpzslP983mDJL+mzzWXr2jIplj0Ghirln7sxWRuAOb8GIqLlUiTfK5N4C/PVPdbTiOpp6uOP15jeMD/nD57IYXmaQO0kD+Xo4xQtuMWaI/ExVrQV+5jnKH1MgWdgMA7MAK2PvhbhWbfU42dwR09oKNuwxSd9zU4cqneP9zNNzYzBdAgnXbj8hbYyUBlaYoHlZ0TnpM93ZVudKbDltZYP0o2HZovmsMvfI3Sg5QMkB0vaSVutnTQm1mVE6UBOG6E6fMCX56xW12cC0bkIBHLu6DxjCHsstx4cyIw1me5szY6MZ+B+WcskOue/Hk9GWdb5qNHynptVjKluyLzGU2SKKK/yAXejfdE+8FESpjwQH3d2sS2Zpct03cLfuxI6vif5z8yLU4eFP5iDgUlLK8QUOcuOcsbSXynyLRZdxyvXpzLpDsvocGcL/NoS1bVVtbSXsSXKMN1MDjGzZ+A6OUGExmiLssyo5BNyeofAe7auRGAwje3Jxm2f6GHUWqhKG/ucvDbHKHKaZV4Vz/sNvvHcqS0ndDnG+LW2Lv7zYsm32+sCslVujefLNnpySLjAdPstJ61VaocP2/VM6euRUzPuTW3laoPqtA6yrqcv3WzWuvblF/v5v7E9qsu3bJ2D2YHLjrM3f70f8do4tIkbmJ/FRWEH5HP3U0MjvPhwrsXplC9p3NT2o01tyL21//wLW4dOeKRqx9G651be+zXxYqPlZg+4RXNLujP3q7abXK6vgaeG9pjMNGw32GKNgt0bGsphzsbAZz2CCzwex3pQYL3mVm3SE1vldfjlLzw/8qyv2pF7rpXxN8suqxzRY+yQvCrKJPHOXRS4uNfG+yfybi88S7XW/kH9wzngqvUJHFOlpJxf6Mw3d7chYJiEUWP7uGPsDHevgrOxlYlcyxW29GZzy5OOLQaesEHts1dP4+eiTH9VnhN7eBO5eL5E/gRX1b2zMKp/DDZLG8b2XT2ul0/sD3lGaJd2vzW8JC5PADfewHy0B5xCc1X64tn8TNeZtKd5p9b5t2+cQ3lheXkJe1keTutqViPO1Kc59lcL/736Dc8cxXL0sevbF/IepgWt1Vm4hmYypi6V22cVW5gX95fXzbX3Yk20a028tb7ZOdbDbfT3/o9//rjq0/OuGkeTtc9Q+nOnj/04e7LYf95BYv+rTwC7/MCv1x/UdeHpXPWztU7O0wqs0H/AP2767969+x7+T29JjjK76VHe+NB9FOP30Ip91dkmfhTe9Hbs9xDz701vSZ/5HgZOatlWat18D3s9TKFyQPwbWcBBCbnV63kOCt4n3gfoWdu0/lSvZ8Xx+00GnG3opE7y3o8+tEfqesYPk9QK4bPPBVZouY79DuQ3vYmDguNzlfjix7ZZmPr1ryaxAwnsaR47N70K0fhS4iAHptH25q0nQNi9GPVdgVDMT/QKvX/TwzxdXSmY/6Z3L3wrLx75sXz4OaofJbqD8FYcJx+O1P9cPfsfzQC9nhWGUUoEsJwEkbDme+nWd11nm9z0/u+7Ev//KP/o9f59/LPX+95/2EbB9/5N4yq+jjH7vX/zvWUZvvev2k9i1JAnC7NEpfx7v/7cj6vWV30H2Vh5kxcxZ78vSf+e/ILVQY3/YP75nsyyPuKP8s9/1uSi1Iw3PbZDJgIrhd6szgAvo/NLKV3CX36uzof4P9T89JP893LYXs6Hl/DiS/mx16uQj/87Eq02zVJr7B1Q5wFC0nKskmdq9+uKpz1UXQGV/1Xf/naikaohuzQTIU3dA2h9s3IGbk6GIx9qw9I0662XCgt/OpSVeje9Dy/7Qv3My2qk4tDLm+cVK/E2qY/UoVnJ3Sbdj3qWxzcF8up/h6WlbnIxjTSqsE3R0dKMxb36BGp8TU9qLYciFlz0C8hd/8gS2danJL/Yjy5H2DoP5fdLv50AkG6t1HHzBgiUSwpRJn8vm4/1ethA1Bj2qam3Jl98+HiHBCE3vQr55V0n3HUojO/9z1/v5bv7fy3vb5X71bd/fVO+Tu+E+6ZlISc844etOKZ3KvtXei2Fv0T4VvCs0HWelxKinhIywQ4p6bC6Rymp4ea/Q0pQFG2ymEYMxWRQBC1008MhxvO4JkFMB5bp9TNYVvDN/xJ/v1Q8rVinLXClv15KeM3nLm1IWqGjFszd9HAsV/iTT4WDN70yGvwe9q/6O0ooEojWsxDQ2/pJGsVe/8f/CwAA//+hqYUMpacAAA==
type: helm.sh/release.v1

Decoded json:

{
  "name": "dotnet",
  "info": {
    "first_deployed": "2023-02-14T23:49:12.655951052+01:00",
    "last_deployed": "2023-02-14T23:49:12.655951052+01:00",
    "deleted": "",
    "description": "Install complete",
    "status": "deployed",
    "notes": "\nYour .NET app is building! To view the build logs, run:\n\noc logs bc/dotnet --follow\n\nNote that your Deployment will report \"ErrImagePull\" and \"ImagePullBackOff\" until the build is complete. Once the build is complete, your image will be automatically rolled out."
  },
  "chart": {
    "metadata": {
      "name": "dotnet",
      "version": "0.0.1",
      "description": "A Helm chart to build and deploy .NET applications",
      "keywords": [
        "runtimes",
        "dotnet"
      ],
      "apiVersion": "v2",
      "annotations": {
        "chart_url": "https://github.com/openshift-helm-charts/charts/releases/download/redhat-dotnet-0.0.1/redhat-dotnet-0.0.1.tgz"
      }
    },
    "lock": null,
    "templates": [
      /* removed */
    ],
    "values": {
      "build": {
        "contextDir": null,
        "enabled": true,
        "env": null,
        "imageStreamTag": {
          "name": "dotnet:3.1",
          "namespace": "openshift",
          "useReleaseNamespace": false
        },
        "output": {
          "kind": "ImageStreamTag",
          "pushSecret": null
        },
        "pullSecret": null,
        "ref": "dotnetcore-3.1",
        "resources": null,
        "startupProject": "app",
        "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex"
      },
      "deploy": {
        "applicationProperties": {
          "enabled": false,
          "mountPath": "/deployments/config/",
          "properties": "## Properties go here"
        },
        "env": null,
        "envFrom": null,
        "extraContainers": null,
        "initContainers": null,
        "livenessProbe": {
          "tcpSocket": {
            "port": "http"
          }
        },
        "ports": [
          {
            "name": "http",
            "port": 8080,
            "protocol": "TCP",
            "targetPort": 8080
          }
        ],
        "readinessProbe": {
          "httpGet": {
            "path": "/",
            "port": "http"
          }
        },
        "replicas": 1,
        "resources": null,
        "route": {
          "enabled": true,
          "targetPort": "http",
          "tls": {
            "caCertificate": null,
            "certificate": null,
            "destinationCACertificate": null,
            "enabled": true,
            "insecureEdgeTerminationPolicy": "Redirect",
            "key": null,
            "termination": "edge"
          }
        },
        "serviceType": "ClusterIP",
        "volumeMounts": null,
        "volumes": null
      },
      "global": {
        "nameOverride": null
      },
      "image": {
        "name": null,
        "tag": "latest"
      }
    },
    "schema": "removed",
    "files": [
      {
        "name": "README.md",
        "data": "removed"
      }
    ]
  },
  "config": {
    "build": {
      "enabled": true,
      "imageStreamTag": {
        "name": "dotnet:3.1",
        "namespace": "openshift",
        "useReleaseNamespace": false
      },
      "output": {
        "kind": "ImageStreamTag"
      },
      "ref": "dotnetcore-3.1",
      "startupProject": "app",
      "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex"
    },
    "deploy": {
      "applicationProperties": {
        "enabled": false,
        "mountPath": "/deployments/config/",
        "properties": "## Properties go here"
      },
      "livenessProbe": {
        "tcpSocket": {
          "port": "http"
        }
      },
      "ports": [
        {
          "name": "http",
          "port": 8080,
          "protocol": "TCP",
          "targetPort": 8080
        }
      ],
      "readinessProbe": {
        "httpGet": {
          "path": "/",
          "port": "http"
        }
      },
      "replicas": 1,
      "route": {
        "enabled": true,
        "targetPort": "http",
        "tls": {
          "enabled": true,
          "insecureEdgeTerminationPolicy": "Redirect",
          "termination": "edge"
        }
      },
      "serviceType": "ClusterIP"
    },
    "image": {
      "tag": "latest"
    }
  },
  "manifest": "---\n# Source: dotnet/templates/service.yaml\napiVersion: v1\nkind: Service\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  type: ClusterIP\n  selector:\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n  ports:\n    - name: http\n      port: 8080\n      protocol: TCP\n      targetPort: 8080\n---\n# Source: dotnet/templates/deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\n  annotations:\n    image.openshift.io/triggers: |-\n      [\n        {\n          \"from\":{\n            \"kind\":\"ImageStreamTag\",\n            \"name\":\"dotnet:latest\"\n          },\n          \"fieldPath\":\"spec.template.spec.containers[0].image\"\n        }\n      ]\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app.kubernetes.io/name: dotnet\n      app.kubernetes.io/instance: dotnet\n  template:\n    metadata:\n      labels:\n        helm.sh/chart: dotnet\n        app.kubernetes.io/name: dotnet\n        app.kubernetes.io/instance: dotnet\n        app.kubernetes.io/managed-by: Helm\n        app.openshift.io/runtime: dotnet\n    spec:\n      containers:\n        - name: web\n          image: dotnet:latest\n          ports:\n            - name: http\n              containerPort: 8080\n              protocol: TCP\n          livenessProbe:\n            tcpSocket:\n              port: http\n          readinessProbe:\n            httpGet:\n              path: /\n              port: http\n          volumeMounts:\n      volumes:\n---\n# Source: dotnet/templates/buildconfig.yaml\napiVersion: build.openshift.io/v1\nkind: BuildConfig\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  output:\n    to:\n      kind: ImageStreamTag\n      name: dotnet:latest\n  source:\n    type: Git\n    git:\n      uri: https://github.com/redhat-developer/s2i-dotnetcore-ex\n      ref: dotnetcore-3.1\n  strategy:\n    type: Source\n    sourceStrategy:\n      from:\n        kind: ImageStreamTag\n        name: dotnet:3.1\n        namespace: openshift\n      env:\n        - name: \"DOTNET_STARTUP_PROJECT\"\n          value: \"app\"\n  triggers:\n    - type: ConfigChange\n---\n# Source: dotnet/templates/imagestream.yaml\napiVersion: image.openshift.io/v1\nkind: ImageStream\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  lookupPolicy:\n    local: true\n---\n# Source: dotnet/templates/route.yaml\napiVersion: route.openshift.io/v1\nkind: Route\nmetadata:\n  name: dotnet\n  labels:\n    helm.sh/chart: dotnet\n    app.kubernetes.io/name: dotnet\n    app.kubernetes.io/instance: dotnet\n    app.kubernetes.io/managed-by: Helm\n    app.openshift.io/runtime: dotnet\nspec:\n  to:\n    kind: Service\n    name: dotnet\n  port:\n    targetPort: http\n  tls:\n    termination: edge\n    insecureEdgeTerminationPolicy: Redirect\n",
  "version": 1
}

This is a clone of issue OCPBUGS-10558. The following is the description of the original issue:

Description of problem:

When running a cluster on application credentials, this event appears repeatedly:

ns/openshift-machine-api machineset/nhydri0d-f8dcc-kzcwf-worker-0 hmsg/173228e527 - pathological/true reason/ReconcileError could not find information for "ci.m1.xlarge"

Version-Release number of selected component (if applicable):

 

How reproducible:

Happens in the CI (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/33330/rehearse-33330-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.13-e2e-openstack-ovn-serial/1633149670878351360).

Steps to Reproduce:

1. On a living cluster, rotate the OpenStack cloud credentials
2. Invalidate the previous credentials
3. Watch the machine-api events (`oc -n openshift-machine-api get event`). A `Warning` type of issue could not find information for "name-of-the-flavour" will appear.

If the cluster was installed using a password that you can't invalidate:
1. Rotate the cloud credentials to application credentials
2. Restart MAPO (`oc -n openshift-machine-api get pods -o NAME | xargs -r oc -n openshift-machine-api delete`)
3. Rotate cloud credentials again
4. Revoke the first application credentials you set
5. Finally watch the events (`oc -n openshift-machine-api get event`)

The event signals that MAPO wasn't able to update flavour information on the MachineSet status.

Actual results:

 

Expected results:

No issue detecting the flavour details

Additional info:

Offending code likely around this line: https://github.com/openshift/machine-api-provider-openstack/blob/bcb08a7835c08d20606d75757228fd03fbb20dab/pkg/machineset/controller.go#L116

Description of problem:

See https://github.com/metal3-io/baremetal-operator/issues/1045

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3501. The following is the description of the original issue:

Description of problem:

On clusters serving Route via CRD (i.e. MicroShift), .spec.host values are not automatically assigned during Route creation, as they are on OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

$ cat<<EOF | oc apply --server-side -f-
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: hello-microshift
spec:
  to:
    kind: Service
    name: hello-microshift
EOF

route.route.openshift.io/hello-microshift serverside-applied

$ oc get route hello-microshift -o yaml

apiVersion: route.openshift.io/v1
kind: Route
metadata:
  annotations:
    openshift.io/host.generated: "true"
  creationTimestamp: "2022-11-11T23:53:33Z"
  generation: 1
  name: hello-microshift
  namespace: default
  resourceVersion: "2659"
  uid: cd35cd20-b3fd-4d50-9912-f34b3935acfd
spec:
  host: hello-microshift-default.cluster.local
  to:
    kind: Service
    name: hello-microshift
  wildcardPolicy: None
 

Expected results:

...
metadata:
  annotations:
    openshift.io/host.generated: "true"
...
spec:
  host: hello-microshift-default.foo.bar.baz
...

Actual results:

Host and host.generated annotation are missing.

Additional info:

** This change will be inert on OCP, which already has the correct behavior. **

 

This is a clone of issue OCPBUGS-3990. The following is the description of the original issue:

Description of problem:

This PR fails HyperShift CI fails with:

=== RUN TestAutoscaling/EnsureNoPodsWithTooHighPriority
util.go:411: pod csi-snapshot-controller-7bb4b877b4-q5457 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000
util.go:411: pod csi-snapshot-webhook-644b6dbfb-v4lj7 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000

How reproducible:

always

Steps to Reproduce:

  1. Install HyperShift + create a guest cluster with CSI Snapshot Controller and/or Cluster Storage Operator / AWS EBS CSI driver operator running in the HyperShift managed cluster
  2. Check priorityClass of the guest control plane pods in the hosted cluster.

Alternatively, ci/prow/e2e-aws in https://github.com/openshift/hypershift/pull/1698 and https://github.com/openshift/hypershift/pull/1748 must pass.

In order to start 4.12 development, we need to merge the agent-installer branch. We need to create a PR and engage the Installer team on getting it approved

Description of the problem:

During install, we assume all PVs on a host have been added to a volume group and only remove them if they are. This could let other PVs that are not attached to volume groups persist and prevent coreos from installing properly. 

Relevant assisted installer links:

https://github.com/openshift/assisted-installer/blob/9bec593930995220a2a4550b067f5a186de3b042/src/installer/installer.go#L809 

https://github.com/openshift/assisted-installer/blob/9bec593930995220a2a4550b067f5a186de3b042/src/ops/ops.go#L414

 

Found while investigating triage issue https://issues.redhat.com/browse/AITRIAGE-4017 

See slack thread for more details https://coreos.slack.com/archives/C02CP89N4VC/p1663263128420489 

How reproducible:

100%

Steps to reproduce:

1. Create a host with a PV w/o a volume group

2. Add host to cluster and install 

3. Observe the install fail

Actual results:

Installation fails with 

"Error: checking for exclusive access to /dev/sda 
Caused by:
| 0: couldn't reread partition table: device is in use |
| 1: EBUSY: Device or resource busy" 

Expected results:

All PVs and VGs are removed so that the installation will succeed

Description of problem:

when install private cluster, firstly failed , then need 
ibmcloud is security-group-rule-add "${infra}-sg-kube-api-lb" inbound tcp --port-min 6443 --port-max 6443 --remote $sg 

then openshift-install wait-for  again.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

 

Steps to Reproduce:

1. try to create cluster with BYON, in install-config.yaml publish: Internal, install failed

Actual results:

firstly time, install failed

Expected results:

Just need install once. need not manually security-group-rule-add. 

Additional info:

https://coreos.slack.com/archives/C01U40AM37F/p1664439142279079?thread_ts=1663769891.358229&cid=C01U40AM37F

this issue blocked set up private cluster automatically

 

 

 

 

 

Description:

I was testing the DHCP scenario where only rendezvousIP is specified in the agent-config.yaml and no NMStateConfig is embedded. create-cluster-and-infraenv.service fails on node0 when networkConfig is missing from agent-config.yaml. /etc/assisted/manifests/nmstateconfig.yaml is an empty file.

agent-config.yaml used:

metadata:
name: ostest
namespace: cluster0
spec:
rendezvousIP: 192.168.122.2

Steps to reproduce:

1. Create agent.iso using install-config.yaml and agent-config.yaml
2. Deploy cluster using agent.iso
3. Log into node0 and create-cluster-and-infraenv.service will be displayed as a failed unit.

Expected:

create-cluster-and-infraenv.service in success state

Actual:

create-cluster-and-infraenv.service in failed state

Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=info msg="releaseImage version 4.11.0-0.okd-2022-08-04-074610 cpuarch x86_64"
Aug 05 08:27:59 control1 create-cluster-and-infraenv[2693]: time="2022-08-05T08:27:59Z" level=info msg="Registered cluster with id: 1cc3ea1a-5bbc-4c4d-ad66-6e052800fb0c"
Aug 05 08:27:59 control1 create-cluster-and-infraenv[2693]: time="2022-08-05T08:27:59Z" level=info msg="Registering infraenv"
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=info msg="Registered cluster with id: 1cc3ea1a-5bbc-4c4d-ad66-6e052800fb0c"
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=info msg="Registering infraenv"
Aug 05 08:27:59 control1 create-cluster-and-infraenv[2693]: time="2022-08-05T08:27:59Z" level=fatal msg="Failed to register infraenv with assisted-service: nmstateconfig should have at least one label set matching the infra-env label selector"
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=fatal msg="Failed to register infraenv with assisted-service: nmstateconfig should have at least one label set matching the infra-env label selector"
Aug 05 08:27:59 control1 systemd[1]: create-cluster-and-infraenv.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 08:27:59 control1 systemd[1]: create-cluster-and-infraenv.service: Failed with result 'exit-code'.
Aug 05 08:27:59 control1 systemd[1]: Failed to start Service that creates initial cluster and infraenv.

/etc/assisted/manifests/nmstateconfig.yaml is an empty file.

[core@control1 ~]$ sudo cat /etc/assisted/manifests/nmstateconfig.yaml
[core@control1 ~]$

This is a clone of issue OCPBUGS-3508. The following is the description of the original issue:

Exposed via the fact that the periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4 job is at 0% for at least the past two weeks over approximatesly 65 runs.

Testgrid shows that this job started failing in a very consistent way on Oct 25th at about 8am UTC: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4

6 disruption tests fail, all with alarming consistency virtually always claiming exactly 8s of disruption, max allowed 1s.

And then openshift-tests.[sig-arch] events should not repeat pathologically fails with an odd signature:

{  6 events happened too frequently

event happened 35 times, something is wrong: node/master-2 - reason/NodeHasNoDiskPressure roles/control-plane,master Node master-2 status is now: NodeHasNoDiskPressure
event happened 35 times, something is wrong: node/master-2 - reason/NodeHasSufficientMemory roles/control-plane,master Node master-2 status is now: NodeHasSufficientMemory
event happened 35 times, something is wrong: node/master-2 - reason/NodeHasSufficientPID roles/control-plane,master Node master-2 status is now: NodeHasSufficientPID
event happened 35 times, something is wrong: node/master-1 - reason/NodeHasNoDiskPressure roles/control-plane,master Node master-1 status is now: NodeHasNoDiskPressure
event happened 35 times, something is wrong: node/master-1 - reason/NodeHasSufficientMemory roles/control-plane,master Node master-1 status is now: NodeHasSufficientMemory
event happened 35 times, something is wrong: node/master-1 - reason/NodeHasSufficientPID roles/control-plane,master Node master-1 status is now: NodeHasSufficientPID}

The two types of tests started failing together exactly, and the disruption measurements are bizzarely consistent, every single time we see precisely 8s for kube-api, cache-kube-api, openshift-api, cache-openshift-api, oauth-api, cache-oauth-api. It's always these 6, and it seems to be always exactly 8 seconds. I cannot state enough how strange this is. It almost implies that something is happening on a very consistent schedule.

Occasionally these are accompanied by 1-2s of disruption for those backends with new connections, but sometimes not as well.

It looks like all of the disruption consistently happens within two very long tests:

4s within: [sig-network] services when running openshift ipv4 cluster ensures external ip policy is configured correctly on the cluster [Serial] [Suite:openshift/conformance/serial]

4s within: [sig-network] services when running openshift ipv4 cluster on bare metal [apigroup:config.openshift.io] ensures external auto assign cidr is configured correctly on the cluster [Serial] [Suite:openshift/conformance/serial]

Both tests appear to have run prior to oct 25, so I don't think it's a matter of new tests breaking something or getting unskipped. Both tests also always pass, but appear to be impacting the cluster?

The master's going NotReady also appears to fall within the above two tests as well, though it does not seem to directly match with when we measure disruption, but bear in mind there's a 40s delay before the node goes NotReady.

Focusing on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4/1590640492373086208 where the above are from:

Two of the three master nodes appear to be going NodeNotReady a couple times throughout the run, as visible in the spyglass chart under the node state row on the left. master-0 does not appear here, but it does exist. (I suspect it has leader and thus is the node reporting the others going not ready)

From the master-0 kubelet log in must-gather we can see one of these examples where it reports that master-2 has not checked in:

2022-11-10T10:38:35.874090961Z I1110 10:38:35.873975       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.00700561s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2022-11-10 1
0:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletReady,Message:kubelet is posting ready status,}
2022-11-10T10:38:35.874090961Z I1110 10:38:35.874056       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007097549s. Last MemoryPressure is: &NodeCondition{Type:MemoryPressure,Status:False,LastHeartb
eatTime:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient memory available,}
2022-11-10T10:38:35.874090961Z I1110 10:38:35.874067       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007110285s. Last DiskPressure is: &NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatT
ime:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,}
2022-11-10T10:38:35.874090961Z I1110 10:38:35.874076       1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007119541s. Last PIDPressure is: &NodeCondition{Type:PIDPressure,Status:False,LastHeartbeatTim
e:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,}
2022-11-10T10:38:35.881749410Z I1110 10:38:35.881705       1 controller_utils.go:181] "Recording status change event message for node" status="NodeNotReady" node="master-2"
2022-11-10T10:38:35.881749410Z I1110 10:38:35.881733       1 controller_utils.go:120] "Update ready status of pods on node" node="master-2"
2022-11-10T10:38:35.881820988Z I1110 10:38:35.881799       1 controller_utils.go:138] "Updating ready status of pod to false" pod="metal3-b7b69fdbb-rfbdj"
2022-11-10T10:38:35.881893234Z I1110 10:38:35.881858       1 topologycache.go:179] Ignoring node master-2 because it has an excluded label
2022-11-10T10:38:35.881893234Z W1110 10:38:35.881886       1 topologycache.go:199] Can't get CPU or zone information for worker-0 node
2022-11-10T10:38:35.881903023Z I1110 10:38:35.881892       1 topologycache.go:215] Insufficient node info for topology hints (0 zones, %!s(int64=0) CPU, false)
2022-11-10T10:38:35.881932172Z I1110 10:38:35.881917       1 controller.go:271] Node changes detected, triggering a full node sync on all loadbalancer services
2022-11-10T10:38:35.882290428Z I1110 10:38:35.882270       1 event.go:294] "Event occurred" object="master-2" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node master-2 status is now: NodeNotReady"

Now from master-2's kubelet log around that time, 40 seconds earlier puts us at 10:37:55, so we'd be looking for something odd around there.

A few potential lines:

Nov 10 10:37:55.232537 master-2 kubenswrapper[1930]: I1110 10:37:55.232495    1930 patch_prober.go:29] interesting pod/kube-controller-manager-guard-master-2 container/guard namespace/openshift-kube-controller-manager: Readiness probe status=failure output="Get \"https://192.168.111.22:10257/healthz\": dial tcp 192.168.111.22:10257: connect: connection refused" start-of-body=

Nov 10 10:37:55.232537 master-2 kubenswrapper[1930]: I1110 10:37:55.232549    1930 prober.go:114] "Probe failed" probeType="Readiness" pod="openshift-kube-controller-manager/kube-controller-manager-guard-master-2" podUID=8be2c6c1-f8f6-4bf0-b26d-53ce487354bd containerName="guard" probeResult=failure output="Get \"https://192.168.111.22:10257/healthz\": dial tcp 192.168.111.22:10257: connect: connection refused"

Nov 10 10:38:12.238273 master-2 kubenswrapper[1930]: E1110 10:38:12.238229    1930 controller.go:187] failed to update lease, error: Put "https://api-int.ostest.test.metalkube.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master-2?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Nov 10 10:38:13.034109 master-2 kubenswrapper[1930]: E1110 10:38:13.034077    1930 kubelet_node_status.go:487] "Error updating node status, will retry" err="error getting node \"master-2\": Get \"https://api-int.ostest.test.metalkube.org:6443/api/v1/nodes/master-2?resourceVersion=0&timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

At 10:38:40 all kinds of master-2 watches time out with messages like:

Nov 10 10:38:40.244399 master-2 kubenswrapper[1930]: W1110 10:38:40.244272    1930 reflector.go:347] object-"openshift-oauth-apiserver"/"kube-root-ca.crt": watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

And then suddenly we're back online:

Nov 10 10:38:40.252149 master-2 kubenswrapper[1930]: I1110 10:38:40.252131    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasSufficientMemory"
Nov 10 10:38:40.252149 master-2 kubenswrapper[1930]: I1110 10:38:40.252156    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasNoDiskPressure"
Nov 10 10:38:40.252268 master-2 kubenswrapper[1930]: I1110 10:38:40.252165    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasSufficientPID"
Nov 10 10:38:40.252268 master-2 kubenswrapper[1930]: I1110 10:38:40.252177    1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeReady"
Nov 10 10:38:47.904430 master-2 kubenswrapper[1930]: I1110 10:38:47.904373    1930 kubelet.go:2229] "SyncLoop (probe)" probe="readiness" status="" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:47.904842 master-2 kubenswrapper[1930]: I1110 10:38:47.904662    1930 kubelet.go:2229] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:47.907900 master-2 kubenswrapper[1930]: I1110 10:38:47.907872    1930 kubelet.go:2229] "SyncLoop (probe)" probe="startup" status="started" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:48.431448 master-2 kubenswrapper[1930]: I1110 10:38:48.431414    1930 kubelet.go:2229] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-kube-controller-manager/kube-controller-manager-master-2"
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764029    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-master-2" status=Running
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764059    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/keepalived-master-2" status=Running
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764077    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/coredns-master-2" status=Running
Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764086    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/haproxy-master-2" status=Running
Nov 10 10:38:54.764492 master-2 kubenswrapper[1930]: I1110 10:38:54.764106    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-etcd/etcd-master-2" status=Running
Nov 10 10:38:54.764492 master-2 kubenswrapper[1930]: I1110 10:38:54.764113    1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" status=Running

Also curious:

Nov 10 10:37:50.318237 master-2 ovs-vswitchd[1324]: ovs|00251|connmgr|INFO|br0<->unix#468: 2 flow_mods in the last 0 s (2 deletes)
Nov 10 10:37:50.342965 master-2 ovs-vswitchd[1324]: ovs|00252|connmgr|INFO|br0<->unix#471: 4 flow_mods in the last 0 s (4 deletes)
Nov 10 10:37:50.364271 master-2 ovs-vswitchd[1324]: ovs|00253|bridge|INFO|bridge br0: deleted interface vethcb8d36e6 on port 41

Nov 10 10:37:53.579562 master-2 NetworkManager[1336]: <info>  [1668076673.5795] dhcp4 (enp2s0): state changed new lease, address=192.168.111.22

These look like they could be related to the tests these problems appear to coincide with?

Job was in terrible shape even before but it looks like upgrade started more consistently failing around Oct 2-4.

Sample failed run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1579289246391341056

Looks like we fully lose the api (service unavailable), no artifacts get gathered, mass disruption reported.

This is a clone of issue OCPBUGS-3612. The following is the description of the original issue:

Description of problem:

OCP 4.12 deployments making use of secondary bridge br-ex1 for CNI fail to start ovs-configuration service, with multiple failures.

Version-Release number of selected component (if applicable):

Openshift 4.12.0-rc.0 (2022-11-10)

How reproducible:

Until now always at least one node out of four workers fails, not always the same node, sometimes several nodes.

Steps to Reproduce:

1. Preparing to configure ipi on the provisioning node
   - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)

2. configuring the install-config.yaml (attached)
   - provisioningNetwork: enabled
   - machine network: single stack ipv4
   - disconnected installation
   - ovn-kubernetes with hybrid-networking setup
   - LACP bonding setup using MC manifests at day1
     * bond0 -> baremetal 192.168.32.0/24 (br-ex)
     * bond0.662  -> interface for secondary bridge (br-ex1) 192.168.66.128/26
   - secondary bridge defined in /etc/ovnk/extra_bridge using MC Manifest
   
3. deploy the cluster
- Usually the deployment is completed
- Nodes show Ready status, but in some nodes ovs-configuration fails
- Consequent MC changes fail because MCP cannot roll out configurations in nodes with the failure.

NOTE: This impacts testing of our partners Verizon and F5, because we are validating their CNFs before OCP 4.12 release and we need a secondary bridge for CNI.

Actual results:

br-ex1 and all its related ovs-ports and interfaces fail to activate, ovs-configuration service fails. 

Expected results:

br-ex1 and all its related ovs-ports and interfaces succeed to activate, ovs-configuration service starts successfully. 

Additional info:
1. Nodes and MCP info

$ oc get nodes
NAME       STATUS   ROLES                  AGE     VERSION
master-0   Ready    control-plane,master   7h59m   v1.25.2+f33d98e
master-1   Ready    control-plane,master   7h59m   v1.25.2+f33d98e
master-2   Ready    control-plane,master   8h      v1.25.2+f33d98e
worker-0   Ready    worker                 7h26m   v1.25.2+f33d98e
worker-1   Ready    worker                 7h25m   v1.25.2+f33d98e
worker-2   Ready    worker                 7h25m   v1.25.2+f33d98e
worker-3   Ready    worker                 7h25m   v1.25.2+f33d98e
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE                         
master   rendered-master-210a69a0b40162b2f349ea3a5b5819e5   True      False      False      3              3                   3                     0                      7h57m                       
worker   rendered-worker-e8a62c86ce16e98e45e3166847484cf0   False     True       True       4              2                   2                     1                      7h57m 

2. When logging it to the nodes via SSH, we see when ovs-configuration fails, and from the ovs-configuration service logs, we see the following error: (full log attached worker-0-ovs-configuration.log)

$ ssh core@worker-0
---
Last login: Sat Nov 12 21:33:58 2022 from 192.168.62.10
[systemd]
Failed Units: 3
  NetworkManager-wait-online.service
  ovs-configuration.service
  stalld.service

[core@worker-0 ~]$ sudo journalctl -u ovs-configuration | less
...
Nov 12 15:27:54 worker-0 configure-ovs.sh[8237]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == vlan ']'
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 178: [: ==: unary operator expected
Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: ++ nmcli --get-values connection.type conn show
Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == bond ']'
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 191: [: ==: unary operator expected
Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: ++ nmcli --get-values connection.type conn show
Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT>
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == team ']'
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 203: [: ==: unary operator expected
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + iface_type=802-3-ethernet
Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' '!' '' = 0 ']'

3. We observe the failed node (worker-0) has ovs-if-phys1 connection as an ethernet type. But a working node (worker-1) shows a vlan type for the same connection with the vlan info

[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection                                                                                                            
[connection]
id=ovs-if-phys1
uuid=aea14dc9-2d0c-4320-9c13-ddf3e64747bf
type=ethernet
autoconnect=false
autoconnect-priority=100
autoconnect-slaves=1
interface-name=bond0.662
master=e61c56f7-f3ba-40f7-a1c1-37921fc6c815
slave-type=ovs-port

[ethernet]
cloned-mac-address=B8:83:03:91:C5:2C
mtu=1500

[ovs-interface]
type=system

[core@worker-1 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection
[connection]
id=ovs-if-phys1
uuid=9a019885-3cc1-4961-9dfa-6b7f996556c4
type=vlan
autoconnect-priority=100
autoconnect-slaves=1
interface-name=bond0.662
master=877acf53-87d7-4cdf-a078-000af4f962c3
slave-type=ovs-port
timestamp=1668265640

[ethernet]
cloned-mac-address=B8:83:03:91:C5:E8
mtu=9000

[ovs-interface]
type=system

[vlan]
flags=1
id=662
parent=bond0

4. Another problem we observe is that we specifically disable IPv6 in the the bond0.662 connection, but the generated connection for br-ex1 has ipv6 method-auto, and it should be disabled.

[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/bond0.662.nmconnection 
[connection]
id=bond0.662
type=vlan
interface-name=bond0.662
autoconnect=true
autoconnect-priority=99

[vlan]
parent=bond0
id=662

[ethernet]
mtu=9000

[ipv4]
method=auto
dhcp-timeout=2147483647
never-default=true

[ipv6]
method=disabled
never-default=true

[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/br-ex1.nmconnection
[connection]
id=br-ex1
uuid=df67dcd9-4263-4707-9abc-eda16e75ea0d
type=ovs-bridge
autoconnect=false
autoconnect-slaves=1
interface-name=br-ex1

[ethernet]
mtu=1500

[ovs-bridge]

[ipv4]
method=auto

[ipv6]
addr-gen-mode=stable-privacy
method=auto

[proxy]

5. All journals, must-gather, some deployment files can be found in our CI console (Login with RedHat SSO) https://www.distributed-ci.io/jobs/46459571-900f-43df-8798-d36b322d26f4/files
But attached some of the logs to facilitate the task, worker-0 files are from the node with issues with ovs, worker-1 are from a worker that is OK in case you want to compare.

11_master-bonding.yaml
11_worker-bonding.yaml
install-config.yaml
journal-worker-0.log
journal-worker-1.log
must_gather.tar.gz
sosreport-worker-0-2022-11-12-csbyqfe.tar.xz
sosreport-worker-1-2022-11-12-ubltjdn.tar.xz
worker-0-ip-nmcli-info.log
worker-0-ovs-configuration.log
worker-1-ip-nmcli-info.log
worker-1-ovs-configuration.log

Please let us know if you need any additional information.

This is a clone of issue OCPBUGS-7780. The following is the description of the original issue:

Description of problem:

4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.

Version-Release number of selected component (if applicable):

4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.

How reproducible:

100%

Steps to Reproduce:

1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.

Actual results:

null

Expected results:

{"baselineCapabilitySet":"None"}

This is a clone of issue OCPBUGS-12722. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12165. The following is the description of the original issue:

Description of problem:

While updating a cluster to 4.12.11, which contains the bug fix for [OCPBUGS-7999|https://issues.redhat.com/browse/OCPBUGS-7999] (which is the 4.12.z backport of [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783], it seems that the older {{{Custom|Default}RouteSync{Degraded|Progressing}}} conditions are not cleaned up as they should, as per [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] resolution, while the newer ones are added.

Due to this, on an upgrade to 4.12.11 (or higher, until this bug is fixed), it is possible to hit a problem very similar to the one that lead to [OCPBUGS-2783|https://issues.redhat.com/browse/OCPBUGS-2783] in the first place, but while upgrading to 4.12.11.

So, we need to do a proper cleanup of the older conditions.

Version-Release number of selected component (if applicable):

4.12.11 and higher

How reproducible:

Always in what regards the wrong conditions. It only leads to issues if one of the wrong conditions was in unhealthy state.

Steps to Reproduce:

1. Upgrade
2.
3.

Actual results:

Both new (and correct) conditions plus older (and wrong) conditions.

Expected results:

Both new (and correct) conditions only.

Additional info:

Problem seems to be that the stale conditions controller is created[1] with a list that says {{CustomRouteSync}} and {{DefaultRouteSync}}, while that list should be {{CustomRouteSyncDegraded}}, {{CustomRouteSyncProgressing}}, {{DefaultRouteSyncDegraded}} and {{DefaultRouteSyncProgressing}}. I read the source code of the controller a bit and it seems that it does not admit prefixes but performs a literal comparison.

[1] - https://github.com/openshift/console-operator/blob/0b54727/pkg/console/starter/starter.go#L403-L404

This is a clone of issue OCPBUGS-13765. The following is the description of the original issue:

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
Metal³ now allows these paths in the `name` hint (see OCPBUGS-13080), so the IPI installer's implementation using terraform must be changed to match.

This is a clone of issue OCPBUGS-5546. The following is the description of the original issue:

Description of problem:

The Machine API provider for Azure sets the MachineConfig.ObjectMeta.Name to the cluster name. The value of this field was never actually used anywhere, but was mistakenly brought across in a refactor of the machine scope

It causes a diff between the defaulted machines and the machines once the actuator has seen them, which in turn is causing issues with CPMS.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create a machine with providerSpec.metadata.name unset
2. 
3.

Actual results:

Name gets populated to cluster name

Expected results:

Name should not be populated

Additional info:

 

Description of problem:

[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-07-181244

How reproducible:

Frequently happened in automation. But didn't reproduce it in manual.

Steps to Reproduce:

1. Label one node as egress node

2.
Config one egressIP object
STEP: Check  one EgressIP assigned in the object.

Nov  8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}]

3.
Reboot the node, wait for the node ready.


Actual results:

EgressIP cannot be applied anymore. Waited more than 1 hour.
 oc get egressip
NAME             EGRESSIPS       ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-47031   192.168.54.72    

Expected results:

The egressIP should be applied correctly.

Additional info:


Some logs
E1108 07:29:41.849149       1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72]
I1108 07:29:41.849288       1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable


W1108 07:33:37.401149       1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded
I1108 07:33:37.401348       1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q"
I1108 07:33:37.437465       1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)

After this log, seems like no logs related to "192.168.54.72" happened.

Customers have introduced Openshift using CloudFormation in "Example 4.55. CloudFormation template for the VPC", referring to the document below.
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/installing/index#installing-restricted-networks-aws
CloudFormation uses python3.7 with Lambda.
Since it will be the EOL of Python 3.7, what kind of effect will it have if it becomes unusable?
Is there any immediate effect? Will there be any impact when adding worker nodes?
OCP Version & Channel: 4.10
Cloud Platform: AWS

The application dropdown menu uses a custom component with a configuration to favorite applications, like the Project selection menu favorites projects, but its UX is inconsistent in the way it looks and behaves.

 

The Project selection UI element uses the PatternFly Menu component.  It would be better to have the Application dropdown menu looks and behavior be consistent with the PatternFly Menu component.

 

 

 

 

 

Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

Since 4.11 OCP comes with OperatorHub definition which declares a capability
and enables all catalog sources. For OKD we want to enable just community-operators
as users may not have Red Hat pull secret set.
This commit would ensure that OKD version of marketplace operator gets
its own OperatorHub manifest with a custom set of operator catalogs enabled

Description of the problem:

I installed a cluster with OCS and CNV.

The issue is that cluster event contain repeated messages:

1/9/2022, 6:17:31 PM    Operator ocs status: available message: install strategy completed with no errors
1/9/2022, 6:17:30 PM    Operator lso status: available message: install strategy completed with no errors
1/9/2022, 6:17:30 PM    Operator cnv status: available message: install strategy completed with no errors
1/9/2022, 6:17:06 PM    Successfully completed installing cluster
1/9/2022, 6:17:06 PM    Updated status of the cluster to installed
1/9/2022, 6:17:01 PM    Operator ocs status: available message: install strategy completed with no errors
1/9/2022, 6:17:00 PM    Operator lso status: available message: install strategy completed with no errors
1/9/2022, 6:17:00 PM    Operator cnv status: available message: install strategy completed with no errors
1/9/2022, 6:16:31 PM    Operator ocs status: progressing message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability.
1/9/2022, 6:16:30 PM    Operator lso status: available message: install strategy completed with no errors
1/9/2022, 6:16:30 PM    Operator cnv status: available message: install strategy completed with no errors
1/9/2022, 6:16:01 PM    Operator ocs status: progressing message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability.
1/9/2022, 6:16:00 PM    Operator lso status: available message: install strategy completed with no errors
1/9/2022, 6:16:00 PM    Operator cnv status: available message: install strategy completed with no errors
1/9/2022, 6:15:31 PM    Operator ocs status: progressing message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability.
1/9/2022, 6:15:31 PM    Operator lso status: available message: install strategy completed with no errors
1/9/2022, 6:15:30 PM    Operator cnv status: available message: install strategy completed with no errors

 

How reproducible:

100%

Steps to reproduce:

1. Install cluster with OCS and CNV

2. Watch cluster events

Actual results:

repeated message when olm operator completed installation

Expected results:

1 event record for olm operator finished successfully 

This is a clone of issue OCPBUGS-14784. The following is the description of the original issue:

Description of problem:

'hostedcluster.spec.configuration.ingress.loadBalancer.platform.aws.type' is ignored

Version-Release number of selected component (if applicable):

 

How reproducible:

set field to 'NLB'

Steps to Reproduce:

1. set the field to 'NLB'
2.
3.

Actual results:

a classic load balancer is created

Expected results:

Should create a Network load balancer

Additional info:

 

This is a clone of issue OCPBUGS-13086. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12964. The following is the description of the original issue:

Description of problem:

While installing ocp on aws user can set metadataService auth to Required in order to use IMDSv2, in that case user requires all the vms to use it. 
Currently bootstrap will always run with Optional and this can be blocked on users aws account and will fail the installation process

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Install aws cluster and set metadataService to Required

Steps to Reproduce:

1.
2.
3.

Actual results:

Bootstrap has IMDSv2 set to optional

Expected results:

All vms had IMDSv2 set to required

Additional info:

 

Description of problem:


Facing the same issue as JIRA[1] in OCP 4.12 and for the backport this bug solution to the OCP 4.12

JIRA[1]: https://issues.redhat.com/browse/OCPBUGS-14064

As port 9447 is exposed from the cluster in one of the control nodes and is using weak cipher and TLS 1.0/ TLS 1.1 , this is incompatible with the security standards for our product release. Either we should be able to disable this port or update the cipher and TLS version as the fix for meeting the security standards as you are aware TLS 1.0 & TLS 1.1 are pretty old and deprecated already.

we confirmed that fips were enabled during cluster deployment by passing the key-value pair in the config file."~~~
fips: true

On JIRA[1] it is suggested to open a separate Bug for backporting. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

For some reason, some of the packets on a DNS conversation to the {{openshift-dns/dns-default}} service cluster IP don't get properly denatted, i.e. the reply packet has the pod IP as source IP instead of the service IP.

Version-Release number of selected component (if applicable):

4.10.25

How reproducible:

Sometimes

Steps to Reproduce:

1. Try to resolve DNS with cluster DNS

Actual results:

DNS timeout. Reply packets have the pod IP instead of the service IP the request was sent to.

Expected results:

DNS working.

Additional info:

I'll elaborate about this in the attachments, but I could find nothing wrong in nbdb or any OVN-Kubernetes or OVN logs that rang a bell.
The only interesting thing I could see was that `conntrack -L` had no reference to this conversation, so it makes kind of sense that the reply packet address is not translated back to the service IP one, but I have not been able to find the reason of this.
The query/response packets can be correlated via DNS transaction ID.

Our CMO e2e tests create several containers besides the standard CMO deployment. These pods do currently not set any security context capabilities. Currently this creates a warning like so:

W0705 08:35:38.590283 15206 warnings.go:70] would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "alertmanager-webhook-e2e-testutil" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "alertmanager-webhook-e2e-testutil" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "alertmanager-webhook-e2e-testutil" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "alertmanager-webhook-e2e-testutil" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

We should be proactive and set security capability contraints. From this run this seems to impact the following pods/containers:

  • alertmanager-webhook-e2e-testutil
  • prometheus-example-app

Both are used more then once.

Relevant docs: https://docs.openshift.com/container-platform/4.10/authentication/managing-security-context-constraints.html#security-context-constraints-about_configuring-internal-oauth

Description of problem:

prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests

Version-Release number of selected component (if applicable):

4.11.6

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Hard reboot node via out of band interface
3. oc -n openshift-monitoring get pods prometheus-k8s-0 

Actual results:

NAME               READY   STATUS             RESTARTS          AGE
prometheus-k8s-0   5/6     CrashLoopBackOff   125 (4m57s ago)   5h28m

Expected results:

Running

Additional info:

Attaching must-gather.

The pod recovers successfully after deleting/re-creating.


[kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0
ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)"
ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)"
ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))"
ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090
ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..."
ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped"
ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"

This is a clone of issue OCPBUGS-10433. The following is the description of the original issue:

Description of problem:

When CNO is managed by Hypershift multus-admission-controller does not have correct RollingUpdate parameterts meeting Hypershift requirements outligned here: https://github.com/openshift/hypershift/blob/646bcef53e4ecb9ec01a05408bb2da8ffd832a14/support/config/deployment.go#L81
```
There are two standard cases currently with hypershift: HA mode where there are 3 replicas spread across zones and then non ha with one replica. When only 3 zones are available you need to be able to set maxUnavailable in order to progress the rollout. However, you do not want to set that in the single replica case because it will result in downtime.
```
So when multus-admission-controller has more than one replica the RollingUpdate parameters should be
```
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
```

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift
2.Check rolling update parameters of multus-admission-controller

Actual results:

the operator has default parameters: {"rollingUpdate":{"maxSurge":"25%","maxUnavailable":"25%"},"type":"RollingUpdate"}

Expected results:

{"rollingUpdate":{"maxSurge":0,"maxUnavailable":1},"type":"RollingUpdate"}

Additional info:

 

Description of problem:
When "Service Binding Operator" is successfully installed in the cluster for the first time, the page will automatically redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" " 

Notice: This issue only happened when the user installed "Service Binding Operator" for the first time. If the user uninstalls and re-installs the operator again, this issue will be gone 

Version-Release number of selected components (if applicable):
4.12.0-0.nightly-2022-08-12-053438

How reproducible:
Always

Steps to Reproduce:

  1. Login to OCP web console. Go to Operators -> OperatorHub page
  2. Install "Service Binding Operator", wait until finish, check the page
  3.  

Actual results:
The page will redirect to Operator installation page with the error message "A subscription for this Operator already exists in Namespace "XXX" " 
 
Expected results:
The page should stay on the install page, with the message "Installed operator- ready for use"

Additional info:

Please find the attached snap for more details 

This is a clone of issue OCPBUGS-15335. The following is the description of the original issue:

Description of problem:

After upgrade customers observing anomalies in PipelineRun status and logs.
They are observing this issue, even if PipelineRun is successful there are anomalies in PipelineRun status and logs.
They are getting only logs like below
Tasks Completed: 3 (Failed: 1, Cancelled 0), Skipped: 1.

Version-Release number of selected component (if applicable):

Red Hat Pipeline operator 1.11

How reproducible:

Red Hat Pipeline Operator 1.11 should be installed

Steps to Reproduce:

1. Import a repo using Import from git and and enable the Pipeline
2. Rerun the Pipeline
3.

Actual results:

Pipeline failed with log Tasks Completed: 3 (Failed: 1, Cancelled 0), Skipped: 1.

Expected results:

Pipeline should succeed and full log should be shown

Additional info:

https://redhat-internal.slack.com/archives/CSPS1077U/p1687242065844079

This is a clone of issue OCPBUGS-10888. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10887. The following is the description of the original issue:

Description of problem:

Following https://bugzilla.redhat.com/show_bug.cgi?id=2102765 respectively https://issues.redhat.com/browse/OCPBUGS-2140 problems with OpenID Group sync have been resolved.

Yet the problem documented in https://bugzilla.redhat.com/show_bug.cgi?id=2102765 still does exist and we see that Groups that are being removed are still part of the chache in oauth-apiserver, causing a panic of the respective components and failures during login for potentially affected users.

So in general, it looks like that oauth-apiserver cache is not properly refreshing or handling the OpenID Groups being synced.

E1201 11:03:14.625799       1 runtime.go:76] Observed a panic: interface conversion: interface {} is nil, not *v1.Group
goroutine 3706798 [running]:
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:103 +0xb0
panic({0x1aeab00, 0xc001400390})
    runtime/panic.go:838 +0x207
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1.1()
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:80 +0x2a
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1.1()
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:89 +0x250
panic({0x1aeab00, 0xc001400390})
    runtime/panic.go:838 +0x207
github.com/openshift/library-go/pkg/oauth/usercache.(*GroupCache).GroupsFor(0xc00081bf18?, {0xc000c8ac03?, 0xc001400360?})
    github.com/openshift/library-go@v0.0.0-20211013122800-874db8a3dac9/pkg/oauth/usercache/groups.go:47 +0xe7
github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).processGroups(0xc0002c8880, {0xc0005d4e60, 0xd}, {0xc000c8ac03, 0x7}, 0x1?)
    github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:101 +0xb5
github.com/openshift/oauth-server/pkg/groupmapper.(*UserGroupsMapper).UserFor(0xc0002c8880, {0x20f3c40, 0xc000e18bc0})
    github.com/openshift/oauth-server/pkg/groupmapper/groupmapper.go:83 +0xf4
github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).login(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0xc0015d8200, 0xc001438140?, {0xc0000e7ce0, 0x150})
    github.com/openshift/oauth-server/pkg/oauth/external/handler.go:209 +0x74f
github.com/openshift/oauth-server/pkg/oauth/external.(*Handler).ServeHTTP(0xc00022bc20, {0x20eebb0, 0xc00041b058}, 0x0?)
    github.com/openshift/oauth-server/pkg/oauth/external/handler.go:180 +0x74a
net/http.(*ServeMux).ServeHTTP(0x1c9dda0?, {0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    net/http/server.go:2462 +0x149
github.com/openshift/oauth-server/pkg/server/headers.WithRestoreAuthorizationHeader.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:27 +0x10f
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0xc0005e0280?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authorization.go:64 +0x498
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x2f6cea0?, {0x20eebb0?, 0xc00041b058?}, 0x3?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/maxinflight.go:187 +0x2a4
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0x11?, {0x20eebb0?, 0xc00041b058?}, 0x1aae340?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/impersonation.go:50 +0x21c
net/http.HandlerFunc.ServeHTTP(0xc000d52120?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20eebb0?, 0xc00041b058?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x20eebb0, 0xc00041b058}, 0xc0015d8200)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0xc0015d8100?, {0x20eebb0?, 0xc00041b058?}, 0xc000531930?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAudit.func1({0x7fae682a40d8?, 0xc00041b048}, 0x9dbbaa?)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit.go:111 +0x549
net/http.HandlerFunc.ServeHTTP(0xc00003def0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:79 +0x178
net/http.HandlerFunc.ServeHTTP(0x0?, {0x7fae682a40d8?, 0xc00041b048?}, 0x0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:103 +0x1a5
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfd00?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x7fae682a40d8, 0xc00041b048}, 0xc0015d8100)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/authentication.go:80 +0x8b9
net/http.HandlerFunc.ServeHTTP(0x20f0f20?, {0x7fae682a40d8?, 0xc00041b048?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filterlatency/filterlatency.go:88 +0x46b
net/http.HandlerFunc.ServeHTTP(0xc0019f5890?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc000848764?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithCORS.func1({0x7fae682a40d8, 0xc00041b048}, 0xc000e69e00)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/cors.go:75 +0x10b
net/http.HandlerFunc.ServeHTTP(0xc00149a380?, {0x7fae682a40d8?, 0xc00041b048?}, 0xc0008487d0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:108 +0xa2
created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:94 +0x2cc

goroutine 3706802 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19eb780?, 0xc001206e20})
    k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:74 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0016aec60, 0x1, 0x1560f26?})
    k8s.io/apimachinery@v0.22.2/pkg/util/runtime/runtime.go:48 +0x75
panic({0x19eb780, 0xc001206e20})
    runtime/panic.go:838 +0x207
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0005047c8, {0x20eecd0?, 0xc0010fae00}, 0xdf8475800?)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/timeout.go:114 +0x452
k8s.io/apiserver/pkg/endpoints/filters.withRequestDeadline.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69d00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_deadline.go:101 +0x494
net/http.HandlerFunc.ServeHTTP(0xc0016af048?, {0x20eecd0?, 0xc0010fae00?}, 0xc0000bc138?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69d00)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/waitgroup.go:59 +0x177
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x7fae705daff0?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1({0x20eecd0, 0xc0010fae00}, 0xc000e69c00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/audit_annotations.go:37 +0x230
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1({0x20eecd0?, 0xc0010fae00}, 0xc000e69b00)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/warning.go:35 +0x2bb
net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20eecd0?, 0xc0010fae00?}, 0xd?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1({0x20eecd0, 0xc0010fae00}, 0x0?)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/cachecontrol.go:31 +0x126
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20eecd0?, 0xc0010fae00?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/httplog.WithLogging.func1({0x20ef480?, 0xc001c20620}, 0xc000e69a00)
    k8s.io/apiserver@v0.22.2/pkg/server/httplog/httplog.go:103 +0x518
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0x20cfc08?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1({0x20ef480, 0xc001c20620}, 0xc000e69900)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/requestinfo.go:39 +0x316
net/http.HandlerFunc.ServeHTTP(0x20f0f58?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3f70?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withRequestReceivedTimestampWithClock.func1({0x20ef480, 0xc001c20620}, 0xc000e69800)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/request_received_time.go:38 +0x27e
net/http.HandlerFunc.ServeHTTP(0x419e2c?, {0x20ef480?, 0xc001c20620?}, 0xc0007c3e40?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1({0x20ef480?, 0xc001c20620?}, 0xc0004ff600?)
    k8s.io/apiserver@v0.22.2/pkg/server/filters/wrap.go:74 +0xb1
net/http.HandlerFunc.ServeHTTP(0x1c05260?, {0x20ef480?, 0xc001c20620?}, 0x8?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/endpoints/filters.withAuditID.func1({0x20ef480, 0xc001c20620}, 0xc000e69600)
    k8s.io/apiserver@v0.22.2/pkg/endpoints/filters/with_auditid.go:66 +0x40d
net/http.HandlerFunc.ServeHTTP(0x1c9dda0?, {0x20ef480?, 0xc001c20620?}, 0xd?)
    net/http/server.go:2084 +0x2f
github.com/openshift/oauth-server/pkg/server/headers.WithPreserveAuthorizationHeader.func1({0x20ef480, 0xc001c20620}, 0xc000e69600)
    github.com/openshift/oauth-server/pkg/server/headers/oauthbasic.go:16 +0xe8
net/http.HandlerFunc.ServeHTTP(0xc0016af9d0?, {0x20ef480?, 0xc001c20620?}, 0x16?)
    net/http/server.go:2084 +0x2f
github.com/openshift/oauth-server/pkg/server/headers.WithStandardHeaders.func1({0x20ef480, 0xc001c20620}, 0x4d55c0?)
    github.com/openshift/oauth-server/pkg/server/headers/headers.go:30 +0x18f
net/http.HandlerFunc.ServeHTTP(0x0?, {0x20ef480?, 0xc001c20620?}, 0xc0016afac8?)
    net/http/server.go:2084 +0x2f
k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc00098d622?, {0x20ef480?, 0xc001c20620?}, 0xc000401000?)
    k8s.io/apiserver@v0.22.2/pkg/server/handler.go:189 +0x2b
net/http.serverHandler.ServeHTTP({0xc0019f5170?}, {0x20ef480, 0xc001c20620}, 0xc000e69600)
    net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc0002b1720, {0x20f0f58, 0xc0001e8120})
    net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
    net/http/server.go:3071 +0x4db

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.11.13

How reproducible:

- Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.11
2. Configure OpenID Group Sync (as per https://docs.openshift.com/container-platform/4.11/authentication/identity_providers/configuring-oidc-identity-provider.html#identity-provider-oidc-CR_configuring-oidc-identity-provider)
3. Have users with hundrets of groups
4. Login and after a while, remove some Groups from the user in the IDP and from OpenShift Container Platform 
5. Try to login again and see the panic in oauth-apiserver

Actual results:

User is unable to login and oauth pods are reporting a panic as shown above

Expected results:

oauth-apiserver should invalidate the cache quickly to remove potential invalid references to non exsting groups

Additional info:

 

This is a clone of issue OCPBUGS-8035. The following is the description of the original issue:

Description of problem:

install discnnect private cluster, ssh to master/bootstrap nodes from the bastion on the vpc failed.

Version-Release number of selected component (if applicable):

Pre-merge build https://github.com/openshift/installer/pull/6836
registry.build05.ci.openshift.org/ci-ln-5g4sj02/release:latest
Tag: 4.13.0-0.ci.test-2023-02-27-033047-ci-ln-5g4sj02-latest

How reproducible:

always

Steps to Reproduce:

1.Create bastion instance maxu-ibmj-p1-int-svc 
2.Create vpc on the bastion host 
3.Install private disconnect cluster on the bastion host with mirror registry 
4.ssh to the bastion  
5.ssh to the master/bootstrap nodes from the bastion 

Actual results:

[core@maxu-ibmj-p1-int-svc ~]$ ssh -i ~/openshift-qe.pem core@10.241.0.5 -v
OpenSSH_8.8p1, OpenSSL 3.0.5 5 Jul 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 10.241.0.5 [10.241.0.5] port 22.
debug1: connect to address 10.241.0.5 port 22: Connection timed out
ssh: connect to host 10.241.0.5 port 22: Connection timed out

Expected results:

ssh succeed.

Additional info:

$ibmcloud is sg-rules r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 --vpc maxu-ibmj-p1-vpc
Listing rules of security group r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 under account OpenShift-QE as user ServiceId-dff277a9-b608-410a-ad24-c544e59e3778...
ID                                          Direction   IP version   Protocol                      Remote   
r014-6739d68f-6827-41f4-b51a-5da742c353b2   outbound    ipv4         all                           0.0.0.0/0   
r014-06d44c15-d3fd-4a14-96c4-13e96aa6769c   inbound     ipv4         all                           shakiness-perfectly-rundown-take   r014-25b86956-5370-4925-adaf-89dfca9fb44b   inbound     ipv4         tcp Ports:Min=22,Max=22       0.0.0.0/0   
r014-e18f0f5e-c4e5-44a5-b180-7a84aa59fa97   inbound     ipv4         tcp Ports:Min=3128,Max=3129   0.0.0.0/0   
r014-7e79c4b7-d0bb-4fab-9f5d-d03f6b427d89   inbound     ipv4         icmp Type=8,Code=0            0.0.0.0/0   
r014-03f23b04-c67a-463d-9754-895b8e474e75   inbound     ipv4         tcp Ports:Min=5000,Max=5000   0.0.0.0/0   
r014-8febe8c8-c937-42b6-b352-8ae471749321   inbound     ipv4         tcp Ports:Min=6001,Max=6002   0.0.0.0/0   

This is a clone of issue OCPBUGS-16124. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9404. The following is the description of the original issue:

Version:

$ openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-13-131410
built from commit cdb9627de7efb43ad7af53e7804ddd3434b0dc58
release image registry.ci.openshift.org/ocp/release@sha256:c5413c0fdd0335e5b4063f19133328fee532cacbce74105711070398134bb433
release architecture amd64

Platform:

  • Azure IPI

What happened?
When one creates an IPI Azure cluster with an `internal` publishing method, it creates a standard load balancer with an empty definition. This load balancer doesn't serve a purpose as far as I can tell since the configuration is completely empty. Because it doesn't have a public IP address and backend pools it's not providing any outbound connectivity, and there are no frontend IP configurations for ingress connectivity to the cluster.

Below is the ARM template that is deployed by the installer (through terraform)

```
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"loadBalancers_mgahagan411_7p82n_name":

{ "defaultValue": "mgahagan411-7p82n", "type": "String" }

},
"variables": {},
"resources": [
{
"type": "Microsoft.Network/loadBalancers",
"apiVersion": "2020-11-01",
"name": "[parameters('loadBalancers_mgahagan411_7p82n_name')]",
"location": "northcentralus",
"sku":

{ "name": "Standard", "tier": "Regional" }

,
"properties":

{ "frontendIPConfigurations": [], "backendAddressPools": [], "loadBalancingRules": [], "probes": [], "inboundNatRules": [], "outboundRules": [], "inboundNatPools": [] }

}
]
}
```

What did you expect to happen?

  • Don't create the standard load balancer on an internal Azure IPI cluster (as it appears to serve no purpose)

How to reproduce it (as minimally and precisely as possible)?
1. Create an IPI cluster with the `publish` installation config set to `Internal` and the `outboundType` set to `UserDefinedRouting`.
```
apiVersion: v1
controlPlane:
architecture: amd64
hyperthreading: Enabled
name: master
platform:
azure: {}
replicas: 3
compute:

  • architecture: amd64
    hyperthreading: Enabled
    name: worker
    platform:
    azure: {}
    replicas: 3
    metadata:
    name: mgahaganpvt
    platform:
    azure:
    region: northcentralus
    baseDomainResourceGroupName: os4-common
    outboundType: UserDefinedRouting
    networkResourceGroupName: mgahaganpvt-rg
    virtualNetwork: mgahaganpvt-vnet
    controlPlaneSubnet: mgahaganpvt-master-subnet
    computeSubnet: mgahaganpvt-worker-subnet
    pullSecret: HIDDEN
    networking:
    clusterNetwork:
  • cidr: 10.128.0.0/14
    hostPrefix: 23
    serviceNetwork:
  • 172.30.0.0/16
    machineNetwork:
  • cidr: 10.0.0.0/16
    networkType: OpenShiftSDN
    publish: Internal
    proxy:
    httpProxy: http://proxy-user1:password@10.0.0.0:3128
    httpsProxy: http://proxy-user1:password@10.0.0.0:3128
    baseDomain: qe.azure.devcluster.openshift.com
    ```

2. Show the json content of the standard load balancer is completely empty
`az network lb show -g myResourceGroup -n myLbName`

```
{
"name": "mgahagan411-7p82n",
"id": "/subscriptions/00000000-0000-0000-00000000/resourceGroups/mgahagan411-7p82n-rg/providers/Microsoft.Network/loadBalancers/mgahagan411-7p82n",
"etag": "W/\"40468fd2-e56b-4429-b582-6852348b6a15\"",
"type": "Microsoft.Network/loadBalancers",
"location": "northcentralus",
"tags": {},
"properties":

{ "provisioningState": "Succeeded", "resourceGuid": "6fb11ec9-d89f-4c05-b201-a61ea8ed55fe", "frontendIPConfigurations": [], "backendAddressPools": [], "loadBalancingRules": [], "probes": [], "inboundNatRules": [], "inboundNatPools": [] }

,
"sku":

{ "name": "Standard" }

}
```

This is a clone of issue OCPBUGS-16804. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15327. The following is the description of the original issue:

Description of problem:

On OpenShift Container Platform, the etcd Pod is showing messages like the following:

2023-06-19T09:10:30.817918145Z {"level":"warn","ts":"2023-06-19T09:10:30.817Z","caller":"fileutil/purge.go:72","msg":"failed to lock file","path":"/var/lib/etcd/member/wal/000000000000bc4b-00000000183620a4.wal","error":"fileutil: file already locked"}


This is described in KCS https://access.redhat.com/solutions/7000327

Version-Release number of selected component (if applicable):

any currently supported version (> 4.10) running with 3.5.x

How reproducible:

always

Steps to Reproduce:

happens after running etcd for a while

 

This has been discussed in https://github.com/etcd-io/etcd/issues/15360

It's not a harmful error message, it merely indicates that some WALs have not been included in snapshots yet.

This was caused by changing default numbers: https://github.com/etcd-io/etcd/issues/13889

This was fixed in https://github.com/etcd-io/etcd/pull/15408/files but never backported to 3.5.

To mitigate that error and stop confusing people, we should also supply that argument when starting etcd in: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/pod.yaml#L170-L187

That way we're not surprised by changes of the default values upstream.

This is a clone of issue OCPBUGS-2895. The following is the description of the original issue:

Description of problem:

Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters

Steps to Reproduce:

1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name

Actual results:

See error message:

encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]

Expected results:

Create a cluster/machineset using the existing and valid DiskEncryptionSet

Additional info:

I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513

This is a clone of issue OCPBUGS-3214. The following is the description of the original issue:

Description of problem:

The installer has logic that avoids adding the router CAs to the kubeconfig if the console is not available.  It's not clear why it does this, but it means that the router CAs don't get added when the console is deliberately disabled (it is now an optional capability in 4.12).

Version-Release number of selected component (if applicable):

Seen in 4.12+4.13

How reproducible:

Always, when starting a cluster w/o the Console capability

Steps to Reproduce:

1. Edit the install-config to set:
capabilities:
  baselineCapabilitySet: None
2. install the cluster
3. check the CAs in the kubeconfig, the wildcard route CA will be missing (compare it w/ a normal cluster)

Actual results:

router CAs missing

Expected results:

router CAs should be present

Additional info:

This needs to be backported to 4.12.

This is a clone of issue OCPBUGS-4089. The following is the description of the original issue:

The kube-state-metric pod inside the openshift-monitoring namespace is not running as expected.

On checking the logs I am able to see that there is a memory panic

~~~
2022-11-22T09:57:17.901790234Z I1122 09:57:17.901768 1 main.go:199] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
2022-11-22T09:57:17.901975837Z I1122 09:57:17.901951 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.902389844Z I1122 09:57:17.902291 1 main.go:210] Starting metrics server: 127.0.0.1:8081
2022-11-22T09:57:17.903191857Z I1122 09:57:17.903133 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.906272505Z I1122 09:57:17.906224 1 builder.go:191] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
2022-11-22T09:57:17.917758187Z E1122 09:57:17.917560 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-22T09:57:17.917758187Z goroutine 24 [running]:
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.logPanic(

{0x1635600, 0x2696e10})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
2022-11-22T09:57:17.917758187Z panic({0x1635600, 0x2696e10}

)
2022-11-22T09:57:17.917758187Z /usr/lib/golang/src/runtime/panic.go:1038 +0x215
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x40)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(

{0x17fe520, 0xc00063b590})
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:175 +0x49
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x17fe520, 0xc00063b590}

)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
~~~

Logs are attached to the support case

This is a clone of issue OCPBUGS-11257. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9964. The following is the description of the original issue:

Description of problem:

egressip cannot be assigned on hypershift hosted cluster node

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-09-162945

How reproducible:

100%

Steps to Reproduce:

1. setup hypershift env


2. lable egress ip node on hosted cluster
% oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-129-175.us-east-2.compute.internal   Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-129-244.us-east-2.compute.internal   Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-141-41.us-east-2.compute.internal    Ready    worker   3h20m   v1.26.2+bc894ae
ip-10-0-142-54.us-east-2.compute.internal    Ready    worker   3h20m   v1.26.2+bc894ae

% oc label node/ip-10-0-129-175.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-129-175.us-east-2.compute.internal labeled
% oc label node/ip-10-0-129-244.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-129-244.us-east-2.compute.internal labeled
% oc label node/ip-10-0-141-41.us-east-2.compute.internal k8s.ovn.org/egress-assignable=""
node/ip-10-0-141-41.us-east-2.compute.internal labeled
% oc label node/ip-10-0-142-54.us-east-2.compute.internal  k8s.ovn.org/egress-assignable=""
node/ip-10-0-142-54.us-east-2.compute.internal labeled


3. create egressip
% cat egressip.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip-1
spec:
  egressIPs: [ "10.0.129.180" ]
  namespaceSelector:
    matchLabels:
      env: ovn-tests
% oc apply -f egressip.yaml 
egressip.k8s.ovn.org/egressip-1 created


4. check egressip assignment
             

Actual results:

egressip cannot assigned to node
% oc get egressip NAME         EGRESSIPS      ASSIGNED NODE   ASSIGNED EGRESSIPS egressip-1   10.0.129.180 

Expected results:

egressip can be assigned to one of the hosted cluster node

Additional info:

 

This is a clone of issue OCPBUGS-10647. The following is the description of the original issue:

Description of problem:

Cluster Network Operator managed component multus-admission-controller does not conform to Hypershift control plane expectations.

When CNO is managed by Hypershift, multus-admission-controller must run with non-root security context. If Hypershift runs control plane on kubernetes (as opposed to Openshift) management cluster, it adds pod or container security context to most deployments with runAsUser clause inside.

In Hypershift CPO, the security context of deployment containers, including CNO, is set when it detects that SCC's are not available, see https://github.com/openshift/hypershift/blob/9d04882e2e6896d5f9e04551331ecd2129355ecd/support/config/deployment.go#L96-L100. In such a case CNO should do the same, set security context for its managed deployment multus-admission-controller to meet Hypershift standard.

 

How reproducible:

Always

Steps to Reproduce:

1.Create OCP cluster using Hypershift using Kube management cluster
2.Check pod security context of multus-admission-controller

Actual results:

no pod security context is set

Expected results:

pod security context is set with runAsUser: xxxx

Additional info:

This is the highest priority item from https://issues.redhat.com/browse/OCPBUGS-7942 and it needs to be fixed ASAP as it is a security issue preventing IBM from releasing Hypershift-managed Openshift service.

This is a clone of issue OCPBUGS-13927. The following is the description of the original issue:

Description of problem:

When trying to delete a BMH object, which is unmanaged, the Metal3 cannot delete. The BMH object is unmanaged because it does not provide information about BMC (neither address, nor credentials). 

In this case the Metal 3 tries to delete but fails and never finalizes. The BMH deletion gets stuc.
This is the log from MEtal3

{"level":"info","ts":1676531586.4898946,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.4980938,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5050912,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676531586.5105371,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.51569,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                            
{"level":"info","ts":1676531586.5191178,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676531586.525755,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                 
{"level":"info","ts":1676531586.5356712,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}                                
{"level":"info","ts":1676532186.5117555,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5195107,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.526355,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org"}                                                                                           
{"level":"info","ts":1676532186.5317476,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/worker-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5361836,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532186.5404322,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.5482726,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-2.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532186.555394,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"openshift-machine-api/master-0.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged","requeue":true,"after":600}
{"level":"info","ts":1676532532.3448665,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}                                                                                          
{"level":"info","ts":1676532532.344922,"logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org"}
{"level":"info","ts":1676532532.3656478,"logger":"controllers.BareMetalHost","msg":"Initiating host deletion","baremetalhost":"openshift-machine-api/worker-1.el8k-ztp-1.hpecloud.org","provisioningState":"unmanaged"}
{"level":"error","ts":1676532532.3656952,"msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","bareMetalHost":{"name":"worker-1.el8k-ztp-1.hpecloud.org","namespace":"openshift-machine-api"},
"namespace":"openshift-machine-api","name":"worker-1.el8k-ztp-1.hpecloud.org","reconcileID":"525a5b7d-077d-4d1e-a618-33d6041feb33","error":"action \"unmanaged\" failed: failed to determine current provisioner capacity: failed to parse BMC address informa
tion: missing BMC address","errorVerbose":"missing BMC address\ngithub.com/metal3-io/baremetal-operator/pkg/hardwareutils/bmc.NewAccessDetails\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/github.com/metal3-io/baremetal-operator/pkg/hardwareu
tils/bmc/access.go:145\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:112\ngithub.com/metal3-io/baremetal-operator/pkg/pro
visioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/githu
b.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/meta
l3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal
3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareM
etalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremet
al-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/contr
oller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/contro
ller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\
n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to parse BMC address information\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).bmcAccess\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/iro
nic/ironic.go:114\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).HasCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1922\ngithub.com/metal3-io/baremetal-operator/controlle
rs/metal3%2eio.(*hostStateMachine).ensureCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:83\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n
\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator
/controllers/metal3.io/host_state_machine.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithu
b.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controll
er.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/sr
c/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-
operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-
runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\nfailed to determine current provisioner capacity\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ensur
eCapacity\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:85\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).updateHostStateFrom\n\t/go/src/github.com/metal3-io/baremetal
-operator/controllers/metal3.io/host_state_machine.go:106\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState.func1\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machin
e.go:175\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:186\ngithub.com/metal3-io/baremetal-operator/contr
ollers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:226\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/gi
thub.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operato
r/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-r
untime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controll
er.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594\naction \"unmanaged\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operato
r/controllers/metal3.io/baremetalhost_controller.go:230\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/contr
oller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller
-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.
(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1594","stacktrace":"sigs.k8s.io/cont
roller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/contr
oller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Provide a BMH object with no BMC credentials. The BMH is set unmanaged.

Steps to Reproduce:

1. delete the object
2. gets stuck
3.

Actual results:

get stuck deletiong

Expected results:

Metal3 detects the BMH is unmanaged, and dont try to do deprovisioning.

Additional info:

 

Description of problem:

Data race seen in unit tests:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1448/pull-ci-openshift-ovn-kubernetes-release-4.11-unit/1604898712423763968/artifacts/test/build-log.txt
 

Description of problem:

OLM has a dependency on openshift/cluster-policy-controller. This project had dependencies with v0.0.0 versions, which due to a bug in ART was causing issues building the olm image. To fix this, we have to update the dependencies in the cluster-policy-controller project to point to actual versions.

This was already done:
 * https://github.com/openshift/cluster-policy-controller/pull/103
 * https://github.com/openshift/cluster-policy-controller/pull/101

And these changes already made it to 4.14 and 4.13 branches of the cluster-policy-controller.

The backport to 4.12 is: https://github.com/openshift/cluster-policy-controller/pull/102

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Sample archive with both resources:

archives/compressed/3c/3cc4318d-e564-450b-b16e-51ef279b87fa/202209/30/200617.tar.gz

Sample query to find more archives:

with t as (
  select
    cluster_id,
    file_path,
    json_extract_scalar(content, '$.kind') as kind
  from raw_io_archives
  where date = '2022-09-30' and file_path like 'config/storage/%'
)
select cluster_id, count(*) as cnt
from t
group by cluster_id
order by cnt desc;

Description of problem:

After editing a MachineSet on AWS (just changed an annotation) it shows a warning

[~] $ oc -n openshift-machine-api edit machineset.machine.openshift.io/ci-ln-hlf4lft-76ef8-p7rc4-worker-us-west-1b
W1111 16:06:32.385856   88719 warnings.go:70] incorrect GroupVersionKind for AWSMachineProviderConfig object: machine.openshift.io/v1beta1, Kind=AWSMachineProviderConfig
machineset.machine.openshift.io/ci-ln-hlf4lft-76ef8-p7rc4-worker-us-west-1b edited

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Add an annotation or label to a machine
2.
3.

Actual results:

There is a warning about incorrect GroupVersionKind for AWSMachineProviderConfig object

Expected results:

No warnings shown

Additional info:

 

Description of problem:


Version-Release number of selected component (if applicable):

{ 4.12.0-0.nightly-2022-08-21-135326 }
How reproducible:

Steps to Reproduce:

{See https://bugzilla.redhat.com/show_bug.cgi?id=2118563#c5,
The following messages here are "normal" on startup, but it is very misleading with error statement, suggest suppress them or update them to some more clear context that we can know they are in normal process.

E0818 02:18:53.709223       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:53.715530       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:53.735885       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:53.775984       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:53.790449       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:53.856911       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:53.950782       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:54.017583       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:54.271967       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:54.338944       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:54.916988       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue
E0818 02:18:54.982211       1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue}


Actual results:


Expected results:


Additional info:


Name: Routing
Description: Please change the "Routing" component to be a subcomponent "router" of the "Networking" component.

Component: change to "Networking".
Subcomponent: change to "router".

Existing fields (default assignee, default QA contact, default CC email list, etc.) should remain the same as they currently are.
Default Assignee: aos-network-edge-staff@bot.bugzilla.redhat.com
Default QA Contact: hongli@redhat.com
Default CC List: aos-network-edge-staff@bot.bugzilla.redhat.com
Additional Notes:
I filled in "Default CC email list" because the form validation would not permit me to omit it. However, it can be left empty in Bugzilla (it is currently empty).

If possible, we would like this change to be done prior to the Bugzilla-to-Jira migration to avoid the need to make the change after the migration.

This is a clone of issue OCPBUGS-10414. The following is the description of the original issue:

Description of problem:

Coredns template implementations using incorrect Regex for resolving dot [.] character

Version-Release number of selected component (if applicable):

NA

How reproducible:

100% when you use router sharding with domains including apps

Steps to Reproduce:

1. Create an additional IngressRouter with domains names including apps. for ex: example.test-apps.<clustername>.<clusterdomain>
2. Create and configure the external LB corresponding to the additonal IngressController 
3. Configure the corporate DNS server and create records for the this additional IngressController resolving to the LB Ip setup in step 2 above.  
4. Try resolving the additional domain routes from outside cluster and within cluster, the DNS resolution works fine fro outside cluster. However within cluster all additional domains consisting apps in the domain name resolve to the default ingress VIP instead of their corresponding LB IPs configured on the corportae DNS server.

As an alternate and simple test to reroduce you can reproduce it simply by using the dig command on the cluster node with the additinal domain

for ex: 
sh-4.4# dig test.apps-test..<clustername>.<clusterdomain> 

Actual results:

DNS resolved all the domains consisting of apps to the defult Ingres VIP for example: example.test-apps.<clustername>.<clusterdomain> resolves to default ingressVIP instead of their actual coresponding LB IP.

Expected results:

DNS should resolve it to coresponding LB IP configured at the DNS server.

Additional info:

The DNS solution is happenng using the CoreFile Templates used on the node. which is treating dot(.) as character instead of actual dot[.] this is a Regex configuration bug inside CoreFile used on Vspehere IPI clusters.

Description of problem:

We discovered an issue before code freeze that caused many CI issues.This is resolved with this PR: https://github.com/openshift/cluster-network-operator/pull/1579

Version-Release number of selected component (if applicable):

4.12

How reproducible:

NA

Steps to Reproduce:

1.NA
2.
3.

Actual results:

Severity is set too low for various OVN-K alerts

Expected results:

Alerts work as expected at the correct severity level and CI runs are clear including for hypershift clusters.

Additional info:

This is resolved with this PR: https://github.com/openshift/cluster-network-operator/pull/1579 Here is my testing with `e2e-all` and `e2e-serial` and there are no issues after 10 runs each: https://docs.google.com/spreadsheets/d/1FZON8-d3m7D_2-z3XetODA-ucbXKJzCioC-zRMArHlY/edit?usp=sharing

This is a clone of issue OCPBUGS-13152. The following is the description of the original issue:

Description of problem:
With OCPBUGS-11099 our Pipeline Plugin supports the TektonConfig config "embedded-status: minimal" option that will be the default in OpenShift Pipelines 1.11+.

But since this change, the Pipeline pages loads the TaskRuns for any Pipeline and PipelineRun rows. To decrease the risk of a performance issue we should make this call only if the status.tasks wasn't defined.

Version-Release number of selected component (if applicable):

  • 4.12-4.14, as soon as OCPBUGS-11099 is backported.
  • Tested with Pipelines operator 1.10.1

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines operator
  2. Import a Git repository and enable the Pipeline option
  3. Open the browser network inspector
  4. Navigate to the Pipeline page

Actual results:
The list page load a list of TaskRuns for each Pipeline / PipelineRun also if the PipelineRun contains the related data already (status.tasks)

Expected results:
No unnecessary network calls. When the admin changes the TektonConfig config "embedded-status" option to minimal the UI should still work and load the TaskRuns as it does it today.

Additional info:
None

This is a clone of issue OCPBUGS-3084. The following is the description of the original issue:

Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603

Long log lines get corrupted when using '--timestamps' by the Kubelet.

The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.

https://github.com/kubernetes/kubernetes/blob/f892ab1bd7fd97f1fcc2e296e85fdb8e3e8fb82d/pkg/kubelet/kuberuntime/logs/logs.go#L325

apiVersion: v1
kind: Pod
metadata:
  name: logs
spec:
  restartPolicy: Never
  containers:
  - name: logs
    image: fedora
    args:
    - bash
    - -c
    - 'for i in `seq 1 10000000`; do echo -n $i; done'
kubectl logs logs --timestamps

Description of problem:

Currently in 4.11, MAPI nutanix machine-controller does not provide the machine (VM)’s instance-type, region, zone, etc. labels to the Machine CR. And these columns are empty when viewing the Machine CRs, via cli “oc get Machine” or from the OCP cluster web console. 
$ oc -n openshift-machine-api get machine 
NAME                                  PHASE      TYPE REGION ZONE   AGE 
demo-ocp-cluster-g1-77nws-master-0   Running                        133m 
demo-ocp-cluster-g1-77nws-master-1   Running                        133m 
demo-ocp-cluster-g1-77nws-master-2   Running                        133m 
demo-ocp-cluster-g1-77nws-worker-2bsxn Running                      129m 
demo-ocp-cluster-g1-77nws-worker-75hr5 Running                      129m 
demo-ocp-cluster-g1-77nws-worker-rg7b9 Running                      129m

We can add something like the below labels to the Machine CR in the mapi-nutanix when reconciling for the Machine CRs: 
machine.openshift.io/instance-type: AHV 
machine.openshift.io/region: <prism-central-address> 
machine.openshift.io/zone: <prism-element-name/uuid>

Version-Release number of selected component (if applicable):

 

How reproducible:

run cli “oc get Machine” or from the OCP cluster web console to view the Machines resource

Steps to Reproduce:

1.
2.
3.

Actual results:

The "Type", "Region", "Zone" columns are empty for each Machine CR.

Expected results:

The "Type", "Region", "Zone" columns showing data for each Machine CR.

Additional info:

 

The issue found while testing HOSTEDCP-400 and HOSTEDCP-401.

Hypershift operator installed with flags:

 

--platform-monitoring=operator-only
--enable-uwm-telemetry-remote-write=true
--metrics-set=telemetry

 

Service monitors and pod monitors in the control plane:

 

[jiezhao@cube hypershift]$ oc get servicemonitor -n clusters-jz-test
NAME                                  AGE
catalog-operator                      45m
cluster-version-operator              45m
etcd                                  46m
kube-apiserver                        46m
kube-controller-manager               45m
monitor-multus-admission-controller   43m
monitor-ovn-master-metrics            43m
node-tuning-operator                  45m
olm-operator                          45m
openshift-apiserver                   45m
openshift-controller-manager          45m

[jiezhao@cube hypershift]$ oc get podmonitor -n clusters-jz-test
NAME                              AGE
cluster-image-registry-operator   46m
controlplane-operator             47m
hosted-cluster-config-operator    46m
ignition-server                   47m

 

In OCP management web console, go to Observe->Targets:

 

1. Status of service monitor 'monitor-multus-admission-controller' is Down, error:
   Scraped failed: server returned HTTP status 401 Unauthorized.
   It doesn't have cluster id in target labels
2. Target of pod monitor 'cluster-image-registry-operator' is missing, not shown

 

 – NOT A BUG –
This was a story, but CI is not working for OLM project, so moved to OCPBUGS where it is. 

----------------------------

upstream the `opm alpha diff` functionality moved to `oc-mirror` team by a non-RH actor.

This story is to track downstreaming the two PRs.

The only thing to verify here is that there is no more `opm alpha diff` command. 

Other changes in the PRs are to externalize some interfaces and implement an undocumented alpha-level internal channel-level property list.

 

Description of problem:

See the Insights nomination https://issues.redhat.com/browse/INSIGHTOCP-1197
and the KCS article https://access.redhat.com/solutions/7008996

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-18312. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17864. The following is the description of the original issue:

Description of problem:

Cluster recently upgraded to OCP 4.12.19 experiencing serious slowness issues with Project>Project access page.
The loading time of that page grows significantly faster than the number of entries, and is very noticeable even at a relatively low number of entries.

Version-Release number of selected component (if applicable):

4.12.19

How reproducible:

Easily 

Steps to Reproduce:

1. Create a namespace, and add RoleBindings for multiple users, for instance with :
$ oc -n test-namespace create rolebinding test-load --clusterrole=view --user=user01 --user=user02 --user=...
2. In Developer view of that namespace, navigate to "Project"->"Project access". The page will take a long time to load compared to the time an "oc get rolebinding" would take.

Actual results:

0 RB => instantaneous loading
40 RB => about 10 seconds until page loaded
100 RB => one try took 50 seconds, another 110 seconds
200 RB => nothing for 8 minutes, after which my web browser (Firefox) proposed to stop the page since it slowed the browser down, and after 10 minutes I stopped the attempt without ever seeing the page load. 

Expected results:

Page should load almost instantly with only a few hundred role bindings

This is a clone of issue OCPBUGS-2384. The following is the description of the original issue:

Version:
$ openshift-install version
openshift-install 4.10.0-0.nightly-2021-12-23-153012
built from commit 94a3ed9cbe4db66dc50dab8b85d2abf40fb56426
release image registry.ci.openshift.org/ocp/release@sha256:39cacdae6214efce10005054fb492f02d26b59fe9d23686dc17ec8a42f428534
release architecture amd64

Platform: alibabacloud

Please specify:

  • IPI (automated install with `openshift-install`. If you don't know, then it's IPI)

What happened?
Unexpected error of 'Internal publish strategy is not supported on "alibabacloud" platform', because Internal publish strategy should be supported for "alibabacloud", please clarify otherwise, thanks!

$ openshift-install create install-config --dir work
? SSH Public Key /home/jiwei/.ssh/openshift-qe.pub
? Platform alibabacloud
? Region us-east-1
? Base Domain alicloud-qe.devcluster.openshift.com
? Cluster Name jiwei-uu
? Pull Secret [? for help] *********
INFO Install-Config created in: work
$
$ vim work/install-config.yaml
$ yq e '.publish' work/install-config.yaml
Internal
$ openshift-install create cluster --dir work --log-level info
FATAL failed to fetch Metadata: failed to load asset "Install Config": invalid "install-config.yaml" file: publish: Invalid value: "Internal": Internal publish strategy is not supported on "alibabacloud" platform
$

What did you expect to happen?
"publish: Internal" should be supported for platform "alibabacloud".

How to reproduce it (as minimally and precisely as possible)?
Always

Description of problem:

When log line number is too big, the number will overlap with cut-off line in the log viewer.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-15-150248

How reproducible:

Always

Steps to Reproduce:
1.Go to a pod log page with lots of logs, such as pod in openshift-cluster-version namespace. Check log line numbers.
2.
3.

Actual results:

1. When line number is too big, it will overlap with cut-off line.

Expected results:

1. Should have no overlaps in logs

Additional info:

Description of problem:

Deploy IPI cluster on multi datacenter/cluster vsphere env, installer failed with some reason, then tried to destroy cluster, and found that one vm folder under one of datacenters is not deleted.

When installer exit, following objects are attached with tag jima15b-cq7z7
sh-4.4$ govc tags.attached.ls jima15b-cq7z7 | xargs govc ls -L
/IBMCloud/vm/jima15b-cq7z7
/datacenter-2/vm/jima15b-cq7z7
/datacenter-2/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-west-us-west-1a
/IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-east-us-east-2a
/IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-east-us-east-3a
/IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-east-us-east-1a
/IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-bootstrap

sh-4.4$ ./openshift-install destroy cluster --dir ipi_missingzones/
INFO Destroyed                                     VirtualMachine=jima15b-cq7z7-rhcos-us-west-us-west-1a
INFO Destroyed                                     VirtualMachine=jima15b-cq7z7-rhcos-us-east-us-east-2a
INFO Destroyed                                     VirtualMachine=jima15b-cq7z7-rhcos-us-east-us-east-3a
INFO Destroyed                                     VirtualMachine=jima15b-cq7z7-rhcos-us-east-us-east-1a
INFO Destroyed                                     VirtualMachine=jima15b-cq7z7-bootstrap
INFO Destroyed                                     Folder=jima15b-cq7z7
INFO Deleted                                       Tag=jima15b-cq7z7
INFO Deleted                                       TagCategory=openshift-jima15b-cq7z7
INFO Time elapsed: 55s       

After destroying cluster, folder jima15b-cq7z7 is still there, not deleted.
sh-4.4$ govc ls /datacenter-2/vm/ | grep jima15b-cq7z7
/datacenter-2/vm/jima15b-cq7z7                    

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-18-141547

How reproducible:

always when installer fails to create infrastructure, it works when installation is successful. 

Steps to Reproduce:

1. deploy IPI cluster on vsphere env configured multi datacenter/cluster
2. installer failed to create infrastructure with some reason
3. destroy cluster
4. one folder is not deleted 

Actual results:

one folder is not deleted

Expected results:

All infrastructures created by installer should be removed

Additional info:

 

Platform:

IPI on Baremetal

What happened?

In cases where no hostname is provided, host are automatically assigned the name "localhost" or "localhost.localdomain".

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
localhost.localdomain Ready master 31m v1.22.1+6859754
master-0-1 Ready master 39m v1.22.1+6859754
master-0-2 Ready master 39m v1.22.1+6859754
worker-0-0 Ready worker 12m v1.22.1+6859754
worker-0-1 Ready worker 12m v1.22.1+6859754

What did you expect to happen?

Having all hosts come up as localhost is the worst possible user experience, because they'll fail to form a cluster but you won't know why.

However, we know the BMH name in the image-customization-controller, it would be possible to configure the ignition to set a default hostname if we don't have one from DHCP/DNS.

If not, we should at least fail the installation with a specific error message to this situation.

----------
30/01/22 - adding how to reproduce
----------

How to Reproduce:

1)prepare and installation with day-1 static ip.

add to install-config uner one of the nodes:
networkConfig:
routes:
config:

  • destination: 0.0.0.0/0
    next-hop-address: 192.168.123.1
    next-hop-interface: enp0s4
    dns-resolver:
    config:
    server:
  • 192.168.123.1
    interfaces:
  • name: enp0s4
    type: ethernet
    state: up
    ipv4:
    address:
  • ip: 192.168.123.110
    prefix-length: 24
    enabled: true

2)Ensure a DNS PTR for the address IS NOT configured.

3)create manifests and cluster from install-config.yaml

installation should either:
1)fail as early as possible, and provide some sort of feed back as to the fact that no hostname was provided.
2)derive the Hostname from the bmh or the ignition files

Description of problem:

Ingress Controller is missing a required AWS resource permission for SC2S region us-isob-east-1

During the OpenShift 4 installation in SC2S region us-isob-east-1, the ingress operator degrades due to missing "route53:ListTagsForResources" permission from the "openshift-ingress" CredentialsRequest for which customer proactively raised a PR.
--> https://github.com/openshift/cluster-ingress-operator/pull/868

The code disables part of the logic for C2S isolated regions here: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L167-L168
By not setting tagConfig, it results in the m.tags field to be set nil: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L212-L222
This then drives the logic in the getZoneID method to use either lookupZoneID or lookupZoneIDWithoutResourceTagging: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L280-L284
BLAB: the lookupZoneIDWithoutResourceTagging method is only ever called for endpoints.AwsIsoPartitionID, endpoints.AwsIsoBPartitionID regions.

Version-Release number of selected component (if applicable):

 

How reproducible:

Everytime

Steps to Reproduce:

1. Create an IPI cluster in  SC2S region us-isob-east-1.

Actual results:

Ingress operator degrades due to missing "route53:ListTagsForResources" permission with following error.
~~~
The DNS provider failed to ensure the record: failed to find hosted zone for record: failed to get tagged resources: AccessDenied: User ....... rye... is not authorized to perform: route53:ListTagsForResources on resource.... hostedzone/.. because no identify based policy allows the route53:ListTagsForResources
~~~

Expected results:

Ingress operator should be in available state for new installation.

Additional info:

 

Description of problem: As discovered in https://issues.redhat.com/browse/OCPBUGS-2795, gophercloud fails to list swift containers when the endpoint speaks HTTP2. This means that CIRO will provision a 100GB cinder volume even though swift is available to the tenant.

We're for example seeing this behavior in our CI on vexxhost.

The gophercloud commit that fixed this issue is https://github.com/gophercloud/gophercloud/commit/b7d5b2cdd7ffc13e79d924f61571b0e5f74ec91c, specifically the `|| ct == ""` part on line 75 of openstack/objectstorage/v1/containers/results.go. This commit made it in gophercloud v0.18.0.

CIRO still depends on gophercloud v0.17.0. We should bump gophercloud to fix the bug.

Version-Release number of selected component (if applicable):

All versions. Fix should go to 4.8 - 4.12.

How reproducible:

Always, when swift speaks HTTP2.

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This bug is a backport clone of [Bugzilla Bug 2100181](https://bugzilla.redhat.com/show_bug.cgi?id=2100181). The following is the description of the original bug:

Created attachment 1891950
log

Description of problem:

Prior to OCP 4.7.48, the configure-ovs script picked the corrected bonded interface for br-ex. In OCP 4.7.48 we have that is consistently fail. It picks one of the slave interfaces (ens3f0).

Version-Release number of selected component (if applicable):
OCP Release > OCP 4.7.37

How reproducible:
100%

Steps to Reproduce:
1. Deploy an OCP cluster with bonding
2.
3.

Actual results:

Expected results:

configure-ovs should not fail and assign the correct interface to br-ex (bond1)

Additional info:

There appears to be a new default NM profile from 4.7.37 to 4.7.38 a that was not there before

When multi-cluster is enabled, it possible to get in a situation where you can't cancel login. If you select a cluster you don't know the credentials for, console will remember the last cluster and repeatedly send you to the login page with no way to cancel or go back. If we decide to set the last cluster in the user's preferences, it might be possible to get stuck even if you clear cookies and localStorage.

There are similar issues logging into cluster that are hibernating. See attached video.

cc Scott Berens

This is a clone of issue OCPBUGS-10678. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10655. The following is the description of the original issue:

Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.

Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.

But the result also depends on the installed samples operator and installed ImageStreams.

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the Developer perspective
  2. Navigate to Add > All Samples
  3. Search for Jboss
  4. Click on "JBoss EAP XP 4.0 with OpenJDK 11" (for example)

Actual results:
The git repository is not filled and the create button is disabled.

Expected results:
Samples without git repositories should not be displayed in the list.

Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.

Description of problem:

Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list

1. all kinds of add/delete port to/from default deny port group failures, possible symptoms:
  - port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed
  - port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped
  - db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done
2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed
3. handle error when getting local pod from the cache fails, possible symptoms
  - "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message
  - pod is not added to netpol port groups, network policy is not applied
4. creating deleted namespace via ensureNamespaceLocked, symptoms:
  - namespace was deleted, but address set is present in the db
5. policy acl loglevel update wasn’t applied, possible symptoms:
  - netpol acl log level isn’t set/updated to namespace loglevel
6. netpol cleanup failures, symptoms:
  - network policy failed to be deleted, something is still left in the db, error messages like
  - "failed to destroy network policy"
  - "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy"
7. concurrent write to sets.String - this will panic, you won’t miss
8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g.
  - "peer AddressSet is nil, cannot add <object>"
9. host network and completed pods selected by network policy can produce error logs, no real harm
  - "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err"
10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak
11. add local pod failure, since netpol port group is not committed to db yet, error looks like
  - "Failed to create *factory.localPodSelector <name>, error: object not found"

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

Example 1
1. Create network policy with [in/e]gress selector that applies to a namespace labeled project: myproject
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              project: myproject

2. Use oc apply to delete network policy and crate a pod in project: myproject namespace at the same time
3. check ovnkube-master logs for "peer AddressSet is nil, cannot add peer pod(s)", this should retry with the same error 15 times
4. This may not work from the first try, since we need to hit specific order of network policy delete and pod add handling
5. With the new version no error messages should be present

Example 2
1. create network policy that applies to a namespace test
piVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: test
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
2. Create host network pod in namespace test
3. Check 15 logs saying "Failed to get LSP for pod %s/%s for networkPolicy %s refetching err: "
4. check final log "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy"
5. With the new version no error message should be present

All the other cases are difficult to reproduce, maybe just running some standard network policy tests and making sure everything works will be a good verification.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Create Loadbalancer type service within the OCP 4.11.x OVNKubernetes cluster to expose the api server endpoint, the service does not response for normal oc request. 
But some of them are working, like "oc whoami", "oc get --raw /api"

Version-Release number of selected component (if applicable):

4.11.8 with OVNKubernetes

How reproducible:

always

Steps to Reproduce:

1. Setup openshift cluster 4.11 on AWS with OVNKubernetes as the default network
2. Create the following service under openshift-kube-apiserver namespace to expose the api
----
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1800"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  name: test-api
  namespace: openshift-kube-apiserver
spec:
  allocateLoadBalancerNodePorts: true
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerSourceRanges:
  - <my_ip>/32
  ports:
  - nodePort: 31248
    port: 6443
    protocol: TCP
    targetPort: 6443
  selector:
    apiserver: "true"
    app: openshift-kube-apiserver
  sessionAffinity: None
  type: LoadBalancer

3. Setup the DNS resolution for the access
xxx.mydomain.com ---> <elb-auto-generated-dns>

4. Try to access the cluster api via the service above by updating the kubeconfig to use the custom dns name

Actual results:

No response from the server side.

$ time oc get node -v8
I1025 08:29:10.284069  103974 loader.go:375] Config loaded from file:  bmeng.kubeconfig
I1025 08:29:10.294017  103974 round_trippers.go:420] GET https://rh-api.bmeng-ccs-ovn.3o13.s1.devshift.org:6443/api/v1/nodes?limit=500
I1025 08:29:10.294035  103974 round_trippers.go:427] Request Headers:
I1025 08:29:10.294043  103974 round_trippers.go:431]     Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json
I1025 08:29:10.294052  103974 round_trippers.go:431]     User-Agent: oc/openshift (linux/amd64) kubernetes/e40bd2d
I1025 08:29:10.365119  103974 round_trippers.go:446] Response Status: 200 OK in 71 milliseconds
I1025 08:29:10.365142  103974 round_trippers.go:449] Response Headers:
I1025 08:29:10.365148  103974 round_trippers.go:452]     Audit-Id: 83b9d8ae-05a4-4036-bff6-de371d5bec12
I1025 08:29:10.365155  103974 round_trippers.go:452]     Cache-Control: no-cache, private
I1025 08:29:10.365161  103974 round_trippers.go:452]     Content-Type: application/json
I1025 08:29:10.365167  103974 round_trippers.go:452]     X-Kubernetes-Pf-Flowschema-Uid: 2abc2e2d-ada3-4cb8-a86f-235df3a4e214
I1025 08:29:10.365173  103974 round_trippers.go:452]     X-Kubernetes-Pf-Prioritylevel-Uid: 02f7a188-43c7-4827-af58-5ebe861a1891
I1025 08:29:10.365179  103974 round_trippers.go:452]     Date: Tue, 25 Oct 2022 08:29:10 GMT
^C
real    17m4.840s
user    0m0.567s
sys    0m0.163s


However, it has the correct response if using --raw to request, eg:
$ oc get --raw /api/v1  --kubeconfig bmeng.kubeconfig 
{"kind":"APIResourceList","groupVersion":"v1","resources":[{"name":"bindings","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"componentstatuses","singularName":"","namespaced":false,"kind":"ComponentStatus","verbs":["get","list"],"shortNames":["cs"]},{"name":"configmaps","singularName":"","namespaced":true,"kind":"ConfigMap","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["cm"],"storageVersionHash":"qFsyl6wFWjQ="},{"name":"endpoints","singularName":"","namespaced":true,"kind":"Endpoints","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ep"],"storageVersionHash":"fWeeMqaN/OA="},{"name":"events","singularName":"","namespaced":true,"kind":"Event","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ev"],"storageVersionHash":"r2yiGXH7wu8="},{"name":"limitranges","singularName":"","namespaced":true,"kind":"LimitRange","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["limits"],"storageVersionHash":"EBKMFVe6cwo="},{"name":"namespaces","singularName":"","namespaced":false,"kind":"Namespace","verbs":["create","delete","get","list","patch","update","watch"],"shortNames":["ns"],"storageVersionHash":"Q3oi5N2YM8M="},{"name":"namespaces/finalize","singularName":"","namespaced":false,"kind":"Namespace","verbs":["update"]},{"name":"namespaces/status","singularName":"","namespaced":false,"kind":"Namespace","verbs":["get","patch","update"]},{"name":"nodes","singularName":"","namespaced":false,"kind":"Node","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["no"],"storageVersionHash":"XwShjMxG9Fs="},{"name":"nodes/proxy","singularName":"","namespaced":false,"kind":"NodeProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"nodes/status","singularName":"","namespaced":false,"kind":"Node","verbs":["get","patch","update"]},{"name":"persistentvolumeclaims","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pvc"],"storageVersionHash":"QWTyNDq0dC4="},{"name":"persistentvolumeclaims/status","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["get","patch","update"]},{"name":"persistentvolumes","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pv"],"storageVersionHash":"HN/zwEC+JgM="},{"name":"persistentvolumes/status","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["get","patch","update"]},{"name":"pods","singularName":"","namespaced":true,"kind":"Pod","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["po"],"categories":["all"],"storageVersionHash":"xPOwRZ+Yhw8="},{"name":"pods/attach","singularName":"","namespaced":true,"kind":"PodAttachOptions","verbs":["create","get"]},{"name":"pods/binding","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"pods/ephemeralcontainers","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"pods/eviction","singularName":"","namespaced":true,"group":"policy","version":"v1","kind":"Eviction","verbs":["create"]},{"name":"pods/exec","singularName":"","namespaced":true,"kind":"PodExecOptions","verbs":["create","get"]},{"name":"pods/log","singularName":"","namespaced":true,"kind":"Pod","verbs":["get"]},{"name":"pods/portforward","singularName":"","namespaced":true,"kind":"PodPortForwardOptions","verbs":["create","get"]},{"name":"pods/proxy","singularName":"","namespaced":true,"kind":"PodProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"pods/status","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"podtemplates","singularName":"","namespaced":true,"kind":"PodTemplate","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"LIXB2x4IFpk="},{"name":"replicationcontrollers","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["rc"],"categories":["all"],"storageVersionHash":"Jond2If31h0="},{"name":"replicationcontrollers/scale","singularName":"","namespaced":true,"group":"autoscaling","version":"v1","kind":"Scale","verbs":["get","patch","update"]},{"name":"replicationcontrollers/status","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["get","patch","update"]},{"name":"resourcequotas","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["quota"],"storageVersionHash":"8uhSgffRX6w="},{"name":"resourcequotas/status","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["get","patch","update"]},{"name":"secrets","singularName":"","namespaced":true,"kind":"Secret","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"S6u1pOWzb84="},{"name":"serviceaccounts","singularName":"","namespaced":true,"kind":"ServiceAccount","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["sa"],"storageVersionHash":"pbx9ZvyFpBE="},{"name":"serviceaccounts/token","singularName":"","namespaced":true,"group":"authentication.k8s.io","version":"v1","kind":"TokenRequest","verbs":["create"]},{"name":"services","singularName":"","namespaced":true,"kind":"Service","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["svc"],"categories":["all"],"storageVersionHash":"0/CO1lhkEBI="},{"name":"services/proxy","singularName":"","namespaced":true,"kind":"ServiceProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"services/status","singularName":"","namespaced":true,"kind":"Service","verbs":["get","patch","update"]}]}
 

Expected results:

The normal oc request should be working.

Additional info:

There is no such issue for clusters with openshift-sdn with the same OpenShift version and same LoadBalancer service.

We suspected that it might be related to the MTU setting, but this cannot explain why OpenShiftSDN works well.

Another thing might be related is that the OpenShiftSDN is using iptables for service loadbalancing and OVN is dealing that within the OVN services.

 

Please let me know if any debug log/info is needed.

Description of problem:

E2E CI feature files are failing as Mocha version couldn't be determined 

Version-Release number of selected component (if applicable):

 

How reproducible:

CI Search : https://search.ci.openshift.org/?search=Couldn%27t+determine+Mocha+version&maxAge=336h&context=1&type=bug%2Bjunit&name=pull-ci-openshift-console-operator-master-e2e-aws-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

1.
2.
3.

Actual results:

E2E tests failing with `Couldn't determine Mocha version` error

Expected results:

E2E tests should pass without any failures

Additional info:

 

When we get telemetry from connected clusters, we want to be able to tell when they were created with the agent installer vs. the host assisted service. Currently there is no way to distinguish.

It's not clear whether any particular group owns the namespace of installation methods, or whom we need to notify when we create one.

AWS CPMS changes made here causes the single node clusters to fail installation
https://github.com/openshift/installer/pull/6172

 

Need to fix the issue by checking and not creating the CPMS manifest if the installation type is single node.

This is a clone of issue OCPBUGS-7555. The following is the description of the original issue:

Description of problem:

Enable default sysctls for kubelet.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

Description of problem:

Agent based installation fails during the 3+1 deployment. I found that the machine-api-operator degraded due to minimum worker replica count is 2 and for 3+1 deployment we need to define one worker node.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create agent.iso (openshift-install agent create image) using install-config.yaml and agent-config.yaml (PFA sample files)
2. Deploy a 3+1 cluster using agent.iso
3. Execute "openshift-install agent wait-for install-complete" command to wait for install complete. 

Actual results:

Getting below error:
ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host 
INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.12.0-0.nightly-2022-10-05-053337 
ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.12.0-0.nightly-2022-10-05-053337 because minimum worker replica count (2) not yet met: current running replicas 1, waiting for [] 
INFO Cluster operator machine-api Available is False with Initializing: Operator is initializing 
INFO Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. 
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas 
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
INFO Cluster operator network ManagementStateDegraded is False with :  
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR 				The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR 				https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 

Expected results:

3+1 deployment should be successful.

Additional info:

I found that there is a condition in the machine-api-operator to check that the worker node count should be 2 which is preventing the 3+1 deployment.
https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L322 

The DVO metrics gatherer in the Insights operator relies on the "deployment-validation-operator" namespace name, but this is not very good, because the DVO can be installed in other namespaces (e.g it's installed in the "openshift-operators" namespace when installing through OperatorHub)

Both `[sig-devex][Feature:ImageEcosystem][mysql][Slow] openshift mysql image Creating from a template should instantiate the template [apigroup:apps.openshift.io]` and `[sig-devex][Feature:ImageEcosystem][mariadb][Slow] openshift mariadb image Creating from a template should instantiate the template [apigroup:image.openshift.io][apigroup:operator.openshift.io][apigroup:config.openshift.io][apigroup:apps.openshift.io]` are repeatedly failing over multiple PRs.

More links in https://github.com/openshift/origin/pull/27502#issuecomment-1304613482

Opening this issue to temporarily skip the broken tests to unblocking merging PRs in openshift/origin:master

More details in https://issues.redhat.com/browse/OCPBUGS-3339

Description of problem:

There's argument number mismatch on release_vif() call while reverting
port association.

Version-Release number of selected component (if applicable):

 

How reproducible:

It's clear in the code, no need to reproduce this.

Steps to Reproduce:

1.
2.
3.

Actual results:

TypeError

Expected results:

KuryrPort released

Additional info:

 

This is a clone of issue OCPBUGS-10657. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10207. The following is the description of the original issue:

Description of problem:

When the releaseImage is a digest, for example quay.io/openshift-release-dev/ocp-release@sha256:bbf1f27e5942a2f7a0f298606029d10600ba0462a09ab654f006ce14d314cb2c, a spurious warning is putput when running
openshift-install agent create image

Its not calculating the releaseImage properly (see the '@sha' suffix below) so it causes this spurious message
WARNING The ImageContentSources configuration in install-config.yaml should have at-least one source field matching the releaseImage value quay.io/openshift-release-dev/ocp-release@sha256 

This can cause confusion for users.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Every time when using a release image with a digest is used

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

A nil-pointer dereference occurred in the TestRouterCompressionOperation test in the e2e-gcp-operator CI job for the openshift/cluster-ingress-operator repository.

Version-Release number of selected component (if applicable):

4.12.

How reproducible:

Observed once. However, we run e2e-gcp-operator infrequently.

Steps to Reproduce:

1. Run the e2e-gcp-operator CI job on a cluster-ingress-operator PR.

Actual results:

 panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x14cabef]
goroutine 8048 [running]:
testing.tRunner.func1.2({0x1624920, 0x265b870})
	/usr/lib/golang/src/testing/testing.go:1389 +0x24e
testing.tRunner.func1()
	/usr/lib/golang/src/testing/testing.go:1392 +0x39f
panic({0x1624920, 0x265b870})
	/usr/lib/golang/src/runtime/panic.go:838 +0x207
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x40e43e5698?})
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd8
panic({0x1624920, 0x265b870})
	/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/cluster-ingress-operator/test/e2e.getHttpHeaders(0xc0002b9380?, 0xc0000e4540, 0x1)
	/go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:257 +0x2ef
github.com/openshift/cluster-ingress-operator/test/e2e.testContentEncoding.func1()
	/go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:220 +0x57
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc00003f000})
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:222 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x1b25d40?, 0xc000138000?}, 0xc000befe08?)
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:235 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x1b25d40, 0xc000138000}, 0x48?, 0xc4fa25?, 0x30?)
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:582 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext({0x1b25d40, 0xc000138000}, 0xc000b1da00?, 0xc000befe98?, 0x414207?)
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:528 +0x4a
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0xc00088cea0?, 0x3b9aca00?, 0xc000138000?)
	/go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:514 +0x50
github.com/openshift/cluster-ingress-operator/test/e2e.testContentEncoding(0xc00088cea0, 0xc000a8a270, 0xc0000e4540, 0x1, {0x17fe569, 0x4})
	/go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:219 +0xfc
github.com/openshift/cluster-ingress-operator/test/e2e.TestRouterCompressionOperation(0xc00088cea0)
	/go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:208 +0x454
testing.tRunner(0xc00088cea0, 0x191cdd0)
	/usr/lib/golang/src/testing/testing.go:1439 +0x102
created by testing.(*T).Run
	/usr/lib/golang/src/testing/testing.go:1486 +0x35f 

Expected results:

The test should pass.

Additional info:

The faulty logic was introduced in https://github.com/openshift/cluster-ingress-operator/pull/679/commits/211b9c15b1fd6217dee863790c20f34c26c138aa.
The test was subsequently marked as a parallel test in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc.
The job history shows that the e2e-gcp-operator job has only run once since June: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator. I see failures in May, but none of those failures shows the panic.

 

 

This is a clone of issue OCPBUGS-860. The following is the description of the original issue:

Description of problem:

In GCP, once an external IP address is assigned to master/infra node through GCP console, numbers of pending CSR from kubernetes.io/kubelet-serving is increasing, and the following error are reported:

I0902 10:48:29.254427       1 controller.go:121] Reconciling CSR: csr-q7bwd
I0902 10:48:29.365774       1 csr_check.go:157] csr-q7bwd: CSR does not appear to be client csr
I0902 10:48:29.371827       1 csr_check.go:545] retrieving serving cert from build04-c92hb-master-1.c.openshift-ci-build-farm.internal (10.0.0.5:10250)
I0902 10:48:29.375052       1 csr_check.go:188] Found existing serving cert for build04-c92hb-master-1.c.openshift-ci-build-farm.internal
I0902 10:48:29.375152       1 csr_check.go:192] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate
I0902 10:48:29.375166       1 csr_check.go:193] Current SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5], CSR SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5 35.211.234.95]
I0902 10:48:29.375175       1 csr_check.go:202] Falling back to machine-api authorization for build04-c92hb-master-1.c.openshift-ci-build-farm.internal
E0902 10:48:29.375184       1 csr_check.go:420] csr-q7bwd: IP address '35.211.234.95' not in machine addresses: 10.0.0.5
I0902 10:48:29.375193       1 csr_check.go:205] Could not use Machine for serving cert authorization: IP address '35.211.234.95' not in machine addresses: 10.0.0.5
I0902 10:48:29.379457       1 csr_check.go:218] Falling back to serving cert renewal with Egress IP checks
I0902 10:48:29.382668       1 csr_check.go:221] Could not use current serving cert and egress IPs for renewal: CSR Subject Alternate Names includes unknown IP addresses
I0902 10:48:29.382702       1 controller.go:233] csr-q7bwd: CSR not authorized

Version-Release number of selected component (if applicable):

4.11.2

Steps to Reproduce:

1. Assign external IPs to master/infra node in GCP
2. oc get csr | grep kubernetes.io/kubelet-serving

Actual results:

CSRs are not approved

Expected results:

CSRs are approved

Additional info:

This issue is only happen in GCP. Same OpenShift installations in AWS do not have this issue.

It looks like the CSR are created using external IP addresses once assigned.

Ref: https://coreos.slack.com/archives/C03KEQZC1L2/p1662122007083059

TL;DR

4.12 requires backport of commit:

commit 0111e1faec20d16505a110449966273b430b7ad1
Author: Surya Seetharaman <suryaseetharaman.9@gmail.com>
Date:   Tue Sep 6 21:20:57 2022 +0200

    Support AllocateLoadBalancerNodePortsFalse
    
    This PR supports having allocateloadbalancernodeports
    set to false along with etp=local on lgw mode.
    
    Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>

Analysis

Missing BP of LoadBalancerServiceHasNodePortAllocation into 4.12 causes problems with flow creation for these services, even in shared gateway mode

This issue affects services with `allocateLoadBalancerNodePorts: false` in OCP 4.12.

Any deletion of services with `allocateLoadBalancerNodePorts: false` will fail and go into a 15 minute long retry loop. When one recreates a service while a failed deletion is still in progress, the flows on br-ex are not recreated.

Deletion will fail with:

(...)
obj_retry.go:257] Retry object setup: *factory.serviceForGateway <ns>/<service>
obj_retry.go:290] Removing old object: *factory.serviceForGateway <ns>/<service> (failed: %!s(uint8=<retry>))
(...)
obj_retry.go: 298] Retry delete failed for *factory.serviceForGateway <ns><service>, will try again later: error removing port claim for service: <ns>/<service>: invalid service port <service>, err: invalid port number: 0

And while a deletion is still ongoing, add will fail with:

obj_retry.go: 476] Failed to delete old object <ns>/<service> of type *factory.serviceForGateway, during add event: error removing port claim for service: <ns>/<service>: invalid service port <service>, err: invalid port number: 0

onvkube-node will retry 15 times with a 1 minute backoff before it gives up, and while this fails, the object cannot be recreated.

That also means that there are currently 2 workarounds for this (tested):

  • restart all ovnkube-node pods --> this will get rid of the bad cache entries and recreate the br-ex flows
  • delete the service. Wait for +15 minutes (until you no longer see the error message about failed deletion and retries) and recreate the service

The problem can easily be reproduced in 4.12, I tested this on 4.12.17 with SNO:

$ cat fedora-test.yaml 
---
apiVersion: v1
kind: Service
metadata:
  name: fedora-service
  labels:
    app: fedora-deployment
spec:
  selector:
    app: fedora-pod
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  sessionAffinity: None
  type: LoadBalancer
  allocateLoadBalancerNodePorts: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fedora-deployment
  labels:
    app: fedora-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fedora-pod
  template:
    metadata:
      labels:
        app: fedora-pod
    spec:
      containers:
      - name: fedora-a
        image: registry.fedoraproject.org/fedora:latest
        imagePullPolicy: Always
        command:
        - sleep
        - infinity
      - name: fedora-b
        image: registry.fedoraproject.org/fedora:latest
        imagePullPolicy: Always
        command:
        - sleep
        - infinity
oc apply -f fedora-test.yaml
oc delete svc fedora-service
oc apply -f fedora-test.yaml

Logs:

oc logs -n openshift-ovn-kubernetes ovnkube-node-4xg6w -c ovnkube-node -f | grep fedora-service
(...)
I0714 01:59:30.867309    9291 obj_retry.go:491] Creating *factory.serviceForGateway default/fedora-service took: 70.803µs
I0714 01:59:30.875170    9291 obj_retry.go:491] Creating *factory.endpointSliceForGateway default/fedora-service-5bmf8 took: 15.941µs
I0714 01:59:30.875210    9291 obj_retry.go:491] Creating *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-5bmf8 took: 169ns
E0714 01:59:52.496754    9291 obj_retry.go:673] Failed to delete *factory.serviceForGateway default/fedora-service, error: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
I0714 02:00:02.969493    9291 obj_retry.go:471] Detected stale object during new object add of type *factory.serviceForGateway with the same key: default/fedora-service
W0714 02:00:02.969523    9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default
I0714 02:00:02.971917    9291 obj_retry.go:491] Creating *factory.endpointSliceForGateway default/fedora-service-74vf8 took: 62.416µs
I0714 02:00:02.971926    9291 obj_retry.go:491] Creating *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-74vf8 took: 255ns
E0714 02:00:03.086557    9291 obj_retry.go:476] Failed to delete old object default/fedora-service of type *factory.serviceForGateway, during add event: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
I0714 02:00:13.982590    9291 obj_retry.go:257] Retry object setup: *factory.serviceForGateway default/fedora-service
I0714 02:00:13.982621    9291 obj_retry.go:290] Removing old object: *factory.serviceForGateway default/fedora-service (failed: %!s(uint8=1))
I0714 02:00:14.104772    9291 obj_retry.go:298] Retry delete failed for *factory.serviceForGateway default/fedora-service, will try again later: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
I0714 02:00:22.397338    9291 obj_retry.go:571] Found retry entry for *factory.serviceForGateway default/fedora-service marked for deletion: will delete the object
W0714 02:00:22.397400    9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default
E0714 02:00:22.601603    9291 obj_retry.go:575] Failed to delete stale object default/fedora-service, during update: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0
I0714 02:00:43.980921    9291 obj_retry.go:257] Retry object setup: *factory.serviceForGateway default/fedora-service
I0714 02:00:43.980948    9291 obj_retry.go:290] Removing old object: *factory.serviceForGateway default/fedora-service (failed: %!s(uint8=1))
W0714 02:00:43.980976    9291 gateway_shared_intf.go:656] Delete service: no service found in cache for endpoint fedora-service in namespace default
I0714 02:00:44.199215    9291 obj_retry.go:298] Retry delete failed for *factory.serviceForGateway default/fedora-service, will try again later: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0

And the following watch shows that the flows are created initially, then upon deletion the flows vanish, then as the service is recreated the flows do not reappear:

watch "ovs-ofctl dump-flows br-ex | grep 192.168.18.100"

I can delete the ovnkube-node pod to recreate the flows:

oc delete pod -n openshift-ovn-kubernetes ovnkube-node-4xg6w

And the flows reappaer:

[root@sno ~]# ovs-ofctl dump-flows br-ex | grep 192.168.18.100 
 cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,arp,in_port=1,arp_tpa=192.168.18.100,arp_op=1 actions=LOCAL
 cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,tcp,in_port=1,nw_dst=192.168.18.100,tp_dst=80 actions=output:2
 cookie=0x849b956ca97beaee, duration=27.925s, table=0, n_packets=0, n_bytes=0, idle_age=27, priority=110,tcp,in_port=2,nw_src=192.168.18.100,tp_src=80 actions=output:1

--------------------------------------

The problem does not manifest in 4.13. The difference between 4.12 an 4.13 is a missing backport of 0111e1faec20d16505a110449966273b430b7ad1

Log for service deletion in OCP 4.13:

I0718 13:27:35.699982  334002 obj_retry.go:656] Delete event received for *factory.serviceForGateway default/fedora-service
I0718 13:27:35.700010  334002 gateway_shared_intf.go:679] Deleting service fedora-service in namespace default
I0718 13:27:35.769565  334002 obj_retry.go:656] Delete event received for *factory.endpointSliceForGateway default/fedora-service-6hhds
I0718 13:27:35.769596  334002 gateway_shared_intf.go:856] Deleting endpointslice fedora-service-6hhds in namespace default
I0718 13:27:35.769610  334002 gateway_shared_intf.go:431] No serviceConfig found for service fedora-service in namespace default
I0718 13:27:35.769618  334002 obj_retry.go:656] Delete event received for *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-6hhds

Log for service deletion in OCP 4.12:

I0718 13:28:14.253695   52007 obj_retry.go:653] Delete event received for *factory.serviceForGateway default/fedora-service
I0718 13:28:14.253717   52007 port_claim.go:197] Handle NodePort service fedora-service port 0
I0718 13:28:14.253726   52007 gateway_shared_intf.go:649] Deleting service fedora-service in namespace default
I0718 13:28:14.288844   52007 obj_retry.go:653] Delete event received for *factory.endpointSliceForGateway default/fedora-service-2m857
I0718 13:28:14.288870   52007 gateway_shared_intf.go:817] Deleting endpointslice fedora-service-2m857 in namespace default
I0718 13:28:14.288876   52007 gateway_shared_intf.go:407] No serviceConfig found for service fedora-service in namespace default
I0718 13:28:14.288881   52007 obj_retry.go:653] Delete event received for *factory.endpointSliceForStaleConntrackRemoval default/fedora-service-2m857
E0718 13:28:14.402407   52007 obj_retry.go:673] Failed to delete *factory.serviceForGateway default/fedora-service, error: error removing port claim for service: default/fedora-service: invalid service port fedora-service, err: invalid port number: 0

Both 4.12 and 4.13 have similar code, and `handleService` looks the same as well:

  189 func handleService(svc *kapi.Service, handler handler) []error {                                                        
  190     errors := []error{}                                                                                                 
  191     if !util.ServiceTypeHasNodePort(svc) && len(svc.Spec.ExternalIPs) == 0 {                                            
  192         return errors                                                                                                   
  193     }                                                                                                                   
  194                                                                                                                         
  195     for _, svcPort := range svc.Spec.Ports {                                                                            
  196         if util.ServiceTypeHasNodePort(svc) {                                                                           
  197             klog.V(5).Infof("Handle NodePort service %s port %d", svc.Name, svcPort.NodePort) 

But ServiceTypeHasNodePort in 4.13 correctly differentiates between allocateLoadBalancerNodePorts whereas 4.12 does not:

go-controller/pkg/util/kube.go

  273 func LoadBalancerServiceHasNodePortAllocation(service *kapi.Service) bool {                                             
  274     return service.Spec.AllocateLoadBalancerNodePorts == nil || *service.Spec.AllocateLoadBalancerNodePorts             
  275 }   

  277 // ServiceTypeHasNodePort checks if the service has an associated NodePort or not                                       
  278 func ServiceTypeHasNodePort(service *kapi.Service) bool {                                                               
  279     return service.Spec.Type == kapi.ServiceTypeNodePort ||                                                             
  280         (service.Spec.Type == kapi.ServiceTypeLoadBalancer && LoadBalancerServiceHasNodePortAllocation(service))        
  281 }

In OCP 4.12:

  221 // ServiceTypeHasNodePort checks if the service has an associated NodePort or not                                       
  222 func ServiceTypeHasNodePort(service *kapi.Service) bool {                                                               
  223     return service.Spec.Type == kapi.ServiceTypeNodePort || service.Spec.Type == kapi.ServiceTypeLoadBalancer           
  224 }

This is a clone of issue OCPBUGS-2513. The following is the description of the original issue:

Description of problem:

Agent based installation is failing for Disconnected env due to pull secret is required for registry.ci.openshift.org. As we are installing cluster in disconnected env, only mirror registry secrets should be enough for pulling the image.

Version-Release number of selected component (if applicable):

registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-18-041406

How reproducible:

Always

Steps to Reproduce:

1. Setup mirror registry with this registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-18-041406 release. 
2. Add the ICSP information in the install-config file
4. Create agent.iso using install-config.yaml and agent-config.yaml
5. ssh to the node zero to see the error in create-cluster-and-infraenv.service. 

Actual results:

create-cluster-and-infraenv.service is failing with below error:
 
time="2022-10-18T09:36:13Z" level=fatal msg="Failed to register cluster with assisted-service: AssistedServiceError Code: 400 Href:  ID: 400 Kind: Error Reason: pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\""

Expected results:

create-cluster-and-infraenv.service should be successfully started.

Additional info:

Refer this similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1990659

This is a clone of issue OCPBUGS-8701. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8232. The following is the description of the original issue:

Description of problem:

oc patch project command is failing to annotate the project

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1. Run the below patch command to update the annotation on existing project
~~~
oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "This is a new project"}}}'
~~~


Actual results:

It produces the error output below:
~~~
The Project "<PROJECT_NAME>" is invalid: * metadata.namespace: Invalid value: "<PROJECT_NAME>": field is immutable * metadata.namespace: Forbidden: not allowed on this type 
~~~ 

Expected results:

The `oc patch project` command should patch the project with specified annotation.

Additional info:

Tried to patch the project with OCP 4.11.26 version, and it worked as expected.
~~~
oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "New project"}}}'

project.project.openshift.io/<PROJECT_NAME> patched
~~~

The issue is with OCP 4.12, where it is not working. 

 

This is a clone of issue OCPBUGS-14943. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14668. The following is the description of the original issue:

Description of problem:

visiting global configurations page will return error after 'Red Hat OpenShift Serverless' is installed, the error persist even operator is uninstalled

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-06-212044

How reproducible:

Always

Steps to Reproduce:

1. Subscribe 'Red Hat OpenShift Serverless' from OperatorHub, wait for the operator to be successfully installed
2. Visit Administration -> Cluster Settings -> Configurations tab

Actual results:

react_devtools_backend_compact.js:2367 unhandled promise rejection: TypeError: Cannot read properties of undefined (reading 'apiGroup') 
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
overrideMethod @ react_devtools_backend_compact.js:2367
window.onunhandledrejection @ main-chunk-e70ea3b3d562514df486.min.js:1

main-chunk-e70ea3b3d562514df486.min.js:1 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'apiGroup')
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1

 

Expected results:

no errors

Additional info:

 

Description of problem:

Cluster can not be installed when updating join network CIDR using v6InternalSubnet fdxx::/64 in the manifests/cluster-network-03-config.yml

Version-Release number of selected component (if applicable):

v4.12

How reproducible:

Always

Steps to Reproduce:

Using v6InternalSubnet: fd66::/48 in manifests/cluster-network-03-config.yml to install a dual stack cluster:

cp manifests/cluster-network-02-config.yml manifests/cluster-network-03-config.yml
 sed -i 's/config.openshift.io\/v1/operator.openshift.io\/v1/g' manifests/cluster-network-03-config.yml
cat > ovn_kube_config <<HEREDOC
  defaultNetwork:
    type: OVNKubernetes
    ovnKubernetesConfig:
      v6InternalSubnet: fd66::/48
HEREDOC
  sed -i $'/^status/{e cat ovn_kube_config\n}' manifests/cluster-network-03-config.yml 

Actual results:

Installation fail

Expected results:

Installation pass

Additional info:

 

Description of problem:

On Make Serverless page, to change values of the inputs minpod, maxpod and concurrency fields, we need to click the ‘ + ’ or ‘ - ', it can't be changed by typing in it.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

always

Steps to Reproduce:

1. Create a deployment workload from import from git
2. Right click on workload and select Make Serverless option
3. Check functioning of inputs minpod, maxpod etc.

Actual results:

To change values of the inputs minpod, maxpod and concurrency fields, we need to click the ‘ + ’ or ‘ - ', it can't be changed by typing in it.

Expected results:

We can change values of the inputs minpod, maxpod and concurrency fields, by clicking the ‘ + ’ or ‘ - ' and also by typing in it.

Additional info:

Works fine in v4.11

Assisted installations default to setting platform: baremetal. Using the ReST API, it is possible to select vsphere (or ovirt) as the platform type. In every case, the actual platform data is filled in by assisted-service, and cannot be specified by the user.

The ClusterDeployment resource (from Hive) contains a Platform field. We could look for a platform specified in this field and set that platform when creating the cluster in the create-cluster-and-infraenv service. If ZTP were ever to support other deployment methods, this would probably be a good choice for that also.

We should probably warn the user if they attempt to put any data inside the platform settings, as this will be ignored. This shouldn't be an error, though, as it would prevent users from using existing install configs. Perhaps it should be an error if they specify a platform we don't support.

 

Note: https://issues.redhat.com/browse/AGENT-284?focusedCommentId=21019997&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21019997 

[Pawan]: We can simply use the PlatformType from ACI and then no assisted service client changes are required. We will throw an error if the user provides an unsupported platformType ( aws, gcp, etc)

 

Ignoring the unwanted Platform settings from install-config.yaml to be handled in https://issues.redhat.com/browse/AGENT-348

Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-13940.

This is a clone of issue OCPBUGS-18472. The following is the description of the original issue:

Description of problem:

During OCP 4.12 to 4.13 upgrades, some pods are not able to reach default kubernetes service 172.30.0.1 and hang forever until pods are manually restarted. Mainly dns-default-* pods, but sometimes also dns-operator-* pods.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Very often, 8 out of 10 upgrades.

Steps to Reproduce:

1. Deploy OCP 4.12 with latest GA on a baremetal cluster with IPI and OVN-K
2. Upgrade to latest 4.13 GA
3. Check cluster version status during the upgrade, at some point upgrade hangs for long time, usually with message "Working towards 4.13.X: 694 of 842 done (82% complete), waiting on dns"
4. Check for non-running pods and you might see pods in Crashing status
5. Check pod logs, it will show "https://172.30.0.1:443/api?timeout=32: dial tcp 172.30.0.1:443: i/o timeout"

Actual results:

Upgrade gets stuck or requires manual intervention to continue when pods remain in Crashing status.

Expected results:

Upgrade should be completed without issues, and pods should not remain stuck in Crashing status.

Additional info:

  • We have tested this with latest GA versions today: 4.12.31 to 4.13.10, but we have been observing this since 4.12.28
  • Our deployments have dualstack, but even with single stack IPv4 we have observed the issue.
  • The work-around has been to identify the pods in crashing status, restart them and the upgrade continues, we haven't found additional errors in the journal logs of the nodes with the pods crashing or other pods misbehaving

This is an example of the latest run, upgrading 4.12.31 to 4.13.10, and after some minutes the upgrade gets stuck because dns operator was Degraded, and when checking the pods of openshift-dns namespace the pod dns-default-k8hfl was Running but only with one container:

$ oc get clusterversion
NAME                                         VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS                                                              
clusterversion.config.openshift.io/version   4.12.31   True        True          102m    Working towards 4.13.10: 694 of 842 done (82% complete), waiting on dns

$  oc get co | grep 4.12.31
dns                                        4.12.31   True        True          False      3h45m   DNS "default" reports Progressing=True: "Have 6 available DNS pods, want 7.\nHave 6 up-to-date DNS pods, want 7."...
machine-config                             4.12.31   True        False         False      161m

$ oc -n openshift-dns get pods -o wide
NAME                  READY   STATUS    RESTARTS       AGE     IP              NODE       NOMINATED NODE   READINESS GATES
dns-default-7pghb     2/2     Running   0              14m     10.128.0.5      master-1   <none>           <none>
dns-default-b25vj     2/2     Running   0              15m     10.129.0.4      master-0   <none>           <none>
dns-default-k8hfl     1/2     Running   5 (119s ago)   12m     10.130.0.3      master-2   <none>           <none>
dns-default-mrvh9     2/2     Running   0              13m     10.128.2.5      worker-3   <none>           <none>
dns-default-pnf8w     2/2     Running   0              15m     10.130.2.4      worker-1   <none>           <none>
dns-default-px4cn     2/2     Running   4              3h16m   10.129.2.6      worker-2   <none>           <none>
dns-default-rvj6k     2/2     Running   0              14m     10.131.0.4      worker-0   <none>           <none>
node-resolver-p6465   1/1     Running   0              16m     192.168.22.24   worker-0   <none>           <none>
node-resolver-q8t6l   1/1     Running   0              16m     192.168.22.23   master-2   <none>           <none>
node-resolver-qb8sm   1/1     Running   0              16m     192.168.22.21   master-0   <none>           <none>
node-resolver-rklnq   1/1     Running   0              16m     192.168.22.25   worker-1   <none>           <none>
node-resolver-rlbxc   1/1     Running   0              16m     192.168.22.22   master-1   <none>           <none>
node-resolver-w7x4b   1/1     Running   0              16m     192.168.22.27   worker-3   <none>           <none>
node-resolver-wb8tt   1/1     Running   0              16m     192.168.22.26   worker-2   <none>           <none>

When checking the pod logs we see the dns container complains about not able to reach endpoint https://172.30.0.1:443/version, and when testing from the other container we confirmed we can't get to that URL. But when we test directly from the node running that pod we reach the endpoint and also other pods running in the node do not have issues.

$ oc -n openshift-dns logs dns-default-k8hfl
Defaulted container "dns" out of: dns, kube-rbac-proxy
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:5353
hostname.bind.:5353
[INFO] plugin/reload: Running configuration SHA512 = e100c1081a47648310f72de96fbdbe31f928f02784eda1155c53be749ad04c434e50da55f960a800606274fb080d8a1f79df7effa47afa9a02bddd9f96192e18
CoreDNS-1.10.1
linux/amd64, go1.19.10 X:strictfipsruntime, 
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout

$ oc -n openshift-dns exec -ti dns-default-k8hfl -c kube-rbac-proxy -- /bin/bash                                      
bash-4.4$ curl https://172.30.0.1:443/readyz
curl: (7) Failed to connect to 172.30.0.1 port 443: Connection timed out

[core@master-2 ~]$ curl -k https://172.30.0.1:443/readyz
ok

If we delete that pod, it gets recreated and this time both containers in the pod are running, which unblocks the upgrade and continues with the next cluster operator, until it finishes.

$ oc -n openshift-dns delete pod dns-default-k8hfl
pod "dns-default-k8hfl" deleted 

[kni@provisioner.cluster2.dfwt5g.lab ~]$ oc -n openshift-dns get pods -o wide                                                                                                                                                                
NAME                  READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES                                                                                                                       
dns-default-7pghb     2/2     Running   0          21m     10.128.0.5      master-1   <none>           <none>                                                                                                                                
dns-default-b25vj     2/2     Running   0          22m     10.129.0.4      master-0   <none>           <none>                                                                                                                                
dns-default-l7v9r     2/2     Running   0          51s     10.130.0.5      master-2   <none>           <none>                                                                                                                                
dns-default-mrvh9     2/2     Running   0          20m     10.128.2.5      worker-3   <none>           <none>                                                                                                                                
dns-default-pnf8w     2/2     Running   0          23m     10.130.2.4      worker-1   <none>           <none>                                           dns-default-vlgtb     2/2     Running   0          14s     10.129.2.6      worker-2   <none>           <none>                                                                                                                                
dns-default-rvj6k     2/2     Running   0          22m     10.131.0.4      worker-0   <none>           <none>                                                                                                                                
node-resolver-p6465   1/1     Running   0          23m     192.168.22.24   worker-0   <none>           <none>                                                                                                                                
node-resolver-q8t6l   1/1     Running   0          23m     192.168.22.23   master-2   <none>           <none>                                                                                                                                
node-resolver-qb8sm   1/1     Running   0          23m     192.168.22.21   master-0   <none>           <none>                                                                                                                                
node-resolver-rklnq   1/1     Running   0          23m     192.168.22.25   worker-1   <none>           <none>                                                                                                                                
node-resolver-rlbxc   1/1     Running   0          23m     192.168.22.22   master-1   <none>           <none>                                                                                                                                
node-resolver-w7x4b   1/1     Running   0          23m     192.168.22.27   worker-3   <none>           <none>                                                                                                                                
node-resolver-wb8tt   1/1     Running   0          23m     192.168.22.26   worker-2   <none>           <none> 

$ oc get co | grep 4.12.31
Cluster Operators
machine-config                             4.12.31   True        False         False      169m

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.31   True        True          112m    Working towards 4.13.10: 714 of 842 done (84% complete), waiting on machine-config

https://github.com/openshift/origin/pull/27444 was intended to move the scaling test out of serial to it's own test suite, but it added it to parallel – meaning it's running in all our normal upgrade jobs, causing them to frequently fail with repeating pathological events as well as greatly increasing their run time.

See https://github.com/openshift/origin/pull/27444#discussion_r991296925 for more info

This is a clone of issue OCPBUGS-7102. The following is the description of the original issue:

Description of problem:

https://github.com/openshift/operator-framework-olm/blob/7ec6b948a148171bd336750fed98818890136429/staging/operator-lifecycle-manager/pkg/controller/operators/olm/plugins/downstream_csv_namespace_labeler_plugin_test.go#L309

has a dependency on creation of a next-version release branch.

 

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1. clone operator-framework/operator-framework-olm
2. make unit/olm
3. deal with a really bumpy first-time kubebuilder/envtest install experience
4. profit

 

 

Actual results:

error

Expected results:

pass

Additional info:

 

 

This is a clone of issue OCPBUGS-15721. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14875. The following is the description of the original issue:

Description of problem:

If a JSON schema used in by a chart contains unknown value format (non-standard JSON Schema but valid in OpenAPI spec for example), Helm form view hangs on validation and stays in "submitting" state.

 

As per JSON Schema standard the "format" keyword should only take an advisory role (like an annotation) and should not affect validation.

https://json-schema.org/understanding-json-schema/reference/string.html#format 

Version-Release number of selected component (if applicable):

Verified against 4.13, but probably applies to others.

How reproducible:

100%

Steps to Reproduce:

1. Go to Helm tab.
2. Click create in top right and select Repository
3. Paste following into YAML view and click Create:

apiVersion: helm.openshift.io/v1beta1
kind: ProjectHelmChartRepository
metadata:
  name: reproducer
spec:
  connectionConfig:
    url: 'https://raw.githubusercontent.com/tumido/helm-backstage/repo-multi-schema2'

4. Go to the Helm tab again (if redirected elsewhere)
5. Click create in top right and select Helm Release
6. In catalog filter select Chart repositories: Reproducer
7. Click on the single tile available (Backstage) and click Create
8. Switch to Form view
9. Leave default values and click Create
10. Stare at the always loading screen that never proceeds further.

Actual results:

And never finishes or displays any error in UI.

Expected results:

Unknown format should not result in rejected validation. JSON Schema standard says that formats should not be used for validation.

Additional info:

This is not a schema violation by itself since Helm itself is happy about it and doesn't complain. The same chart can be successfully deployed via the YAML view.

This is a clone of issue OCPBUGS-1704. The following is the description of the original issue:

Description of problem:

According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-25-071630

How reproducible:

Always, if the Service Usage API is disabled in the GCP project.

Steps to Reproduce:

1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project.
2. Try IPI installation in the GCP project. 

Actual results:

The installation would fail finally, without any worker machines launched.

Expected results:

Installation should succeed, or the OCP doc should be updated.

Additional info:

Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. 
FYI if enabling the API, and without changing anything else, the installation could succeed. 

Grafana has been removed in 4.11 and we can safely remove any logic in CMO that deals with Grafana (except dashboards since they are used by OCP console).

Another point to clarify is to communicate to ProdSec and ART that Grafana isn't part of OCP anymore.

This is a clone of issue OCPBUGS-10391. The following is the description of the original issue:

Description of problem:

When installing SNO with bootstrap in place CVO hangs for 6 minutes waiting for the lease

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.Run the POC using the makefile here https://github.com/eranco74/bootstrap-in-place-poc
2. Observe the CVO logs post reboot
3.

Actual results:

I0102 09:45:53.131061       1 leaderelection.go:248] attempting to acquire leader lease openshift-cluster-version/version...
I0102 09:51:37.219685       1 leaderelection.go:258] successfully acquired lease openshift-cluster-version/version

Expected results:

Expected the bootstrap CVO to release the lease so that the CVO running post reboot won't have to wait the lease duration  

Additional info:

POC (hack) that remove the lease and allows CVO to start immediately:
https://github.com/openshift/installer/pull/6757/files#diff-f12fbadd10845e6dab2999e8a3828ba57176db10240695c62d8d177a077c7161R38-R48
  
Slack thread:
https://redhat-internal.slack.com/archives/C04HSKR4Y1X/p1673345953183709

This is a clone of issue OCPBUGS-17107. Grant Spence manually made this bug for tracking 4.12 backport.

The following is the description of the original issue:

Description of problem:

per oc set route-backends -h output:
Routes may have one or more optional backend services with weights controlling how much traffic flows to each service.
[...]
**If all weights are zero the route will not send traffic to any backends.**

this is not the case anymore for a route with a single backend.

Version-Release number of selected component (if applicable):

at least from OCP 4.12 onward

How reproducible:

all the time

Steps to Reproduce:

1. kubectl create -f example/
2. kubectl patch route example -p '{"spec":{"to": {"weight": 0}}}' --type merge
3. curl http://localhost -H "Host: example.local" 

Actual results:

curl succeeds

Expected results:

curl fails

Additional info:

https://access.redhat.com/support/cases/#/case/03567697

is regression following NE-822. Reverting
https://github.com/openshift/router/commit/9656da7d5e2ac0962f3eaf718ad7a8c8b2172cfa makes it work again.

This is a clone of issue OCPBUGS-10892. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8203. The following is the description of the original issue:

When processing an install-config containing either BMC passwords in the baremetal platform config, or a vSphere password in the vsphere platform config, we log a warning message to say that the value is ignored.

This warning currently includes the value in the password field, which may be inconvenient for users reusing IPI configs who don't want their password values to appear in logs.

Description of problem:

If a master fails and is drained, the old copy of the metal3 pod gets stuck in Terminating state for some (possibly long) time. While the new pod works correctly, CBO expects only one port to exist and thus cannot determine the applicable Ironic IP address.

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. On dev-scripts: virsh destroy <VM with metal3 pod>
2. Wait for drain to happen or trigger it manually
3. Check CBO logs

Actual results:

"unable to determine Ironic's IP to pass to the machine-image-customization-controller: there should be only one pod listed for the given label"

Expected results:

CBO reconfigures its pods with the new Ironic IP

Additional info:

I don't know how to filter out pods in Terminating state...

This is a clone of issue OCPBUGS-6647. The following is the description of the original issue:

Description of problem:

Resource type drop-down menu item 'Last used' is in English

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Navigate to kube:admin -> User Preferences -> Applications
2. Click on Resource type dorp-down

Actual results:

Content is in English

Expected results:

Content should be in target language

Additional info:

Screenshot reference provided

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.

This is particularly pronounced in SNO topologies, leading to installation failures. 

The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.

This is a clone of issue OCPBUGS-8381. The following is the description of the original issue:

Derscription of problem:

On a hypershift cluster that has public certs for OAuth configured, the console reports a x509 certificate error when attempting to display a token

Version-Release number of selected component (if applicable):

4.12.z

How reproducible:

always

Steps to Reproduce:

1. Create a hosted cluster configured with a letsencrypt certificate for the oauth endpoint.
2. Go to the console of the hosted cluster. Click on the user icon and get token.

Actual results:

The console displays an oauth cert error

Expected results:

The token displays

Additional info:

The hcco reconciles the oauth cert into the console namespace. However, it is only reconciling the self-signed one and not the one that was configured through .spec.configuration.apiserver of the hostedcluster. It needs to detect the actual cert used for oauth and send that one.

 

Description of problem:

The IPI installation in some regions got bootstrap failure, and without any node available/ready.

Version-Release number of selected component (if applicable):

12-22 16:22:27.970  ./openshift-install 4.12.0-0.nightly-2022-12-21-202045
12-22 16:22:27.970  built from commit 3f9c38a5717c638f952df82349c45c7d6964fcd9
12-22 16:22:27.970  release image registry.ci.openshift.org/ocp/release@sha256:2d910488f25e2638b6d61cda2fb2ca5de06eee5882c0b77e6ed08aa7fe680270
12-22 16:22:27.971  release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. try the IPI installation in the problem regions (so far tried and failed with ap-southeast-2, ap-south-1, eu-west-1, ap-southeast-6, ap-southeast-3, ap-southeast-5, eu-central-1, cn-shanghai, cn-hangzhou and cn-beijing) 

Actual results:

Bootstrap failed to complete

Expected results:

Installation in those regions should succeed.

Additional info:

FYI the QE flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/166672/

No any node available/ready, and no any operator available.
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          30m     Unable to apply 4.12.0-0.nightly-2022-12-21-202045: an unknown error has occurred: MultipleErrors
$ oc get nodes
No resources found
$ oc get machines -n openshift-machine-api -o wide
NAME                         PHASE   TYPE   REGION   ZONE   AGE   NODE   PROVIDERID   STATE
jiwei-1222f-v729x-master-0                                  30m                       
jiwei-1222f-v729x-master-1                                  30m                       
jiwei-1222f-v729x-master-2                                  30m                       
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication
baremetal
cloud-controller-manager                                                                          
cloud-credential                                                                                  
cluster-autoscaler                                                                                
config-operator                                                                                   
console                                                                                           
control-plane-machine-set                                                                         
csi-snapshot-controller                                                                           
dns                                                                                               
etcd                                                                                              
image-registry                                                                                    
ingress                                                                                           
insights                                                                                          
kube-apiserver                                                                                    
kube-controller-manager                                                                           
kube-scheduler                                                                                    
kube-storage-version-migrator                                                                     
machine-api                                                                                       
machine-approver                                                                                  
machine-config                                                                                    
marketplace                                                                                       
monitoring                                                                                        
network                                                                                           
node-tuning                                                                                       
openshift-apiserver                                                                               
openshift-controller-manager                                                                      
openshift-samples                                                                                 
operator-lifecycle-manager                                                                        
operator-lifecycle-manager-catalog                                                                
operator-lifecycle-manager-packageserver
service-ca
storage
$

Mater nodes don't run for example kubelet and crio services.
[core@jiwei-1222f-v729x-master-0 ~]$ sudo crictl ps
FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/crio/crio.sock: connect: no such file or directory" 
[core@jiwei-1222f-v729x-master-0 ~]$ 

The machine-config-daemon firstboot tells "failed to update OS".
[jiwei@jiwei log-bundle-20221222085846]$ grep -Ei 'error|failed' control-plane/10.0.187.123/journals/journal.log 
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:16 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Dec 22 16:24:18 localhost ignition[867]: failed to fetch config: resource requires networking
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:18 localhost ignition[891]: GET error: Get "http://100.100.100.200/latest/user-data": dial tcp 100.100.100.200:80: connect: network is unreachable
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <info>  [1671726259.0329] hostname: hostname: hostnamed not used as proxy creation failed with: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain NetworkManager[919]: <warn>  [1671726259.0464] sleep-monitor-sd: failed to acquire D-Bus proxy: Could not connect: No such file or directory
Dec 22 16:24:19 localhost.localdomain ignition[891]: GET error: Get "https://api-int.jiwei-1222f.alicloud-qe.devcluster.openshift.com:22623/config/master": dial tcp 10.0.187.120:22623: connect: connection refused
...repeated logs omitted...
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-ctl[1888]: 2022-12-22T16:27:46Z|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 ovs-vswitchd[1888]: ovs|00001|dns_resolve|WARN|Failed to read /etc/resolv.conf: No such file or directory
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 dbus-daemon[1669]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1924]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[1937]: Error: Device '' not found.
Dec 22 16:27:46 jiwei-1222f-v729x-master-0 nm-dispatcher[2037]: Error: Device '' not found.
Dec 22 08:35:32 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:35:32.477770    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-910221290 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 rpm-ostree[2288]: Txn Rebase on /org/projectatomic/rpmostree1/rhcos failed: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: W1222 08:56:06.785425    2181 firstboot_complete_machineconfig.go:46] error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511 : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411e6e3be017538859cfbd7b5cd57fc87e5fee58f15df19ed3ec11044ebca511: Warning: The unit file, source configuration file or drop-ins of rpm-ostreed.service changed on disk. Run 'systemctl daemon-reload' to reload units.
Dec 22 08:56:06 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: error: remote error: Get "https://quay.io/v2/openshift-release-dev/ocp-v4.0-art-dev/blobs/sha256:27f262e70d98996165748f4ab50248671d4a4f97eb67465cd46e1de2d6bd24d0": net/http: TLS handshake timeout
Dec 22 08:57:31 jiwei-1222f-v729x-master-0 machine-config-daemon[2181]: Warning: failed, retrying in 1s ... (1/2)I1222 08:57:31.244684    2181 run.go:19] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-extensions/os-extensions-content-4021566291 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:259d8c6b9ec714d53f0275db9f2962769f703d4d395afb9d902e22cfe96021b0
Dec 22 08:59:20 jiwei-1222f-v729x-master-0 systemd[2353]: /usr/lib/systemd/user/podman-kube@.service:10: Failed to parse service restart specifier, ignoring: never
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2437]: Error: open default: no such file or directory
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 podman[2450]: Error: failed to start API service: accept unixgram @00026: accept4: operation not supported
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman-kube@default.service: Failed with result 'exit-code'.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: Failed to start A template for running K8s workloads via podman-play-kube.
Dec 22 08:59:21 jiwei-1222f-v729x-master-0 systemd[2353]: podman.service: Failed with result 'exit-code'.
[jiwei@jiwei log-bundle-20221222085846]$ 

 

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/72

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

Version-Release number of selected component (if applicable):

  OpenShift 4.10.12

How reproducible:

Always

Steps to Reproduce:
  1. Define deny all network policy for egress an ingress in a namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

2. Define the following network policy to allow the traffic between the pods in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace-001
spec:
  egress:
  - to:
    - podSelector: {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress 

3. Test the connectivity between two pods from the namespace.

Actual results:

   The connectivity is not allowed

Expected results:

  The connectivity should be allowed between pods from the same namespace.

Additional info:

  After performing a test and analyzing SDN flows for the namespace: 

sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 
 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]

We see that the following rule doesn't match because `reg1` hasn't been defined:

 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 

 

Description of problem:

It seems that we don't correctly update the network data secret version in the PreprovisioningImage, resulting in BMO assuming that the image is still stale, while the image-customization-controller assumes it's done. As a result, the host is stuck in inspecting.

How reproducible:

What I think I did is to add a network data secret to a host which already has a preprovisioningimage previously created. I need to check if I can repeat it.

Actual results:

Host in inspecting, BMO logs show

{"level":"info","ts":"2023-05-11T11:52:52.348Z","logger":"controllers.BareMetalHost","msg":"network data in pre-provisioning image is out of date","baremetalhost":"openshift-machine-api/oste
st-extraworker-0","provisioningState":"inspecting","latestVersion":"9055823","currentVersion":"9055820"}

Indeed, the image has the old version:

status:
  architecture: x86_64
  conditions:
  - lastTransitionTime: "2023-05-11T11:27:51Z"
    message: Generated image
    observedGeneration: 1
    reason: ImageSuccess
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-05-11T11:27:51Z"
    message: ""
    observedGeneration: 1
    reason: ImageSuccess
    status: "False"
    type: Error
  format: iso
  imageUrl: http://metal3-image-customization-service.openshift-machine-api.svc.cluster.local/231b39d5-1b83-484c-9096-aa87c56a222a
  networkData:
    name: ostest-extraworker-0-network-config-secret
    version: "9055820"

What I find puzzling is that we even have two versions of the secret. I only created it once.

Description of problem
`oc-mirror` does not work as expected with relative path for OCI format copy

How reproducible:
always

Steps to Reproduce:
Copy the operator image with OCI format to localhost with relative path my-oci-catalog;
cat imageset-copy.yaml
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
operators:

  • catalog: registry.redhat.io/redhat/redhat-operator-index:v4.12
    packages:
  • name: aws-load-balancer-operator

`oc mirror --config imageset-copy.yaml --use-oci-feature --oci-feature-action=copy oci://my-oci-catalog`

Actual results:
2. will create a dir with name : my-file-catalog , but no use for user specified dir: my-oci-catalog
ls -tl
total 20
drwxr-xr-x. 3 root root 4096 Dec 6 13:58 oc-mirror-workspace
drwxr-xr-x. 3 root root 4096 Dec 6 13:58 olm_artifacts
drwxr-x---. 3 root root 4096 Dec 6 13:58 my-file-catalog
drwxr-xr-x. 2 root root 4096 Dec 6 13:58 my-oci-catalog
rw-rr-. 1 root root 206 Dec 6 12:39 imageset-copy.yaml

Expected results:
2. Use the user specified directory .

Additional info:
``oc-mirror --config config-operator.yaml oci:///home/ocmirrortest/noo --use-oci-feature --oci-feature-action=copy` with full path works well.

This is a clone of issue OCPBUGS-6651. The following is the description of the original issue:

Description of problem:

When running a hypershift HostedCluster with a publicAndPrivate / private setup behind a proxy, Nodes never go ready.

ovn-kubernetes pods fail to run because the init container fails.

[root@ip-10-0-129-223 core]# crictl logs cf142bb9f427d
+ [[ -f /env/ ]]
++ date -Iseconds
2023-01-25T12:18:46+00:00 - checking sbdb
+ echo '2023-01-25T12:18:46+00:00 - checking sbdb'
+ echo 'hosts: dns files'
+ proxypid=15343
+ ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
+ sbdb_ip=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645
+ retries=0
+ ovn-sbctl --no-leader-only --timeout=5 --db=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt get-connection
+ exec socat TCP-LISTEN:9645,reuseaddr,fork PROXY:10.0.140.167:ovnkube-sbdb.apps.agl-proxy.hypershift.local:443,proxyport=3128
ovn-sbctl: ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645: database connection failed ()
+ ((  retries += 1  ))


Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always.

Steps to Reproduce:

1. Create a publicAndPrivate hypershift HostedCluster behind a proxy. E.g"
➜  hypershift git:(main) ✗ ./bin/hypershift create cluster \
aws --pull-secret ~/www/pull-secret-ci.txt \
--ssh-key ~/.ssh/id_ed25519.pub \
--name agl-proxy \
--aws-creds ~/www/config/aws-osd-hypershift-creds \
--node-pool-replicas=3 \
--region=us-east-1 \
--base-domain=agl.hypershift.devcluster.openshift.com \
--zones=us-east-1a \
--endpoint-access=PublicAndPrivate \
--external-dns-domain=agl-services.hypershift.devcluster.openshift.com --enable-proxy=true

2. Get the kubeconfig for the guest cluster. E.g
kubectl get secret -nclusters agl-proxy-admin-kubeconfig  -oyaml

3. Get pods in the guest cluster.
See ovnkube-node pods init container failing with
[root@ip-10-0-129-223 core]# crictl logs cf142bb9f427d
+ [[ -f /env/ ]]
++ date -Iseconds
2023-01-25T12:18:46+00:00 - checking sbdb
+ echo '2023-01-25T12:18:46+00:00 - checking sbdb'
+ echo 'hosts: dns files'
+ proxypid=15343
+ ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
+ sbdb_ip=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645
+ retries=0
+ ovn-sbctl --no-leader-only --timeout=5 --db=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt get-connection
+ exec socat TCP-LISTEN:9645,reuseaddr,fork PROXY:10.0.140.167:ovnkube-sbdb.apps.agl-proxy.hypershift.local:443,proxyport=3128
ovn-sbctl: ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645: database connection failed ()
+ ((  retries += 1  ))

To create a bastion an ssh into the Nodes See https://hypershift-docs.netlify.app/how-to/debug-nodes/

Actual results:

Nodes unready

Expected results:

Nodes go ready

Additional info:

 

Description of problem:

The alertmanager pod is stuck on OCP 4.11 with OVN in container Creating State

From oc describe alertmanager pod:
...
Events:
  Type     Reason                  Age                  From     Message
  ----     ------                  ----                 ----     -------
  Warning  FailedCreatePodSandBox  16s (x459 over 17h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-managed-ocs-alertmanager-0_openshift-storage_3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc_0(88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78): error adding pod openshift-storage_alertmanager-managed-ocs-alertmanager-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/alertmanager-managed-ocs-alertmanager-0/3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] [openshift

Version-Release number of selected component (if applicable):

OCP 4.11 with OVN

How reproducible:

100%

Steps to Reproduce:

1. Terminate the node on which alertmanager pod is running
2. pod will get stuck in container Creating state
3.

Actual results:

AlertManager pod is stuck in container Creating state

Expected results:

Alertmanager pod is ready

Additional info:

The workaround would be to terminate the alertmanager pod

Similar to how we generate the kubeconfig at the same time as the ISO, we should also generate the admin password.

This will require changes to the installer to allow assisted-service to pass at least the hash of the password in to the installer process that generates the bootstrap ignition, similar in concept to the changes made to pass the kubeconfig.

This is a clone of issue OCPBUGS-3441. The following is the description of the original issue:

Update the cluster-authentication-operator to not go degraded when it can’t determine the console url.  This risks masking certain cases where we would want to raise an error to the admin, but the expectation is that this failure mode is rare.

Risk could be avoided by looking at ClusterVersion's enabledCapabilities to decide if missing Console was expected or not (unclear if the risk is high enough to be worth this amount of effort).

AC: Update the cluster-authentication-operator to not go degraded when console config CRD is missing and ClusterVersion config has Console in enabledCapabilities.

This is a clone of issue OCPBUGS-4026. The following is the description of the original issue:

Description of problem:
There is an endless re-render loop and a browser feels slow to stuck when opening the add page or the topology.

Saw also endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds

Version-Release number of selected component (if applicable):
1. Console UI 4.12-4.13 (master)
2. Service Binding Operator (tested with 1.3.1)

How reproducible:
Always with installed SBO

But the "stuck feeling" depends on the browser (Firefox feels more stuck) and your locale machine power

Steps to Reproduce:
1. Install Service Binding Operator
2. Create or update the BindableKinds resource "bindable-kinds"

apiVersion: binding.operators.coreos.com/v1alpha1
kind: BindableKinds
metadata:
  name: bindable-kinds

3. Open the browser console log
4. Open the console UI and navigate to the add page

Actual results:
1. Saw endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
2. Browser feels slow and get stuck after some time
3. The page crashs after some time

Expected results:
1. The API call should be called just once
2. The add page should just work without feeling laggy
3. No crash

Additional info:
Get introduced after we watching the bindable-kinds resource with https://github.com/openshift/console/pull/11161

It looks like this happen only if the SBO is installed and the bindable-kinds resource exist, but doesn't contain any status.

The status list all available bindable resource types. I could not reproduce this by installing and uninstalling an operator, but you can manually create or update this resource as mentioned above.

I haven't gone back to pin down all affected versions, but I wouldn't be surprised if we've had this exposure for a while. On a 4.12.0-ec.2 cluster, we have:

cluster:usage:resources:sum{resource="podnetworkconnectivitychecks.controlplane.operator.openshift.io"}

currently clocking in around 67983. I've gathered a dump with:

$ oc --as system:admin -n openshift-network-diagnostics get podnetworkconnectivitychecks.controlplane.operator.openshift.io | gzip >checks.gz

And many, many of these reference nodes which no longer exist (the cluster is aggressively autoscaled, with nodes coming and going all the time). We should fix garbage collection on this resource, to avoid consuming excessive amounts of memory in the Kube API server and etcd as they attempt to list the large resource set.

Description of problem:

Install a single node cluster on AWS, then enable TechPreview, cause the cluster error. 
The CMA and CAPI CMA shouldn't be on the same port.

Version-Release number of selected component (if applicable):

4.11.9

How reproducible:

always

Steps to Reproduce:

1.Launch 4.11.9 single node cluster on AWS
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.9    True        False         34m     Cluster version is 4.11.9
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.9    True        False         False      31m     
baremetal                                  4.11.9    True        False         False      49m     
cloud-controller-manager                   4.11.9    True        False         False      52m     
cloud-credential                           4.11.9    True        False         False      53m     
cluster-autoscaler                         4.11.9    True        False         False      48m     
config-operator                            4.11.9    True        False         False      50m     
console                                    4.11.9    True        False         False      37m     
csi-snapshot-controller                    4.11.9    True        False         False      49m     
dns                                        4.11.9    True        False         False      48m     
etcd                                       4.11.9    True        False         False      47m     
image-registry                             4.11.9    True        False         False      43m     
ingress                                    4.11.9    True        False         False      86s     
insights                                   4.11.9    True        False         False      43m     
kube-apiserver                             4.11.9    True        False         False      43m     
kube-controller-manager                    4.11.9    True        False         False      47m     
kube-scheduler                             4.11.9    True        False         False      44m     
kube-storage-version-migrator              4.11.9    True        False         False      50m     
machine-api                                4.11.9    True        False         False      44m     
machine-approver                           4.11.9    True        False         False      49m     
machine-config                             4.11.9    True        False         False      49m     
marketplace                                4.11.9    True        False         False      48m     
monitoring                                 4.11.9    True        False         False      56s     
network                                    4.11.9    True        False         False      52m     
node-tuning                                4.11.9    True        False         False      49m     
openshift-apiserver                        4.11.9    True        False         False      72s     
openshift-controller-manager               4.11.9    True        False         False      39m     
openshift-samples                          4.11.9    True        False         False      43m     
operator-lifecycle-manager                 4.11.9    True        False         False      49m     
operator-lifecycle-manager-catalog         4.11.9    True        False         False      49m     
operator-lifecycle-manager-packageserver   4.11.9    True        False         False      104s    
service-ca                                 4.11.9    True        False         False      50m     
storage                                    4.11.9    True        False         False      49m     
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                         STATUS   ROLES           AGE   VERSION
ip-10-0-137-222.us-east-2.compute.internal   Ready    master,worker   53m   v1.24.0+dc5a2fd

2.Enable TechPreview
spec:
  featureSet: TechPreviewNoUpgrade

liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate                           
featuregate.config.openshift.io/cluster edited

3.Check the cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get pod  -n openshift-cloud-controller-manager
NAME                                            READY   STATUS    RESTARTS       AGE
aws-cloud-controller-manager-5888c85fc6-28tgt   1/1     Running   12 (10m ago)   55m
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion                            
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.9    True        False         111m    Error while reconciling 4.11.9: the workload openshift-cluster-machine-approver/machine-approver-capi has not yet successfully rolled out
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.9    False       False         False      9m44s   OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.huliu-aws411arn2.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)...
baremetal                                  4.11.9    True        False         False      128m    
cloud-controller-manager                   4.11.9    True        False         False      131m    
cloud-credential                           4.11.9    True        False         False      133m    
cluster-api                                4.11.9    True        False         False      41m     
cluster-autoscaler                         4.11.9    True        False         False      128m    
config-operator                            4.11.9    True        False         False      129m    
console                                    4.11.9    False       True          False      10m     DeploymentAvailable: 0 replicas available for console deployment...
csi-snapshot-controller                    4.11.9    True        False         False      4m52s   
dns                                        4.11.9    True        False         False      128m    
etcd                                       4.11.9    True        False         False      127m    
image-registry                             4.11.9    True        False         False      123m    
ingress                                    4.11.9    True        False         False      3m15s   
insights                                   4.11.9    True        False         False      122m    
kube-apiserver                             4.11.9    True        False         False      123m    
kube-controller-manager                    4.11.9    True        False         False      126m    
kube-scheduler                             4.11.9    True        False         False      124m    
kube-storage-version-migrator              4.11.9    True        False         False      129m    
machine-api                                4.11.9    True        False         False      124m    
machine-approver                           4.11.9    True        False         False      128m    
machine-config                             4.11.9    True        False         False      129m    
marketplace                                4.11.9    True        False         False      128m    
monitoring                                 4.11.9    True        False         False      5m1s    
network                                    4.11.9    True        False         False      131m    
node-tuning                                4.11.9    True        False         False      128m    
openshift-apiserver                        4.11.9    True        False         False      23s     
openshift-controller-manager               4.11.9    True        False         False      118m    
openshift-samples                          4.11.9    True        False         False      122m    
operator-lifecycle-manager                 4.11.9    True        False         False      128m    
operator-lifecycle-manager-catalog         4.11.9    True        False         False      128m    
operator-lifecycle-manager-packageserver   4.11.9    True        False         False      2m43s   
service-ca                                 4.11.9    True        False         False      129m    
storage                                    4.11.9    True        False         False      69m     
liuhuali@Lius-MacBook-Pro huali-test %  

Actual results:

Cluster is broken

CMA is complaining,
 message: '0/1 nodes are available: 1 node(s) didn''t have free ports for the requested
      pod ports. preemption: 0/1 nodes are available: 1 node(s) didn''t have free
      ports for the requested pod ports.'

Expected results:

Cluster should be healthy

Additional info:

Talked with dev here https://coreos.slack.com/archives/GE2HQ9QP4/p1666178083034159?thread_ts=1666176493.224399&cid=GE2HQ9QP4

Must-Gather https://drive.google.com/file/d/1Q7Ddnhbg3Cq4ptBA2ycJnGKK01As1JcF/view?usp=sharing 

If enable TechPreview during installation on single node cluster, the cluster installation failed.

This bug is a backport clone of [Bugzilla Bug 1948666](https://bugzilla.redhat.com/show_bug.cgi?id=1948666). The following is the description of the original bug:

Description of problem:

When users try to deploy an application from git method on dev console it throws warning message for specific public repos `URL is valid but cannot be reached. If this is a private repository, enter a source secret in Advanced Git Options.`. If we ignore the warning and go ahead the build will be successful although the warning message seems to be misleading.

Actual results:
Getting a warning for url while trying to deploy an application from git method on dev console from a public repo

Expected results:
It should show validated

This is a clone of issue OCPBUGS-13138. The following is the description of the original issue:

This is a clone of issue OCPBUGS-12951. The following is the description of the original issue:

Description of problem:

4.13.0-RC.6 Enter to Cluster status: error while trying to install cluster with agent base installer
After the read disk stage the cluster status turn to "error"

Version-Release number of selected component (if applicable):


How reproducible:

Create image with the attached install config and agent config file and boot node with this images

Steps to Reproduce:

1. Create image with the attached install config and agent config file and boot node with this images

Actual results:

Cluster status: error

Expected results:

Should continue with cluster status: installing 

Additional info:


Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Currently, we have this validation  https://github.com/openshift/installer/blob/master/pkg/asset/agent/installconfig_test.go#L103 which checks if the platform is none then the number of control planes should be 1 and workers should be zero.

We need another validation to check if the number of control planes is 1 and workers are zero, the in the install-config.yaml the platform can only be set as none and in agent-cluster-install.yaml, the platformType should only be set as none. If we try to do SNO (i.e. control planes is 1 and workers are zero)  with e.g. platform: baremetal then assisted will reject it, so we should catch it as early as possible

The 4.12 builds fail all the time. Last successfully build was from May 31.

Error:

# Root Suite.Entire pipeline flow from Builder page "before all" hook for "Background Steps"
AssertionError: Timed out retrying after 80000ms: Expected to find element: `[data-test-id="PipelineResource"]`, but never found it.

Full error:

  Running:  e2e/pipeline-ci.feature                                                         (1 of 1)
Couldn't determine Mocha version


  Logging in as kubeadmin
      Installing operator: "Red Hat OpenShift Pipelines"
      Operator Red Hat OpenShift Pipelines was not yet installed.
      Performing Pipelines post-installation steps
      Verify the CRD's for the "Red Hat OpenShift Pipelines"
  1) "before all" hook for "Background Steps"
      Deleting "" namespace

  0 passing (3m)
  1 failing

  1) Entire pipeline flow from Builder page
       "before all" hook for "Background Steps":
     AssertionError: Timed out retrying after 80000ms: Expected to find element: `[data-test-id="PipelineResource"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping all of the remaining tests.
      at ../../dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts.exports.waitForCRDs (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17156:77)
      at performPostInstallationSteps (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17242:21)
      at ../../dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts.exports.verifyAndInstallOperator (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17268:5)
      at ../../dev-console/integration-tests/support/pages/functions/installOperatorOnCluster.ts.exports.verifyAndInstallPipelinesOperator (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:17272:13)
      at Context.eval (https://console-openshift-console.apps.ci-op-issiwkzy-bc347.XXXXXXXXXXXXXXXXXXXXXX/__cypress/tests?p=support/commands/index.ts:20848:13)



[mochawesome] Report JSON saved to /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress_report_pipelines.json


  (Results)

  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ Tests:        13                                                                               │
  │ Passing:      0                                                                                │
  │ Failing:      1                                                                                │
  │ Pending:      0                                                                                │
  │ Skipped:      12                                                                               │
  │ Screenshots:  1                                                                                │
  │ Video:        true                                                                             │
  │ Duration:     2 minutes, 58 seconds                                                            │
  │ Spec Ran:     e2e/pipeline-ci.feature                                                          │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘


  (Screenshots)

  -  /go/src/github.com/openshift/console/frontend/gui_test_screenshots/cypress/scree     (1280x720)
     nshots/e2e/pipeline-ci.feature/Background Steps -- before all hook (failed).png                


  (Video)

  -  Started processing:  Compressing to 32 CRF                                                     
  -  Finished processing: /go/src/github.com/openshift/console/frontend/gui_test_scre   (16 seconds)
                          enshots/cypress/videos/e2e/pipeline-ci.feature.mp4                        

    Compression progress:  100%

====================================================================================================

  (Run Finished)


       Spec                                              Tests  Passing  Failing  Pending  Skipped  
  ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
  │ ✖  e2e/pipeline-ci.feature                  02:58       13        -        1        -       12 │
  └────────────────────────────────────────────────────────────────────────────────────────────────┘
    ✖  1 of 1 failed (100%)                     02:58       13        -        1        -       12  

See also

  1. https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-console-release-4.12-e2e-gcp-console
  2. https://search.ci.openshift.org/?search=Expected+to+find+element&maxAge=336h&context=1&type=all&name=pull-ci-openshift-console-release-4.12-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job (not exact match, but couldn't create a better filter)

Failures like:

$ oc login --token=...

Logged into "https://api..." as "..." using the token provided.

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get projects.project.openshift.io)

break login, which tries to gather information before saving the configuration, including a giant project list.

Ideally login would be able to save the successful login credentials, even when the informative gathering had difficulties. And possibly the informative gathering could be made conditional (--quiet or similar?) so expensive gathering could be skipped in use-cases where the context was not needed.

This is a clone of issue OCPBUGS-4724. The following is the description of the original issue:

Description of problem: Installing OCP4.12 on top of Openstack 16.1 following the multi-availabilityZone installation is creating a cluster where the egressIP annotations ("cloud.network.openshift.io/egress-ipconfig") are created with empty value for the workers:

$ oc get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
ostest-kncvv-master-0         Ready    control-plane,master   9h    v1.25.4+86bd4ff
ostest-kncvv-master-1         Ready    control-plane,master   9h    v1.25.4+86bd4ff
ostest-kncvv-master-2         Ready    control-plane,master   9h    v1.25.4+86bd4ff
ostest-kncvv-worker-0-qxr5g   Ready    worker                 8h    v1.25.4+86bd4ff
ostest-kncvv-worker-1-bmvvv   Ready    worker                 8h    v1.25.4+86bd4ff
ostest-kncvv-worker-2-pbgww   Ready    worker                 8h    v1.25.4+86bd4ff
$ oc get node ostest-kncvv-worker-0-qxr5g -o json | jq -r '.metadata.annotations' 
{
  "alpha.kubernetes.io/provided-node-ip": "10.196.2.156",
  "cloud.network.openshift.io/egress-ipconfig": "null",
  "csi.volume.kubernetes.io/nodeid": "{\"cinder.csi.openstack.org\":\"8327aef0-c6a7-4bf6-8f8f-d25c9abd9bce\",\"manila.csi.openstack.org\":\"ostest-kncvv-worker-0-qxr5g\"}",
  "k8s.ovn.org/host-addresses": "[\"10.196.2.156\",\"172.17.5.154\"]",
  "k8s.ovn.org/l3-gateway-config": "{\"default\":{\"mode\":\"shared\",\"interface-id\":\"br-ex_ostest-kncvv-worker-0-qxr5g\",\"mac-address\":\"fa:16:3e:7e:b5:70\",\"ip-addresses\":[\"10.196.2.156/16\"],\"ip-address\":\"10.196.2.156/16\",\"next-hops\":[\"10.196.0.1\"],\"next-hop\":\"10.196.0.1\",\"node-port-enable\":\"true\",\"vlan-id\":\"0\"}}",
  "k8s.ovn.org/node-chassis-id": "fd777b73-aa64-4fa5-b0b1-70c3bebc2ac6",
  "k8s.ovn.org/node-gateway-router-lrp-ifaddr": "{\"ipv4\":\"100.64.0.6/16\"}",
  "k8s.ovn.org/node-mgmt-port-mac-address": "42:e8:4f:42:9f:7d",
  "k8s.ovn.org/node-primary-ifaddr": "{\"ipv4\":\"10.196.2.156/16\"}",
  "k8s.ovn.org/node-subnets": "{\"default\":\"10.128.2.0/23\"}",
  "machine.openshift.io/machine": "openshift-machine-api/ostest-kncvv-worker-0-qxr5g",
  "machineconfiguration.openshift.io/controlPlaneTopology": "HighlyAvailable",
  "machineconfiguration.openshift.io/currentConfig": "rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/desiredDrain": "uncordon-rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/lastAppliedDrain": "uncordon-rendered-worker-31323caf2b510e5b81179bb8ec9c150f",
  "machineconfiguration.openshift.io/reason": "",
  "machineconfiguration.openshift.io/state": "Done",
  "volumes.kubernetes.io/controller-managed-attach-detach": "true"
}

Furthermore, Below is observed on openshift-cloud-network-config-controller:

$ oc logs -n openshift-cloud-network-config-controller          cloud-network-config-controller-5fcdb6fcff-6sddj | grep egress
I1212 00:34:14.498298       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-2-pbgww
I1212 00:34:15.777129       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-0-qxr5g
I1212 00:38:13.115115       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-1-bmvvv
I1212 01:58:54.414916       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-0-drd5l
I1212 02:01:03.312655       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-1-h976w
I1212 02:04:11.656408       1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-2-zxwrv

Version-Release number of selected component (if applicable):

RHOS-16.1-RHEL-8-20221206.n.1
4.12.0-0.nightly-2022-12-09-063749

How reproducible:

Always

Steps to Reproduce:

1. Run AZ job on D/S CI (Openshift on Openstack QE CI)
2. Run conformance/serial tests

Actual results:

conformance/serial TCs are failing because it is not finding the egressIP annotation on the workers

Expected results:

Tests passing

Additional info:

Links provided on private comment.

This is a clone of issue OCPBUGS-11442. The following is the description of the original issue:

Description of problem:

Currently: Hypershift is squashing any user configured proxy configuration based on this line: https://github.com/openshift/hypershift/blob/main/support/globalconfig/proxy.go#L21-L28, https://github.com/openshift/hypershift/blob/release-4.11/control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go#L487-L493. Because of this any user changes to the cluster-wide proxy configuration documented here: https://docs.openshift.com/container-platform/4.12/networking/enable-cluster-wide-proxy.html are squashed and not valid for more than a few seconds. That blocks some functionality in the openshift cluster from working including application builds from the openshift samples provided in the cluster. 

 

Version-Release number of selected component (if applicable):

4.13 4.12 4.11

How reproducible:

100%

Steps to Reproduce:

1. Make a change to the Proxy object in the cluster with kubectl edit proxy cluster
2. Save the change
3. Wait a few seconds

Actual results:

HostedClusterConfig operator will go in and squash the value

Expected results:

The value the user provides remains in the configuration and is not squashed to an empty value

Additional info:

 

Description of problem:

container_network* metrics stop reporting after a container restarts. Other container_* metrics continue to report for the same pod. 

How reproducible:

Issue can be reproduced by triggering a container restart 

Steps to Reproduce:

1.Restart container 
2.Check metrics and see container_network* not reporting

Additional info:
Ticket with more detailed debugging process OHSS-16739

Due to OVN's usage of learn flows for ECMP symmetric reply handling, benchmark tools that start many short-lived connections can overload OVS.

Instead, OVN can use the kernel's CT functionality to determine the reply next-hop without adding learn flows. This greatly reduces OVS CPU usage in this scenario and allows higher throughput.

Upstream OVN patch:
https://patchwork.ozlabs.org/project/ovn/patch/20230901105557.970938-1-dceara@redhat.com/

These downstream builds and later include the patch:
ovn23.09-23.09.0-beta.31.el9fdp
ovn23.06-23.06.1-8.el9fdp
ovn23.03-23.03.1-10.el9fdp
ovn22.12-22.12.1-8.el8fdp

This is a clone of issue OCPBUGS-7438. The following is the description of the original issue:

Description of problem:

The egress service nodeSelector parsing does
not take into account wrong values that cause
errors (such as "name part must consist of alphanumeric characters"),
and the controller does not handle them gracefully given a bad input.
when a bad input is given it should log an error and ignore the service

 

Version-Release number of selected component (if applicable):

 

How reproducible:

create an egress service with a bad nodeSelector:
"{"nodeSelector":{"matchLabels":{"a:b": "c&"}}}"

ovnkube-master controller does not handle it gracefully

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The current version of openshift's corendns is based on Kubernetes 1.24 packages.  OpenShift 4.12 is based on Kubernetes 1.25.  

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/coredns/blob/release-4.12/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.

Expected results:

Kubernetes packages are at version v0.25.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

This is a clone of issue OCPBUGS-1627. The following is the description of the original issue:

Description of problem:
Two issues when setting user-defined folder in failureDomain.
1. installer get error when setting folder as a path of user-defined folder in failureDomain.

failureDomains setting in install-config.yaml:

    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-1
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-1
        folder: /IBMCloud/vm/qe-jima
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-2
        folder: /IBMCloud/vm/qe-jima
    - name: us-east-3
      region: us-east
      zone: us-east-3a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-3
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR
        folder: /IBMCloud/vm/qe-jima
    - name: us-west-1
      region: us-west
      zone: us-west-1a
      server: ibmvcenter.vmc-ci.devcluster.openshift.com
      topology:
        datacenter: datacenter-2
        computeCluster: /datacenter-2/host/vcs-mdcnc-workload-4
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR

Error message in terraform after completing ova image import:

DEBUG vsphereprivate_import_ova.import[0]: Still creating... [1m40s elapsed] 
DEBUG vsphereprivate_import_ova.import[3]: Creation complete after 1m40s [id=vm-367860] 
DEBUG vsphereprivate_import_ova.import[1]: Creation complete after 1m49s [id=vm-367863] 
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [1m50s elapsed] 
DEBUG vsphereprivate_import_ova.import[2]: Still creating... [1m50s elapsed] 
DEBUG vsphereprivate_import_ova.import[2]: Still creating... [2m0s elapsed] 
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [2m0s elapsed] 
DEBUG vsphereprivate_import_ova.import[2]: Creation complete after 2m2s [id=vm-367862] 
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [2m10s elapsed] 
DEBUG vsphereprivate_import_ova.import[0]: Creation complete after 2m20s [id=vm-367861] 
DEBUG data.vsphere_virtual_machine.template[0]: Reading... 
DEBUG data.vsphere_virtual_machine.template[3]: Reading... 
DEBUG data.vsphere_virtual_machine.template[1]: Reading... 
DEBUG data.vsphere_virtual_machine.template[2]: Reading... 
DEBUG data.vsphere_virtual_machine.template[3]: Read complete after 1s [id=42054e33-85d6-e310-7f4f-4c52a73f8338] 
DEBUG data.vsphere_virtual_machine.template[1]: Read complete after 2s [id=42053e17-cc74-7c89-f5d1-059c9030ecc7] 
DEBUG data.vsphere_virtual_machine.template[2]: Read complete after 2s [id=4205019f-26d8-f9b4-ac0c-2c073fd70b35] 
DEBUG data.vsphere_virtual_machine.template[0]: Read complete after 2s [id=4205eaf2-c727-c647-ad44-bd9ad7023c56] 
ERROR                                              
ERROR Error: error trying to determine parent targetFolder: folder '/IBMCloud/vm//IBMCloud/vm' not found 
ERROR                                              
ERROR   with vsphere_folder.folder["IBMCloud-/IBMCloud/vm/qe-jima"], 
ERROR   on main.tf line 61, in resource "vsphere_folder" "folder": 
ERROR   61: resource "vsphere_folder" "folder" {   
ERROR                                              
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "pre-bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 
ERROR                                              
ERROR Error: error trying to determine parent targetFolder: folder '/IBMCloud/vm//IBMCloud/vm' not found 
ERROR                                              
ERROR   with vsphere_folder.folder["IBMCloud-/IBMCloud/vm/qe-jima"], 
ERROR   on main.tf line 61, in resource "vsphere_folder" "folder": 
ERROR   61: resource "vsphere_folder" "folder" {   
ERROR                                              
ERROR   

2.  installer get panic error when setting folder as user-defined folder name in failure domains.

failure domain in install-config.yaml

    failureDomains:
    - name: us-east-1
      region: us-east
      zone: us-east-1a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-1
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-1
        folder: qe-jima
    - name: us-east-2
      region: us-east
      zone: us-east-2a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2
        networks:
        - multi-zone-qe-dev-1
        datastore: multi-zone-ds-2
        folder: qe-jima
    - name: us-east-3
      region: us-east
      zone: us-east-3a
      server: xxx
      topology:
        datacenter: IBMCloud
        computeCluster: /IBMCloud/host/vcs-mdcnc-workload-3
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR
        folder: qe-jima
    - name: us-west-1
      region: us-west
      zone: us-west-1a
      server: xxx
      topology:
        datacenter: datacenter-2
        computeCluster: /datacenter-2/host/vcs-mdcnc-workload-4
        networks:
        - multi-zone-qe-dev-1
        datastore: workload_share_vcsmdcncworkload3_joYiR                                  

panic error message in installer:

INFO Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/releases/rhcos-4.12/412.86.202208101039-0/x86_64/rhcos-412.86.202208101039-0-vmware.x86_64.ova?sha256=' 
INFO The file was found in cache: /home/user/.cache/openshift-installer/image_cache/rhcos-412.86.202208101039-0-vmware.x86_64.ova. Reusing... 
panic: runtime error: index out of range [1] with length 1goroutine 1 [running]:
github.com/openshift/installer/pkg/tfvars/vsphere.TFVars({{0xc0013bd068, 0x3, 0x3}, {0xc000b11dd0, 0x12}, {0xc000b11db8, 0x14}, {0xc000b11d28, 0x14}, {0xc000fe8fc0, ...}, ...})
    /go/src/github.com/openshift/installer/pkg/tfvars/vsphere/vsphere.go:79 +0x61b
github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1d1ed360, 0x5?)
    /go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:847 +0x4798
 

Based on explanation of field folder, looks like folder name should be ok. If it is not allowed to use folder name, need to validate the folder and update explain.

 

sh-4.4$ ./openshift-install explain installconfig.platform.vsphere.failureDomains.topology.folder
KIND:     InstallConfig
VERSION:  v1RESOURCE: <string>
  folder is the name or inventory path of the folder in which the virtual machine is created/located.
 

 

 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-095559

How reproducible:

always

Steps to Reproduce:

see description

Actual results:

installation has errors when set user-defined folder

Expected results:

installation is successful when set user-defined folder

Additional info:

 

This is a clone of issue OCPBUGS-3924. The following is the description of the original issue:

The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. We'll need the components setting manifests for these deprecated APIs to move to modern APIs. And then we should drop our ability to reconcile the deprecated APIs, to avoid having other components leak back in to using them.

Specifically cluster-monitoring-operator touches:

Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times

Full output of the test at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27560/pull-ci-openshift-origin-master-e2e-gcp-ovn/1593697975584952320/artifacts/e2e-gcp-ovn/openshift-e2e-test/build-log.txt:

[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:27
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:158
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:159
flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Ginkgo exit error 4: exit with code 4

This is required to unblock https://github.com/openshift/origin/pull/27561

This is a clone of issue OCPBUGS-2579. The following is the description of the original issue:

On disabling the helm and 
import-from-samples actions in customization, Helm Charts and Samples options are still enabled in topology add actions.

Under 

spec:
    customization:
        addPage:
           disabledActions:

Insert snippet of Add page actions. (attached screenshot for reference)

Actual result:

Helm Charts and Samples options are still enabled in topology add actions even after disabling them in customization

Expected result:

Helm Charts and Samples options should be disabled(hidden)

Description of problem:

The current version of openshift/cluster-dns-operator vendors Kubernetes 1.24 packages.  OpenShift 4.12 is based on Kubernetes 1.25.  

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/cluster-dns-operator/blob/release-4.12/go.mod  

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.

Expected results:

Kubernetes packages are at version v0.25.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

Description of problem:

We have ODF bug for it here: https://bugzilla.redhat.com/show_bug.cgi?id=2169779

Discussed in formu-storage with Hemant here:
https://redhat-internal.slack.com/archives/CBQHQFU0N/p1677085216391669

And asked to open bug for it.

This currently blocking ODF 4.13 deployment over vSphere

Version-Release number of selected component (if applicable):

 

How reproducible:

YES

Steps to Reproduce:

1. Deploy ODF 4.13 on vSphere with `thin-csi` SC
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4997. The following is the description of the original issue:

The fix for OCPBUGS-3382 ensures that we pass the proxy settings from the install-config through to the final cluster. However, nothing in the agent ISO itself uses proxy settings (at least until bootstrapping starts.

It is probably less likely for the agent-based installer that proxies will be needed than e.g. for assisted (where agents running on-prem need to call back to assisted-service in the cloud), but we should be consistent about using any proxy config provided. There may certainly be cases where the registry is only reachable via a proxy.

This can be easily set system-wide by configuring default environment variables in the systemd config. An example (from the bootstrap ignition) is: https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/etc/systemd/system.conf.d/10-default-env.conf.template
Note that current the agent service explicitly overrides these environment variables to be empty, so that will have to be cleared.

This is a clone of issue OCPBUGS-7617. The following is the description of the original issue:

Description of problem:

Azure Disk volume is taking time to attach/detach
Version-Release number of selected component (if applicable):

Openshift ARO 4.10.30
How reproducible:

While performing scaledown and scaleup of statefulset pod takes time to attach and detach volume from nodes.

Reviewed must-gather and test output will share my findings in comments.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Name: DNS
Description: Please change the "DNS" component to be a subcomponent "DNS" of the "Networking" component.

Component: change to "Networking".
Subcomponent: change to "DNS".

Existing fields (default assignee, default QA contact, default CC email list, etc.) should remain the same as they currently are.
Default Assignee: aos-network-edge-staff@bot.bugzilla.redhat.com
Default QA Contact: hongli@redhat.com
Default CC List: aos-network-edge-staff@bot.bugzilla.redhat.com
Additional Notes:
I filled in "Default CC email list" because the form validation would not permit me to omit it. However, it can be left empty in Bugzilla (it is currently empty).

If possible, we would like this change to be done prior to the Bugzilla-to-Jira migration to avoid the need to make the change after the migration.

This is a clone of issue OCPBUGS-14644. The following is the description of the original issue:

Description of problem:

Trying to run must-gather in production, I encountered these errors:

Error running must-gather collection:
    gather did not start for pod must-gather-sbrwj: timed out waiting for the condition
Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information.
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:164","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 15m30s timeout","severity":"error","time":"2023-06-06T17:42:20Z"}
error running backup collection: errors occurred while gathering data:
    [skipping gathering secrets/support due to error: secrets "support" not found, skipping gathering sharedconfigmaps.sharedresource.openshift.io due to error: the server doesn't have a resource type "sharedconfigmaps", skipping gathering sharedsecrets.sharedresource.openshift.io due to error: the server doesn't have a resource type "sharedsecrets"]
error: gather did not start for pod must-gather-sbrwj: timed out waiting for the condition
{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:251","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 15s grace period","severity":"error","time":"2023-06-06T17:42:25Z"}
{"component":"entrypoint","error":"process timed out","file":"k8s.io/test-infra/prow/entrypoint/run.go:79","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2023-06-06T17:42:25Z"}
error: failed to execute wrapped command: exit status 127 
INFO[2023-06-06T17:42:26Z] Step XXXXXX-perfscale-ci-tests-rosa-hypershift-cluster-density-v2-gather-must-gather failed after 15m43s. 
INFO[2023-06-06T17:42:26Z] Running step XXXXXX-perfscale-ci-tests-rosa-hypershift-cluster-density-v2-gather-extra. 

followed by.. 

error: the server doesn't have a resource type "machineconfigpools"
...
error: the server doesn't have a resource type "machineconfigs"
...
INFO: gathering the audit logs for each master
error: the server doesn't have a resource type "machineconfigs"
error: the server doesn't have a resource type "machinesets"
error: the server doesn't have a resource type "machines"
error: the server doesn't have a resource type "machineconfigpools"
error: the server doesn't have a resource type "machines"
error: the server doesn't have a resource type "machinesets"
error: the server doesn't have a resource type "controlplanemachinesets"
error: the server doesn't have a resource type "controlplanemachinesets"
/logs/artifacts/network/multus_logs /

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Run must-gather on a Rosa/Hypershift cluster 
2. See the errors reported
3.

Actual results:

Must-gather errors out.

Expected results:

Must-gather should run successfully.

Additional info:

Ran into issue when testing in production, but applies to all 4.12 

This is a clone of issue OCPBUGS-18677. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18608. The following is the description of the original issue:

Description of problem:

UPSTREAM: <carry>: Force using host go always and use host libriaries introduced a build failure for the Windows kubelet that is showing up only in release-4.11 for an unknown reason but could potentially occur on other releases too.

Version-Release number of selected component (if applicable):

WMCO version: 9.0.0 and below
 

How reproducible:

Always on release-4.11
 

Steps to Reproduce:

1. Clone the WMCO repo
2. Build the WMCO image

Actual results:

WMCO image build fails

Expected results:

 WMCO image build should succeed

Description of problem:
The "Add Git Repository" has a "Show configuration options" expandable section that shows the required permissions for a webhook setup, and provides a link to "read more about setting up webhook".

But the permission section shows nothing when open this second expandable section, and the link doesn't do anything until the user enters a "supported" GitHub, GitLab or BitBucket URL.

Version-Release number of selected component (if applicable):
4.11-4.13

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines operator
  2. Navigate to the Developer perspective > Pipelines
  3. Press "Create" and select "Repository"
  4. Click on "Show configuration options"
  5. Click on "See Git permissions"
  6. Click on "Read more about setting up webhook"

Actual results:

  1. The Git permission section shows no git permissions.
  2. The Read more link doesn't open any new page.

Expected results:

  1. The Git permission section should show some info or must not be disabled.
  2. The Read more link should open a page or must not be displayed as well.

Additional info:

  1. None

Copied from an upstream issue: https://github.com/operator-framework/operator-lifecycle-manager/issues/2830

What did you do?

When attempting to reinstall an operator that uses conversion webhooks by

  • Deleting the operator subscription and any CSVs associated with it
  • Recreating the operator subscription

The resulting InstallPlan enters a failed state with message similar to

error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found

When the original CSVs are deleted, the operator's main deployment and service are removed, but CRDs are left in-cluster. However, since the service/CA bundle/deployment that serve the conversion webhook are removed, conversion webhooks are broken at that point. Eventually this impacts garbage collection on the cluster as well.

This can be reproduced by installing the DevWorkspace Operator from the Red Hat catalog. (I can provide yamls/upstream images that reproduce as well, if that's helpful). It may be necessary to create a DevWorkspace in the cluster before deletion, e.g. by oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/plain.yaml

What did you expect to see?
Operator is able to be reinstalled without removing CRDs and all instances.

What did you see instead? Under which circumstances?
It's necessary to completely remove the operator including CRDs. For our operator (DevWorkspace), this also makes uninstall especially complicated as finalizers are used (so CRDs cannot be deleted if the controller is removed, and the controller cannot be restored by reinstalling)

Environment

operator-lifecycle-manager version: 4.10.24

Kubernetes version information: Kubernetes Version: v1.23.5+012e945 (OpenShift 4.10.24)

Kubernetes cluster kind: OpenShift

Manoj noticed that the cluster registration fails for SNO clusters when the network type is set to OpenShiftSDN. We should add some validation to prevent this combination.

Failed to register cluster with assisted-service: AssistedServiceError Code: 400 Href: ID: 400 Kind: Error Reason: OpenShiftSDN network type is not allowed in single node mode

Documentation also indicates OpenShiftSDN is not compatible: https://docs.openshift.com/container-platform/4.11/installing/installing_sno/install-sno-preparing-to-install-sno.html

This is a clone of issue OCPBUGS-13731. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8271. The following is the description of the original issue:

Description of problem:

The kube-controller-manager container cluster-policy-controller will show unusual error logs ,such as "
I0214 10:49:34.698154       1 interface.go:71] Couldn't find informer for template.openshift.io/v1, Resource=templateinstances
I0214 10:49:34.698159       1 resource_quota_monitor.go:185] QuotaMonitor unable to use a shared informer for resource "template.openshift.io/v1, Resource=templateinstances": no informer found for template.openshift.io/v1, Resource=templateinstances
"

Version-Release number of selected component (if applicable):

 

How reproducible:

when the cluster-policy-controller restart ,u will see these logs

Steps to Reproduce:

1.oc logs kube-controller-manager-master0 -n openshift-kube-controller-manager -c cluster-policy-controller  

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:

Description of problem:

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.

Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Jan  6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...

Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'

Version-Release number of selected component (if applicable):

4.11+ openshift-tests

How reproducible:

nondeterministic, wild guess is ~30% of upgrade jobs

Steps to Reproduce:

1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test

Actual results:

Jan 23 00:47:43.842: INFO: Admin Ack verified
Jan 23 00:57:43.836: INFO: Admin Ack verified
Jan 23 01:07:43.839: INFO: Admin Ack verified
Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8

No check done after the upgrade

Expected results:

Jan 23 00:57:37.894: INFO: Admin Ack verified
Jan 23 01:07:37.894: INFO: Admin Ack verified
Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d
Jan 23 01:17:57.937: INFO: Admin Ack verified

One or more checks done after upgrade

This is a clone of issue OCPBUGS-6621. The following is the description of the original issue:

Description of problem:

Image registry pods panic while deploying OCP in ap-southeast-4 AWS region

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Deploy OCP in AWS ap-southeast-4 region

Steps to Reproduce:

Deploy OCP in AWS ap-southeast-4 region 

Actual results:

panic: Invalid region provided: ap-southeast-4

Expected results:

Image registry pods should come up with no errors

Additional info:

 

 

 

 

This is a clone of issue OCPBUGS-15892. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15835. The following is the description of the original issue:

Description of problem:

https://search.ci.openshift.org/?search=error%3A+tag+latest+failed%3A+Internal+error+occurred%3A+registry.centos.org&maxAge=48h&context=1&type=build-log&name=okd&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

all currently tested versions

How reproducible:

~ 9% of jobs fail on this test

 

 ! error: Import failed (InternalError): Internal error occurred: registry.centos.org/dotnet/dotnet-31-runtime-centos7:latest: Get "https://registry.centos.org/v2/": dial tcp: lookup registry.centos.org on 172.30.0.10:53: no such host   782 31 minutes ago 

 

This is a clone of issue OCPBUGS-2088. The following is the description of the original issue:

The rendezvous host must be one of the control plane nodes.

The user has control over the rendezvous IP in the agent-config. In the case where they also provide NMState config with static IPs, we are able to verify whether the rendezvous IP points to a control plane node or not. If it does not, we should fail.

Description of problem:

NPE on topology if creates a k8s svc and KSVC which has no metadata in template

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. create a KSVC from admin -> serving -> create service
2. create a k8s svc from search service (create)

Actual results:

topology breaks (see attached screenshot)

Expected results:

topology shouldn't break

Additional info:

This is a clone of issue OCPBUGS-4950. The following is the description of the original issue:

Description of problem:

A PR bumping OLM's k8s dependencies to 1.25 wasn't merged into openshift 4.12

Version-Release number of selected component (if applicable):

openshift-4.12

How reproducible:

Always

Steps to Reproduce:

1. Check OLM's repository for k8s dependencies in the 4.12 branch

Actual results:

Has 1.24 k8s dependencies

Expected results:

Has 1.25 k8s dependencies

Additional info:

 

 

In many cases, the /dev/disk/by-path symlink is the only way to stably identify a disk without having prior knowledge of the hardware from some external source (e.g. a spreadsheet of disk serial numbers). It should be possible to specify this path in the root device hints.
This is fixed by the first commit in the upstream Metal³ PR https://github.com/metal3-io/baremetal-operator/pull/1264

Description of problem:

The platform-operators-aggregated cluster operator wasn't created after enabling "TechPreviewNoUpgrade" featureGate, as follows,

MacBook-Pro:~ jianzhang$ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type=merge
featuregate.config.openshift.io/cluster patched

MacBook-Pro:~ jianzhang$ oc wait --for=condition=Available=True clusteroperators.config.openshift.io/platform-operators-aggregated
Error from server (NotFound): clusteroperators.config.openshift.io "platform-operators-aggregated" not found

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-095559

How reproducible:

always

Steps to Reproduce:

1. Install OCP 4.12 cluster.

2. Enable "TechPreviewNoUpgrade" feature gate.
MacBook-Pro:~ jianzhang$ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type=merge
featuregate.config.openshift.io/cluster patched 

3. Check platform-operators-aggregated cluster operator.
 

Actual results:

MacBook-Pro:~ jianzhang$ oc wait --for=condition=Available=True clusteroperators.config.openshift.io/platform-operators-aggregated
Error from server (NotFound): clusteroperators.config.openshift.io "platform-operators-aggregated" not found

Expected results:

The platform-operators-aggregated cluster operator can be created successfully.

Additional info:

The openshift-platform-operators pods running well.

MacBook-Pro:~ jianzhang$ oc get deploy -n openshift-platform-operators
NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
platform-operators-controller-manager   1/1     1            1           126m
platform-operators-rukpak-core          1/1     1            1           126m
platform-operators-rukpak-webhooks      2/2     2            2           126m
MacBook-Pro:~ jianzhang$ oc get co platform-operators-aggregated
Error from server (NotFound): clusteroperators.config.openshift.io "platform-operators-aggregated" not found

Description of problem:

OVN-Kubernetes master is crashing during upgrade from 4.11.5 to 4.11.6

Version-Release number of selected component (if applicable):

4.11.5 to 4.11.6
cannot clean up egress default deny ACL name: cannot update old NetworkPolicy ACLs for namespace ocm-myuser-1urk47c6ti1n94n1spdvo9902as3klar-sd6: error in transact with ops [{Op:update Table:ACL Row:map[action:drop direction:from-lport external_ids:{GoMap:map[default-deny-policy-type:Egress]} log:false match:inport == @a12995145443578534523_egressDefaultDeny meter:{GoSet:[acl-logging]} name:{GoSet:[ocm-myuser-1urk47c6ti1n94n1spdvo9902as3klar-sd6_egressDefaultDeny]} options:{GoMap:map[apply-after-lb:true]} priority:1000 severity:{GoSet:[info]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {5277db54-dd96-4c4d-bbed-99142cab91e7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:0 Error:constraint violation Details:"ocm-myuser-1urk47c6ti1n94n1spdvo9902as3klar-sd6_egressDefaultDeny" length 65 is greater than maximum allowed length 63 UUID:{GoUUID:} Rows:[]}] and errors 


Description of problem:

If a customer creates a machine with a networks section like this

networks:
- filter: {}
  noAllowedAddressPairs: false
  subnets:
  - filter: {}
    uuid: primary-subnet-uuid
- filter: {}
  noAllowedAddressPairs: true
  subnets:
  - filter: {}
    uuid: other-subnet-uuid
primarySubnet: primary-subnet-uuid

Then all the ports are created without the allowed address pairs.

Doing some research in the source code, I have found that:
- For each entry on the networks: section, networks are filtered as per its filter: section[1]
- Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin:
  - The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4].
  - For each subnet entry that matches the network ID, a port is created[5].

So, the problematic behavior happens due to the following:

- Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks.
- This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected.
- However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering.
- As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4].

Why do we consider this behavior undesired?

I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.

Version-Release number of selected component (if applicable):

4.10.30

How reproducible:

Always

Steps to Reproduce:

1. Create a machineset like in the description
2.
3.

Actual results:

All ports have no address pairs

Expected results:

Only the port on the secondary subnet has no address pairs.

Additional info:

A simple workaround would be to just fill the filter so that a single network is selected for each network entry.

References:
[1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576
[2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580
[3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583
[4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660
[5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625

Description of problem:

revert "force cert rotation every couple days for development" in 4.12

We want short expiry times during development and long expiry times when we ship.

--- Additional comment from Eric Paris on 2020-04-02 19:57:29 CEST ---

This bug has been set to target the 4.5.0 release without specifying a severity. As part of triage when determining the priority of bugs a severity should be specified. Since these bugs have no been properly triaged I am removing the target release. Teams will need to add a severity before deferring these bugs again.

--- Additional comment from Michal Fojtik on 2020-05-12 12:45:25 CEST ---

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing the severity. 

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--- Additional comment from Standa Laznicka on 2020-05-12 14:53:12 CEST ---

you don't really want to close this

--- Additional comment from Stefan Schimanski on 2020-05-19 13:11:00 CEST ---

Waiting for master to open. We will fix it then on the release branch.

--- Additional comment from Stefan Schimanski on 2020-06-18 12:23:34 CEST ---

Will be done when 4.6 branches from master.

--- Additional comment from Michal Fojtik on 2020-07-09 14:46:02 CEST ---

Stefan is PTO, adding UpcomingSprint to his bugs to fulfill the duty.

--- Additional comment from Michal Fojtik on 2020-08-24 15:12:08 CEST ---

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

--- Additional comment from Michal Fojtik on 2020-08-31 15:59:33 CEST ---

This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

--- Additional comment from Michal Fojtik on 2020-08-31 17:00:25 CEST ---

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

--- Additional comment from Stefan Schimanski on 2020-09-11 13:00:27 CEST ---

This is waiting for Eric Paris to stop fast forwarding release-4.6 from master.

--- Additional comment from Michal Fojtik on 2020-10-30 11:12:07 CET ---

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

--- Additional comment from Nick Stielau on 2021-01-20 18:49:09 CET ---

Can we get some context on why this is blocker+?  Would we further delay the release if we don't get a fix in for this?

--- Additional comment from Stefan Schimanski on 2021-03-16 17:28:08 CET ---

--- Additional comment from Eric Paris on 2021-06-08 14:00:16 CEST ---

This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+

--- Additional comment from Michal Fojtik on 2021-06-10 10:49:36 CEST ---

This is a blocker? until we have Target Release 4.9 (it is a blocker+ for 4.9).

--- Additional comment from Wally on 2021-06-11 15:14:26 CEST ---

Setting blocker- until next week to clear reports heading to code freeze.  Will reset once 4.9 opens.

--- Additional comment from Wally on 2021-08-31 19:26:13 UTC ---

Setting blocker- until next week to clear reports heading to code freeze.  Will reset once 4.10 opens.

--- Additional comment from Michal Fojtik on 2022-02-03 21:53:15 UTC ---

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

--- Additional comment from Wally on 2022-02-09 20:11:25 UTC ---

Setting blocker- for now but will add reminder and keep in my queue for visibility.

--- Additional comment from Red Hat Bugzilla on 2022-05-09 08:32:21 UTC ---

Account disabled by LDAP Audit for extended failure

--- Additional comment from OpenShift Automated Release Tooling on 2022-06-24 01:06:13 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

--- Additional comment from Ke Wang on 2022-06-24 15:24:03 UTC ---

To verify the bug, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1921139#c6

--- Additional comment from OpenShift BugZilla Robot on 2022-06-25 12:40:12 UTC ---

Bugfix included in accepted release 4.11.0-0.nightly-2022-06-25-081133
Bug will not be automatically moved to VERIFIED for the following reasons:
- PR openshift/cluster-kube-apiserver-operator#1307 not approved by QA contact

This bug must now be manually moved to VERIFIED by dpunia@redhat.com

--- Additional comment from Deepak Punia on 2022-06-27 08:20:33 UTC ---

Below is the steps to verify this bug:

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator
  cluster-kube-apiserver-operator                https://github.com/openshift/cluster-kube-apiserver-operator                7764681777edfa3126981a0a1d390a6060a840a3

# git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307"
08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation

# oc get clusterversions.config.openshift.io 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-081133   True        False         64m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133

$ cat scripts/check_secret_expiry.sh
FILE="$1"
if [ ! -f "$1" ]; then
  echo "must provide \$1" && exit 0
fi
export IFS=$'\n'
for i in `cat "$FILE"`
do
  if `echo "$i" | grep "^#" > /dev/null`; then
    continue
  fi
  NS=`echo $i | cut -d ' ' -f 1`
  SECRET=`echo $i | cut -d ' ' -f 2`
  rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null
  echo "Check cert dates of $SECRET in project $NS:"
  openssl x509 -noout --dates -in tls.crt; echo
done

$ cat certs.txt
openshift-kube-controller-manager-operator csr-signer-signer
openshift-kube-controller-manager-operator csr-signer
openshift-kube-controller-manager kube-controller-manager-client-cert-key
openshift-kube-apiserver-operator aggregator-client-signer
openshift-kube-apiserver aggregator-client
openshift-kube-apiserver external-loadbalancer-serving-certkey
openshift-kube-apiserver internal-loadbalancer-serving-certkey
openshift-kube-apiserver service-network-serving-certkey
openshift-config-managed kube-controller-manager-client-cert-key
openshift-config-managed kube-scheduler-client-cert-key
openshift-kube-scheduler kube-scheduler-client-cert-key

Checking the Certs,  they are with one day expiry times, this is as expected.
# ./check_secret_expiry.sh certs.txt
Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:41:38 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of csr-signer in project openshift-kube-controller-manager-operator:
notBefore=Jun 27 04:52:21 2022 GMT
notAfter=Jun 28 04:41:38 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator:
notBefore=Jun 27 04:41:37 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of aggregator-client in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jun 28 04:41:37 2022 GMT

Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:49 2022 GMT
notAfter=Jul 27 04:52:50 2022 GMT

Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver:
notBefore=Jun 27 04:52:28 2022 GMT
notAfter=Jul 27 04:52:29 2022 GMT

Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:26 2022 GMT
notAfter=Jul 27 04:52:27 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT

Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler:
notBefore=Jun 27 04:52:47 2022 GMT
notAfter=Jul 27 04:52:48 2022 GMT
# 

# cat check_secret_expiry_within.sh
#!/usr/bin/env bash
# usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year
WITHIN=${1:-24hours}
echo "Checking validity within $WITHIN ..."
oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before")  \(.metadata.annotations."auth.openshift.io/certificate-not-after")  \(.metadata.namespace)\t\(.metadata.name)"'

# ./check_secret_expiry_within.sh 1day
Checking validity within 1day ...
2022-06-27T04:41:37Z  2022-06-28T04:41:37Z  openshift-kube-apiserver-operator	aggregator-client-signer
2022-06-27T04:52:26Z  2022-06-28T04:41:37Z  openshift-kube-apiserver	aggregator-client
2022-06-27T04:52:21Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer
2022-06-27T04:41:38Z  2022-06-28T04:41:38Z  openshift-kube-controller-manager-operator	csr-signer-signer

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7696. The following is the description of the original issue:

Description of problem:

not able to deploy machine with publicIp:true for Azure disconnected cluster 

Version-Release number of selected component (if applicable):

Cluster version is 4.13.0-0.nightly-2023-02-16-120330

How reproducible:

Always

Steps to Reproduce:

1.Create a machineset with publicIp true

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-17T09:54:35Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-az17a-vk8wq
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: machineset-36489
  namespace: openshift-machine-api
  resourceVersion: "227215"
  uid: e9213148-0bdf-48f1-84be-1e1a36af43c1
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-az17a-vk8wq
      machine.openshift.io/cluster-api-machineset: machineset-36489
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: huliu-az17a-vk8wq
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: machineset-36489
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/huliu-az17a-vk8wq-rg/providers/Microsoft.Compute/galleries/gallery_huliu_az17a_vk8wq/images/huliu-az17a-vk8wq-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: westus
          managedIdentity: huliu-az17a-vk8wq-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: huliu-az17a-vk8wq-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: true
          publicLoadBalancer: huliu-az17a-vk8wq
          resourceGroup: huliu-az17a-vk8wq-rg
          subnet: huliu-az17a-vk8wq-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: huliu-az17a-vk8wq-vnet
          zone: ""
status:
  fullyLabeledReplicas: 1
  observedGeneration: 1
  replicas: 1

Machine in failed status with below error :
 Error Message:           failed to reconcile machine "machineset-36489-hhjfc": network.PublicIPAddressesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidDomainNameLabel" Message="The domain name label -machineset-36489-hhjfc is invalid. It must conform to the following regular expression: ^[a-z][a-z0-9-]{1,61}[a-z0-9]$." Details=[]

 

Actual results:

Machine should be created successfully as publicZone exists in the cluster DNS 
oc edit dns cluster

apiVersion: config.openshift.io/v1
kind: DNS
metadata:
  creationTimestamp: "2023-02-17T02:26:41Z"
  generation: 1
  name: cluster
  resourceVersion: "529"
  uid: a299c3d8-e8ed-4266-b842-7585d5c0632d
spec:
  baseDomain: huliu-az17a.qe.azure.devcluster.openshift.com
  privateZone:
    id: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huliu-az17a-vk8wq-rg/providers/Microsoft.Network/privateDnsZones/huliu-az17a.qe.azure.devcluster.openshift.com
  publicZone:
    id: /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com
status: {}

Expected results:

machine should be running successfully 

Additional info:

Must Gather https://drive.google.com/file/d/1cPkFrTh7veO1Ph24GmVAyrs6mI3dmWYR/view?usp=sharing

Description of problem:

"Failed to open directory, disabling udev device properties" in node-exporter logs

$ for i in $(oc -n openshift-monitoring get pod | grep node-exporter | awk '{print $1}'); do echo $i; oc -n openshift-monitoring logs -c node-exporter $i | grep "Failed to open directory, disabling udev device properties"; echo -e "\n"; done
node-exporter-4279b
ts=2022-10-17T01:16:05.833Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data

node-exporter-9tq64
ts=2022-10-17T01:16:04.642Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data

node-exporter-dwtwh
ts=2022-10-17T01:16:04.936Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data

node-exporter-nrznc
ts=2022-10-17T01:16:05.601Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data

node-exporter-q87s4
ts=2022-10-17T01:16:05.228Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data

node-exporter-twtxj
ts=2022-10-17T01:16:05.249Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data

debug on node, /run/udev/data is readable

# oc debug node/ip-10-0-138-107.us-east-2.compute.internal
Temporary namespace openshift-debug-dhvqv is created for debugging node...
Starting pod/ip-10-0-138-107us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.138.107
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls -l /run/udev/
total 0
srw-------.  1 root root    0 Oct 17 01:04 control
drwxr-xr-x.  2 root root 3780 Oct 17 01:26 data
drwxr-xr-x. 40 root root  800 Oct 17 01:04 links
drwxr-xr-x.  3 root root   60 Oct 17 01:04 static_node-tags
drwxr-xr-x.  5 root root  100 Oct 17 01:04 tags
drwxr-xr-x.  2 root root  140 Oct 17 01:04 watch
sh-4.4# ls -l /run/udev/data
total 304
-rw-r--r--. 1 root root   55 Oct 17 01:04 +acpi:AMZN0000:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXCPU:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXCPU:01
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXCPU:02
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXCPU:03
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXPWRBN:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXSLPBN:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXSYBUS:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXSYBUS:01
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:LNXSYSTM:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0103:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0303:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0400:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0501:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0A03:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0B00:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0C0F:00
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0C0F:01
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0C0F:02
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0C0F:03
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0C0F:04
-rw-r--r--. 1 root root   57 Oct 17 01:04 +acpi:PNP0F13:00
-rw-r--r--. 1 root root  142 Oct 17 01:04 +input:input0
-rw-r--r--. 1 root root  142 Oct 17 01:04 +input:input1
-rw-r--r--. 1 root root  218 Oct 17 01:04 +input:input2
-rw-r--r--. 1 root root  198 Oct 17 01:04 +input:input4
-rw-r--r--. 1 root root  143 Oct 17 01:04 +input:input5
-rw-r--r--. 1 root root   60 Oct 17 01:04 +module:configfs
-rw-r--r--. 1 root root   66 Oct 17 01:04 +module:fuse
-rw-r--r--. 1 root root  188 Oct 17 01:04 +pci:0000:00:00.0
-rw-r--r--. 1 root root  195 Oct 17 01:04 +pci:0000:00:01.0
-rw-r--r--. 1 root root  213 Oct 17 01:04 +pci:0000:00:01.3
-rw-r--r--. 1 root root  207 Oct 17 01:04 +pci:0000:00:03.0
-rw-r--r--. 1 root root  259 Oct 17 01:04 +pci:0000:00:04.0
-rw-r--r--. 1 root root  208 Oct 17 01:04 +pci:0000:00:05.0
-rw-r--r--. 1 root root   55 Oct 17 01:04 +platform:AMZN0000:00
-rw-r--r--. 1 root root  825 Oct 17 01:04 b259:0
-rw-r--r--. 1 root root 1357 Oct 17 01:04 b259:1
-rw-r--r--. 1 root root 1568 Oct 17 01:04 b259:2
-rw-r--r--. 1 root root 1619 Oct 17 01:04 b259:3
-rw-r--r--. 1 root root 1602 Oct 17 01:04 b259:4
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:144
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:183
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:227
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:228
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:229
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:231
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:235
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:236
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:62
-rw-r--r--. 1 root root    0 Oct 17 01:04 c10:63
-rw-r--r--. 1 root root  193 Oct 17 01:04 c13:32
-rw-r--r--. 1 root root    0 Oct 17 01:04 c13:63
-rw-r--r--. 1 root root  113 Oct 17 01:04 c13:64
-rw-r--r--. 1 root root  113 Oct 17 01:04 c13:65
-rw-r--r--. 1 root root  232 Oct 17 01:04 c13:66
-rw-r--r--. 1 root root  199 Oct 17 01:04 c13:67
-rw-r--r--. 1 root root  143 Oct 17 01:04 c13:68
-rw-r--r--. 1 root root    0 Oct 17 01:04 c162:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:1
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:11
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:3
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:4
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:5
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:7
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:8
-rw-r--r--. 1 root root    0 Oct 17 01:04 c1:9
-rw-r--r--. 1 root root    0 Oct 17 01:04 c202:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c202:1
-rw-r--r--. 1 root root    0 Oct 17 01:04 c202:2
-rw-r--r--. 1 root root    0 Oct 17 01:04 c202:3
-rw-r--r--. 1 root root    0 Oct 17 01:04 c203:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c203:1
-rw-r--r--. 1 root root    0 Oct 17 01:04 c203:2
-rw-r--r--. 1 root root    0 Oct 17 01:04 c203:3
-rw-r--r--. 1 root root    0 Oct 17 01:04 c241:0
-rw-r--r--. 1 root root  259 Oct 17 01:04 c242:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c246:0
-rw-r--r--. 1 root root   23 Oct 17 01:04 c251:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:1
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:10
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:11
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:12
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:13
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:14
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:15
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:16
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:17
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:18
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:19
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:2
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:20
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:21
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:22
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:23
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:24
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:25
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:26
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:27
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:28
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:29
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:3
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:30
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:31
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:32
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:33
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:34
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:35
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:36
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:37
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:38
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:39
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:4
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:40
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:41
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:42
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:43
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:44
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:45
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:46
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:47
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:48
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:49
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:5
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:50
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:51
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:52
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:53
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:54
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:55
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:56
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:57
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:58
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:59
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:6
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:60
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:61
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:62
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:63
-rw-r--r--. 1 root root   20 Oct 17 01:04 c4:64
-rw-r--r--. 1 root root   20 Oct 17 01:04 c4:65
-rw-r--r--. 1 root root   20 Oct 17 01:04 c4:66
-rw-r--r--. 1 root root   20 Oct 17 01:04 c4:67
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:7
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:8
-rw-r--r--. 1 root root    0 Oct 17 01:04 c4:9
-rw-r--r--. 1 root root    0 Oct 17 01:04 c5:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c5:1
-rw-r--r--. 1 root root    0 Oct 17 01:04 c5:2
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:0
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:1
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:128
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:129
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:130
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:131
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:132
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:133
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:134
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:2
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:3
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:4
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:5
-rw-r--r--. 1 root root    0 Oct 17 01:04 c7:6
-rw-r--r--. 1 root root   87 Oct 17 01:04 n1
-rw-r--r--. 1 root root  360 Oct 17 01:06 n10
-rw-r--r--. 1 root root  360 Oct 17 01:06 n11
-rw-r--r--. 1 root root  360 Oct 17 01:06 n13
-rw-r--r--. 1 root root  360 Oct 17 01:07 n14
-rw-r--r--. 1 root root  595 Oct 17 01:04 n2
-rw-r--r--. 1 root root  360 Oct 17 01:09 n25
-rw-r--r--. 1 root root  360 Oct 17 01:10 n29
-rw-r--r--. 1 root root  195 Oct 17 01:04 n3
-rw-r--r--. 1 root root  360 Oct 17 01:10 n30
-rw-r--r--. 1 root root  360 Oct 17 01:11 n31
-rw-r--r--. 1 root root  360 Oct 17 01:14 n35
-rw-r--r--. 1 root root  360 Oct 17 01:14 n37
-rw-r--r--. 1 root root  360 Oct 17 01:14 n39
-rw-r--r--. 1 root root  188 Oct 17 01:04 n4
-rw-r--r--. 1 root root  360 Oct 17 01:15 n41
-rw-r--r--. 1 root root  193 Oct 17 01:04 n5
-rw-r--r--. 1 root root  360 Oct 17 01:18 n50
-rw-r--r--. 1 root root  362 Oct 17 01:26 n54
-rw-r--r--. 1 root root  189 Oct 17 01:04 n6
-rw-r--r--. 1 root root  357 Oct 17 01:05 n7
-rw-r--r--. 1 root root  357 Oct 17 01:05 n8
-rw-r--r--. 1 root root  359 Oct 17 01:05 n9 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-15-094115
node-exporter version=1.4.0

How reproducible:

always

Steps to Reproduce:

1. check node-exporter logs
2.
3.

Actual results:

"Failed to open directory, disabling udev device properties" in node-exporter logs

Expected results:

no error logs

Additional info:

no functional affection for the cluster
code:
https://github.com/prometheus/node_exporter/blob/release-1.4/collector/diskstats_linux.go#L262-L270

This is a clone of issue OCPBUGS-3987. The following is the description of the original issue:

Description of problem:

When the user supplies nmstateConfig in agent-config.yaml invalid configurations may not be detected

Version-Release number of selected component (if applicable):

4.12

How reproducible:

every time

Steps to Reproduce:

1. Create an invalid NM config. In this case an interface was defined with a route but no IP address 
2. The ISO can be generated with no errors
3. At run time the invalid was detected by assisted-service, create-cluster-and-infraenv.service logged the error "failed to validate network yaml for host 0, invalid yaml, error:"
 

Actual results:

Installation failed

Expected results:

Invalid configuration would be detected when ISO is created

Additional info:

It looks like the ValidateStaticConfigParams check is ONLY done when the nmstateconfig is provided in nmstateconfig.yaml, not when the file is generated (supplied in agent-config.yaml). https://github.com/openshift/installer/blob/master/pkg/asset/agent/manifests/nmstateconfig.go#L188

 

 

CI is failing due to the updated pod security admission controller. We need to update the console test pods with the correct security values.

Error: Command failed: echo '{"apiVersion":"v1","kind":"Pod","metadata":

{"name":"test-jxlpt-event-test-pod","namespace":"test-jxlpt"}

,"spec":{"containers":[

{"name":"httpd","image":"image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest"}

]}}' | kubectl create -n test-jxlpt -f -
Error from server (Forbidden): error when creating "STDIN": pods "test-jxlpt-event-test-pod" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "httpd" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "httpd" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "httpd" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "httpd" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Description of problem:
In 4.13 we ship the Collection Profile feature as TP. This change introduced a change in our default selectors for prometheus *Monitors and the respective labels in some CMO controlled service monitors. The change is in effect, even without TP being enabled.
In order to avoid double scraping on update, we need to backport the slector change to 4.12.

Description of problem:

This is the original bug: https://bugzilla.redhat.com/show_bug.cgi?id=2098054

It was fixed in https://github.com/openshift/kubernetes/pull/1340 but was reverted as it introduced a bug that meant we did not register instances on create for NLB services.

Need to fix the issue and reintroduce the fix

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3280. The following is the description of the original issue:

I have a script that does continuous installs using AGENT_E2E_TEST_SCENARIO=COMPACT_IPV4, just starting a new install after the previous one completes. What I'm seeing is that eventually I end up getting installation failures due to the container-images-available validation failure. What gets logged in wait-for bootstrap-complete is:

level=debug msg=Host master-0: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6. result: failure. 

level=debug msg=Host master-0: validation 'container-images-available' that used to succeed is now failing
level=debug msg=Host master-0: updated status from preparing-for-installation to preparing-failed (Host failed to prepare for installation due to following failing validation(s): Failed to fetch container images needed for installation from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6. This may be due to a network hiccup. Retry to install again. If this problem persists, check your network settings to make sure you’re not blocked. ; Host couldn't synchronize with any NTP server)

Sometimes the image gets loaded onto the other masters OK and sometimes there are failures with more than one host. In either case the install stalls at this point.

When using a disconnected environment (MIRROR_IMAGES=true) I don't see this occurring.

Containers on host0
[core@master-0 ~]$ sudo podman ps
CONTAINER ID  IMAGE                                                                                                                   COMMAND               CREATED       STATUS           PORTS       NAMES
00a0eebb989c  localhost/podman-pause:4.2.0-1661537366                                                                                                       11 hours ago  Up 11 hours ago              cef65dd7f170-infra
5d0eced94979  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:caa73897dcb9ff6bc00a4165f4170701f4bd41e36bfaf695c00461ec65a8d589  /bin/bash start_d...  11 hours ago  Up 11 hours ago              assisted-db
813bef526094  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:caa73897dcb9ff6bc00a4165f4170701f4bd41e36bfaf695c00461ec65a8d589  /assisted-service     11 hours ago  Up 11 hours ago              service
edde1028a542  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e43558e28be8fbf6fe4529cf9f9beadbacbbba8c570ecf6cb81ae732ec01807f  next_step_runner ...  11 hours ago  Up 11 hours ago              next-step-runner

Some relevant logs from assisted-service for this container image:
time="2022-11-03T01:48:44Z" level=info msg="Submitting step <container-image-availability> id <container-image-availability-b72665b1> to infra_env <17c8b837-0130-4b8c-ad06-19bcd2a61dbf> host <df170326-772b-43b5-87ef-3dfff91ba1a9>  Arguments: <[{\"images\":[\"registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca122ab3a82dfa15d72a05f448c48a7758a2c7b0ecbb39011235bcf0666fbc15\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e52a45b47cd9d70e7378811f4ba763fd43ec2580378822286c7115fbee6ef3a\"],\"timeout\":960}]>" func=github.com/openshift/assisted-service/internal/host/hostcommands.logSteps file="/src/internal/host/hostcommands/instruction_manager.go:285" go-id=841 host_id=df170326-772b-43b5-87ef-3dfff91ba1a9 infra_env_id=17c8b837-0130-4b8c-ad06-19bcd2a61dbf pkg=instructions request_id=47cc221f-4f47-4d0d-8278-c0f5af933567

time="2022-11-03T01:49:35Z" level=error msg="Received step reply <container-image-availability-9788cfa7> from infra-env <17c8b837-0130-4b8c-ad06-19bcd2a61dbf> host <845f1e3c-c286-4d2f-ba92-4c5cab953641> exit-code <2> stderr <> stdout <{\"images\":[

{\"name\":\"registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451\",\"result\":\"success\"}

,{\"download_rate\":159.65409925994226,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca122ab3a82dfa15d72a05f448c48a7758a2c7b0ecbb39011235bcf0666fbc15\",\"result\":\"success\",\"size_bytes\":523130669,\"time\":3.276650405},{\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6\",\"result\":\"failure\"},{\"download_rate\":278.8962416008878,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e52a45b47cd9d70e7378811f4ba763fd43ec2580378822286c7115fbee6ef3a\",\"result\":\"success\",\"size_bytes\":402688178,\"time\":1.443863767}]}>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/src/internal/bminventory/inventory.go:3287" go-id=845 host_id=845f1e3c-c286-4d2f-ba92-4c5cab953641 infra_env_id=17c8b837-0130-4b8c-ad06-19bcd2a61dbf pkg=Inventory request_id=3a571ba6-5175-4bbe-b89a-20cdde30b884                         

time="2022-11-03T01:49:35Z" level=info msg="Adding new image status for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6 with status failure to host 845f1e3c-c286-4d2f-ba92-4c5cab953641" func="github.com/openshift/assisted-service/internal/host.(*Manager).UpdateImageStatus" file="/src/internal/host/host.go:805" pkg=host-state

 

This is a clone of issue OCPBUGS-3524. The following is the description of the original issue:

Description of problem:

Install fully private cluster on Azure against 4.12.0-0.nightly-2022-11-10-033725, sa for coreOS image have public access.

$ az storage account list -g jima-azure-11a-f58lp-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
clusterptkpx    True
imageregistryjimaazrsgcc    False

same profile on 4.11.0-0.nightly-2022-11-10-202051, sa for coreos image are not publicly accessible.

$ az storage account list -g jima-azure-11c-kf9hw-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
clusterr8wv9    False
imageregistryjimaaz9btdx    False 

Checked that terraform-provider-azurerm version is different between 4.11 and 4.12.

4.11: v2.98.0

4.12: v3.19.1

In terraform-provider-azurerm v2.98.0, it use property allow_blob_public_access to manage sa public access, the default value is false.

In  terraform-provider-azurerm v3.19.1, property allow_blob_public_access is renamed to allow_nested_items_to_be_public , the default value is true. 

https://github.com/hashicorp/terraform-provider-azurerm/blob/main/CHANGELOG.md#300-march-24-2022

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-10-033725

How reproducible:

always on 4.12

Steps to Reproduce:

1. Install fully private cluster on azure against 4.12 payload
2. 
3.

Actual results:

sa for coreos image is publicly accessible

Expected results:

sa for coreos image should not be publicly accessible

Additional info:

only happened on 4.12

 

 

For the disconnected installation , we should not be able to provision machines successfully with publicIP:true , this has been the behavior earlier till -
4.11 and around 17th Aug nightly released 4.12 , but it has started allowing creation of machines with publicIP:true set in machineset

Issue reproduced on - Cluster version - 4.12.0-0.nightly-2022-08-23-223922

It is always reproducible .

Steps :
Create machineset using yaml with 
{"spec":{"providerSpec":{"value":{"publicIP": true}}}}

Machineset created successfully and machine provisioned successfully .

This seems to be regression bug refer - https://bugzilla.redhat.com/show_bug.cgi?id=1889620

Here is the must gather log - https://drive.google.com/file/d/1UXjiqAx7obISTxkmBsSBuo44ciz9HD1F/view?usp=sharing

Here is the test successfully ran for 4.11 , for exactly same profile and machine creation failed with InvalidConfiguration Error- https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/575822/console

We can confirm disconnected cluster using below  there would be lot of mirrors used in those - 

oc get ImageContentSourcePolicy image-policy-aosqe -o yaml 

apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  creationTimestamp: "2022-08-24T09:08:47Z"
  generation: 1
  name: image-policy-aosqe
  resourceVersion: "34648"
  uid: 20e45d6d-e081-435d-b6bb-16c4ca21c9d6
spec:
  repositoryDigestMirrors:
  - mirrors:
    - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6001/olmqe
    source: quay.io/olmqe
  - mirrors:
    - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6001/openshifttest
    source: quay.io/openshifttest
  - mirrors:
    - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6001/openshift-qe-optional-operators
    source: quay.io/openshift-qe-optional-operators
  - mirrors:
    - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6002
    source: registry.redhat.io
  - mirrors:
    - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6002
    source: registry.stage.redhat.io
  - mirrors:
    - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6002
    source: brew.registry.redhat.io

 

 

Description of problem:

Large OpenShift Container Platform 4.10.24 - Cluster is failing to update router-certs secret in openshift-config-managed namespace as the given secret is too big.

2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266  Reconciler error  {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}

The OpenShift Container Platform 4 - Cluster has 180 IngressController configured with endpointPublishingStrategy set to private.

Now the default certificate needs to be replaced but is not properly replicated to openshift-authentication namespace and potentially other location because of the problem mentioned (since the required secret can not be updated)

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.10.24

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.10
2. Create 180 IngressController with specific certificates
3. Check openshift-ingress-operator logs to see how it fails to update/create the necessary secret in openshift-config-managed

Actual results:

2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266  Reconciler error  {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}

Expected results:

No matter how many IngressController is created, secret management taken care by Operators need to work, even if data exceed 1 MB size limitation. In that case an approach needs to exist to split data into multiple secrets or handle it otherwise.

Additional info:

 

Catastrophic job runs where high numbers of tests fail are common. There are likely many root causes, but let's try to find one. This is a hard task because it's not "this one test failed, figure out why."

Clusters of failures are more common on certain platforms, it may be fruitful to start with the worst.

NURP's that average > 5 openshift-tests or openshift-tests-upgrade failures:

                      variants                       |          avg           
-----------------------------------------------------+------------------------
 {azure,amd64,ovn,upgrade,upgrade-micro,single-node} |   124.5294117647058824
 {azure,amd64,ovn,upgrade,upgrade-minor,single-node} |    92.9090909090909091
 {openstack,amd64,ovn,ha}                            |    49.2105263157894737
 {azure,amd64,sdn,ha,fips}                           |    25.6666666666666667
 {metal-ipi,amd64,ovn,ha}                            |    24.6000000000000000
 {openstack,amd64,ovn,ha,fips}                       |    23.5000000000000000
 {azure,amd64,ovn,ha,hypershift}                     |    22.6666666666666667
 {s390x,sdn,ha}                                      |    22.5454545454545455
 {gcp,amd64,ovn,ha}                                  |    21.5714285714285714
 {ppc64le,sdn,ha}                                    |    17.9545454545454545
 {metal-ipi,amd64,sdn,ha}                            |    17.6000000000000000
 {openstack,amd64,ovn,ha,serial}                     |    15.3333333333333333
 {azure,amd64,ovn,ha}                                |    15.1627906976744186
 {promote}                                           |    15.0000000000000000
 {aws,amd64,ovn,ha}                                  |    14.2558139534883721
 {metal-ipi,amd64,ovn,upgrade,upgrade-minor,ha}      |    13.9375000000000000
 {gcp,amd64,ovn,upgrade,upgrade-minor,ha,realtime}   |    11.2000000000000000
 {azure,amd64,sdn,upgrade,upgrade-minor,ha}          |     9.6842105263157895
 {never-stable}                                      |     9.0740740740740741
 {aws,amd64,ovn,single-node}                         |     8.8666666666666667
 {metal-ipi,amd64,sdn,upgrade,upgrade-micro,ha}      |     7.9090909090909091
 {azure,amd64,sdn,upgrade,upgrade-micro,ha}          |     6.4000000000000000
 {aws,amd64,sdn,ha}                                  |     5.7800000000000000
 {vsphere-ipi,amd64,ovn,ha}                          |     5.6458333333333333
 {openstack,amd64,ovn,upgrade,upgrade-minor,ha}      |     5.6250000000000000
 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha}      |     5.5882352941176471
 {aws,amd64,sdn,upgrade,upgrade-micro,ha}            |     5.5789473684210526

Here's a sippy link for 4.12 job runs with > 50 failures: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.12/runs?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522test_failures%2522%252C%2522operatorValue%2522%253A%2522%253E%2522%252C%2522value%2522%253A%252250%2522%257D%252C%257B%2522columnField%2522%253A%2522overall_result%2522%252C%2522operatorValue%2522%253A%2522equals%2522%252C%2522value%2522%253A%2522F%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=desc&sortField=timestamp

Description of problem:

By creating network policies with a namespace that has maximum length, it can end up causing this error:

2023-06-22T17:34:40.804880959Z I0622 17:34:40.804851       1 obj_retry.go:318] Retry add failed for *v1.NetworkPolicy ocm-production-24gfm4t0rtdsg01bcqgihdrceh3t59na-mshen-incident/kas, will try again later: failed to create Network Policy ocm-production-24gfm4t0rtdsg01bcqgihdrceh3t59na-mshen-incident/kas: failed to create default deny port groups: error in transact with ops [
{Op:update Table:ACL Row:map[action:drop direction:to-lport external_ids:{GoMap:map[default-deny-policy-type:Ingress]} log:false match:outport == @a7686019953911959437_ingressDefaultDeny meter:{GoSet:[acl-logging]} name:{GoSet:[ocm-production-24gfm4t0rtdsg01bcqgihdrceh3t59na-mshen-incident_]} priority:1000] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {08cc8026-4c22-4c52-99cd-e8cd1469c8bd}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:update Table:ACL Row:map[action:allow direction:to-lport external_ids:{GoMap:map[default-deny-policy-type:Ingress]} log:false match:outport == @a7686019953911959437_ingressDefaultDeny && (arp || nd) meter:{GoSet:[acl-logging]} name:{GoSet:[ocm-production-24gfm4t0rtdsg01bcqgihdrceh3t59na-mshen-incident_]} priority:1001] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {08cc8026-4c22-4c52-99cd-e8cd1469c8bd}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:update Table:ACL Row:map[action:drop direction:from-lport external_ids:{GoMap:map[default-deny-policy-type:Egress]} log:false match:inport == @a7686019953911959437_egressDefaultDeny meter:{GoSet:[acl-logging]} name:{GoSet:[ocm-production-24gfm4t0rtdsg01bcqgihdrceh3t59na-mshen-incident_]} options:{GoMap:map[apply-after-lb:true]} priority:1000] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {f324353c-a47b-4044-9cd9-dbeef058ada3}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}{Op:update Table:ACL Row:map[action:allow direction:from-lport external_ids:{GoMap:map[default-deny-policy-type:Egress]} log:false match:inport == @a7686019953911959437_egressDefaultDeny && (arp || nd) meter:{GoSet:[acl-logging]} name:{GoSet:[ocm-production-24gfm4t0rtdsg01bcqgihdrceh3t59na-mshen-incident_]} options:{GoMap:map[apply-after-lb:true]} priority:1001] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {f324353c-a47b-4044-9cd9-dbeef058ada3}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}{Op:update Table:Port_Group Row:map[acls:{GoSet:[{GoUUID:08cc8026-4c22-4c52-99cd-e8cd1469c8bd} {GoUUID:08cc8026-4c22-4c52-99cd-e8cd1469c8bd}]} external_ids:{GoMap:map[name:a7686019953911959437_ingressDefaultDeny]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d3b52500-963a-4f7b-8928-d869f298d2e8}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}{Op:update Table:Port_Group Row:map[acls:{GoSet:[{GoUUID:f324353c-a47b-4044-9cd9-dbeef058ada3} {GoUUID:f324353c-a47b-4044-9cd9-dbeef058ada3}]} external_ids:{GoMap:map[name:a7686019953911959437_egressDefaultDeny]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {b128baec-6acd-4683-8c12-5b968bf73bd8}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:ovsdb error Details:set contains duplicate UUID:{GoUUID:} Rows:[]} {Count:0 Error: Details: UUID:{GoUUID:} Rows:[]}] and errors [ovsdb error: set contains duplicate]: 1 ovsdb operations failed

 

This is not a problem in 4.14 as we moved to ACL indexes, but in 4.13 and before we compare the ACL name and the external ids. For default deny ACLs we simply store the direction in the external id, and the name of the ACL is limited to 63 characters in OVN. When we create default deny acls, we create one that denies everything, then we also create some allow acls to permit arp and neighbor discovery traffic. These 2 ACLs may be recognized as duplicate because their truncated name (namespace only) and their directions in external ids match.

 

Description of problem:

Customer has identified that we are seeing packets leave from two egressIPs to the same target address and port; splitting traffic instead of selecting a primary interface to use as egress when multiple egress IPs are 

Version-Release number of selected component (if applicable):

OCP 4.10.30

How reproducible:

every time on customer endpoint

Steps to Reproduce:

1. Deploy egressIP object with two selected IPs in valid range, scope eip to namespace with pods reaching to upstream source.
2. Capture packets at target and observe incoming packets from two separate sources attempting to continue conversation with continued ACKs instead of starting a new conversation with SYN on first contact from new IP.
3. traffic is fragmented, dropped/rejected by host for not coming from same origination point between requests from openshift-hosted services through EIP(s)

Actual results:

Traffic is dropped at target due to two origin points

Expected results:

traffic should flow from single eip as leader.

Additional info:

issue is mitigated when EIP is set to only include a single IP address; (occurs on multiple egressIPs deployed across multiple projects; multiple clusters affected in customer environment)

See next comments for specific information/case number/data sets and conversation.

This is a clone of issue OCPBUGS-10427. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9969. The following is the description of the original issue:

Description of problem:

OCP cluster born on 4.1 fails to scale-up node due to older podman version 1.0.2 present in 4.1 bootimage. This was observed while testing bug https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21889975&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21889975

Journal log:
- Unit machine-config-daemon-update-rpmostree-via-container.service has finished starting up.
--
-- The start-up result is RESULT.
Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: flag provided but not defined: -authfile
Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: See 'podman run --help'.
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=125/n/a
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'.
Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 24ms CPU time

Version-Release number of selected component (if applicable):

OCP 4.12 and later

Steps to Reproduce:

1.Upgrade a 4.1 based cluster to 4.12 or later version
2. Try to Scale up node
3. Node will fail to join

 

Additional info:  https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21890647&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21890647

This is a clone of issue OCPBUGS-4049. The following is the description of the original issue:

Description of problem:

In case of CRC we provision the cluster first and the create the disk image out of it and that what we share to our users. Now till now we always remove the pull secret from the cluster after provision it using https://github.com/crc-org/snc/blob/master/snc.sh#L241-L258 and it worked without any issue till 4.11.x but for 4.12.0-rc.1 we are seeing that MCO not able to reconcile.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a single node cluster using cluster bot `launch 4.12.0-rc.1 aws,single-node` 

2. Once cluster is provisioned update the pull secret from the config 

```
$ cat pull-secret.yaml 
apiVersion: v1
data:
  .dockerconfigjson: e30K
kind: Secret
metadata:
  name: pull-secret
  namespace: openshift-config
type: kubernetes.io/dockerconfigjson
$ oc replace -f pull-secret.yaml
```

3. Wait for MCO recocile and you will see failure to reconcile MCO

Actual results:

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-66086aa249a9f92b773403f7c3745ea4   False     True       True       1              0                   0                     1                      94m
worker   rendered-worker-0c07becff7d3c982e24257080cc2981b   True      False      False      0              0                   0                     0                      94m


$ oc get co machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.12.0-rc.1   True        False         True       93m     Failed to resync 4.12.0-rc.1 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 0)]

$ oc logs machine-config-daemon-nf9mg -n openshift-machine-config-operator
[...]
I1123 15:00:37.864581   10194 run.go:19] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba
Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: (Mirrors also failed: [quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
W1123 15:00:39.186103   10194 run.go:45] podman failed: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba failed: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: (Mirrors also failed: [quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
: exit status 125; retrying...

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-6222. The following is the description of the original issue:

Please review the following PR: https://github.com/openshift/alibaba-cloud-csi-driver/pull/20

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

At runtime we know the version of OpenShift that we're installing, so we can dynamically generate the OS_IMAGES environment variable to point at the image for the current release. This will prevent having to add to the hard-coded list for every release.

And possibly other alerts.  Declaring namespace labels on alerts makes it easy to find the source or affected resource, as described here. But because Insights alerts are based on metrics exported by the cluster-version operator, they inherit source information from the CVO, and end up looking like:

ALERTS{alertname="SimpleContentAccessNotAvailable", alertstate="firing", condition="SCAAvailable", endpoint="metrics", instance="10.58.57.116:9099", job="cluster-version-operator", name="insights", namespace="openshift-cluster-version", pod="cluster-version-operator-5d8579fb58-p5hfn", prometheus="openshift-monitoring/k8s", reason="NotFound", receive="true", service="cluster-version-operator", severity="info"}

Adding namespace: openshift-insights to the labels block for InsightsDisabled and SimpleContentAccessNotAvailable would avoid this confusion.

You might also want to clear the job and service labels as irrelevant source information. And you might want to clear the pod label to avoid churning alerts when the CVO rolls out a new pod. You can get the label clearing by wrapping the expr with max without (job, pod, service) (...) or similar.

This is a clone of issue OCPBUGS-12729. The following is the description of the original issue:

Description of problem:

This came out of the investigation of https://issues.redhat.com/browse/OCPBUGS-11691 . The nested node configs used to support dual stack VIPs do not correctly respect the EnableUnicast setting. This is causing issues on EUS upgrades where the unicast migration cannot happen until all nodes are on 4.12. This is blocking both the workaround and the eventual proper fix.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Deploy 4.11 with unicast explicitly disabled (via MCO patch)
2. Write /etc/keepalived/monitor-user.conf to suppress unicast migration
3. Upgrade to 4.12

Actual results:

Nodes come up in unicast mode

Expected results:

Nodes remain in multicast mode until monitor-user.conf is removed

Additional info:

 

This is a clone of issue OCPBUGS-18582. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18257. The following is the description of the original issue:

Description of problem:

The fix for https://issues.redhat.com/browse/OCPBUGS-15947 seems to have introduced a problem in our keepalived-monitor logic. What I'm seeing is that at some point all of the apiservers became unavailable, which caused haproxy-monitor to drop the redirect firewall rule since it wasn't able to reach the API and we normally want to fall back to direct, un-loadbalanced API connectivity in that case.

However, due to the fix linked above we now short-circuit the keepalived-monitor update loop if we're unable to retrieve the node list, which is what will happen if the node holding the VIP has neither a local apiserver nor the HAProxy firewall rule. Because of this we will also skip updating the status of the firewall rule and thus the keepalived priority for the node won't be dropped appropriately.

Version-Release number of selected component (if applicable):

We backported the fix linked above to 4.11 so I expect this goes back at least that far.

How reproducible:

Unsure. It's clearly not happening every time, but I have a local dev cluster in this state so it can happen.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

I think the solution here is just to move the firewall rule check earlier in the update loop so it will have run before we try to retrieve nodes. There's no dependency on the ordering of those two steps so I don't foresee any major issues.

To workaround this I believe we can just bounce keepalived on the affected node until the VIP ends up on the node with a local apiserver.

Description of problem:

When the cluster install finished, wait-for install-complete command didn't exit as expected.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Get the latest agent-installer and build image
git clone https://github.com/openshift/installer.git
cd installer/
hack/build.sh
Edit agent-config and install-config yaml file
Create the agent.iso image:
OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release:4.12.0-ec.3-x86_64 bin/openshift-install agent create image --log-level debug

2. Install SNO cluster
virt-install --connect qemu:///system -n control-0 -r 33000 --vcpus 8 --cdrom ./agent.iso --disk pool=installer,size=120 --boot uefi,hd,cdrom --os-variant=rhel8.5 --network network=default,mac=52:54:00:aa:aa:aa --wait=-1 

3. Run 'bin/openshift agent wait-for bootstrap-complete --log-level debug' and the command finished as expected.

4. After 'bootstrap' completion, run 'bin/openshift agent wait-for install-complete --log-level debug', the command didn't finish as expected.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:

Description of problem: This is a follow-up to OCPBUGS-3933.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

In looking at jobs on an accepted payload at https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.ci/release/4.12.0-0.ci-2022-08-30-122201 , I observed this job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1564589538850902016 with "Undiagnosed panic detected in pod" "pods/openshift-controller-manager-operator_openshift-controller-manager-operator-74bf985788-8v9qb_openshift-controller-manager-operator.log.gz:E0830 12:41:48.029165       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)" 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

probably relatively easy to reproduce (but not consistently) given it's happened several times according to this search: https://search.ci.openshift.org/?search=Observed+a+panic%3A+%22invalid+memory+address+or+nil+pointer+dereference%22&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

1. let nightly payloads run or run one of the presubmit jobs mentioned in the search above
2.
3.

Actual results:

Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}

Expected results:

no panics

Additional info:

 

In 4.12.0-rc.0 some API-server components declare flowcontrol/v1beta1 release manifests:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.12.0-rc.0-x86_64
$ grep -r flowcontrol.apiserver.k8s.io manifests
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_etcd-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
manifests/0000_50_cluster-openshift-controller-manager-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1

The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. This ticket tracks removing those manifests, or replacing them with a more modern resource type, or some such. Definition of done is that new 4.13 (and with backports, 4.12) nightlies no longer include flowcontrol.apiserver.k8s.io/v1beta1 manifests.

This can be noticed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27560/pull-ci-openshift-origin-master-e2e-gcp-ovn/1593697975584952320/artifacts/e2e-gcp-ovn/openshift-e2e-test/build-log.txt:

[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/api_requests.go:27
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:158
[AfterEach] [sig-arch][Late]
  github.com/openshift/origin/test/extended/util/client.go:159
flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times
api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times
api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times
user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times
user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times
user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times
user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times
user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Ginkgo exit error 4: exit with code 4

This is required to unblock https://github.com/openshift/origin/pull/27561

Description of problem:

NodePort port not accessible

Version-Release number of selected component (if applicable):

OCP 4.8.20

How reproducible:

$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry

$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.

In a new-created namespaces the same deployment works:

[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000

[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000

[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s

[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.

Actual results:

  • Working on new created namespace
  • Not working on already created namespace

Expected results:

  • Suppose to work on all namespaces.

Additional info:

  • This cluster get upgrade from 4.7.x to 4.8 and then they manually enable OVN.
  • The issue was happening on all namespaces but after restarting the ovnkube-master-xxxx pods only the newly created namespaces work.

This is a clone of issue OCPBUGS-3993. The following is the description of the original issue:

Description of problem:
On Openshift on Openstack CI, we are deploying an OCP cluster with an additional network on the workers in install-config.yaml for integration with Openstack Manila.

compute:
- name: worker
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['0eeae16f-bbc7-4e49-90b2-d96419b7c30d']
  replicas: 3

As a result, the egressIP annotation includes two interfaces definition:

$ oc get node ostest-hp9ld-worker-0-gdp5k -o json | jq -r '.metadata.annotations["cloud.network.openshift.io/egress-ipconfig"]' | jq .                                 
[
  {
    "interface": "207beb76-5476-4a05-b412-d0cc53ab00a7",
    "ifaddr": {
      "ipv4": "10.46.44.64/26"
    },
    "capacity": {
      "ip": 8
    }
  },
  {
    "interface": "2baf2232-87f7-4ad5-bd80-b6586de08435",
    "ifaddr": {
      "ipv4": "172.17.5.0/24"
    },
    "capacity": {
      "ip": 10
    }
  }
]

According to Huiran Wang, egressIP only works for primary interface on the node.

Version-Release number of selected component (if applicable):

 4.12.0-0.nightly-2022-11-22-012345
RHOS-16.1-RHEL-8-20220804.n.1

How reproducible:

Always

Steps to Reproduce:

Deploy cluster with additional Network on the workers

Actual results:

It is possible to select an egressIP network for a secondary interface

Expected results:

Only primary subnet can be chosen for egressIP

Additional info:

https://issues.redhat.com/browse/OCPQE-12968

Description of problem:

4.2 AWS boot images such as ami-01e7fdcb66157b224 include the old ignition.platform.id=ec2 kernel command line parameter. When launched against 4.12.0-rc.3, new machines fail with:

  1. The old user-data and old AMI successfully get to the machine-config-server request stage.
  2. The new instance will then request the full Ignition from /config/worker , and the machine-config server translates that to the old Ignition v2 spec format.
  3. The instance will lay down that Ignition-formatted content, and then try and reboot into the new state.
  4. Coming back up in the new state, the modern Afterburn comes up to try and figure out a node name for the kubelet, and this fails with unknown provider 'ec2'.

Version-Release number of selected component (if applicable):

coreos-assemblers used ignition.platform.id=ec2, but pivoted to =aws here. It's not clear when that made its way into new AWS boot images. Some time after 4.2 and before 4.6.

Afterburn dropped support for legacy command-line options like the ec2 slug in 5.0.0. But it's not clear when that shipped into RHCOS. The release controller points at this RHCOS diff, but that has afterburn-0-5.3.0-1 builds on both sides.

How reproducible:

100%, given a sufficiently old AMI and a sufficiently new OpenShift release target.

Steps to Reproduce:

  1. Install 4.12.0-rc.3 or similar new OpenShift on AWS in us-east-1.
  2. Create Ignition v2 user-data in a Secret in openshift-machine-api. I'm fuzzy on how to do that portion easily, since it's basically RFE-3001 backwards.
  3. Edit a compute MachineSet to set spec.template.spec.providerSpec.value.ami to id: ami-01e7fdcb66157b224 and also point it at your v2 user-data Secret.
  4. Possibly delete an existing Machine in that MachineSet, or raise replicas, or otherwise talk the MachineSet controller into provisioning a new Machine to pick up the reconfigured AMI.

Actual results:

The new Machine will get to Provisioned but fail to progress to Running. systemd journal logs will include unknown provider 'ec2' for Afterburn units.

Expected results:

Old boot-image AMIs can successfully update to 4.12.

Alternatively, we pin down the set of exposed boot images sufficiently that users with older clusters can audit for exposure and avoid the issue by updating to more modern boot images (although updating boot images is not trivial, see RFE-3001 and the Ignition spec 2 to 3 transition discussed in kcs#5514051.

Description of problem:
Kebab menu for helm repository is showing inconsistent behavior

Version-Release number of selected component (if applicable): 4.12

How reproducible: Always

Steps to Reproduce:
1. Create some helm chart repository
2. Go to the Helm page and switch to the repositories tab
3. Open kebab menu for different repos

Actual results:
Menus are overlapping

Expected results:
The menu should work properly; one menu should close before opening a new one

Additional info:
Video has been added for the reference

Description of problem:

node_exporter collects network metrics for "virtual" interfaces like br-*. When OVN is used, it also reports metrics for ovs-*, ovn, and genev_sys_* interfaces.

Version-Release number of selected component (if applicable):

4.12 (and before)

How reproducible:

Always

Steps to Reproduce:

1. Launch a 4.12 cluster.
2. Run the following PromQL query: "group by(device) (node_network_info)"
3.

Actual results:

Expected results:

Only real host interfaces should be present.

Additional info:


This is a clone of issue OCPBUGS-2841. The following is the description of the original issue:

Currently the agent installer supports only x86_64 arch. The image creation command must fail if some other arch is configured different from x86_64

We want to have an allowed list of architectures.

allowed = ['x86_64', 'amd64']

Description of problem: upon attempting to install OCP 4.10 UPI on baremetal ppc64le, the openshift-install gather command returns `panic: unsupported platform "none"`

Version-Release number of selected component (if applicable):

OCP 4.10.16

openshift-install 4.10.24 

How reproducible:

easily

Steps to Reproduce:
1. create install config
2. create manifests
3. create ignition configs

4. openshift-install gather bootstrap --log-level "debug"

Actual results:

DEBUG OpenShift Installer 4.10.24                  
DEBUG Built from commit d63a12ba0ec33d492093a8fc0e268a01a075f5da 
DEBUG Fetching Bootstrap SSH Key Pair...           
DEBUG Loading Bootstrap SSH Key Pair...            
DEBUG Using Bootstrap SSH Key Pair loaded from state file 
DEBUG Reusing previously-fetched Bootstrap SSH Key Pair 
DEBUG Fetching Install Config...                   
DEBUG Loading Install Config...                    
DEBUG   Loading SSH Key...                         
DEBUG   Loading Base Domain...                     
DEBUG     Loading Platform...                      
DEBUG   Loading Cluster Name...                    
DEBUG     Loading Base Domain...                   
DEBUG     Loading Platform...                      
DEBUG   Loading Networking...                      
DEBUG     Loading Platform...                      
DEBUG   Loading Pull Secret...                     
DEBUG   Loading Platform...                        
DEBUG Loading Install Config from both state file and target directory 
DEBUG On-disk Install Config matches asset in state file 
DEBUG Using Install Config loaded from state file  
DEBUG Reusing previously-fetched Install Config    
panic: unsupported platform "none"

goroutine 1 [running]:
github.com/openshift/installer/pkg/terraform/stages/platform.StagesForPlatform({0x146f2d0a, 0x1619aa08})
        /go/src/github.com/openshift/installer/pkg/terraform/stages/platform/stages.go:55 +0x2ff
main.runGatherBootstrapCmd({0x14d8e028, 0x1})
        /go/src/github.com/openshift/installer/cmd/openshift-install/gather.go:115 +0x2d6
main.newGatherBootstrapCmd.func1(0xc001364500, {0xc0005a0b40, 0x2, 0x2})
        /go/src/github.com/openshift/installer/cmd/openshift-install/gather.go:65 +0x59
github.com/spf13/cobra.(*Command).execute(0xc001364500, {0xc0005a0b20, 0x2, 0x2})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc001334c80)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:72 +0x29e
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:50 +0x125

Expected results:

I'm not really sure what I expected to happen.  I've never used that gather before..

I would assume at least no panicking.

Additional info:

This is a clone of issue OCPBUGS-4954. The following is the description of the original issue:

Description of problem:
During the cluster destroy process for IBM Cloud IPI, failures can occur when COS Instances are deleted, but Reclamations are created for the COS deletions, and prevent cleanup of the ResourceGroup

Version-Release number of selected component (if applicable):
4.13.0 (and 4.12.0)

How reproducible:
Sporadic, it depends on IBM Cloud COS

Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud
2. Delete the IPI cluster on IBM Cloud
3. COS Reclamation may be created, and can cause the destroy cluster to fail

Actual results:

time="2022-12-12T16:50:06Z" level=debug msg="Listing resource groups"
time="2022-12-12T16:50:06Z" level=debug msg="Deleting resource group \"eu-gb-reclaim-1-zc6xg\""
time="2022-12-12T16:50:07Z" level=debug msg="Failed to delete resource group eu-gb-reclaim-1-zc6xg: Resource groups with active or pending reclamation instances can't be deleted. Use the CLI commands \"ibmcloud resource service-instances --type all\" and \"ibmcloud resource reclamations\" to check for remaining instances, then delete the instances and try again."

Expected results:
Successful destroy cluster (including deletion of ResourceGroup)

Additional info:
IBM Cloud is testing a potential fix currently.

It was also identified, the destroy stages are not in a proper order.
https://github.com/openshift/installer/blob/9377cb3974986a08b531a5e807fd90a3a4e85ebf/pkg/destroy/ibmcloud/ibmcloud.go#L128-L155

Changes are being made in an attempt to resolve this along with a fix for this bug as well.

Description of problem:
Users on a disconnected cluster with a proxy could not import a Devfile (from GitHub).

The API call /api/devfile/ takes 30 seconds until it fails with 504 Gateway timeout.

Version-Release number of selected component (if applicable):
This might happen since 4.8

Tested this yet only on 4.12.0-0.nightly-2022-09-07-112008

How reproducible:
Always

Steps to Reproduce:

  1. Start a disconnected cluster with a proxy
  2. Open the browser network inspector and filter for /api/devfile
  3. Switch to Developer perspective
  4. Navigate to Add > Developer Catalog (All Services) > Devfiles
  5. Select a Devfile like Basic Go (https://github.com/devfile-samples/devfile-sample-go-basic.git)
  6. Press Create

Actual results:

  • Network call fails after 30 seconds
  • Import doesn't work

Expected results:

  • Import should create a Deployment and switch to topology view

Additional info:
The console Pod log contains this error:

E0909 10:28:18.448680 1 devfile-handler.go:74] Failed to parse devfile: failed to populateAndParseDevfile: Get "https://registry.devfile.io/devfiles/go": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is a clone of issue OCPBUGS-16617. The following is the description of the original issue:

Description of problem:

LB skip_snat improperly applied with affinity_timeout

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

https://issues.redhat.com/browse/FD-3041

Description of problem: Knative tests were disabled due to https://issues.redhat.com/browse/OCPBUGS-190  to unblock the queue and should be enabled back again

https://coreos.slack.com/archives/C6A3NV5J9/p1660659719046909 

https://github.com/openshift/console/pull/11956#discussion_r948075848 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Description of problem:

mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed.

When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green.

$ oc get machineset | egrep 'blue|green'
housiocp4-wvqbx-worker-blue-us-east-2a    0         0                             2d17h
housiocp4-wvqbx-worker-green-us-east-2a   1         1         1       1           2d17h

$ oc get machineautoscaler
NAME                      REF KIND     REF NAME                                   MIN   MAX   AGE
worker-green-us-east-1a   MachineSet   housiocp4-wvqbx-worker-green-us-east-2a   1     4     2d17h

$ oc get machinehealthcheck
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
machine-api-termination-handler   100%           0                  0
worker-green-us-east-1a           40%            1                  1

      {
        "name": "machine-health-check-unterminated-short-circuit",
        "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml",
        "rules": [
          { 
            "state": "firing",
            "name": "MachineHealthCheckUnterminatedShortCircuit",
            "query": "mapi_machinehealthcheck_short_circuit == 1",
            "duration": 1800,
            "labels": {
              "severity": "warning"
            },
            "annotations": {
              "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
              "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"
            },
            "alerts": [
              { 
                "labels": {
                  "alertname": "MachineHealthCheckUnterminatedShortCircuit",
                  "container": "kube-rbac-proxy-mhc-mtrc",
                  "endpoint": "mhc-mtrc",
                  "exported_namespace": "openshift-machine-api",
                  "instance": "10.128.0.58:8444",
                  "job": "machine-api-controllers",
                  "name": "worker-blue-us-east-1a",
                  "namespace": "openshift-machine-api",
                  "pod": "machine-api-controllers-779dcb8769-8gcn6",
                  "service": "machine-api-controllers",
                  "severity": "warning"
                },
                "annotations": {
                  "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n",
                  "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes"
                },
                "state": "firing",
                "activeAt": "2022-12-09T15:59:25.1287541Z",
                "value": "1e+00"
              }
            ],
            "health": "ok",
            "evaluationTime": 0.000648129,
            "lastEvaluation": "2022-12-12T09:35:55.140174009Z",
            "type": "alerting"
          }
        ],
        "interval": 30,
        "limit": 0,
        "evaluationTime": 0.000661589,
        "lastEvaluation": "2022-12-12T09:35:55.140165629Z"
      },

As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)

How reproducible:

- Always

Steps to Reproduce:

1. Setup OpenShift Container Platform 4 on AWS for example
2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck
3. Have active Machines for blue only
4. Trigger unhealthy Machines in blue MachineSet
5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0
6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1
7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.

Actual results:

mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.

Expected results:

mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level

Additional info:

It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)

This is a clone of issue OCPBUGS-11218. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10950. The following is the description of the original issue:

Description of problem: 

"pipelines-as-code-pipelinerun-go" configMap is not been used for the Go repository while creating Pipeline Repository. "pipelines-as-code-pipelinerun-generic" configMap has been used.

Prerequisites (if any, like setup, operators/versions):

Install Red Hat Pipeline operator

Steps to Reproduce

  1. Navigate to Create Repository form 
  2. Enter the Git URL `https://github.com/vikram-raj/hello-func-go`
  3. Click on Add

Actual results:

`pipelines-as-code-pipelinerun-generic` PipelineRun template has been shown on the overview page 

Expected results:

`pipelines-as-code-pipelinerun-go` PipelineRun template should show on the overview page

Reproducibility (Always/Intermittent/Only Once):

Build Details:

4.13

Workaround:

Additional info:

This is a clone of issue OCPBUGS-4377. The following is the description of the original issue:

Description of problem:


--> Service name search ability while creating the Route from the console

2. What is the nature and description of the request?
--> While creating the route from the console(OCP dashboard) there is no option to search the service by name, we need to select the service from the drop-down list only, we need the searchability so that the user can type the service name and can select the service which comes at the top in search results.

3. Why does the customer need this? (List the business requirements here)
--> Sometimes it is a very hectic task to select the service from the drop-down list, In one of the customer case they have 150 services in the namespace and they need to scroll down too long for selecting the service.

4. List any affected packages or components.
--> OCP console

5. Expected result.
--> Have the ability to type the service name while creating the route.

This is a clone of issue OCPBUGS-2598. The following is the description of the original issue:

Description of problem:

Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too. 

This makes the main command 

ovs-appctl -t ovs-monitor-ipsec

To hang until the subcommand is finished.

As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be  insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

How reproducible:

I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

Steps to Reproduce:

1. Install an Openshift cluster with IPSEC enabled
2. Scale to 170+ nodes or more
3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.

Actual results:

Ip Sec pods are stuck in a Crashloopbackoff state

Expected results:

Ip Sec pods to work normally

Additional info:

We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

 

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Go to the detail page of some Deployments with PDB connected to it
2. Click Edit PDB from the kebab menu
3. Inspect the second input box under the `Availability requirement `

Actual results: The name and aria-label attributes always show minAvailable

Expected results: They should be consistent with the first input box

Additional info:

This is a clone of issue OCPBUGS-15457. The following is the description of the original issue:

This is a clone of issue OCPBUGS-15239. The following is the description of the original issue:

We've removed SR-IOV code that was using python-grpcio and python-protobuf. These are gone from Python's requirements.txt, but we never removed them from RPM spec we use to build Kuryr in OpenShift. This should be fixed.

Description of problem:

While running scale tests with ACM provisioning 1200+ SNOs via ZTP, converged flow was enabled. With converged flow the rate at which clusters begin install is much slower than what was witnessed without converged flow.

Example:
Without converged flow - 1250/1269 SNOs completed install in 3hrs and 11m
With converged flow - 487/1250 SNOs completed install in 10hours

The test actually hit timeouts so we don't exactly know how long it took, but you can see we only managed 487 SNOs to be provisioned in 10 hours.

The concurrency measurement scripts show that converged flow ran at a concurrency of 68 SNOs installing at a time vs non-converged flow peaking at 507.  Something within the converged flow is bottlenecking the SNOs install.

Version-Release number of selected component (if applicable):

Hub/SNO OCP 4.11.8
ACM 2.6.1-DOWNSTREAM-2022-09-08-02-53-38

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

converged flow to match previous provisioning speeds/rates

Additional info:

Must gather will be provided.

Description of problem:

This is just a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2105570 for purposes of cherry-picking.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-5184. The following is the description of the original issue:

Description of problem:

Fail to deploy IPI azure cluster, where set region as westus3, vm type as NV8as_v4. Master node is running from azure portal, but could not ssh login. From serials log, get below error:

[ 3009.547219] amdgpu d1ef:00:00.0: amdgpu: failed to write reg:de0
[ 3011.982399] mlx5_core 6637:00:02.0 enP26167s1: TX timeout detected
[ 3011.987010] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 0, SQ: 0x170, CQ: 0x84d, SQ Cons: 0x823 SQ Prod: 0x840, usecs since last trans: 2418884000
[ 3011.996946] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 1, SQ: 0x175, CQ: 0x852, SQ Cons: 0x248c SQ Prod: 0x24a7, usecs since last trans: 2148366000
[ 3012.006980] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 2, SQ: 0x17a, CQ: 0x857, SQ Cons: 0x44a1 SQ Prod: 0x44c0, usecs since last trans: 2055000000
[ 3012.016936] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 3, SQ: 0x17f, CQ: 0x85c, SQ Cons: 0x405f SQ Prod: 0x4081, usecs since last trans: 1913890000
[ 3012.026954] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 4, SQ: 0x184, CQ: 0x861, SQ Cons: 0x39f2 SQ Prod: 0x3a11, usecs since last trans: 2020978000
[ 3012.037208] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 5, SQ: 0x189, CQ: 0x866, SQ Cons: 0x1784 SQ Prod: 0x17a6, usecs since last trans: 2185513000
[ 3012.047178] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 6, SQ: 0x18e, CQ: 0x86b, SQ Cons: 0x4c96 SQ Prod: 0x4cb3, usecs since last trans: 2124353000
[ 3012.056893] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 7, SQ: 0x193, CQ: 0x870, SQ Cons: 0x3bec SQ Prod: 0x3c0f, usecs since last trans: 1855857000
[ 3021.535888] amdgpu d1ef:00:00.0: amdgpu: failed to write reg:e15
[ 3021.545955] BUG: unable to handle kernel paging request at ffffb57b90159000
[ 3021.550864] PGD 100145067 P4D 100145067 PUD 100146067 PMD 0 

From azure doc https://learn.microsoft.com/en-us/azure/virtual-machines/nvv4-series , looks like nvv4 series only supports Window VM.

 

Version-Release number of selected component (if applicable):

4.12 nightly build

How reproducible:

Always

Steps to Reproduce:

1. prepare install-config.yaml, set region as westus3, vm type as NV8as_v4 2. install cluster
3.

Actual results:

installation failed

Expected results:

If nvv4 series is not supported for Linux VM, installer might validate and show the message that such size is not supported.

Additional info:

 

 

 

 

 

This is a clone of issue OCPBUGS-4684. The following is the description of the original issue:

Description of problem:

In DeploymentConfig both the Form view and Yaml view are not in sync

Version-Release number of selected component (if applicable):

4.11.13

How reproducible:

Always

Steps to Reproduce:

1. Create a DC with selector and labels as given below
spec:
  replicas: 1
  selector:
    app: apigateway
    deploymentconfig: qa-apigateway
    environment: qa
  strategy:
    activeDeadlineSeconds: 21600
    resources: {}
    rollingParams:
      intervalSeconds: 1
      maxSurge: 25%
      maxUnavailable: 25%
      timeoutSeconds: 600
      updatePeriodSeconds: 1
    type: Rolling
  template:
    metadata:
      labels:
        app: apigateway
        deploymentconfig: qa-apigateway
        environment: qa

2. Now go to GUI--> Workloads--> DeploymentConfig --> Actions--> Edit DeploymentConfig, first go to Form view and now switch to Yaml view, the selector and labels shows as app: ubi8 while it should display app: apigateway

  selector:
    app: ubi8
    deploymentconfig: qa-apigateway
    environment: qa
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: ubi8
        deploymentconfig: qa-apigateway
        environment: qa

3. Now in yaml view just click reload and the value is displayed as it is when it was created (app: apigateway).

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

On the Machinesets, we configured the Azure tags, that should be assigned to the newly created nodes. VMs and disks have that tags assigned, while NICs - don't have configured Azsure tags assigned to them.

Version-Release number of selected component (if applicable):


OCP 4.11

How reproducible:


It can be reproducible

Steps to Reproduce:


1. We need to acquire Azure tags
2. Create machine set configs with Azure tags configured
3. Create VMs through the machine set

Actual results:


NICs, created by the Machinesets don't have Azure tags, configured on the Machineset.

Expected results:


NiCs should automatically pick up these tags.

Additional info:


As in Azure NICs can be treated as separate resources. there is a possibility if we assign the tags for the NICs in the main machine config file. it may work.

This is a clone of issue OCPBUGS-17823. The following is the description of the original issue:

This is a clone of issue OCPBUGS-17813. The following is the description of the original issue:

Description of problem:

GCP bootimage override is not available in 4.13, 4.12 or 4.11

Feature CORS-2445

Version-Release number of selected component (if applicable):{code:none}


How reproducible: Always

Steps to Reproduce:{code:none}
1.
2.
3.

Actual results:


Expected results:


Additional info:


This relates to the recovery of a cluster following an etcd outage.

The ingress path to kube-apiserver is:

───────────> VIP ─────────────────> Local HAProxy ────┬─> kube-apiserver-master-0
    (managed by keepalived)                           │
                                                      ├─> kube-apiserver-master-1
                                                      │
                                                      └─> kube-apiserver-master-2

Each master is running an HAProxy which load balances between the 3 kube-apiservers. Each HAProxy is running health checks against each kube-apiserver, and will add or remove it from the available pool based on its health.

We only use keepalived to ensure that HAProxy is not a single point of failure. It is the job of keepalived to ensure that incoming traffic is being directed to an HAProxy which is functioning correctly.

The current health check we are using for keepalived involves polling /readyz against the local HAProxy. While this seems intuitively correct it is in fact testing the wrong thing. It is testing whether the kube-apiserver it connects to is functioning correctly. However, this is not the purpose of keepalived. HAProxy runs health checks against kube-apiserver backends. keepalived simply selects a correctly functioning HAProxy.

This becomes important during recovery from an outage. When none of the kube-apiservers are healthy this health check will fail continuously, and the API VIP will move uselessly between masters. However the situation is much worse when only one of the kube-apiservers is up. In this case there is a high probability that it is overloaded and at least rate limiting incoming connections. This may lead us to fail the keepalived health check and fail the VIP over to the next HAProxy. This will cause all open kube-apiserver connections to reset, even the established ones. This increases the load on the kube-apiserver and increases the probability that the health check will fail again.

Ideally the keepalived health check would check only the health of HAProxy itself, not the health of the pool of kube-apiservers. In practise it will probably never be necessary to move the VIP while the master is up, regardless of the health of the cluster. A network partition affecting HAProxy would already be handled by VRRP between the masters, so it may be that it would be sufficient to check that the local HAProxy pod is healthy.

Description of problem:

The current version of openshift/router vendors Kubernetes 1.24 packages.  OpenShift 4.12 is based on Kubernetes 1.25.  

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Check https://github.com/openshift/router/blob/release-4.12/go.mod 

Actual results:

Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.

Expected results:

Kubernetes packages are at version v0.25.0 or later.

Additional info:

Using old Kubernetes API and client packages brings risk of API compatibility issues.

Description of problem:

Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.

Version-Release number of selected component (if applicable):

Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107.

Network Type: OVNKubernetes

How reproducible:

Twice out of two attempts.

Steps to Reproduce:

1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1.
   The cluster is up and running with three workers:
   $ oc get clusterversion
   NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
   version   4.11.0-0.nightly-2022-09-19-214532   True        False         51m     Cluster version is 4.11.0-0.nightly-2022-09-19-214532

2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 

3. The upgrade is not succeeds: [0]
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-19-214532   True        True          17h     Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network

One node degrided to 'NotReady,SchedulingDisabled' status:
$ oc get nodes
NAME                          STATUS                        ROLES    AGE   VERSION
ostest-9vllk-master-0         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-1         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-2         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-worker-0-4x4pt   NotReady,SchedulingDisabled   worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-h6kcs   Ready                         worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-xhz9b   Ready                         worker   18h   v1.24.0+3882f8f

$ oc get pods -A | grep -v -e Completed -e Running
NAMESPACE                                          NAME                                                         READY   STATUS      RESTARTS       AGE
openshift-openstack-infra                          coredns-ostest-9vllk-worker-0-4x4pt                          0/2     Init:0/1    0              18h
 
$ oc get events
LAST SEEN   TYPE      REASON                                        OBJECT            MESSAGE
7m15s       Warning   OperatorDegraded: MachineConfigDaemonFailed   /machine-config   Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
7m15s       Warning   MachineConfigDaemonFailed                     /machine-config   Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
baremetal                                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-controller-manager                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-credential                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cluster-autoscaler                         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
config-operator                            4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
console                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
control-plane-machine-set                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
dns                                        4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
etcd                                       4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
image-registry                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     Progressing: The registry is ready...
ingress                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
insights                                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-apiserver                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13
kube-controller-manager                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-scheduler                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-api                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-approver                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-config                             4.11.0-0.nightly-2022-09-19-214532   False       True          True       16h     Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
marketplace                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
monitoring                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
network                                    4.12.0-0.nightly-2022-09-20-040107   True        True          True       19h     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z...
node-tuning                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-apiserver                        4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
openshift-controller-manager               4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-samples                          4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
service-ca                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
storage                                    4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...

[0] http://pastebin.test.redhat.com/1074531

Actual results:

OCP 4.11 --> 4.12 upgrade fails.

Expected results:

OCP 4.11 --> 4.12 upgrade success.

Additional info:

Attached logs of the NotReady node - [^journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz]

This is a clone of issue OCPBUGS-16160. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16135. The following is the description of the original issue:

Description of problem:

The control-plane-operator pod gets stuck deleting an awsendpointservice if its hostedzone is already gone:

Logs:

{"level":"error","ts":"2023-07-13T03:06:58Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","aWSEndpointService":{"name":"private-router","namespace":"ocm-staging-24u87gg3qromrf8mg2r2531m41m0c1ji-diegohcp-west2"},"namespace":"ocm-staging-24u87gg3qromrf8mg2r2531m41m0c1ji-diegohcp-west2","name":"private-router","reconcileID":"59eea7b7-1649-4101-8686-78113f27567d","error":"failed to delete resource: NoSuchHostedZone: No hosted zone found with ID: Z05483711XJV23K8E97HK\n\tstatus code: 404, request id: f8686dd6-a906-4a5e-ba4a-3dd52ad50ec3","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"} 

Version-Release number of selected component (if applicable):

4.12.24

How reproducible:

Have not tried to reproduce yet, but should be fairly reproducible

Steps to Reproduce:

1. Install a PublicAndPrivate or Private HCP
2. Delete the Route53 Hosted Zone defined in its awsendpointservice's .status.dnsZoneID field
3. Observe the control-plane-operator looping on the above logs and the uninstall hanging

Actual results:

Uninstall hangs due to CPO being unable to delete the awsendpointservice

Expected results:

awsendpointservice cleans up, if the hosted zone is already gone CPO shouldn't care that it can't list hosted zones

Additional info:

 

Description of problem:

NPE on topology for the ns which just got deleted, see screenshot below

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Login as regular user
2. Create a ns and delete the ns
3. visit the deleted ns in topology

Actual results:

console breaks dur to NPE

Expected results:

console shouldn't break

Additional info:

 

This is a clone of issue OCPBUGS-14315. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4501. The following is the description of the original issue:

Description of problem:

IPV6 interface and IP is missing in all pods created in OCP 4.12 EC-2.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Every time

Steps to Reproduce:

We create network-attachment-definitions.k8s.cni.cncf.io in OCP cluster at namespace scope for our software pods to get IPV6 IPs. 

Actual results:

Pods do not receive IPv6 addresses

Expected results:

Pods receive IPv6 addresses

Additional info:

This has been working flawlessly till OCP 4.10. 21 however we are trying same code in OCP 4.12-ec2 and we notice all our pods are missing ipv6 address and we have to restart pods couple times for them to get ipv6 address.

Description of problem:

With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.

Description of problem:

We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.

The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.

While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.

From the logs, we see slow container creation times.

I added some instrumentation to the low_latency_hooks.sh script

snip
pid=$(jq '.pid' /dev/stdin 2>&1)
logger "Start low_latency_hooks ${pid}"
[[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
snip
if [ "${mode}" = "ro" ]; then
ip netns exec "${ns}" mount -o remount,ro /sys
[ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
fi
logger "Stop low_latency_hooks ${pid}"

Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.

insecure-readyz
total container create time: 35s
hook time: 29s

cert-syncer
total container create time: 41s
hook time: 32s

cert-regeneration-controller
total container create time: 73s
hook time: 54s

kube-apiserver
total container create time: 18s
hook time: 16s

check-endpoints
total container create time: 31s
hook time: 31s

I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.

Container create times:
insecure-readyz - 1s
cert-syncer - 1s
cert-regeneration-controller - 1s
kube-apiserver -1s
check-endpoints - 5s

I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.

Version-Release number of selected component (if applicable):
4.9.37
Issue likely in later releases as well, have not timed yet

How reproducible:
100%

Steps to Reproduce:
1. Force a rollout with a platform cpu load representative of the application
2.
3.

Actual results:
Slow rollout times sometimes exceeding the timeout

Expected results:
Rollout should fit into the timeout window

Additional info:

The CPO does not currently respect the CVO runlevels as standalone OCP does.

The CPO reconciles everything all at once during upgrades which is resulting in FeatureSet aware components trying to start because the FeatureSet status is set for that version, leading to pod restarts.

It should roll things out in the following order for both initial install and upgrade, waiting between stages until rollout is complete:

  • etcd
  • kas
  • kcm and ks
  • everything else

Description of problem:

The TestReloadInterval E2E test has completely wrong validations in which the min value should be 1s, not 5s.

But there is a race condition which allow these tests to sometimes pass due to the last test condition.

Therefore, failures in CI are actually correct, and successes are wrong based on the E2E conditions.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

50%

Steps to Reproduce:

1.Run TestReloadInterval E2E test (make test-e2e TEST=TestReloadInterval)

Actual results:

Sometimes fails on 5us test case:

reloadinterval_test.go:106: router deployment not updated with RELOAD_INTERVAL=5s: timed out waiting for the condition

Expected results:

Should pass E2E

Additional info:

 

 

 

 

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2083087 (OCPBUGSM-44070) to backport this issue.

Description of problem:
"Delete dependent objects of this resource" is a bit of confusing for some users because when creating the Application in Dev console not only the deployment but also IS, route, svc, secret objects will be created as well. When deleting the Application (in fact it is deployment), there is an option called "Delete dependent objects of this resource" and some users might think this means the IS, route, svc and any other objects which are created alongside with the deployment will be deleted as well

Version-Release number of selected component (if applicable):
4.8

How reproducible:
Always

Steps to Reproduce:
1. Create Application in Dev console
2. Delete the deployment
3. Check "Delete dependent objects of this resource"

Actual results:
Only deployment will be deleted and IS, svc, route will not be deleted

Expected results:
We either change the description of this option, or we really delete IS, svc, route and any other objects created under this Application.

Additional info:

This is a clone of issue OCPBUGS-11004. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8349. The following is the description of the original issue:

Description of problem:

On a freshly installed cluster, the control-plane-machineset-operator begins rolling a new master node, but the machine remains in a Provisioned state and never joins as a node.

Its status is:
Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}]

The cluster is left in this state until an admin manually removes the stuck master node, at which point a new master machine is provisioned and successfully joins the cluster.

Version-Release number of selected component (if applicable):

4.12.4

How reproducible:

Observed at least 4 times over the last week, but unsure on how to reproduce.

Actual results:

A master node remains in a stuck Provisioned state and requires manual deletion to unstick the control plane machine set process.

Expected results:

No manual interaction should be necessary.

Additional info:

 

Description of problem:

openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors:

}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
W1018 11:36:43.523673      15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to {
  "Addr": "[2620:52:0:198::10]:2379",
  "ServerName": "2620:52:0:198::10",
  "Attributes": null,
  "BalancerAttributes": null,
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-18-041406

How reproducible:

100%

Steps to Reproduce:

1. Deploy SNO with single stack IPv6 via ZTP procedure

Actual results:

Deployment times out and some of the operators aren't deployed successfully.

NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-10-18-041406   False       False         True       124m    APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
baremetal                                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      112m    
cloud-controller-manager                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
cloud-credential                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
cluster-autoscaler                         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
config-operator                            4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
console                                                                                                                      
control-plane-machine-set                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
dns                                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
etcd                                       4.12.0-0.nightly-2022-10-18-041406   True        False         True       121m    ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries
image-registry                             4.12.0-0.nightly-2022-10-18-041406   False       True          True       104m    Available: The registry is removed...
ingress                                    4.12.0-0.nightly-2022-10-18-041406   True        True          True       111m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
insights                                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      118s    
kube-apiserver                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      102m    
kube-controller-manager                    4.12.0-0.nightly-2022-10-18-041406   True        False         True       107m    GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused
kube-scheduler                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
kube-storage-version-migrator              4.12.0-0.nightly-2022-10-18-041406   True        False         False      117m    
machine-api                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-approver                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
machine-config                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
marketplace                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      116m    
monitoring                                                                      False       True          True       98m     deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found
network                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
node-tuning                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
openshift-apiserver                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      104m    
openshift-controller-manager               4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
openshift-samples                                                               False       True          False      103m    The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state
operator-lifecycle-manager                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-10-18-041406   True        False         False      106m    
service-ca                                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
storage                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m  

Expected results:

Deployment succeeds without issues.

Additional info:

I was unable to run must-gather so attaching the pods logs copied from the host file system.

Description of problem:

OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. 
Hub cluster is OCP 4.10.33
ACM is 2.5.4

When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. 

To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.

Version-Release number of selected component (if applicable):

OCP 4.10.33
ACM 2.5.4

How reproducible:

Always

Steps to Reproduce:

1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue:
  a. Invalid rootDeviceHint in BareMetalHost CR
  b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName
2. Apply installation CRs to hub cluster
3. Wait for cluster installation to fail
4. Remove cluster installation CRs and namespace

Actual results:

Cluster namespace remains in terminating state indefinitely:
$ oc get ns cnfocto1
NAME       STATUS        AGE    
cnfocto1   Terminating   17h

Expected results:

Cluster namespace (and all installation CRs in it) are successfully removed.

Additional info:

The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest):
Namespace: 0
AgentClusterInstall: 1
ClusterDeployment: 1
NMStateConfig: 1
InfraEnv: 1
BareMetalHost: 1
HostFirmwareSettings: 1
ConfigMap: 1 (extra manifests)
ManagedCluster: 2
KlusterletAddonConfig: 2

 

Description:

I was testing the DHCP scenario where only rendezvousIP is specified in the agent-config.yaml and no NMStateConfig is embedded. pre-network-manager-config.service fails on node0 when networkConfig is missing from agent-config.yaml. /usr/local/bin/pre-network-manager-config.sh is not found on node0.

If NMStateConfig is not provided, then perhaps the service should not be included and activated in the ignition.

agent-config.yaml used:

metadata:
name: ostest
namespace: cluster0
spec:
rendezvousIP: 192.168.122.2

Steps to reproduce:

1. Create agent.iso using install-config.yaml and agent-config.yaml
2. Deploy cluster using agent.iso
3. Log into node0 and pre-network-manager-config.service will be displayed as a failed unit.

Expected:

pre-network-manager-config.service in success state

Actual:

pre-network-manager-config.service in failed state

Aug 05 08:27:18 localhost systemd[1]: Starting Prepare network manager config content...
Aug 05 08:27:18 localhost systemd[1]: pre-network-manager-config.service: Main process exited, code=exited, status=203/EXEC
Aug 05 08:27:18 localhost systemd[1]: pre-network-manager-config.service: Failed with result 'exit-code'.
Aug 05 08:27:18 localhost systemd[1]: Failed to start Prepare network manager config content.

This is a clone of issue OCPBUGS-7015. The following is the description of the original issue:

Description of problem:

fail to create vSphere 4.12.2 IPI cluster as apiVIP and ingressVIP are not in machine networks

# ./openshift-install create cluster --dir=/tmp
? SSH Public Key /root/.ssh/id_rsa.pub
? Platform vsphere
? vCenter vcenter.vmware.gsslab.pnq2.redhat.com
? Username administrator@gsslab.pnq
? Password [? for help] ************
INFO Connecting to vCenter vcenter.vmware.gsslab.pnq2.redhat.com
INFO Defaulting to only available datacenter: OpenShift-DC
INFO Defaulting to only available cluster: OCP
? Default Datastore OCP-PNQ-Datastore
? Network PNQ2-25G-PUBLIC-PG
? Virtual IP Address for API [? for help] 192.168.1.10
X Sorry, your reply was invalid: IP expected to be in one of the machine networks: 10.0.0.0/16
? Virtual IP Address for API [? for help]


As the user could not define cidr for machineNetwork when creating the cluster or install-config file interactively, it will use default value 10.0.0.0/16, so fail to create the cluster ot install-config when inputting apiVIP and ingressVIP outside of default machinenNetwork.

Error is thrown from https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L655-L666, seems new function introduced from PR https://github.com/openshift/installer/pull/5798

The issue should also impact Nutanix platform.

I don't understand why the installer is expecting/validating VIPs from 10.0.0.0/16 machine network by default when it's not evening asking to input the machine networks during the survey. This validation was not mandatory in previous OCP installers.


 

Version-Release number of selected component (if applicable):

# ./openshift-install version
./openshift-install 4.12.2
built from commit 7fea1c4fc00312fdf91df361b4ec1a1a12288a97
release image quay.io/openshift-release-dev/ocp-release@sha256:31c7741fc7bb73ff752ba43f5acf014b8fadd69196fc522241302de918066cb1
release architecture amd64

How reproducible:

Always

Steps to Reproduce:

1. create install-config.yaml file by running command "./openshift-install create install-config --dir ipi"
2. failed with above error

Actual results:

fail to create install-config.yaml file

Expected results:

succeed to create install-config.yaml file

Additional info:

 The current workaround is to use dummy VIPs from 10.0.0.0/16 machinenetwork to create the install-config first and then modify the machinenetwork and VIPs as per your requirement which is overhead and creates a negative experience.


There was already a bug reported which seems to have only fixed the VIP validation: https://issues.redhat.com/browse/OCPBUGS-881
 

This is a clone of issue OCPBUGS-95. The following is the description of the original issue:

In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

 

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

This is a clone of issue OCPBUGS-10990. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10526. The following is the description of the original issue:

Description of problem:


Version-Release number of selected component (if applicable):

 4.13.0-0.nightly-2023-03-17-161027 

How reproducible:

Always

Steps to Reproduce:

1.  Create a GCP XPN cluster with flexy job template ipi-on-gcp/versioned-installer-xpn-ci, then 'oc descirbe node'

2. Check logs for cloud-network-config-controller pods

Actual results:


 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE    VERSION
huirwang-0309d-r85mj-master-0.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-1.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-2.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-b-5txgq.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
 `oc describe node`, there is no related egressIP annotations 
% oc describe node huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal 
Name:               huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n2-standard-4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
                    kubernetes.io/os=linux
                    machine.openshift.io/interruptible-instance=
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=n2-standard-4
                    node.openshift.io/os_id=rhcos
                    topology.gke.io/zone=us-central1-a
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-a
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/huirwang-0309d-r85mj-worker-a-wsrls"}
                    k8s.ovn.org/host-addresses: ["10.0.32.117"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal","mac-address":"42:01:0a:00:...
                    k8s.ovn.org/node-chassis-id: 7fb1870c-4315-4dcb-910c-0f45c71ad6d3
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 16:52:e3:8c:13:e2
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.117/32"}
                    k8s.ovn.org/node-subnets: {"default":["10.131.0.0/23"]}
                    machine.openshift.io/machine: openshift-machine-api/huirwang-0309d-r85mj-worker-a-wsrls
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true


 % oc logs cloud-network-config-controller-5cd96d477d-2kmc9  -n openshift-cloud-network-config-controller  
W0320 03:00:08.981493       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0320 03:00:08.982280       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
E0320 03:00:38.982868       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com: i/o timeout
E0320 03:01:23.863454       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: read udp 10.129.0.14:52109->172.30.0.10:53: read: connection refused
I0320 03:02:19.249359       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0320 03:02:19.250662       1 controller.go:88] Starting node controller
I0320 03:02:19.250681       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0320 03:02:19.250693       1 controller.go:88] Starting secret controller
I0320 03:02:19.250703       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0320 03:02:19.250709       1 controller.go:88] Starting cloud-private-ip-config controller
I0320 03:02:19.250715       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0320 03:02:19.258642       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258671       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258682       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal to node workqueue
I0320 03:02:19.351258       1 controller.go:96] Starting node workers
I0320 03:02:19.351303       1 controller.go:102] Started node workers
I0320 03:02:19.351298       1 controller.go:96] Starting secret workers
I0320 03:02:19.351331       1 controller.go:102] Started secret workers
I0320 03:02:19.351265       1 controller.go:96] Starting cloud-private-ip-config workers
I0320 03:02:19.351508       1 controller.go:102] Started cloud-private-ip-config workers
E0320 03:02:19.589704       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.615551       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.644628       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.774047       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.783309       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.816430       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue

Expected results:

EgressIP should work

Additional info:

It can be reproduced in  4.12 as well, not regression issue.

Description of problem:

Pod and PDB list page just report "Not found" when no resources found 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-15-094115

How reproducible:

Always

Steps to Reproduce:

1. normal user has a new empty project
2. normal user visit PDB list page via Workloads ->  PodDisruptionBudgets 
3.

Actual results:

2. it just reports 'Not found'

Expected results:

2. for other workloads, it will report "No <resource> found", for example
No HorizontalPodAutoscalers found
No StatefulSets found
No Deployments found

so for Pods and PodDisruptionBudgets list page, when no resource can be found, it's better that we also reports "No pods found" and "No PodDisruptionBudgets found"

Additional info:

 

Description of problem:

This PR: https://github.com/openshift/cluster-network-operator/pull/1612/files removed the fallback logic of checking for the hosts kubeconfig file when apiserver-url.env was not populated on the machine. In IBM Cloud ROKS (both public cloud + Satellite (Hypershift)) this file is not populated. This means that any upgrade to 4.12 will result in the cluster network operator failing and cause impacts to the cluster.

I am proposing the following plan: First, this PR is held till 4.13. Second: IBM Cloud ROKS team will ensure from the initial release of 4.12 that this file is populated in it's entire fleet of workers (4.12 and beyond). Holding this to 4.13 will allow a seamless upgrade experience when the user upgrades the control plane to 4.12 but the workers are still 4.11. Then when the user goes to upgrade to 4.13: their workers will all be at 4.12 which is guarenteed to have this file and the logic to remove the check for the host kubeconfig can be removed.

For full disclosure was brought up that we could go and push a daemonset across our entire fleet of 16000+ ROKS clusters that just lays down the file but that still introduces race conditions with the network-operator and results in significant resource increase of cluster workload across our entire fleet that the plan I proposed above would remove

Example on a ROKS on Satellite worker showing that this file does not exist (yet): 
[root@tyler-test-24 ~]# ls /etc/kubernetes/apiserver-url.env
ls: cannot access '/etc/kubernetes/apiserver-url.env': No such file or directory

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-10239. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8082. The following is the description of the original issue:

Description of problem:

Currently during the gathering some of the ServiceAccounts were lost. This tasks fixes that problem.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

We added server groups for control plane and computes as part of OSASINFRA-2570, except for UPI that only creates server group for the control plane.

We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.

Related to OCPCLOUD-1135.

This is a clone of issue OCPBUGS-12854. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11550. The following is the description of the original issue:

Description of problem:

`cluster-reader` ClusterRole should have ["get", "list", "watch"] permissions for a number of privileged CRs, but lacks them for the API Group "k8s.ovn.org", which includes CRs such as EgressFirewalls, EgressIPs, etc.

Version-Release number of selected component (if applicable):

OCP 4.10 - 4.12 OVN

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster with OVN components, e.g. EgressFirewall
2. Check permissions of ClusterRole `cluster-reader`

Actual results:

No permissions for OVN resources 

Expected results:

Get, list, and watch verb permissions for OVN resources

Additional info:

Looks like a similar bug was opened for "network-attachment-definitions" in OCPBUGS-6959 (whose closure is being contested).

Description of problem:

The icon color of Alerts in the Topology list view should be based on alert type.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. create a deployment
2. Create a resource quota so that quota alert will be visible in topology list page
3. navigate to topology list page
3.

Actual results:

Alert icon color is black and white. See the screenshots

Expected results:

Alert icon color should be base on alert type. 

Additional info:

 

This is a clone of issue OCPBUGS-1565. The following is the description of the original issue:

Description of problem:

We've observed a split brain case for keepalived unicast, where two worker nodes were fighting for the ingress VIP. 
One of these nodes failed to register itself with the cluster, so it was missing from the output of the node list. That, in turn, caused it to be missing from the unicast_peer list in keepalived. This one node believed it was the master (not receiving VRRP from other nodes), and other nodes constantly re-electing a master.

This behavior was observed in a QE-deployed cluster on PSI. It caused constant VIP flapping and a huge load on OVN.

Version-Release number of selected component (if applicable):


How reproducible:

Not sure. We don't know why the worker node failed to register with the cluster (the cluster is gone now) or what the QE were testing at the time. 

Steps to Reproduce:

1.
2.
3.

Actual results:

The cluster was unhealthy due to the constant Ingress VIP failover. It was also putting a huge load on PSI cloud.

Expected results:

The flapping VIP can be very expensive for the underlying infrastructure. In no way we should allow OCP to bring the underlying infra down.

The node should not be able to claim the VIP when using keepalived in unicast mode unless they have correctly registered with the cluster and they appear in the node list.

Additional info:


In the Known Issues section of the OpenStack-specific Installer docs issues, there is a point about control plane anti-affinity.

The known issue has several problems:

  • it is in the UPI section, when it is not a UPI-specific issue
  • it mentions Control plane scale-out, when OCP only supports exactly 3 masters
  • it is now possible to set anti-affinity from the install-config.yaml, and that should be the recommended solution when VM distribution across hosts is required.

This is a clone of issue OCPBUGS-1604. The following is the description of the original issue:

Description of problem:

When viewing a resource that exists for multiple clusters, the data may be from the wrong cluster for a short time after switching clusters using the multicluster switcher.

Version-Release number of selected component (if applicable):

4.10.6

How reproducible:

Always

Steps to Reproduce:

1. Install RHACM 2.5 on OCP 4.10 and enable the FeatureGate to get multicluster switching
2. From the local-cluster perspective, view a resource that would exist on all clusters, like /k8s/cluster/config.openshift.io~v1~Infrastructure/cluster/yaml
3. Switch to a different cluster in the cluster switcher 

Actual results:

Content for resource may start out correct, but then switch back to the local-cluster version before switching to the correct cluster several moments later.

Expected results:

Content should always be shown from the selected cluster.

Additional info:

Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2075657

Description of problem:

Machine cannot go into Failed phase when providing an invalid vmSize, it stuck in Provisioning, and the prompt message is not accurate.

The case works well in 4.11 and previous versions, it’s a regression issue on 4.12, and seems introduced here: 
https://github.com/openshift/machine-api-provider-azure/pull/32/files#diff-af805e1e45f03df0b5b56ff4413e5ad52cd31904a94d37e8e916751953e4687dR565

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-28-204419

How reproducible:

always

Steps to Reproduce:

1. Create a machineset with invalid vmSize

vmSize: invalid

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml               
machineset.machine.openshift.io/huliu-azure02pr-jmvl2-1 created

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                                 PHASE          TYPE              REGION           ZONE   AGE
huliu-azure02pr-jmvl2-1-6gbdw                        Provisioning                                             4m58s
huliu-azure02pr-jmvl2-master-0                       Running        Standard_D8s_v3   southcentralus   1      5h11m
huliu-azure02pr-jmvl2-master-1                       Running        Standard_D8s_v3   southcentralus   2      5h11m
huliu-azure02pr-jmvl2-master-2                       Running        Standard_D8s_v3   southcentralus   3      5h11m
huliu-azure02pr-jmvl2-worker-southcentralus1-9hgmk   Running        Standard_D4s_v3   southcentralus   1      4h56m
huliu-azure02pr-jmvl2-worker-southcentralus2-44mf6   Running        Standard_D4s_v3   southcentralus   2      4h56m
huliu-azure02pr-jmvl2-worker-southcentralus3-4m9b7   Running        Standard_D4s_v3   southcentralus   3      4h56m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-azure02pr-jmvl2-1-6gbdw  -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  creationTimestamp: "2022-09-29T06:36:03Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: huliu-azure02pr-jmvl2-1-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-azure02pr-jmvl2
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: huliu-azure02pr-jmvl2-1
  name: huliu-azure02pr-jmvl2-1-6gbdw
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: huliu-azure02pr-jmvl2-1
    uid: f729cb01-274a-4c6e-8f69-808cff412fe3
  resourceVersion: "174604"
  uid: 2c4b9dd4-5666-47cd-8fc5-38bac0b9cad1
spec:
  lifecycleHooks: {}
  metadata: {}
  providerSpec:
    value:
      acceleratedNetworking: true
      apiVersion: machine.openshift.io/v1beta1
      credentialsSecret:
        name: azure-cloud-credentials
        namespace: openshift-machine-api
      diagnostics: {}
      image:
        offer: ""
        publisher: ""
        resourceID: /resourceGroups/huliu-azure02pr-jmvl2-rg/providers/Microsoft.Compute/images/huliu-azure02pr-jmvl2-gen2
        sku: ""
        version: ""
      kind: AzureMachineProviderSpec
      location: southcentralus
      managedIdentity: huliu-azure02pr-jmvl2-identity
      metadata:
        creationTimestamp: null
        name: huliu-azure02pr-jmvl2
      networkResourceGroup: huliu-azure02pr-jmvl2-rg
      osDisk:
        diskSettings: {}
        diskSizeGB: 128
        managedDisk:
          storageAccountType: Premium_LRS
        osType: Linux
      publicIP: false
      publicLoadBalancer: huliu-azure02pr-jmvl2
      resourceGroup: huliu-azure02pr-jmvl2-rg
      subnet: huliu-azure02pr-jmvl2-worker-subnet
      userDataSecret:
        name: worker-user-data
      vmSize: invalid
      vnet: huliu-azure02pr-jmvl2-vnet
      zone: "1"
status:
  conditions:
  - lastTransitionTime: "2022-09-29T06:36:03Z"
    status: "True"
    type: Drainable
  - lastTransitionTime: "2022-09-29T06:36:03Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  - lastTransitionTime: "2022-09-29T06:36:03Z"
    status: "True"
    type: Terminable
  lastUpdated: "2022-09-29T06:36:03Z"
  phase: Provisioning
  providerStatus:
    conditions:
    - lastTransitionTime: "2022-09-29T06:36:03Z"
      message: 'failed to create nic huliu-azure02pr-jmvl2-1-6gbdw-nic for machine
        huliu-azure02pr-jmvl2-1-6gbdw: failed to find sku invalid'
      reason: MachineCreationFailed
      status: "True"
      type: MachineCreated
    metadata: {}

machine-controller log:
...
W0929 11:38:25.817887       1 controller.go:382] huliu-azure02pr-jmvl2-invalid-lzzb2: failed to create machine: requeue in: 20s
I0929 11:38:25.817905       1 controller.go:412] Actuator returned requeue-after error: requeue in: 20s
I0929 11:38:25.817984       1 logr.go:252] events "msg"="Warning"  "message"="CreateError: failed to reconcile machine \"huliu-azure02pr-jmvl2-invalid-lzzb2\"s: failed to create nic huliu-azure02pr-jmvl2-invalid-lzzb2-nic for machine huliu-azure02pr-jmvl2-invalid-lzzb2: failed to find sku invalid" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-azure02pr-jmvl2-invalid-lzzb2","uid":"bab43f44-7da9-4b62-bbdc-01a180cc1de7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"316506"} "reason"="FailedCreate"
I0929 11:38:25.817989       1 controller.go:187] huliu-azure02pr-jmvl2-invalid-lzzb2: reconciling Machine
I0929 11:38:25.818015       1 actuator.go:213] huliu-azure02pr-jmvl2-invalid-lzzb2: actuator checking if machine exists
W0929 11:38:25.916417       1 virtualmachines.go:99] vm huliu-azure02pr-jmvl2-invalid-lzzb2 not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/huliu-azure02pr-jmvl2-invalid-lzzb2' under resource group 'huliu-azure02pr-jmvl2-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix")
I0929 11:38:25.916463       1 controller.go:380] huliu-azure02pr-jmvl2-invalid-lzzb2: reconciling machine triggers idempotent create
I0929 11:38:25.916476       1 actuator.go:85] Creating machine huliu-azure02pr-jmvl2-invalid-lzzb2
I0929 11:38:25.917540       1 machine_scope.go:176] huliu-azure02pr-jmvl2-invalid-lzzb2: status unchanged
I0929 11:38:25.917596       1 machine_scope.go:192] huliu-azure02pr-jmvl2-invalid-lzzb2: patching machine
E0929 11:38:25.941083       1 actuator.go:79] Machine error: failed to reconcile machine "huliu-azure02pr-jmvl2-invalid-lzzb2"s: failed to create nic huliu-azure02pr-jmvl2-invalid-lzzb2-nic for machine huliu-azure02pr-jmvl2-invalid-lzzb2: failed to find sku invalid

Actual results:

Machine stuck in Provisioning, the prompt message is not accurate

Expected results:

Machine go into Failed phase and give InvalidConfiguration error, as the previous versions. 

Additional info:

test result on previous version:

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE     TYPE              REGION   ZONE   AGE
jfan49-jn66b-master-0              Running   Standard_D8s_v3   westus          6h27m
jfan49-jn66b-master-1              Running   Standard_D8s_v3   westus          6h27m
jfan49-jn66b-master-2              Running   Standard_D8s_v3   westus          6h27m
jfan49-jn66b-worker-1-tdpdt        Failed                                      61s
jfan49-jn66b-worker-westus-2fz6b   Running   Standard_D4s_v3   westus          6h21m
jfan49-jn66b-worker-westus-6fkgb   Running   Standard_D4s_v3   westus          6h21m
jfan49-jn66b-worker-westus-k74gf   Running   Standard_D4s_v3   westus          6h21m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine jfan49-jn66b-worker-1-tdpdt  -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  annotations:
    machine.openshift.io/instance-state: Unknown
  creationTimestamp: "2022-09-29T08:50:13Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: jfan49-jn66b-worker-1-
  generation: 2
  labels:
    machine.openshift.io/cluster-api-cluster: jfan49-jn66b
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: jfan49-jn66b-worker-1
  name: jfan49-jn66b-worker-1-tdpdt
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: jfan49-jn66b-worker-1
    uid: 4319d2e2-3ee2-4cb2-a7b4-5a0d4e1ea3d7
  resourceVersion: "128119"
  uid: 7d9e4bbe-7c37-416e-a133-577476937b7a
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: azureproviderconfig.openshift.io/v1beta1
      credentialsSecret:
        name: azure-cloud-credentials
        namespace: openshift-machine-api
      image:
        offer: ""
        publisher: ""
        resourceID: /resourceGroups/jfan49-jn66b-rg/providers/Microsoft.Compute/images/jfan49-jn66b
        sku: ""
        version: ""
      kind: AzureMachineProviderSpec
      location: westus
      managedIdentity: jfan49-jn66b-identity
      metadata:
        creationTimestamp: null
        name: jfan49-jn66b
      networkResourceGroup: jfan49-jn66b-rg
      osDisk:
        diskSizeGB: 128
        managedDisk:
          storageAccountType: Premium_LRS
        osType: Linux
      publicIP: false
      publicLoadBalancer: jfan49-jn66b
      resourceGroup: jfan49-jn66b-rg
      subnet: jfan49-jn66b-worker-subnet
      userDataSecret:
        name: worker-user-data
      vmSize: invalid
      vnet: jfan49-jn66b-vnet
      zone: ""
status:
  conditions:
  - lastTransitionTime: "2022-09-29T08:50:13Z"
    message: Instance has not been created
    reason: InstanceNotCreated
    severity: Warning
    status: "False"
    type: InstanceExists
  errorMessage: 'failed to reconcile machine "jfan49-jn66b-worker-1-tdpdt": failed
    to create vm jfan49-jn66b-worker-1-tdpdt: failure sending request for machine
    jfan49-jn66b-worker-1-tdpdt: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate:
    Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter"
    Message="The value invalid provided for the VM size is not valid. The valid sizes
    in the current region are: Standard_B1ls,Standard_B1ms,Standard_B1s,Standard_B2ms,Standard_B2s,Standard_B4ms,Standard_B8ms,Standard_B12ms,Standard_B16ms,Standard_B20ms,Standard_E2_v4,Standard_E4_v4,Standard_E8_v4,Standard_E16_v4,Standard_E20_v4,Standard_E32_v4,Standard_E2d_v4,Standard_E4d_v4,Standard_E8d_v4,Standard_E16d_v4,Standard_E20d_v4,Standard_E32d_v4,Standard_E2s_v4,Standard_E4-2s_v4,Standard_E4s_v4,Standard_E8-2s_v4,Standard_E8-4s_v4,Standard_E8s_v4,Standard_E16-4s_v4,Standard_E16-8s_v4,Standard_E16s_v4,Standard_E20s_v4,Standard_E32-8s_v4,Standard_E32-16s_v4,Standard_E32s_v4,Standard_E2ds_v4,Standard_E4-2ds_v4,Standard_E4ds_v4,Standard_E8-2ds_v4,Standard_E8-4ds_v4,Standard_E8ds_v4,Standard_E16-4ds_v4,Standard_E16-8ds_v4,Standard_E16ds_v4,Standard_E20ds_v4,Standard_E32-8ds_v4,Standard_E32-16ds_v4,Standard_E32ds_v4,Standard_D2d_v4,Standard_D4d_v4,Standard_D8d_v4,Standard_D16d_v4,Standard_D32d_v4,Standard_D48d_v4,Standard_D64d_v4,Standard_D2_v4,Standard_D4_v4,Standard_D8_v4,Standard_D16_v4,Standard_D32_v4,Standard_D48_v4,Standard_D64_v4,Standard_D2ds_v4,Standard_D4ds_v4,Standard_D8ds_v4,Standard_D16ds_v4,Standard_D32ds_v4,Standard_D48ds_v4,Standard_D64ds_v4,Standard_D2s_v4,Standard_D4s_v4,Standard_D8s_v4,Standard_D16s_v4,Standard_D32s_v4,Standard_D48s_v4,Standard_D64s_v4,Standard_D1_v2,Standard_D2_v2,Standard_D3_v2,Standard_D4_v2,Standard_D5_v2,Standard_D11_v2,Standard_D12_v2,Standard_D13_v2,Standard_D14_v2,Standard_D15_v2,Standard_D2_v2_Promo,Standard_D3_v2_Promo,Standard_D4_v2_Promo,Standard_D5_v2_Promo,Standard_D11_v2_Promo,Standard_D12_v2_Promo,Standard_D13_v2_Promo,Standard_D14_v2_Promo,Standard_F1,Standard_F2,Standard_F4,Standard_F8,Standard_F16,Standard_DS1_v2,Standard_DS2_v2,Standard_DS3_v2,Standard_DS4_v2,Standard_DS5_v2,Standard_DS11-1_v2,Standard_DS11_v2,Standard_DS12-1_v2,Standard_DS12-2_v2,Standard_DS12_v2,Standard_DS13-2_v2,Standard_DS13-4_v2,Standard_DS13_v2,Standard_DS14-4_v2,Standard_DS14-8_v2,Standard_DS14_v2,Standard_DS15_v2,Standard_DS2_v2_Promo,Standard_DS3_v2_Promo,Standard_DS4_v2_Promo,Standard_DS5_v2_Promo,Standard_DS11_v2_Promo,Standard_DS12_v2_Promo,Standard_DS13_v2_Promo,Standard_DS14_v2_Promo,Standard_F1s,Standard_F2s,Standard_F4s,Standard_F8s,Standard_F16s,Standard_A1_v2,Standard_A2m_v2,Standard_A2_v2,Standard_A4m_v2,Standard_A4_v2,Standard_A8m_v2,Standard_A8_v2,Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3,Standard_D48_v3,Standard_D64_v3,Standard_D2s_v3,Standard_D4s_v3,Standard_D8s_v3,Standard_D16s_v3,Standard_D32s_v3,Standard_D48s_v3,Standard_D64s_v3,Standard_E2_v3,Standard_E4_v3,Standard_E8_v3,Standard_E16_v3,Standard_E20_v3,Standard_E32_v3,Standard_E2s_v3,Standard_E4-2s_v3,Standard_E4s_v3,Standard_E8-2s_v3,Standard_E8-4s_v3,Standard_E8s_v3,Standard_E16-4s_v3,Standard_E16-8s_v3,Standard_E16s_v3,Standard_E20s_v3,Standard_E32-8s_v3,Standard_E32-16s_v3,Standard_E32s_v3,Standard_F2s_v2,Standard_F4s_v2,Standard_F8s_v2,Standard_F16s_v2,Standard_F32s_v2,Standard_F48s_v2,Standard_F64s_v2,Standard_F72s_v2,Standard_E48_v4,Standard_E64_v4,Standard_E48d_v4,Standard_E64d_v4,Standard_E48s_v4,Standard_E64-16s_v4,Standard_E64-32s_v4,Standard_E64s_v4,Standard_E80is_v4,Standard_E48ds_v4,Standard_E64-16ds_v4,Standard_E64-32ds_v4,Standard_E64ds_v4,Standard_E80ids_v4,Standard_E48_v3,Standard_E64_v3,Standard_E48s_v3,Standard_E64-16s_v3,Standard_E64-32s_v3,Standard_E64s_v3,Standard_A0,Standard_A1,Standard_A2,Standard_A3,Standard_A5,Standard_A4,Standard_A6,Standard_A7,Basic_A0,Basic_A1,Basic_A2,Basic_A3,Basic_A4,Standard_NC4as_T4_v3,Standard_NC8as_T4_v3,Standard_NC16as_T4_v3,Standard_NC64as_T4_v3,Standard_M64,Standard_M64m,Standard_M128,Standard_M128m,Standard_M8-2ms,Standard_M8-4ms,Standard_M8ms,Standard_M16-4ms,Standard_M16-8ms,Standard_M16ms,Standard_M32-8ms,Standard_M32-16ms,Standard_M32ls,Standard_M32ms,Standard_M32ts,Standard_M64-16ms,Standard_M64-32ms,Standard_M64ls,Standard_M64ms,Standard_M64s,Standard_M128-32ms,Standard_M128-64ms,Standard_M128ms,Standard_M128s,Standard_M32ms_v2,Standard_M64ms_v2,Standard_M64s_v2,Standard_M128ms_v2,Standard_M128s_v2,Standard_M192ims_v2,Standard_M192is_v2,Standard_M32dms_v2,Standard_M64dms_v2,Standard_M64ds_v2,Standard_M128dms_v2,Standard_M128ds_v2,Standard_M192idms_v2,Standard_M192ids_v2,Standard_E64i_v3,Standard_E64is_v3,Standard_D1,Standard_D2,Standard_D3,Standard_D4,Standard_D11,Standard_D12,Standard_D13,Standard_D14,Standard_DS1,Standard_DS2,Standard_DS3,Standard_DS4,Standard_DS11,Standard_DS12,Standard_DS13,Standard_DS14,Standard_DC8_v2,Standard_DC1s_v2,Standard_DC2s_v2,Standard_DC4s_v2,Standard_L8s_v2,Standard_L16s_v2,Standard_L32s_v2,Standard_L48s_v2,Standard_L64s_v2,Standard_L80s_v2,Standard_NV4as_v4,Standard_NV8as_v4,Standard_NV16as_v4,Standard_NV32as_v4,Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5,Standard_GS1,Standard_GS2,Standard_GS3,Standard_GS4,Standard_GS4-4,Standard_GS4-8,Standard_GS5,Standard_GS5-8,Standard_GS5-16,Standard_L4s,Standard_L8s,Standard_L16s,Standard_L32s,Standard_DC2as_v5,Standard_DC4as_v5,Standard_DC8as_v5,Standard_DC16as_v5,Standard_DC32as_v5,Standard_DC48as_v5,Standard_DC64as_v5,Standard_DC96as_v5,Standard_DC2ads_v5,Standard_DC4ads_v5,Standard_DC8ads_v5,Standard_DC16ads_v5,Standard_DC32ads_v5,Standard_DC48ads_v5,Standard_DC64ads_v5,Standard_DC96ads_v5,Standard_EC2as_v5,Standard_EC4as_v5,Standard_EC8as_v5,Standard_EC16as_v5,Standard_EC20as_v5,Standard_EC32as_v5,Standard_EC48as_v5,Standard_EC64as_v5,Standard_EC96as_v5,Standard_EC96ias_v5,Standard_EC2ads_v5,Standard_EC4ads_v5,Standard_EC8ads_v5,Standard_EC16ads_v5,Standard_EC20ads_v5,Standard_EC32ads_v5,Standard_EC48ads_v5,Standard_EC64ads_v5,Standard_EC96ads_v5,Standard_EC96iads_v5,Standard_D2ds_v5,Standard_D4ds_v5,Standard_D8ds_v5,Standard_D16ds_v5,Standard_D32ds_v5,Standard_D48ds_v5,Standard_D64ds_v5,Standard_D96ds_v5,Standard_D2d_v5,Standard_D4d_v5,Standard_D8d_v5,Standard_D16d_v5,Standard_D32d_v5,Standard_D48d_v5,Standard_D64d_v5,Standard_D96d_v5,Standard_D2s_v5,Standard_D4s_v5,Standard_D8s_v5,Standard_D16s_v5,Standard_D32s_v5,Standard_D48s_v5,Standard_D64s_v5,Standard_D96s_v5,Standard_D2_v5,Standard_D4_v5,Standard_D8_v5,Standard_D16_v5,Standard_D32_v5,Standard_D48_v5,Standard_D64_v5,Standard_D96_v5,Standard_E2ds_v5,Standard_E4-2ds_v5,Standard_E4ds_v5,Standard_E8-2ds_v5,Standard_E8-4ds_v5,Standard_E8ds_v5,Standard_E16-4ds_v5,Standard_E16-8ds_v5,Standard_E16ds_v5,Standard_E20ds_v5,Standard_E32-8ds_v5,Standard_E32-16ds_v5,Standard_E32ds_v5,Standard_E48ds_v5,Standard_E64-16ds_v5,Standard_E64-32ds_v5,Standard_E64ds_v5,Standard_E96-24ds_v5,Standard_E96-48ds_v5,Standard_E96ds_v5,Standard_E104ids_v5,Standard_E2d_v5,Standard_E4d_v5,Standard_E8d_v5,Standard_E16d_v5,Standard_E20d_v5,Standard_E32d_v5,Standard_E48d_v5,Standard_E64d_v5,Standard_E96d_v5,Standard_E104id_v5,Standard_E2s_v5,Standard_E4-2s_v5,Standard_E4s_v5,Standard_E8-2s_v5,Standard_E8-4s_v5,Standard_E8s_v5,Standard_E16-4s_v5,Standard_E16-8s_v5,Standard_E16s_v5,Standard_E20s_v5,Standard_E32-8s_v5,Standard_E32-16s_v5,Standard_E32s_v5,Standard_E48s_v5,Standard_E64-16s_v5,Standard_E64-32s_v5,Standard_E64s_v5,Standard_E96-24s_v5,Standard_E96-48s_v5,Standard_E96s_v5,Standard_E104is_v5,Standard_E2_v5,Standard_E4_v5,Standard_E8_v5,Standard_E16_v5,Standard_E20_v5,Standard_E32_v5,Standard_E48_v5,Standard_E64_v5,Standard_E96_v5,Standard_E104i_v5,Standard_E2bs_v5,Standard_E4bs_v5,Standard_E8bs_v5,Standard_E16bs_v5,Standard_E32bs_v5,Standard_E48bs_v5,Standard_E64bs_v5,Standard_E2bds_v5,Standard_E4bds_v5,Standard_E8bds_v5,Standard_E16bds_v5,Standard_E32bds_v5,Standard_E48bds_v5,Standard_E64bds_v5,Standard_D2a_v4,Standard_D4a_v4,Standard_D8a_v4,Standard_D16a_v4,Standard_D32a_v4,Standard_D48a_v4,Standard_D64a_v4,Standard_D96a_v4,Standard_D2as_v4,Standard_D4as_v4,Standard_D8as_v4,Standard_D16as_v4,Standard_D32as_v4,Standard_D48as_v4,Standard_D64as_v4,Standard_D96as_v4,Standard_E2a_v4,Standard_E4a_v4,Standard_E8a_v4,Standard_E16a_v4,Standard_E20a_v4,Standard_E32a_v4,Standard_E48a_v4,Standard_E64a_v4,Standard_E96a_v4,Standard_E2as_v4,Standard_E4-2as_v4,Standard_E4as_v4,Standard_E8-2as_v4,Standard_E8-4as_v4,Standard_E8as_v4,Standard_E16-4as_v4,Standard_E16-8as_v4,Standard_E16as_v4,Standard_E20as_v4,Standard_E32-8as_v4,Standard_E32-16as_v4,Standard_E32as_v4,Standard_E48as_v4,Standard_E64-16as_v4,Standard_E64-32as_v4,Standard_E64as_v4,Standard_E96-24as_v4,Standard_E96-48as_v4,Standard_E96as_v4,Standard_E96ias_v4,Standard_NC6s_v3,Standard_NC12s_v3,Standard_NC24rs_v3,Standard_NC24s_v3,Standard_NV6s_v2,Standard_NV12s_v2,Standard_NV24s_v2,Standard_NV12s_v3,Standard_NV24s_v3,Standard_NV48s_v3,Standard_H8,Standard_H8_Promo,Standard_H16,Standard_H16_Promo,Standard_H8m,Standard_H8m_Promo,Standard_H16m,Standard_H16m_Promo,Standard_H16r,Standard_H16r_Promo,Standard_H16mr,Standard_H16mr_Promo,Standard_M208ms_v2,Standard_M208s_v2,Standard_M416-208s_v2,Standard_M416s_v2,Standard_M416-208ms_v2,Standard_M416ms_v2,Standard_DC1s_v3,Standard_DC2s_v3,Standard_DC4s_v3,Standard_DC8s_v3,Standard_DC16s_v3,Standard_DC24s_v3,Standard_DC32s_v3,Standard_DC48s_v3,Standard_DC1ds_v3,Standard_DC2ds_v3,Standard_DC4ds_v3,Standard_DC8ds_v3,Standard_DC16ds_v3,Standard_DC24ds_v3,Standard_DC32ds_v3,Standard_DC48ds_v3,Standard_NC24ads_A100_v4,Standard_NC48ads_A100_v4,Standard_NC96ads_A100_v4,Standard_D2as_v5,Standard_D4as_v5,Standard_D8as_v5,Standard_D16as_v5,Standard_D32as_v5,Standard_D48as_v5,Standard_D64as_v5,Standard_D96as_v5,Standard_E2as_v5,Standard_E4-2as_v5,Standard_E4as_v5,Standard_E8-2as_v5,Standard_E8-4as_v5,Standard_E8as_v5,Standard_E16-4as_v5,Standard_E16-8as_v5,Standard_E16as_v5,Standard_E20as_v5,Standard_E32-8as_v5,Standard_E32-16as_v5,Standard_E32as_v5,Standard_E48as_v5,Standard_E64-16as_v5,Standard_E64-32as_v5,Standard_E64as_v5,Standard_E96-24as_v5,Standard_E96-48as_v5,Standard_E96as_v5,Standard_E112ias_v5,Standard_D2ads_v5,Standard_D4ads_v5,Standard_D8ads_v5,Standard_D16ads_v5,Standard_D32ads_v5,Standard_D48ads_v5,Standard_D64ads_v5,Standard_D96ads_v5,Standard_E2ads_v5,Standard_E4-2ads_v5,Standard_E4ads_v5,Standard_E8-2ads_v5,Standard_E8-4ads_v5,Standard_E8ads_v5,Standard_E16-4ads_v5,Standard_E16-8ads_v5,Standard_E16ads_v5,Standard_E20ads_v5,Standard_E32-8ads_v5,Standard_E32-16ads_v5,Standard_E32ads_v5,Standard_E48ads_v5,Standard_E64-16ads_v5,Standard_E64-32ads_v5,Standard_E64ads_v5,Standard_E96-24ads_v5,Standard_E96-48ads_v5,Standard_E96ads_v5,Standard_E112iads_v5,Standard_L8s_v3,Standard_L16s_v3,Standard_L32s_v3,Standard_L48s_v3,Standard_L64s_v3,Standard_L80s_v3.
    Find out more on the valid VM sizes in each region at https://aka.ms/azure-regionservices."
    Target="vmSize"'
  errorReason: InvalidConfiguration
  lastUpdated: "2022-09-29T08:50:19Z"
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2022-09-29T08:50:19Z"
      lastTransitionTime: "2022-09-29T08:50:19Z"
      message: 'failed to create vm jfan49-jn66b-worker-1-tdpdt: failure sending request
        for machine jfan49-jn66b-worker-1-tdpdt: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate:
        Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter"
        Message="The value invalid provided for the VM size is not valid. The valid
        sizes in the current region are: Standard_B1ls,Standard_B1ms,Standard_B1s,Standard_B2ms,Standard_B2s,Standard_B4ms,Standard_B8ms,Standard_B12ms,Standard_B16ms,Standard_B20ms,Standard_E2_v4,Standard_E4_v4,Standard_E8_v4,Standard_E16_v4,Standard_E20_v4,Standard_E32_v4,Standard_E2d_v4,Standard_E4d_v4,Standard_E8d_v4,Standard_E16d_v4,Standard_E20d_v4,Standard_E32d_v4,Standard_E2s_v4,Standard_E4-2s_v4,Standard_E4s_v4,Standard_E8-2s_v4,Standard_E8-4s_v4,Standard_E8s_v4,Standard_E16-4s_v4,Standard_E16-8s_v4,Standard_E16s_v4,Standard_E20s_v4,Standard_E32-8s_v4,Standard_E32-16s_v4,Standard_E32s_v4,Standard_E2ds_v4,Standard_E4-2ds_v4,Standard_E4ds_v4,Standard_E8-2ds_v4,Standard_E8-4ds_v4,Standard_E8ds_v4,Standard_E16-4ds_v4,Standard_E16-8ds_v4,Standard_E16ds_v4,Standard_E20ds_v4,Standard_E32-8ds_v4,Standard_E32-16ds_v4,Standard_E32ds_v4,Standard_D2d_v4,Standard_D4d_v4,Standard_D8d_v4,Standard_D16d_v4,Standard_D32d_v4,Standard_D48d_v4,Standard_D64d_v4,Standard_D2_v4,Standard_D4_v4,Standard_D8_v4,Standard_D16_v4,Standard_D32_v4,Standard_D48_v4,Standard_D64_v4,Standard_D2ds_v4,Standard_D4ds_v4,Standard_D8ds_v4,Standard_D16ds_v4,Standard_D32ds_v4,Standard_D48ds_v4,Standard_D64ds_v4,Standard_D2s_v4,Standard_D4s_v4,Standard_D8s_v4,Standard_D16s_v4,Standard_D32s_v4,Standard_D48s_v4,Standard_D64s_v4,Standard_D1_v2,Standard_D2_v2,Standard_D3_v2,Standard_D4_v2,Standard_D5_v2,Standard_D11_v2,Standard_D12_v2,Standard_D13_v2,Standard_D14_v2,Standard_D15_v2,Standard_D2_v2_Promo,Standard_D3_v2_Promo,Standard_D4_v2_Promo,Standard_D5_v2_Promo,Standard_D11_v2_Promo,Standard_D12_v2_Promo,Standard_D13_v2_Promo,Standard_D14_v2_Promo,Standard_F1,Standard_F2,Standard_F4,Standard_F8,Standard_F16,Standard_DS1_v2,Standard_DS2_v2,Standard_DS3_v2,Standard_DS4_v2,Standard_DS5_v2,Standard_DS11-1_v2,Standard_DS11_v2,Standard_DS12-1_v2,Standard_DS12-2_v2,Standard_DS12_v2,Standard_DS13-2_v2,Standard_DS13-4_v2,Standard_DS13_v2,Standard_DS14-4_v2,Standard_DS14-8_v2,Standard_DS14_v2,Standard_DS15_v2,Standard_DS2_v2_Promo,Standard_DS3_v2_Promo,Standard_DS4_v2_Promo,Standard_DS5_v2_Promo,Standard_DS11_v2_Promo,Standard_DS12_v2_Promo,Standard_DS13_v2_Promo,Standard_DS14_v2_Promo,Standard_F1s,Standard_F2s,Standard_F4s,Standard_F8s,Standard_F16s,Standard_A1_v2,Standard_A2m_v2,Standard_A2_v2,Standard_A4m_v2,Standard_A4_v2,Standard_A8m_v2,Standard_A8_v2,Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3,Standard_D48_v3,Standard_D64_v3,Standard_D2s_v3,Standard_D4s_v3,Standard_D8s_v3,Standard_D16s_v3,Standard_D32s_v3,Standard_D48s_v3,Standard_D64s_v3,Standard_E2_v3,Standard_E4_v3,Standard_E8_v3,Standard_E16_v3,Standard_E20_v3,Standard_E32_v3,Standard_E2s_v3,Standard_E4-2s_v3,Standard_E4s_v3,Standard_E8-2s_v3,Standard_E8-4s_v3,Standard_E8s_v3,Standard_E16-4s_v3,Standard_E16-8s_v3,Standard_E16s_v3,Standard_E20s_v3,Standard_E32-8s_v3,Standard_E32-16s_v3,Standard_E32s_v3,Standard_F2s_v2,Standard_F4s_v2,Standard_F8s_v2,Standard_F16s_v2,Standard_F32s_v2,Standard_F48s_v2,Standard_F64s_v2,Standard_F72s_v2,Standard_E48_v4,Standard_E64_v4,Standard_E48d_v4,Standard_E64d_v4,Standard_E48s_v4,Standard_E64-16s_v4,Standard_E64-32s_v4,Standard_E64s_v4,Standard_E80is_v4,Standard_E48ds_v4,Standard_E64-16ds_v4,Standard_E64-32ds_v4,Standard_E64ds_v4,Standard_E80ids_v4,Standard_E48_v3,Standard_E64_v3,Standard_E48s_v3,Standard_E64-16s_v3,Standard_E64-32s_v3,Standard_E64s_v3,Standard_A0,Standard_A1,Standard_A2,Standard_A3,Standard_A5,Standard_A4,Standard_A6,Standard_A7,Basic_A0,Basic_A1,Basic_A2,Basic_A3,Basic_A4,Standard_NC4as_T4_v3,Standard_NC8as_T4_v3,Standard_NC16as_T4_v3,Standard_NC64as_T4_v3,Standard_M64,Standard_M64m,Standard_M128,Standard_M128m,Standard_M8-2ms,Standard_M8-4ms,Standard_M8ms,Standard_M16-4ms,Standard_M16-8ms,Standard_M16ms,Standard_M32-8ms,Standard_M32-16ms,Standard_M32ls,Standard_M32ms,Standard_M32ts,Standard_M64-16ms,Standard_M64-32ms,Standard_M64ls,Standard_M64ms,Standard_M64s,Standard_M128-32ms,Standard_M128-64ms,Standard_M128ms,Standard_M128s,Standard_M32ms_v2,Standard_M64ms_v2,Standard_M64s_v2,Standard_M128ms_v2,Standard_M128s_v2,Standard_M192ims_v2,Standard_M192is_v2,Standard_M32dms_v2,Standard_M64dms_v2,Standard_M64ds_v2,Standard_M128dms_v2,Standard_M128ds_v2,Standard_M192idms_v2,Standard_M192ids_v2,Standard_E64i_v3,Standard_E64is_v3,Standard_D1,Standard_D2,Standard_D3,Standard_D4,Standard_D11,Standard_D12,Standard_D13,Standard_D14,Standard_DS1,Standard_DS2,Standard_DS3,Standard_DS4,Standard_DS11,Standard_DS12,Standard_DS13,Standard_DS14,Standard_DC8_v2,Standard_DC1s_v2,Standard_DC2s_v2,Standard_DC4s_v2,Standard_L8s_v2,Standard_L16s_v2,Standard_L32s_v2,Standard_L48s_v2,Standard_L64s_v2,Standard_L80s_v2,Standard_NV4as_v4,Standard_NV8as_v4,Standard_NV16as_v4,Standard_NV32as_v4,Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5,Standard_GS1,Standard_GS2,Standard_GS3,Standard_GS4,Standard_GS4-4,Standard_GS4-8,Standard_GS5,Standard_GS5-8,Standard_GS5-16,Standard_L4s,Standard_L8s,Standard_L16s,Standard_L32s,Standard_DC2as_v5,Standard_DC4as_v5,Standard_DC8as_v5,Standard_DC16as_v5,Standard_DC32as_v5,Standard_DC48as_v5,Standard_DC64as_v5,Standard_DC96as_v5,Standard_DC2ads_v5,Standard_DC4ads_v5,Standard_DC8ads_v5,Standard_DC16ads_v5,Standard_DC32ads_v5,Standard_DC48ads_v5,Standard_DC64ads_v5,Standard_DC96ads_v5,Standard_EC2as_v5,Standard_EC4as_v5,Standard_EC8as_v5,Standard_EC16as_v5,Standard_EC20as_v5,Standard_EC32as_v5,Standard_EC48as_v5,Standard_EC64as_v5,Standard_EC96as_v5,Standard_EC96ias_v5,Standard_EC2ads_v5,Standard_EC4ads_v5,Standard_EC8ads_v5,Standard_EC16ads_v5,Standard_EC20ads_v5,Standard_EC32ads_v5,Standard_EC48ads_v5,Standard_EC64ads_v5,Standard_EC96ads_v5,Standard_EC96iads_v5,Standard_D2ds_v5,Standard_D4ds_v5,Standard_D8ds_v5,Standard_D16ds_v5,Standard_D32ds_v5,Standard_D48ds_v5,Standard_D64ds_v5,Standard_D96ds_v5,Standard_D2d_v5,Standard_D4d_v5,Standard_D8d_v5,Standard_D16d_v5,Standard_D32d_v5,Standard_D48d_v5,Standard_D64d_v5,Standard_D96d_v5,Standard_D2s_v5,Standard_D4s_v5,Standard_D8s_v5,Standard_D16s_v5,Standard_D32s_v5,Standard_D48s_v5,Standard_D64s_v5,Standard_D96s_v5,Standard_D2_v5,Standard_D4_v5,Standard_D8_v5,Standard_D16_v5,Standard_D32_v5,Standard_D48_v5,Standard_D64_v5,Standard_D96_v5,Standard_E2ds_v5,Standard_E4-2ds_v5,Standard_E4ds_v5,Standard_E8-2ds_v5,Standard_E8-4ds_v5,Standard_E8ds_v5,Standard_E16-4ds_v5,Standard_E16-8ds_v5,Standard_E16ds_v5,Standard_E20ds_v5,Standard_E32-8ds_v5,Standard_E32-16ds_v5,Standard_E32ds_v5,Standard_E48ds_v5,Standard_E64-16ds_v5,Standard_E64-32ds_v5,Standard_E64ds_v5,Standard_E96-24ds_v5,Standard_E96-48ds_v5,Standard_E96ds_v5,Standard_E104ids_v5,Standard_E2d_v5,Standard_E4d_v5,Standard_E8d_v5,Standard_E16d_v5,Standard_E20d_v5,Standard_E32d_v5,Standard_E48d_v5,Standard_E64d_v5,Standard_E96d_v5,Standard_E104id_v5,Standard_E2s_v5,Standard_E4-2s_v5,Standard_E4s_v5,Standard_E8-2s_v5,Standard_E8-4s_v5,Standard_E8s_v5,Standard_E16-4s_v5,Standard_E16-8s_v5,Standard_E16s_v5,Standard_E20s_v5,Standard_E32-8s_v5,Standard_E32-16s_v5,Standard_E32s_v5,Standard_E48s_v5,Standard_E64-16s_v5,Standard_E64-32s_v5,Standard_E64s_v5,Standard_E96-24s_v5,Standard_E96-48s_v5,Standard_E96s_v5,Standard_E104is_v5,Standard_E2_v5,Standard_E4_v5,Standard_E8_v5,Standard_E16_v5,Standard_E20_v5,Standard_E32_v5,Standard_E48_v5,Standard_E64_v5,Standard_E96_v5,Standard_E104i_v5,Standard_E2bs_v5,Standard_E4bs_v5,Standard_E8bs_v5,Standard_E16bs_v5,Standard_E32bs_v5,Standard_E48bs_v5,Standard_E64bs_v5,Standard_E2bds_v5,Standard_E4bds_v5,Standard_E8bds_v5,Standard_E16bds_v5,Standard_E32bds_v5,Standard_E48bds_v5,Standard_E64bds_v5,Standard_D2a_v4,Standard_D4a_v4,Standard_D8a_v4,Standard_D16a_v4,Standard_D32a_v4,Standard_D48a_v4,Standard_D64a_v4,Standard_D96a_v4,Standard_D2as_v4,Standard_D4as_v4,Standard_D8as_v4,Standard_D16as_v4,Standard_D32as_v4,Standard_D48as_v4,Standard_D64as_v4,Standard_D96as_v4,Standard_E2a_v4,Standard_E4a_v4,Standard_E8a_v4,Standard_E16a_v4,Standard_E20a_v4,Standard_E32a_v4,Standard_E48a_v4,Standard_E64a_v4,Standard_E96a_v4,Standard_E2as_v4,Standard_E4-2as_v4,Standard_E4as_v4,Standard_E8-2as_v4,Standard_E8-4as_v4,Standard_E8as_v4,Standard_E16-4as_v4,Standard_E16-8as_v4,Standard_E16as_v4,Standard_E20as_v4,Standard_E32-8as_v4,Standard_E32-16as_v4,Standard_E32as_v4,Standard_E48as_v4,Standard_E64-16as_v4,Standard_E64-32as_v4,Standard_E64as_v4,Standard_E96-24as_v4,Standard_E96-48as_v4,Standard_E96as_v4,Standard_E96ias_v4,Standard_NC6s_v3,Standard_NC12s_v3,Standard_NC24rs_v3,Standard_NC24s_v3,Standard_NV6s_v2,Standard_NV12s_v2,Standard_NV24s_v2,Standard_NV12s_v3,Standard_NV24s_v3,Standard_NV48s_v3,Standard_H8,Standard_H8_Promo,Standard_H16,Standard_H16_Promo,Standard_H8m,Standard_H8m_Promo,Standard_H16m,Standard_H16m_Promo,Standard_H16r,Standard_H16r_Promo,Standard_H16mr,Standard_H16mr_Promo,Standard_M208ms_v2,Standard_M208s_v2,Standard_M416-208s_v2,Standard_M416s_v2,Standard_M416-208ms_v2,Standard_M416ms_v2,Standard_DC1s_v3,Standard_DC2s_v3,Standard_DC4s_v3,Standard_DC8s_v3,Standard_DC16s_v3,Standard_DC24s_v3,Standard_DC32s_v3,Standard_DC48s_v3,Standard_DC1ds_v3,Standard_DC2ds_v3,Standard_DC4ds_v3,Standard_DC8ds_v3,Standard_DC16ds_v3,Standard_DC24ds_v3,Standard_DC32ds_v3,Standard_DC48ds_v3,Standard_NC24ads_A100_v4,Standard_NC48ads_A100_v4,Standard_NC96ads_A100_v4,Standard_D2as_v5,Standard_D4as_v5,Standard_D8as_v5,Standard_D16as_v5,Standard_D32as_v5,Standard_D48as_v5,Standard_D64as_v5,Standard_D96as_v5,Standard_E2as_v5,Standard_E4-2as_v5,Standard_E4as_v5,Standard_E8-2as_v5,Standard_E8-4as_v5,Standard_E8as_v5,Standard_E16-4as_v5,Standard_E16-8as_v5,Standard_E16as_v5,Standard_E20as_v5,Standard_E32-8as_v5,Standard_E32-16as_v5,Standard_E32as_v5,Standard_E48as_v5,Standard_E64-16as_v5,Standard_E64-32as_v5,Standard_E64as_v5,Standard_E96-24as_v5,Standard_E96-48as_v5,Standard_E96as_v5,Standard_E112ias_v5,Standard_D2ads_v5,Standard_D4ads_v5,Standard_D8ads_v5,Standard_D16ads_v5,Standard_D32ads_v5,Standard_D48ads_v5,Standard_D64ads_v5,Standard_D96ads_v5,Standard_E2ads_v5,Standard_E4-2ads_v5,Standard_E4ads_v5,Standard_E8-2ads_v5,Standard_E8-4ads_v5,Standard_E8ads_v5,Standard_E16-4ads_v5,Standard_E16-8ads_v5,Standard_E16ads_v5,Standard_E20ads_v5,Standard_E32-8ads_v5,Standard_E32-16ads_v5,Standard_E32ads_v5,Standard_E48ads_v5,Standard_E64-16ads_v5,Standard_E64-32ads_v5,Standard_E64ads_v5,Standard_E96-24ads_v5,Standard_E96-48ads_v5,Standard_E96ads_v5,Standard_E112iads_v5,Standard_L8s_v3,Standard_L16s_v3,Standard_L32s_v3,Standard_L48s_v3,Standard_L64s_v3,Standard_L80s_v3.
        Find out more on the valid VM sizes in each region at https://aka.ms/azure-regionservices."
        Target="vmSize"'
      reason: MachineCreationFailed
      status: "True"
      type: MachineCreated
    metadata: {}

This in reference to https://issues.redhat.com/browse/OCPBUGS-5817

We need to block upgrade to OCP-4.14 from 4.13 because of the bugs listed in above issues.

  • Check in the vSphere CSI driver operator
  • Check node.status.volumesAttached / volumesInUse for vSphere in-line in-tree volumes
  • Check PVs for un-mounted in-tree vSphere volumes
  • The operator will add admin-ack-key in admin-gates ConfigMap if there is such a volume
  • The operator will remove the admin-ack-key when upgrade becomes possible (admin removed all in-tree PVs or upgraded to the right vSphere version or removes in-line volumes from pods)
  • The admin ack will not remove other 4.13 / 4.14 requirements, like removal of community CSI driver or upgrade from 6.7u3 to at least 7.0.3 - such cluster is still un-upgradeable even with the ack.

Description of problem:

Pod in the openshift-marketplace cause PodSecurityViolation alerts in vanilla OpenShift cluster

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-04-203333

How reproducible:

100%

Steps to Reproduce:

1. install a freshly new cluster
2. check the alerts in the console

Actual results:

PodSecurityViolation alert is present

Expected results:

No alerts

Additional info:

I'll provide a filtered version of the audit logs containing the violations

Not all information provided in the install-config gets passed through to assisted-service.

An example is that platform settings other than the VIPs are ignored. So are the "capabilities". There may be others - we need to do a thorough audit.

If the user supplies data that we then ignore, we should log a warning. However, we must not return an error, because this may prevent people using their existing install-configs with the agent install method.

This is a clone of issue OCPBUGS-4573. The following is the description of the original issue:

Description of problem:

The statsPort is not correctly set for HostNetwork endpointPublishingStrategyWhen we change the httpPort from 80 to 85 and statsPort from 1936 to 1939 on the default router like here: # oc get IngressController default -n openshift-ingress-operator
...
 clientTLS:
    clientCA:
      name: ""
    clientCertificatePolicy: ""
  endpointPublishingStrategy:
    hostNetwork:
      httpPort: 85
      httpsPort: 443
      statsPort: 1939
    type: HostNetwork
...
status:
...  
endpointPublishingStrategy:
    hostNetwork:
      httpPort: 85
      httpsPort: 443
      protocol: TCP
      statsPort: 1939
 
We can see that the route pods get restarted:# oc get pod -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-5b96855754-2wnrp   1/1     Running   0          1m
router-default-5b96855754-9c724   1/1     Running   0          2mThe pods are configured correctly:# oc get pod router-default-5b96855754-2wnrp -o yaml
...
spec:
  containers:
  - env:
    - name: ROUTER_SERVICE_HTTPS_PORT
      value: "443"
    - name: ROUTER_SERVICE_HTTP_PORT
      value: "85"
    - name: STATS_PORT
      value: "1939"
...
    livenessProbe:
      failureThreshold: 3
      httpGet:
        host: localhost
        path: /healthz
        port: 1939
        scheme: HTTP
...
    ports:
    - containerPort: 85
      hostPort: 85
      name: http
      protocol: TCP
    - containerPort: 443
      hostPort: 443
      name: https
      protocol: TCP
    - containerPort: 1939
      hostPort: 1939
      name: metrics
      protocol: TCPBut the endpoint is incorrect:# oc get ep router-internal-default -o yaml
...
apiVersion: v1
items:
- apiVersion: v1
  kind: Endpoints
  metadata:
    creationTimestamp: "2022-12-02T13:34:48Z"
    labels:
      ingresscontroller.operator.openshift.io/owning-ingresscontroller: default
    name: router-internal-default
    namespace: openshift-ingress
    resourceVersion: "23216275"
    uid: 50c00fc0-08e5-4a6a-a7eb-7501fa1a7ba6
  subsets:
  - addresses:
    - ip: 10.74.211.203
      nodeName: worker-0.rhodain01.lab.psi.pnq2.redhat.com
      targetRef:
        kind: Pod
        name: router-default-5b96855754-2wnrp
        namespace: openshift-ingress
        uid: eda945b9-9061-4361-b11a-9d895fee0003
    - ip: 10.74.211.216
      nodeName: worker-1.rhodain01.lab.psi.pnq2.redhat.com
      targetRef:
        kind: Pod
        name: router-default-5b96855754-9c724
        namespace: openshift-ingress
        uid: 97a04c3e-ddea-43b7-ac70-673279057929
    ports:
    - name: metrics
      port: 1936
      protocol: TCP
    - name: https
      port: 443
      protocol: TCP
    - name: http
      port: 85
      protocol: TCP
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""Notice that the https port is correctly set to 85, but the stats port is still set to 1936 and not to 1939. That is a problem as the metrics target endpoint is reported as down with an error message:    Get "https://10.74.211.203:1936/metrics": dial tcp 10.74.211.203:1936: connect: connection refusedWhen the EP is corrected and the ports are changed to:
  ports:
  - name: metrics
    port: 1939
    protocol: TCPthe metrics target endpoint is picked up correctly and the metrics are scribed works as expected

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

Set endpointPublishingStrategy and modify the nodeport for statPort:

endpointPublishingStrategy:
    hostNetwork:
      httpPort: 85
      httpsPort: 443
      protocol: TCP
      statsPort: 1939

 

Actual results:

Stats are scribed from the standard port and not the one specified.

Expected results:

The endpoint object is pointing to the specified port.

Additional info:

 

Description of problem:
Pipelines Repository support is Tech Preview, this is shown when search for repositories or checking the details page.

But the tabbed pipelines tab (in admin and dev perspective doesn't show this). Also, the "Add Git Repository" form page doesn't mention this.

Version-Release number of selected component (if applicable):
4.11 - 4.13 (master)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Pipelines operator
  2. Navigate to Pipelines > Repository tab
  3. Select Create > Repository

Actual results:
The Repository tab and the "Add Git Repository" form page doesn't show a Tech Preview badge.

Expected results:
The Repository tab and the "Add Git Repository" form page should show a Tech Preview badge.

Additional info:
Check how the Shipwright Builds show this Tech Preview badge for the tab.

With CSISnapshot capability is disabled, all CSI driver operators are Degraded. For example AWS EBS CSI driver operator during installation:

18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource
AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: )
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):
4.12.nightly

The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.

CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.

This is a clone of issue OCPBUGS-1453. The following is the description of the original issue:

Description of problem:

TargetDown alert fired while it shouldn't.
Prometheus endpoints are not always properly unregistered and the alert will therefore think that some Kube service endpoints are down

Version-Release number of selected component (if applicable):

The problem as always been there.

How reproducible:

Not reproducible.
Most of the time Prometheus endpoints are properly unregistered.
Aim here is to get the TargetDown Prometheus expression be more resilient; this can be tested on past metrics data in which the unregistration issue was encountered.

Steps to Reproduce:

N/A

Actual results:

TargetDown alert triggered while Kube service endpoints are all up & running.

Expected results:

TargetDown alert should not have been trigerred.

This is a clone of issue OCPBUGS-6092. The following is the description of the original issue:

Description of problem:

While configuring 4.12.0 dualstack baremetal cluster ovs-configuration.service fails
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: Attempt 10 to bring up connection ovs-if-phys1
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + nmcli conn up ovs-if-phys1
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[26588]: Error: Connection activation failed: No suitable device found for this connection (device eno1np0 not available because profile i
s not compatible with device (mismatching interface name)).
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + s=4
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + sleep 5
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + '[' 4 -eq 0 ']'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + false
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + echo 'ERROR: Cannot bring up connection ovs-if-phys1 after 10 attempts'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: ERROR: Cannot bring up connection ovs-if-phys1 after 10 attempts
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + return 4
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + handle_exit
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + e=4
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + '[' 4 -eq 0 ']'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + echo 'ERROR: configure-ovs exited with error: 4'
Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: ERROR: configure-ovs exited with error: 4

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

So far 100%

Steps to Reproduce:

1. Deploy dualstack baremetal cluster with bonded interfaces(configured with MC and not NMState within install-config.yaml)
2. Run migration to second interface, part of machine config
      - contents:
          source: data:text/plain;charset=utf-8,bond0.117
        filesystem: root
        mode: 420
        path: /etc/ovnk/extra_bridge
3. Install operators:
* kubevirt-hyperconverged
* sriov-network-operator
* cluster-logging
* elasticsearch-operator
4. Start applying node-tunning profiles
5. During node reboots ovs-configuration service fails

Actual results:

ovs-configuration service fails on some nodes resulting in ovnkube-node-* pods failure
oc get po -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS          AGE
ovnkube-master-dvgx7   6/6     Running            8                 16h
ovnkube-master-vs7mp   6/6     Running            6                 16h
ovnkube-master-zrm4c   6/6     Running            6                 16h
ovnkube-node-2g8mb     4/5     CrashLoopBackOff   175 (3m48s ago)   16h
ovnkube-node-bfbcc     4/5     CrashLoopBackOff   176 (64s ago)     16h
ovnkube-node-cj6vf     5/5     Running            5                 16h
ovnkube-node-f92rm     5/5     Running            5                 16h
ovnkube-node-nmjpn     5/5     Running            5                 16h
ovnkube-node-pfv5z     4/5     CrashLoopBackOff   163 (4m53s ago)   15h
ovnkube-node-z5vf9     5/5     Running            10                15h

Expected results:

ovs-configuration service succeeds on all nodes

Additional info:


Description of problem:

Egress firewall returned error is overridden by the status update error, and never returned.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create egress firewall with bad cidr
kind: EgressFirewall
apiVersion: k8s.ovn.org/v1
metadata:
  name: default
  namespace: default
spec:
  egress:
  - type: Allow
    to:
      cidrSelector: 1.2.3.345/32 
2. Before fix: you should see the log "Creating *v1.EgressFirewall default/default took: 4.662942ms" 3. After fix: you should see the log "Failed to create *v1.EgressFirewall default/default, error: cannot create EgressFirewall Rule to destination 1.2.3.345/32 for namespace default: invalid CIDR address: 1.2.3.345/32" 
4. These logs are mutually exclusive, check one of them is present and the other is not

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-11333. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:

Description of problem:

according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour

$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 240
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-19-052243

How reproducible:

always

Steps to Reproduce:

1. enable UWM, check startupProbe for UWM prometheus/platform prometheus
2.
3.

Actual results:

startupProbe for UWM prometheus is still 15m

Expected results:

startupProbe for UWM prometheus should be 1 hour

Additional info:

since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.

Description of problem:

some upgrade ci jobs from 4.11.z to 4.12 nightly build are failed, because system unit machine-config-daemon-update-rpmostree-via-container is failed

e.g. job https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-nightly-4.12-upgrade-from-stable-4.11-aws-ipi-proxy-p1/1579169944476585984

omg get mcp
NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
worker  rendered-worker-6e18de1272fad7a5ca1529941e3ceaed  False    True      True      3             0                  0                    1                     3h53m
master  rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4  False    True      True      3             0                  0                    1                     3h53m 

check issued node

omg get node/ip-10-0-57-74.us-east-2.compute.internal -o yaml|yq -y '.metadata.annotations'
cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-0f6de21569b5b65c8","ifaddr":{"ipv4":"10.0.48.0/20"},"capacity":{"ipv4":14,"ipv6":15}}]'
csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-01a34f6b5f2cd1e41"}'
machine.openshift.io/machine: openshift-machine-api/ci-op-kb95kxx9-2a438-r6z94-master-2
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-master-065664319cfbaee64277097d49a8a5a6
machineconfiguration.openshift.io/desiredConfig: rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4
machineconfiguration.openshift.io/desiredDrain: drain-rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4
machineconfiguration.openshift.io/lastAppliedDrain: drain-rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4
machineconfiguration.openshift.io/reason: 'error running systemd-run --unit machine-config-daemon-update-rpmostree-via-container
  --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged
  --pid=host --net=host --rm -v /:/run/host quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661
  rpm-ostree ex deploy-from-self /run/host: Running as unit: machine-config-daemon-update-rpmostree-via-container.service


  Finished with result: exit-code


  Main processes terminated with: code=exited/status=125


  Service runtime: 2min 52ms


  CPU time consumed: 144ms


  : exit status 125'
machineconfiguration.openshift.io/state: Degraded
volumes.kubernetes.io/controller-managed-attach-detach: 'true' 

check mcd log on issued node

omg get pod -n openshift-machine-config-operator  -o json | jq -r '.items[]|select(.spec.nodeName=="ip-10-0-57-74.us-east-2.compute.internal")|.metadata.name' | grep daemon
machine-config-daemon-znbvf

2022-10-09T22:12:58.797891917Z I1009 22:12:58.797821  179598 update.go:1917] Updating OS to layered image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661
2022-10-09T22:12:58.797891917Z I1009 22:12:58.797846  179598 rpm-ostree.go:447] Running captured: rpm-ostree --version
2022-10-09T22:12:58.815829171Z I1009 22:12:58.815800  179598 update.go:2068] rpm-ostree is not new enough for layering; forcing an update via container
2022-10-09T22:12:58.817577513Z I1009 22:12:58.817555  179598 update.go:2053] Running: systemd-run --unit machine-config-daemon-update-rpmostree-via-container --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged --pid=host --net=host --rm -v /:/run/host quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 rpm-ostree ex deploy-from-self /run/host 
...
2022-10-09T22:15:00.831959313Z E1009 22:15:00.831949  179598 writer.go:200] Marking Degraded due to: error running systemd-run --unit machine-config-daemon-update-rpmostree-via-container --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged --pid=host --net=host --rm -v /:/run/host quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 rpm-ostree ex deploy-from-self /run/host: Running as unit: machine-config-daemon-update-rpmostree-via-container.service
2022-10-09T22:15:00.831959313Z Finished with result: exit-code
2022-10-09T22:15:00.831959313Z Main processes terminated with: code=exited/status=125
2022-10-09T22:15:00.831959313Z Service runtime: 2min 52ms
2022-10-09T22:15:00.831959313Z CPU time consumed: 144ms
2022-10-09T22:15:00.831959313Z : exit status 125

Version-Release number of selected component (if applicable):

4.12

Steps to Reproduce:

upgrade cluster from 4.11.8 to 4.12.0-0.nightly-2022-10-05-053337  

Actual results:

upgrade is failed due to node is degraded, rpm-ostree update via container is failed

Expected results:

upgrade can be completed successfully

Additional info:

must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-nightly-4.12-upgrade-from-stable-4.11-aws-ipi-proxy-p1/1579169944476585984/artifacts/aws-ipi-proxy-p1/gather-must-gather/artifacts/must-gather.tar

Other build logs of failed jobs

https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-nightly-4.12-upgrade-from-stable-4.11-aws-ipi-proxy-cco-manual-security-token-service-p1/1579200140067999744/build-log.txt

https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-nightly-4.12-upgrade-from-stable-4.11-azure-ipi-proxy-p1/1579094436883730432/build-log.txt

https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-nightly-4.12-upgrade-from-stable-4.11-azure-ipi-proxy-workers-rhcos-rhel8-p2/1578747158293647360/build-log.txt

This is a clone of issue OCPBUGS-2500. The following is the description of the original issue:

Description of problem:

When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always occurs

Steps to Reproduce:

1.Create a project with at least 12 components (Apps, Operators, knative Brokers)
2. Go to the Administrator Viewpoint
3. Switch to Developer Viewpoint/Topology
4. No components displayed
5. Click on 'fit to screen'
6. All components appear

Actual results:

Topology renders with all controls but no components visible (see screenshot 1)

Expected results:

All components should be visible

Additional info:

 

Description of the problem:

assisted-installer-controller Job does not apply Additional Root CA Trust Bundle

https://github.com/openshift/assisted-installer/issues/513

How reproducible:

https://github.com/openshift/assisted-installer/issues/513

Steps to reproduce:

1.  Create cluster with proxy and additional certificate bundle

2.Install

Actual results:

Controller failed to reach service cause of self signed certificate

Expected results:

Installation succeeds

This is a clone of issue OCPBUGS-20333. The following is the description of the original issue:

This is a clone of issue OCPBUGS-19430. The following is the description of the original issue:

This is a clone of issue OCPBUGS-18772. The following is the description of the original issue:

MCO installs resolve-prepender NetworkManager script on the nodes. In order to find out node details it needs to pull baremetalRuntimeCfgImage. However, this image needs to be pulled just the first time, in the followup attempts this script just verifies that this image is available.

This is not desirable in situations where mirror / quay are not available or having a temporary problem - these kind of issues should not prevent the node from starting kubelet. During certificate rotation testing I noticed that the node with a significant time skew won't start kubelet, as it tries to pull baremetalRuntimeCfgImage for kubelet to start - but the image is already on the nodes and it doesn't need refreshing.

This is a clone of issue OCPBUGS-3440. The following is the description of the original issue:

Description of problem:

https://github.com/openshift/cluster-authentication-operator/pull/587 addresses an issue in which the auth operator goes degraded when the console capability is not enabled.  The rest is that the console publicAssetURL is not configured when the console is disabled.  However if the console capability is later enabled on the cluster, there is no logic in place to ensure the auth operator detects this and performs the configuration.

Manually restarting the auth operator will address this, but we should have a solution that handles it automatically.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Install a cluster w/o the console cap
2. Inspect the auth configmap, see that assetPublicURL is empty
3. Enable the console capability, wait for console to start up
4. Inspect the auth configmap and see it is still empty

Actual results:

assetPublicURL does not get populated

Expected results:

assetPublicURL is populated once the console is enabled

Additional info:


Description of problem:

When alert raised for vSphere privilege check which is reported by vsphere-problem-detector, we could only get the very simple info as below:

 

=======================================

Description

The vsphere-problem-detector monitors the health and configuration of OpenShift on VSphere. If problems are found which may prevent machine scaling, storage provisioning, and safe upgrades, the vsphere-problem-detector will raise alerts.

 

Summary

VSphere cluster health checks are failing

 

Message

VSphere cluster health checks are failing with CheckAccountPermissions

=======================================

 

  1. Please add vSphere privilege check in the Description, currently only mention "prevent machine scaling, storage provisioning, and safe upgrades" 
  2. Could we at least add something like "Check vsphere-problem-detector pod log in openshift-cluster-storage-operator namespace to see the detail info" if we could not list which privilege is missing.

(We could get the namespace/pod info from metric, but I think adding it in alert Description or Message should be more clear)

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-12-152748

 

How reproducible:

Always

 

Steps to Reproduce:

See description

Actual results:

Alert info is not so clear

 

Expected results:

Add more Alert info

Note: This issue is a duplicate of OCPBUGS-10238 intended to target the 4.12 version.

Description of problem:

Updates to the `.spec.updateStrategy.registryPoll.interval` fields for a default CatalogSource are reverted.

Version-Release number of selected component (if applicable):

4.12.5

How reproducible:

100%

Steps to Reproduce:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.5    True        False         86m     Cluster version is 4.12.5
$ oc get catalogsource  -n openshift-marketplace redhat-operators -o jsonpath='{.spec.updateStrategy.registryPoll.interval}'
10m
$ oc patch -n openshift-marketplace catalogsource/redhat-operators --type=merge -p '{"spec":{"updateStrategy":{"registryPoll":{"interval":"30m0s"}}}}'
catalogsource.operators.coreos.com/redhat-operators patched

Actual results:

$ oc get catalogsource  -n openshift-marketplace redhat-operators -o jsonpath='{.spec.updateStrategy.registryPoll.interval}' 
10m
$ oc logs -n openshift-marketplace deployment/marketplace-operator time="2023-03-14T09:43:58Z" level=info msg="[defaults] Restoring CatalogSource redhat-operators" time="2023-03-14T09:43:58Z" level=info msg="[defaults] CatalogSource redhat-operators is annotated and its spec is the same as the default spec"

Expected results:

In 4.12.3 the updated value remains:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.3    True        False         15d     Cluster version is 4.12.3
$ oc get catalogsource  -n openshift-marketplace redhat-operators -o jsonpath='{.spec.updateStrategy.registryPoll.interval}'
10m
$ oc patch -n openshift-marketplace catalogsource/redhat-operators --type=merge -p '{"spec":{"updateStrategy":{"registryPoll":{"interval":"30m0s"}}}}'
catalogsource.operators.coreos.com/redhat-operators patched
$ oc get catalogsource  -n openshift-marketplace redhat-operators -o jsonpath='{.spec.updateStrategy.registryPoll.interval}'
30m0s

Additional info:

 

Description of problem:

PodDisruption gatherer gathered resources from all namespaces and the limit for gathered PDBs was 5000. This task fixes these problems, gathering resources only from openshift namespaces and restricts number of gathered resources to 100.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

Create PDB in namespace that doesn't start with "openshift-" prefix and see if it was collected. 

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3744. The following is the description of the original issue:

Description of problem:

Egress router POD creation on Openshift 4.11 is failing with below error.
~~~
Nov 15 21:51:29 pltocpwn03 hyperkube[3237]: E1115 21:51:29.467436    3237 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy_c965a287-28aa-47b6-9e79-0cc0e209fcf2_0(72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344): error adding pod stage-wfe-proxy_stage-wfe-proxy-ext-qrhjw to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw/c965a287-28aa-47b6-9e79-0cc0e209fcf2:openshift-sdn]: error adding container to network \\\"openshift-sdn\\\": CNI request failed with status 400: 'could not open netns \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": unknown FS magic on \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": 1021994\\n'\"" pod="stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw" podUID=c965a287-28aa-47b6-9e79-0cc0e209fcf2
~~~

I have checked SDN POD log from node where egress router POD is failing and I could see below error message.

~~~
2022-11-15T21:51:29.283002590Z W1115 21:51:29.282954  181720 pod.go:296] CNI_ADD stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw failed: could not open netns "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": unknown FS magic on "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": 1021994
~~~

Crio is logging below event and looking at the log it seems the namespace has been created on node.

~~~
Nov 15 21:51:29 pltocpwn03 crio[3150]: time="2022-11-15 21:51:29.307184956Z" level=info msg="Got pod network &{Name:stage-wfe-proxy-ext-qrhjw Namespace:stage-wfe-proxy ID:72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344 UID:c965a287-28aa-47b6-9e79-0cc0e209fcf2 NetNS:/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
~~~

Version-Release number of selected component (if applicable):

4.11.12

How reproducible:

Not Sure

Steps to Reproduce:

1.
2.
3.

Actual results:

Egress router POD is failing to create. Sample application could be created without any issue.

Expected results:

Egress router POD should get created

Additional info:

Egress router POD is created following below document and it does contain pod.network.openshift.io/assign-macvlan: "true" annotation.

https://docs.openshift.com/container-platform/4.11/networking/openshift_sdn/deploying-egress-router-layer3-redirection.html#nw-egress-router-pod_deploying-egress-router-layer3-redirection

Description of problem:

As a downstream consumer of the installer (as a library), I want to be able to choose whether or not the image gallery is used when creating machinesets on Azure so that I can achieve backwards compatibility with pre-4.12

Version-Release number of selected component (if applicable):

4.12+

How reproducible:

always

Steps to Reproduce:

1. Try to generate machinesets in pre-4.12 environment
2. Lament as the installer automatically uses image gallery regardless

Actual results:

Installer attempts to guess whether to use image gallery

Expected results:

I should be able to choose myself

Additional info:

 

The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.

As of Sep 13th:

  • several vsphere and openstack variant combo's fail this test around 24-32% of the time
  • aws, amd64, ovn, upgrade, upgrade-micro, ha - fails 6% of the time
  • aws, amd64, ovn, upgrade, upgrade-minor, ha - fails 4% of the time
  • gcp, amd64, sdn, upgrade, upgrade-minor, ha - fails 8% of the time
  • globally across all jobs fails around 3% of the time.

Even on some major variant combos, a 4-8% failure rate is too high.
On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?

Has the problem been getting worse?

I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.

This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

At a glance: LeaseGrant, MemberList, Txn, Status, Range.

Broken out of TRT-401
For linking with sippy:
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]

 

Description of problem:

This a bug record to pin down dependencies version in CMO release 4.12 after the release-4.12 branch was detached from master branch.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

N/A

Steps to Reproduce:

N/A

Actual results:

N/A

Expected results:

N/A

Additional info:

None.

Description of problem:

The Alertmanager silence create / edit form got a new "Negative matcher" option in 4.12 (see https://issues.redhat.com/browse/OCPBUGSM-47734). However, there is nothing to explain what this option means and it will likely not be obvious from the label alone unless you are already quite familiar with Alertmanager.

After discussion with the docs team, it was decided that adding some explanation in context in the UI would be much better than adding an explanation to the documentation. 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Admin perspective
2. Go to Observe > Alerting > Silences page
3. Click on the Create button ("Negative matcher" option is shown with no explanation)

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4900. The following is the description of the original issue:

The test:

test=[sig-storage] Volume limits should verify that all nodes have volume limits [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]

Is hard failing on aws and gcp techpreview clusters:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.12/analysis?test=%5Bsig-storage%5D%20Volume%20limits%20should%20verify%20that%20all%20nodes%20have%20volume%20limits%20%5BSkipped%3ANoOptionalCapabilities%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D

The failure message is consistently:

fail [github.com/onsi/ginkgo/v2@v2.1.5-0.20220909190140-b488ab12695a/internal/suite.go:612]: Dec 15 09:07:51.278: Expected volume limits to be set
Ginkgo exit error 1: exit with code 1

Sample failure:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.12-ocp-e2e-aws-ovn-arm64-techpreview/1603313676431921152

A fix for this will bring several jobs back to life, but they do span 4.12 and 4.13.

job=periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.13-e2e-gcp-sdn-techpreview=all
job=periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-aws-ovn-arm64-techpreview=all
job=periodic-ci-openshift-multiarch-master-nightly-4.12-ocp-e2e-aws-ovn-arm64-techpreview=all

Description of problem:

Deploying Openshift 4.12 using IPI is failing to deploy our masters
The issue we encounter when we do a journalctl on the master:
Jan 23 18:21:50 master0-sahara bash[4859]: Copying config sha256:ecc0cdc6ecc65607d63a1847e235f4988c104b07e680c0eed8b2fc0e5c20d934
Jan 23 18:21:50 master0-sahara bash[4859]: Writing manifest to image destination
Jan 23 18:21:50 master0-sahara bash[4859]: Storing signatures
Jan 23 18:21:50 master0-sahara bash[4859]: time="2023-01-23T18:21:50Z" level=warning msg="Found incomplete layer \"e2e51ecd22dcbc318fb317f20dff685c6d54755d60a80b12ed290658864d45fd\", deleting it"
Jan 23 18:21:50 master0-sahara bash[4859]: Error: checking platform of image ecc0cdc6ecc65607d63a1847e235f4988c104b07e680c0eed8b2fc0e5c20d934: inspecting image: layer not known

Everything is working with 4.11

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Everytime

Steps to Reproduce:

1. Redeploy 4.12 using Openshift IPI

Actual results:

level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get \"https://api.orjfdciocp-sahara.otcdcslab.com:6443/apis/config.openshift.io/v1/clusteroperators\": dial tcp 172.16.8.45:6443: connect: no route to host", "level=error msg=Bootstrap failed to complete: timed out waiting for the condition", "level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.", "level=warning msg=The bootstrap machine is unable to resolve API and/or API-Int Server URLs", "level=info msg=Successfully resolved API_INT_URL api-int.orjfdciocp-sahara.otcdcslab.com", "level=info msg=Unable to reach API_INT_URL's https endpoint at https://172.16.8.45:6443/version", "level=info msg=It might be too early for the https://172.16.8.45:6443/version to be available.", "level=info msg=Bootstrap gather logs captured here \"/home/kni/clusterconfigs/log-bundle-20230124124420.tar.gz\""], "stdout": "", "stdout_lines": []}

Expected results:

success 

Additional info:

The workaround is to do a `podman system reset` on the failing master

Description of problem:

scale up more worker nodes but they are not added to the Load Balancer instances (backend pool), if moving the router pod to the new worker nodes then co/ingress becomes degraded

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-23-204408

How reproducible:

100%

Steps to Reproduce:

1. ensure the fresh install cluster works well.
2. scale up worker nodes.
$ oc -n openshift-machine-api get machineset
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
hongli-1024-hnkrm-worker-us-east-2a   1         1         1       1           5h21m
hongli-1024-hnkrm-worker-us-east-2b   1         1         1       1           5h21m
hongli-1024-hnkrm-worker-us-east-2c   1         1         1       1           5h21m

$ oc -n openshift-machine-api scale machineset hongli-1024-hnkrm-worker-us-east-2a --replicas=2
machineset.machine.openshift.io/hongli-1024-hnkrm-worker-us-east-2a scaled

$ oc -n openshift-machine-api scale machineset hongli-1024-hnkrm-worker-us-east-2b --replicas=2
machineset.machine.openshift.io/hongli-1024-hnkrm-worker-us-east-2b scaled

(about 5 minutes later)
$ oc -n openshift-machine-api get machineset
NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
hongli-1024-hnkrm-worker-us-east-2a   2         2         2       2           5h29m
hongli-1024-hnkrm-worker-us-east-2b   2         2         2       2           5h29m
hongli-1024-hnkrm-worker-us-east-2c   1         1         1       1           5h29m


3. delete router pods and to make new ones running on new workers

$ oc get node
NAME                                         STATUS   ROLES                  AGE     VERSION
ip-10-0-128-45.us-east-2.compute.internal    Ready    worker                 71m     v1.25.2+4bd0702
ip-10-0-131-192.us-east-2.compute.internal   Ready    control-plane,master   6h35m   v1.25.2+4bd0702
ip-10-0-139-51.us-east-2.compute.internal    Ready    worker                 6h29m   v1.25.2+4bd0702
ip-10-0-162-228.us-east-2.compute.internal   Ready    worker                 71m     v1.25.2+4bd0702
ip-10-0-172-216.us-east-2.compute.internal   Ready    control-plane,master   6h35m   v1.25.2+4bd0702
ip-10-0-190-82.us-east-2.compute.internal    Ready    worker                 6h25m   v1.25.2+4bd0702
ip-10-0-196-26.us-east-2.compute.internal    Ready    control-plane,master   6h35m   v1.25.2+4bd0702
ip-10-0-199-158.us-east-2.compute.internal   Ready    worker                 6h28m   v1.25.2+4bd0702

$ oc -n openshift-ingress get pod -owide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
router-default-86444dcd84-cm96l   1/1     Running   0          65m   10.130.2.7   ip-10-0-128-45.us-east-2.compute.internal    <none>           <none>
router-default-86444dcd84-vpnjz   1/1     Running   0          65m   10.131.2.7   ip-10-0-162-228.us-east-2.compute.internal   <none>           <none>


Actual results:

$ oc get co ingress console authentication
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress          4.12.0-0.nightly-2022-10-23-204408   True        False         True       66m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
console          4.12.0-0.nightly-2022-10-23-204408   False       False         False      66m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-1024.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-1024.qe.devcluster.openshift.com": EOF
authentication   4.12.0-0.nightly-2022-10-23-204408   False       False         True       66m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-1024.qe.devcluster.openshift.com/healthz": EOF


checked the Load Balancer on AWS console and found that new created nodes are not added to load balancer. see the snapshot attached.

Expected results:

the LB should added new created instances automatically and ingress should work with new workers.

Additional info:

1. this is also reproducible with common user created LoadBalancer service.
2. if the LB service is created after adding the new nodes then it works well, we can see that all nodes are added to LB on AWS console.  

 

This is a clone of issue OCPBUGS-3027. The following is the description of the original issue:

Description of problem:

When running the console in development mode per https://github.com/openshift/console#frontend-development, metrics do not load on the cluster overview, pods list page, pod details page (Metrics tab is missing), etc.

Samuel Padgett suspects the changes in https://github.com/openshift/console/commit/0bd839da219462ea585183de1c856fb60e9f96fb are related.

This is a clone of issue OCPBUGS-11473. The following is the description of the original issue:

This is a clone of issue OCPBUGS-160. The following is the description of the original issue:

Description of problem:

The NS autolabeler should adjust the PSS namespace labels such that a previously permitted workload (based on the SCCs it has access to) can still run.

The autolabeler requires the RoleBinding's .subjects[].namespace to be set when .subjects[].kind is ServiceAccount even though this is not required by the RBAC system to successfully bind the SA to a Role

Version-Release number of selected component (if applicable):

$ oc version
Client Version: 4.7.0-0.ci-2021-05-21-142747
Server Version: 4.12.0-0.nightly-2022-08-15-150248
Kubernetes Version: v1.24.0+da80cd0

How reproducible: 100%

Steps to Reproduce:

---
apiVersion: v1
kind: Namespace
metadata:
  name: test

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mysa
  namespace: test

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: myrole
  namespace: test
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: myrb
  namespace: test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: myrole
subjects:
- kind: ServiceAccount
  name: mysa
  #namespace: test  # This is required for the autolabeler

---
kind: Job
apiVersion: batch/v1
metadata:
  name: myjob
  namespace: test
spec:
  template:
    spec:
      containers:
        - name: ubi
          image: registry.access.redhat.com/ubi8
          command: ["/bin/bash", "-c"]
          args: ["whoami; sleep infinity"]
      restartPolicy: Never
      securityContext:
        runAsUser: 0
      serviceAccount: mysa
      terminationGracePeriodSeconds: 2
{{}}

Actual results:

Applying the manifest, above, the Job's pod will not start:

$ kubectl -n test describe job/myjob...Events:
  Type     Reason        Age   From            Message
  ----     ------        ----  ----            -------
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-zxcvv" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  20s   job-controller  Error creating: pods "myjob-fkb9x" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Warning  FailedCreate  10s   job-controller  Error creating: pods "myjob-5klpc" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "ubi" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "ubi" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "ubi" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "ubi" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Uncommenting the "namespace" field in the RoleBinding will allow it to start as the autolabeler will adjust the Namespace labels.

However, the namespace field isn't actually required by the RBAC system. Instead of using the autolabeler, the pod can be allowed to run by (w/o uncommenting the field):

$ kubectl label ns/test security.openshift.io/scc.podSecurityLabelSync=false
namespace/test labeled
$ kubectl label ns/test pod-security.kubernetes.io/enforce=privileged --overwrite
namespace/test labeled

 

We now see that the pod is running as root and has access to the privileged scc:

$ kubectl -n test get po -oyaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.129.2.18/23"],"mac_address":"0a:58:0a:81:02:12","gateway_ips":["10.129.2.1"],"ip_address":"10.129.2.18/23","gateway_ip":"10.129.2.1"'}}
      k8s.v1.cni.cncf.io/network-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      k8s.v1.cni.cncf.io/networks-status: |-
        [{
            "name": "ovn-kubernetes",
            "interface": "eth0",
            "ips": [
                "10.129.2.18"
            ],
            "mac": "0a:58:0a:81:02:12",
            "default": true,
            "dns": {}
        }]
      openshift.io/scc: privileged
    creationTimestamp: "2022-08-16T13:08:24Z"
    generateName: myjob-
    labels:
      controller-uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
      job-name: myjob
    name: myjob-rwjmv
    namespace: test
    ownerReferences:
    - apiVersion: batch/v1
      blockOwnerDeletion: true
      controller: true
      kind: Job
      name: myjob
      uid: 1867dbe6-73b2-44ea-a324-45c9273107b8
    resourceVersion: "36418"
    uid: 39f18dea-31d4-4783-85b5-8ae6a8bec1f4
  spec:
    containers:
    - args:
      - whoami; sleep infinity
      command:
      - /bin/bash
      - -c
      image: registry.access.redhat.com/ubi8
      imagePullPolicy: Always
      name: ubi
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-6f2h6
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: mysa-dockercfg-mvmtn
    nodeName: ip-10-0-140-172.ec2.internal
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext:
      runAsUser: 0
    serviceAccount: mysa
    serviceAccountName: mysa
    terminationGracePeriodSeconds: 2
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-6f2h6
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
        - configMap:
            items:
            - key: service-ca.crt
              path: service-ca.crt
            name: openshift-service-ca.crt
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:28Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2022-08-16T13:08:24Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://8fd1c3a5ee565a1089e4e6032bd04bceabb5ab3946c34a2bb55d3ee696baa007
      image: registry.access.redhat.com/ubi8:latest
      imageID: registry.access.redhat.com/ubi8@sha256:08e221b041a95e6840b208c618ae56c27e3429c3dad637ece01c9b471cc8fac6
      lastState: {}
      name: ubi
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2022-08-16T13:08:28Z"
    hostIP: 10.0.140.172
    phase: Running
    podIP: 10.129.2.18
    podIPs:
    - ip: 10.129.2.18
    qosClass: BestEffort
    startTime: "2022-08-16T13:08:24Z"
kind: List
metadata:
  resourceVersion: ""
{{}}

 

$ kubectl -n test logs job/myjob
root

 

Expected results:

The autolabeler should properly follow the RoleBinding back to the SCC

 

Additional info:

Refer to the CIS RedHat OpenShift Container Platform Benchmark PDF: https://drive.google.com/file/d/12o6O-M2lqz__BgmtBrfeJu1GA2SJ352c/view
1.1.7 Ensure that the etcd pod specification file permissions are set to 600 or more restrictive (Manual)
======================================================================================================
As per CIS v1.3 PDF permissions should be 600 with the following statement:
"The pod specification file is created on control plane nodes at /etc/kubernetes/manifests/etcd-member.yaml with permissions 644. Verify that the permissions are 600 or more restrictive."
But when I ran the following command it was showing 644 permissions

for i in $(oc get pods -n openshift-etcd -l app=etcd -o name | grep etcd )
do
echo "check pod $i"
oc rsh -n openshift-etcd $i \
stat -c %a /etc/kubernetes/manifests/etcd-pod.yaml
done

This is a clone of issue OCPBUGS-5548. The following is the description of the original issue:

Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390

When creating a Deployment, DeploymentConfig, or Knative Service with enabled Pipeline, and then deleting it again with the enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the automatically created Pipeline is not deleted.

When the user tries to create the same resource with a Pipeline again this fails with an error:

An error occurred
secrets "nodeinfo-generic-webhook-secret" already exists

Version-Release number of selected component (if applicable):
4.13

(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5547)

How reproducible:
Always

Steps to Reproduce:

  1. Install OpenShift Pipelines operator (tested with 1.8.2)
  2. Create a new project
  3. Navigate to Add > Import from git and create an application
  4. Case 1: In the topology select the new resource and delete it
  5. Case 2: In the topology select the application group and delete the complete app

Actual results:
Case 1: Delete resources:

  1. Deployment (tries it twice!) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Case 2: Delete application:

  1. Deployment (just once) $name
  2. Service $name
  3. Route $name
  4. ImageStream $name

Expected results:
Case 1: Delete resource:

  1. Delete Deployment $name should be called just once
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Case 2: Delete application:

  1. (Keep this deletion) Deployment $name
  2. (Keep this deletion) Service $name
  3. (Keep this deletion) Route $name
  4. (Keep this deletion) ImageStream $name
  5. Missing deletion of the Tekton Pipeline $name
  6. Missing deletion of the Tekton TriggerTemplate with generated name trigger-template-$name-$random
  7. Missing deletion of the Secret $name-generic-webhook-secret
  8. Missing deletion of the Secret $name-github-webhook-secret

Additional info:

This is a clone of issue OCPBUGS-5523. The following is the description of the original issue:

Description of problem:

catalog pod restarting frequently  after one stack trace daily.          ~~~                                                                          $ omc logs catalog-operator-f7477865d-x6frl -p
2023-01-04T13:05:15.175952229Z time="2023-01-04T13:05:15Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
2023-01-04T13:05:15.175952229Z fatal error: concurrent map read and map write
2023-01-04T13:05:15.178587884Z
2023-01-04T13:05:15.178674833Z goroutine 669 [running]:
2023-01-04T13:05:15.179284556Z runtime.throw({0x1efdc12, 0xc000580000})
2023-01-04T13:05:15.179458107Z 	/usr/lib/golang/src/runtime/panic.go:1198 +0x71 fp=0xc00559d098 sp=0xc00559d068 pc=0x43bcd1
2023-01-04T13:05:15.179707701Z runtime.mapaccess1_faststr(0x7f39283dd878, 0x10, {0xc000894c40, 0xf})
2023-01-04T13:05:15.179932520Z 	/usr/lib/golang/src/runtime/map_faststr.go:21 +0x3a5 fp=0xc00559d100 sp=0xc00559d098 pc=0x418ca5
2023-01-04T13:05:15.180181245Z github.com/operator-framework/operator-lifecycle-manager/pkg/metrics.UpdateSubsSyncCounterStorage(0xc00545cfc0)       ~~~

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Slack discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1673120541153639                            MG link - https://attachments.access.redhat.com/hydra/rest/cases/03396604/attachments/25f23643-2447-442b-ba26-4338b679b8cc?usePresignedUrl=true

 

Description of problem:

AWS tagging - when applying user defined tags you cannot add more than 10

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Configure userTags for aws platform with more than 8 tags.
2. Installer fails to add the tags while AWS supports upto 50 tags.

Actual results:

Installer validation fails.

Expected results:

Installer should be able to add more than 8 tags.

Additional info:

 

This is a clone of issue OCPBUGS-5068. The following is the description of the original issue:

Description of problem:

virtual media provisioning fails when iLO Ironic driver is used

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers
2.
3.

Actual results:

Provisioning fails with "An auth plugin is required to determine endpoint URL" error

Expected results:

Provisioning succeeds

Additional info:

Relevant log snippet:

3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75      78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL
 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last):
 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection
 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     task.driver.boot.prepare_ramdisk(task, ramdisk_params=params)
 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped
 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     result = f(*args, **kwargs)
 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk
 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     iso = image_utils.prepare_deploy_iso(task, ramdisk_params,
 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso
 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     return prepare_iso_image(inject_files=inject_files)
 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image
 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     image_url = img_handler.publish_image(
 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image
 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     swift_api = swift.SwiftAPI()
 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__
 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     endpoint = keystone.get_endpoint('swift', session=session)

This is a clone of issue OCPBUGS-3018. The following is the description of the original issue:

Description of problem:

When running an overnight run in dev-scripts (COMPACT_IPV4) with repeated installs I saw this panic in WaitForBootstrapComplete occur once.

level=debug msg=Agent Rest API Initialized
E1101 05:19:09.733309 1802865 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4086520?, 0x1d875810})
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00056fb00?})
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x4086520, 0x1d875810})
    /usr/local/go/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/agent.(*NodeZeroRestClient).getClusterID(0xc0001341e0)
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/rest.go:121 +0x53
github.com/openshift/installer/pkg/agent.(*Cluster).IsBootstrapComplete(0xc000134190)
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/cluster.go:183 +0x4fc
github.com/openshift/installer/pkg/agent.WaitForBootstrapComplete.func1()
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:31 +0x77
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x1d8fa901?)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0001958c0?, {0x1a53c7a0, 0xc0011d4a50}, 0x1, 0xc0001958c0)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0009ab860?, 0x77359400, 0x0, 0xa?, 0x8?)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92
github.com/openshift/installer/pkg/agent.WaitForBootstrapComplete({0x7ffd7fccb4e3?, 0x40d7e7?})
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:30 +0x1bc
github.com/openshift/installer/pkg/agent.WaitForInstallComplete({0x7ffd7fccb4e3?, 0x5?})
    /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:73 +0x56
github.com/openshift/installer/cmd/openshift-install/agent.newWaitForInstallCompleteCmd.func1(0xc0003b6c80?, {0xc0004d67c0?, 0x2?, 0x2?})
    /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/agent/waitfor.go:73 +0x126
github.com/spf13/cobra.(*Command).execute(0xc0003b6c80, {0xc0004d6780, 0x2, 0x2})
    /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b
github.com/spf13/cobra.(*Command).ExecuteC(0xc0013b0a00)
    /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
    /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918
main.installerMain()
    /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0
main.main()
    /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x33d3cd3]

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Occurred on the 12th run, all previous installs were successfule

Steps to Reproduce:

1.Set up dev-scripts for  AGENT_E2E_TEST_SCENARIO=COMPACT_IPV4, no mirroring
2. Run 'make clean; make agent' in a loop
3. After repeated installs got the failure

Actual results:

Panic in WaitForBootstrapComplete

Expected results:

No failure

Additional info:

It looks like clusterResult is used here even on failure, which causes the dereference - https://github.com/openshift/installer/blob/master/pkg/agent/rest.go#L121

 

This is a clone of issue OCPBUGS-5306. The following is the description of the original issue:

Description of problem:

One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-01-02-175114

How reproducible:

always after several times

Steps to Reproduce:

1.Install a cluster 
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2023-01-02-175114   True        False         30m     Cluster version is 4.12.0-0.nightly-2023-01-02-175114
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      33m     
baremetal                                  4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
cloud-controller-manager                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      84m     
cloud-credential                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
cluster-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
cluster-autoscaler                         4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
config-operator                            4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
console                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      33m     
control-plane-machine-set                  4.12.0-0.nightly-2023-01-02-175114   True        False         False      79m     
csi-snapshot-controller                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
dns                                        4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
etcd                                       4.12.0-0.nightly-2023-01-02-175114   True        False         False      79m     
image-registry                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
ingress                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
insights                                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      21m     
kube-apiserver                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      77m     
kube-controller-manager                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      77m     
kube-scheduler                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      77m     
kube-storage-version-migrator              4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
machine-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      75m     
machine-approver                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
machine-config                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
marketplace                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
monitoring                                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      72m     
network                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      83m     
node-tuning                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      80m     
openshift-apiserver                        4.12.0-0.nightly-2023-01-02-175114   True        False         False      75m     
openshift-controller-manager               4.12.0-0.nightly-2023-01-02-175114   True        False         False      76m     
openshift-samples                          4.12.0-0.nightly-2023-01-02-175114   True        False         False      22m     
operator-lifecycle-manager                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2023-01-02-175114   True        False         False      75m     
platform-operators-aggregated              4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
service-ca                                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      81m     
storage                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      74m     
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   85m
huliu-aws4d2-fcks7-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   85m
huliu-aws4d2-fcks7-master-2                  Running   m6i.xlarge   us-east-2   us-east-2a   85m
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running   m6i.xlarge   us-east-2   us-east-2a   80m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running   m6i.xlarge   us-east-2   us-east-2a   80m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running   m6i.xlarge   us-east-2   us-east-2b   80m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   86m

2.Edit controlplanemachineset, change instanceType to another value to trigger RollingUpdate 
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE          TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-0                  Running        m6i.xlarge   us-east-2   us-east-2a   86m
huliu-aws4d2-fcks7-master-1                  Running        m6i.xlarge   us-east-2   us-east-2b   86m
huliu-aws4d2-fcks7-master-2                  Running        m6i.xlarge   us-east-2   us-east-2a   86m
huliu-aws4d2-fcks7-master-mbgz6-0            Provisioning   m5.xlarge    us-east-2   us-east-2a   5s
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running        m6i.xlarge   us-east-2   us-east-2a   81m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running        m6i.xlarge   us-east-2   us-east-2a   81m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running        m6i.xlarge   us-east-2   us-east-2b   81m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-0                  Deleting   m6i.xlarge   us-east-2   us-east-2a   92m
huliu-aws4d2-fcks7-master-1                  Running    m6i.xlarge   us-east-2   us-east-2b   92m
huliu-aws4d2-fcks7-master-2                  Running    m6i.xlarge   us-east-2   us-east-2a   92m
huliu-aws4d2-fcks7-master-mbgz6-0            Running    m5.xlarge    us-east-2   us-east-2a   5m36s
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running    m6i.xlarge   us-east-2   us-east-2a   87m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running    m6i.xlarge   us-east-2   us-east-2a   87m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running    m6i.xlarge   us-east-2   us-east-2b   87m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE         TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-1                  Running       m6i.xlarge   us-east-2   us-east-2b   101m
huliu-aws4d2-fcks7-master-2                  Running       m6i.xlarge   us-east-2   us-east-2a   101m
huliu-aws4d2-fcks7-master-mbgz6-0            Running       m5.xlarge    us-east-2   us-east-2a   15m
huliu-aws4d2-fcks7-master-nbt9g-1            Provisioned   m5.xlarge    us-east-2   us-east-2b   3m1s
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running       m6i.xlarge   us-east-2   us-east-2a   96m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running       m6i.xlarge   us-east-2   us-east-2a   96m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running       m6i.xlarge   us-east-2   us-east-2b   96m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-1                  Deleting   m6i.xlarge   us-east-2   us-east-2b   149m
huliu-aws4d2-fcks7-master-2                  Running    m6i.xlarge   us-east-2   us-east-2a   149m
huliu-aws4d2-fcks7-master-mbgz6-0            Running    m5.xlarge    us-east-2   us-east-2a   62m
huliu-aws4d2-fcks7-master-nbt9g-1            Running    m5.xlarge    us-east-2   us-east-2b   50m
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running    m6i.xlarge   us-east-2   us-east-2a   144m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running    m6i.xlarge   us-east-2   us-east-2a   144m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running    m6i.xlarge   us-east-2   us-east-2b   144m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
huliu-aws4d2-fcks7-master-1                  Deleting   m6i.xlarge   us-east-2   us-east-2b   4h12m
huliu-aws4d2-fcks7-master-2                  Running    m6i.xlarge   us-east-2   us-east-2a   4h12m
huliu-aws4d2-fcks7-master-mbgz6-0            Running    m5.xlarge    us-east-2   us-east-2a   166m
huliu-aws4d2-fcks7-master-nbt9g-1            Running    m5.xlarge    us-east-2   us-east-2b   153m
huliu-aws4d2-fcks7-worker-us-east-2a-m279f   Running    m6i.xlarge   us-east-2   us-east-2a   4h7m
huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps   Running    m6i.xlarge   us-east-2   us-east-2a   4h7m
huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz   Running    m6i.xlarge   us-east-2   us-east-2b   4h7m

3.master-1 stuck in Deleting, and many co get degraded, many pod cannot get Running  
liuhuali@Lius-MacBook-Pro huali-test % oc get co     
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2023-01-02-175114   True        True          True       9s      APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver (2 containers are waiting in pending apiserver-7b65bbc76b-mxl99 pod)...
baremetal                                  4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
cloud-controller-manager                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h11m   
cloud-credential                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
cluster-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
cluster-autoscaler                         4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
config-operator                            4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
console                                    4.12.0-0.nightly-2023-01-02-175114   False       False         False      150m    RouteHealthAvailable: console route is not admitted
control-plane-machine-set                  4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h7m    Observed 1 replica(s) in need of update
csi-snapshot-controller                    4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h9m    CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods...
dns                                        4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
etcd                                       4.12.0-0.nightly-2023-01-02-175114   True        True          True       4h7m    GuardControllerDegraded: Missing operand on node ip-10-0-79-159.us-east-2.compute.internal...
image-registry                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h2m    
ingress                                    4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h2m    
insights                                   4.12.0-0.nightly-2023-01-02-175114   True        False         False      3h8m    
kube-apiserver                             4.12.0-0.nightly-2023-01-02-175114   True        True          True       4h5m    GuardControllerDegraded: Missing operand on node ip-10-0-79-159.us-east-2.compute.internal
kube-controller-manager                    4.12.0-0.nightly-2023-01-02-175114   True        False         True       4h5m    GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.19.115:9091: i/o timeout
kube-scheduler                             4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h5m    
kube-storage-version-migrator              4.12.0-0.nightly-2023-01-02-175114   True        False         False      162m    
machine-api                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h3m    
machine-approver                           4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
machine-config                             4.12.0-0.nightly-2023-01-02-175114   False       False         True       139m    Cluster not available for [{operator 4.12.0-0.nightly-2023-01-02-175114}]: error during waitForDeploymentRollout: [timed out waiting for the condition, deployment machine-config-controller is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)]
marketplace                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h8m    
monitoring                                 4.12.0-0.nightly-2023-01-02-175114   False       True          True       144m    reconciling Prometheus Operator Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator: got 1 unavailable replicas
network                                    4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h11m   DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)...
node-tuning                                4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h7m    
openshift-apiserver                        4.12.0-0.nightly-2023-01-02-175114   False       True          False      151m    APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
openshift-controller-manager               4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h4m    
openshift-samples                          4.12.0-0.nightly-2023-01-02-175114   True        False         False      3h10m   
operator-lifecycle-manager                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2023-01-02-175114   True        False         False      2m44s   
platform-operators-aggregated              4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h2m    
service-ca                                 4.12.0-0.nightly-2023-01-02-175114   True        False         False      4h9m    
storage                                    4.12.0-0.nightly-2023-01-02-175114   True        True          False      4h2m    AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test % 


liuhuali@Lius-MacBook-Pro huali-test % oc get pod --all-namespaces|grep -v Running
NAMESPACE                                          NAME                                                                       READY   STATUS              RESTARTS         AGE
openshift-apiserver                                apiserver-5cbdf985f9-85z4t                                                 0/2     Init:0/1            0                155m
openshift-authentication                           oauth-openshift-5c46d6658b-lkbjj                                           0/1     Pending             0                156m
openshift-cloud-credential-operator                pod-identity-webhook-77bf7c646d-4rtn8                                      0/1     ContainerCreating   0                156m
openshift-cluster-api                              capa-controller-manager-d484bc464-lhqbk                                    0/1     ContainerCreating   0                156m
openshift-cluster-csi-drivers                      aws-ebs-csi-driver-controller-5668745dcb-jc7fm                             0/11    ContainerCreating   0                156m
openshift-cluster-csi-drivers                      aws-ebs-csi-driver-operator-5d6b9fbd77-827vs                               0/1     ContainerCreating   0                156m
openshift-cluster-csi-drivers                      shared-resource-csi-driver-operator-866d897954-z77gz                       0/1     ContainerCreating   0                156m
openshift-cluster-csi-drivers                      shared-resource-csi-driver-webhook-d794748dc-kctkn                         0/1     ContainerCreating   0                156m
openshift-cluster-samples-operator                 cluster-samples-operator-754758b9d7-nbcc9                                  0/2     ContainerCreating   0                156m
openshift-cluster-storage-operator                 csi-snapshot-controller-6d9c448fdd-wdb7n                                   0/1     ContainerCreating   0                156m
openshift-cluster-storage-operator                 csi-snapshot-webhook-6966f555f8-cbdc7                                      0/1     ContainerCreating   0                156m
openshift-console-operator                         console-operator-7d8567876b-nxgpj                                          0/2     ContainerCreating   0                156m
openshift-console                                  console-855f66f4f8-q869k                                                   0/1     ContainerCreating   0                156m
openshift-console                                  downloads-7b645b6b98-7jqfw                                                 0/1     ContainerCreating   0                156m
openshift-controller-manager                       controller-manager-548c7f97fb-bl68p                                        0/1     Pending             0                156m
openshift-etcd                                     installer-13-ip-10-0-76-132.us-east-2.compute.internal                     0/1     ContainerCreating   0                9m39s
openshift-etcd                                     installer-3-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h13m
openshift-etcd                                     installer-4-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h12m
openshift-etcd                                     installer-5-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h7m
openshift-etcd                                     installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h1m
openshift-etcd                                     installer-8-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-10-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                160m
openshift-etcd                                     revision-pruner-10-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                160m
openshift-etcd                                     revision-pruner-11-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                159m
openshift-etcd                                     revision-pruner-11-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                159m
openshift-etcd                                     revision-pruner-11-ip-10-0-79-159.us-east-2.compute.internal               0/1     Completed           0                156m
openshift-etcd                                     revision-pruner-12-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                156m
openshift-etcd                                     revision-pruner-12-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                156m
openshift-etcd                                     revision-pruner-12-ip-10-0-79-159.us-east-2.compute.internal               0/1     Completed           0                156m
openshift-etcd                                     revision-pruner-13-ip-10-0-48-21.us-east-2.compute.internal                0/1     ContainerCreating   0                155m
openshift-etcd                                     revision-pruner-13-ip-10-0-63-159.us-east-2.compute.internal               0/1     Completed           0                155m
openshift-etcd                                     revision-pruner-13-ip-10-0-76-132.us-east-2.compute.internal               0/1     ContainerCreating   0                10m
openshift-etcd                                     revision-pruner-13-ip-10-0-79-159.us-east-2.compute.internal               0/1     Completed           0                155m
openshift-etcd                                     revision-pruner-6-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-etcd                                     revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                3h57m
openshift-etcd                                     revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                168m
openshift-etcd                                     revision-pruner-9-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                166m
openshift-etcd                                     revision-pruner-9-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                166m
openshift-kube-apiserver                           installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h4m
openshift-kube-apiserver                           installer-7-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                168m
openshift-kube-apiserver                           installer-9-ip-10-0-76-132.us-east-2.compute.internal                      0/1     ContainerCreating   0                9m52s
openshift-kube-apiserver                           revision-pruner-6-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-kube-apiserver                           revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                3h59m
openshift-kube-apiserver                           revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                168m
openshift-kube-apiserver                           revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                168m
openshift-kube-apiserver                           revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                166m
openshift-kube-apiserver                           revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                166m
openshift-kube-apiserver                           revision-pruner-8-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                156m
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-48-21.us-east-2.compute.internal                 0/1     ContainerCreating   0                155m
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                155m
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-76-132.us-east-2.compute.internal                0/1     ContainerCreating   0                9m54s
openshift-kube-apiserver                           revision-pruner-9-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                155m
openshift-kube-controller-manager                  installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h11m
openshift-kube-controller-manager                  installer-7-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h7m
openshift-kube-controller-manager                  installer-8-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                169m
openshift-kube-controller-manager                  installer-8-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h4m
openshift-kube-controller-manager                  installer-8-ip-10-0-79-159.us-east-2.compute.internal                      0/1     Completed           0                156m
openshift-kube-controller-manager                  revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h13m
openshift-kube-controller-manager                  revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h10m
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h5m
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-76-132.us-east-2.compute.internal                0/1     ContainerCreating   0                4m36s
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                156m
openshift-kube-scheduler                           installer-6-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h11m
openshift-kube-scheduler                           installer-7-ip-10-0-48-21.us-east-2.compute.internal                       0/1     Completed           0                169m
openshift-kube-scheduler                           installer-7-ip-10-0-63-159.us-east-2.compute.internal                      0/1     Completed           0                4h10m
openshift-kube-scheduler                           installer-7-ip-10-0-79-159.us-east-2.compute.internal                      0/1     Completed           0                156m
openshift-kube-scheduler                           revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h13m
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal                 0/1     Completed           0                169m
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal                0/1     Completed           0                4h10m
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-76-132.us-east-2.compute.internal                0/1     ContainerCreating   0                4m36s
openshift-kube-scheduler                           revision-pruner-7-ip-10-0-79-159.us-east-2.compute.internal                0/1     Completed           0                156m
openshift-machine-config-operator                  machine-config-controller-55b4d497b6-p89lb                                 0/2     ContainerCreating   0                156m
openshift-marketplace                              qe-app-registry-w8gnc                                                      0/1     ContainerCreating   0                148m
openshift-monitoring                               prometheus-operator-776bd79f6d-vz7q5                                       0/2     ContainerCreating   0                156m
openshift-multus                                   multus-admission-controller-5f88d77b65-nzmj5                               0/2     ContainerCreating   0                156m
openshift-oauth-apiserver                          apiserver-7b65bbc76b-mxl99                                                 0/1     Init:0/1            0                154m
openshift-operator-lifecycle-manager               collect-profiles-27879975-fpvzk                                            0/1     Completed           0                3h21m
openshift-operator-lifecycle-manager               collect-profiles-27879990-86rk8                                            0/1     Completed           0                3h6m
openshift-operator-lifecycle-manager               collect-profiles-27880005-bscc4                                            0/1     Completed           0                171m
openshift-operator-lifecycle-manager               collect-profiles-27880170-s8cbj                                            0/1     ContainerCreating   0                4m37s
openshift-operator-lifecycle-manager               packageserver-6f8f8f9d54-4r96h                                             0/1     ContainerCreating   0                156m
openshift-ovn-kubernetes                           ovnkube-master-lr9pk                                                       3/6     CrashLoopBackOff    23 (46s ago)     156m
openshift-route-controller-manager                 route-controller-manager-747bf8684f-5vhwx                                  0/1     Pending             0                156m
liuhuali@Lius-MacBook-Pro huali-test % 

Actual results:

RollingUpdate cannot complete successfully

Expected results:

RollingUpdate should complete successfully

Additional info:

Must gather - https://drive.google.com/file/d/1bvE1XUuZKLBGmq7OTXNVCNcFZkqbarab/view?usp=sharing

must gather of another cluster hit the same issue (also this template ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-techpreview-ci and with ovn network): https://drive.google.com/file/d/1CqAJlqk2wgnEuMo3lLaObk4Nbxi82y_A/view?usp=sharing

must gather of another cluster hit the same issue (this template ipi-on-aws/versioned-installer-private_cluster-sts-usgov-ci and with ovn network):
https://drive.google.com/file/d/1tnKbeqJ18SCAlJkS80Rji3qMu3nvN_O8/view?usp=sharing
 
Seems this template ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-techpreview-ci and with ovn network can often hit this issue.

This is a clone of issue OCPBUGS-13168. The following is the description of the original issue:

Description of problem:

oc login --token=$token
--server=https://api.dalh-dev-hs-2.05zb.p3.openshiftapps.com:443 --certificate-authority=ca.crt
The server uses a certificate signed by an unknown authority.
You can bypass the certificate check, but any data you send to the server could be intercepted by others.

The referenced "ca.crt" comes from the Secret created when a Service Account is created.

Version-Release number of selected component (if applicable): 4.12.12

How reproducible: Always

Tracker bug for bootimage bump in 4.12. This bug should block bugs which need a bootimage bump to fix.

The previous tracker is OCPBUGS-561.

If the status for the hosts in assisted-installer changes from preparing-for-installation to ready, that means that it failed to generate the ignition configs needed to install, and installation will not proceed. When we see this we should report a failure immediately from agent wait-for bootstrap-complete. Currently we just time out some time after reporting this log message:

level=info msg=Host master-2.ostest.test.metalkube.org: updated status from preparing-for-installation to known (Host is ready to be installed) 

To catch the case where the user runs the command after this failure has already happened, perhaps we should institute a relatively short timeout for installation to begin after all of the hosts are in the known state.

Our Prometheus alerts are inconsistent with both upstream and sometimes our own vendor folder. Let's do a clean update run before the next release is branched off.

Description of problem:

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2074299 for backporting purposes.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Installing 1000+ SNOs via ACM/MCE via ZTP with gitops, a small percentage of clusters end up never completing install because the monitoring operator does not reconcile to available.

# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          16h     Unable to apply 4.11.0: the cluster operator monitoring has not yet successfully rolled out
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig get co monitoring
NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring             False       True          True       15h     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. 

 

Version-Release number of selected component (if applicable):

  • Hub OCP and SNO OCP - 4.11.0
  • ACM - 2.6.0-DOWNSTREAM-2022-08-11-23-41-09  (FC5)

 

How reproducible:

  • 2 out of 23 failures out of 1728 installs
  • ~8% of the failures are because of this issue
  • failure rate of ~.1% of the total installs

 

Additional info:

 

# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig get po -n openshift-monitoring
NAME                                                     READY   STATUS              RESTARTS   AGE
alertmanager-main-0                                      0/6     ContainerCreating   0          15h
cluster-monitoring-operator-54dd78cc74-l5w24             2/2     Running             0          15h
kube-state-metrics-b6455c4dc-8hcfn                       3/3     Running             0          15h
node-exporter-k7899                                      2/2     Running             0          15h
openshift-state-metrics-7984888fbd-cl67v                 3/3     Running             0          15h
prometheus-adapter-785bf4f975-wgmnh                      1/1     Running             0          15h
prometheus-k8s-0                                         0/6     Init:0/1            0          15h
prometheus-operator-74d8754ff7-9zrgw                     2/2     Running             0          15h
prometheus-operator-admission-webhook-6665fb687d-c5jgv   1/1     Running             0          15h
thanos-querier-575496c665-jcc8l                          6/6     Running             0          15h 
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig describe po -n openshift-monitoring alertmanager-main-0
Name:                 alertmanager-main-0
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 sno01219/fc00:1001::8aa
Start Time:           Mon, 15 Aug 2022 23:53:39 +0000
Labels:               alertmanager=main
                      app.kubernetes.io/component=alert-router
                      app.kubernetes.io/instance=main
                      app.kubernetes.io/managed-by=prometheus-operator
                      app.kubernetes.io/name=alertmanager
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=0.24.0
                      controller-revision-hash=alertmanager-main-fcf8dd5fb
                      statefulset.kubernetes.io/pod-name=alertmanager-main-0
Annotations:          kubectl.kubernetes.io/default-container: alertmanager
                      openshift.io/scc: nonroot
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        StatefulSet/alertmanager-main
Containers:
  alertmanager:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:91308d35c1e56463f55c1aaa519ff4de7335d43b254c21abdb845fc8c72821a1
    Image ID:
    Ports:         9094/TCP, 9094/UDP
    Host Ports:    0/TCP, 0/UDP
    Args:
      --config.file=/etc/alertmanager/config/alertmanager.yaml
      --storage.path=/alertmanager
      --data.retention=120h
      --cluster.listen-address=
      --web.listen-address=127.0.0.1:9093
      --web.external-url=https:/console-openshift-console.apps.sno01219.rdu2.scalelab.redhat.com/monitoring
      --web.route-prefix=/
      --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
      --cluster.reconnect-timeout=5m
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     4m
      memory:  40Mi
    Environment:
      POD_IP:   (v1:status.podIP)
    Mounts:
      /alertmanager from alertmanager-main-db (rw)
      /etc/alertmanager/certs from tls-assets (ro)
      /etc/alertmanager/config from config-volume (rw)
      /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy (ro)
      /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy-metric from secret-alertmanager-kube-rbac-proxy-metric (ro)
      /etc/alertmanager/secrets/alertmanager-main-proxy from secret-alertmanager-main-proxy (ro)
      /etc/alertmanager/secrets/alertmanager-main-tls from secret-alertmanager-main-tls (ro)
      /etc/pki/ca-trust/extracted/pem/ from alertmanager-trusted-ca-bundle (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro)
  config-reloader:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209e20410ec2d3d7a502f568d2b7fe1cd1beadcb36fff2d1e6f59d77be3200e3
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/prometheus-config-reloader
    Args:
      --listen-address=localhost:8080
      --reload-url=http://localhost:9093/-/reload
      --watched-dir=/etc/alertmanager/config
      --watched-dir=/etc/alertmanager/secrets/alertmanager-main-tls
      --watched-dir=/etc/alertmanager/secrets/alertmanager-main-proxy
      --watched-dir=/etc/alertmanager/secrets/alertmanager-kube-rbac-proxy
      --watched-dir=/etc/alertmanager/secrets/alertmanager-kube-rbac-proxy-metric
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  alertmanager-main-0 (v1:metadata.name)
      SHARD:     -1
    Mounts:
      /etc/alertmanager/config from config-volume (ro)
      /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy (ro)
      /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy-metric from secret-alertmanager-kube-rbac-proxy-metric (ro)
      /etc/alertmanager/secrets/alertmanager-main-proxy from secret-alertmanager-main-proxy (ro)
      /etc/alertmanager/secrets/alertmanager-main-tls from secret-alertmanager-main-tls (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro)
  alertmanager-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:140f8947593d92e1517e50a201e83bdef8eb965b552a21d3caf346a250d0cf6e
    Image ID:
    Port:          9095/TCP
    Host Port:     0/TCP
    Args:
      -provider=openshift
      -https-address=:9095
      -http-address=
      -email-domain=*
      -upstream=http://localhost:9093
      -openshift-sar=[{"resource": "namespaces", "verb": "get"}, {"resource": "alertmanagers", "resourceAPIGroup": "monitoring.coreos.com", "namespace": "openshift-monitoring", "verb": "patch", "resourceName": "non-existant"}]
      -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}, "/": {"resource":"alertmanagers", "group": "monitoring.coreos.com", "namespace": "openshift-monitoring", "verb": "patch", "name": "non-existant"}}
      -tls-cert=/etc/tls/private/tls.crt
      -tls-key=/etc/tls/private/tls.key
      -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
      -cookie-secret-file=/etc/proxy/secrets/session_secret
      -openshift-service-account=alertmanager-main
      -openshift-ca=/etc/pki/tls/cert.pem
      -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  20Mi
    Environment:
      HTTP_PROXY:
      HTTPS_PROXY:
      NO_PROXY:
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from alertmanager-trusted-ca-bundle (ro)
      /etc/proxy/secrets from secret-alertmanager-main-proxy (rw)
      /etc/tls/private from secret-alertmanager-main-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro)
  kube-rbac-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e
    Image ID:
    Port:          9092/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9092
      --upstream=http://127.0.0.1:9096
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --logtostderr=true
      --tls-min-version=VersionTLS12
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy (rw)
      /etc/tls/private from secret-alertmanager-main-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro)
  kube-rbac-proxy-metric:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e
    Image ID:
    Port:          9097/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9097
      --upstream=http://127.0.0.1:9093
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --client-ca-file=/etc/tls/client/client-ca.crt
      --logtostderr=true
      --allow-paths=/metrics
      --tls-min-version=VersionTLS12
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy-metric (ro)
      /etc/tls/client from metrics-client-ca (ro)
      /etc/tls/private from secret-alertmanager-main-tls (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro)
  prom-label-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2550b2cbdf864515b1edacf43c25eb6b6f179713c1df34e51f6e9bba48d6430a
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --insecure-listen-address=127.0.0.1:9096
      --upstream=http://127.0.0.1:9093
      --label=namespace
      --error-on-replace
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alertmanager-main-generated
    Optional:    false
  tls-assets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          alertmanager-main-tls-assets-0
    SecretOptionalName:  <nil>
  secret-alertmanager-main-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alertmanager-main-tls
    Optional:    false
  secret-alertmanager-main-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alertmanager-main-proxy
    Optional:    false
  secret-alertmanager-kube-rbac-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alertmanager-kube-rbac-proxy
    Optional:    false
  secret-alertmanager-kube-rbac-proxy-metric:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alertmanager-kube-rbac-proxy-metric
    Optional:    false
  alertmanager-main-db:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  metrics-client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      metrics-client-ca
    Optional:  false
  alertmanager-trusted-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      alertmanager-trusted-ca-bundle-2rsonso43rc5p
    Optional:  true
  kube-api-access-hl77l:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  2m25s (x409 over 15h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-0_openshift-monitoring_1c367a83-24e3-4249-861a-a107a6beaee2_0(dff5f302f774d060728261b3c86841ebdbd7ba11537ec9f4d90d57be17bdf44b): error adding pod openshift-monitoring_alertmanager-main-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-monitoring/alertmanager-main-0/1c367a83-24e3-4249-861a-a107a6beaee2:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-0 dff5f302f774d060728261b3c86841ebdbd7ba11537ec9f4d90d57be17bdf44b] [openshift-monitoring/alertmanager-main-0 dff5f302f774d060728261b3c86841ebdbd7ba11537ec9f4d90d57be17bdf44b] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded                                                                                                                                                                                                                                                                             
 oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig describe po -n openshift-monitoring prometheus-k8s-0
Name:                 prometheus-k8s-0
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 sno01219/fc00:1001::8aa
Start Time:           Mon, 15 Aug 2022 23:53:39 +0000
Labels:               app.kubernetes.io/component=prometheus
                      app.kubernetes.io/instance=k8s
                      app.kubernetes.io/managed-by=prometheus-operator
                      app.kubernetes.io/name=prometheus
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=2.36.2
                      controller-revision-hash=prometheus-k8s-546b544f8b
                      operator.prometheus.io/name=k8s
                      operator.prometheus.io/shard=0
                      prometheus=k8s
                      statefulset.kubernetes.io/pod-name=prometheus-k8s-0
Annotations:          kubectl.kubernetes.io/default-container: prometheus
                      openshift.io/scc: nonroot
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        StatefulSet/prometheus-k8s
Init Containers:
  init-config-reloader:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209e20410ec2d3d7a502f568d2b7fe1cd1beadcb36fff2d1e6f59d77be3200e3
    Image ID:
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/prometheus-config-reloader
    Args:
      --watch-interval=0
      --listen-address=:8080
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  prometheus-k8s-0 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
Containers:
  prometheus:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7df53b796e81ba8301ba74d02317226329bd5752fd31c1b44d028e4832f21c3
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --storage.tsdb.retention.time=15d
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --web.enable-lifecycle
      --web.external-url=https:/console-openshift-console.apps.sno01219.rdu2.scalelab.redhat.com/monitoring
      --web.route-prefix=/
      --web.listen-address=127.0.0.1:9090
      --web.config.file=/etc/prometheus/web_config/web-config.yaml
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        70m
      memory:     1Gi
    Liveness:     exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/healthy; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/healthy; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=3
    Startup:      exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=15s #success=1 #failure=60
    Environment:  <none>
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro)
      /etc/prometheus/certs from tls-assets (ro)
      /etc/prometheus/config_out from config-out (ro)
      /etc/prometheus/configmaps/kubelet-serving-ca-bundle from configmap-kubelet-serving-ca-bundle (ro)
      /etc/prometheus/configmaps/metrics-client-ca from configmap-metrics-client-ca (ro)
      /etc/prometheus/configmaps/serving-certs-ca-bundle from configmap-serving-certs-ca-bundle (ro)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/secrets/kube-etcd-client-certs from secret-kube-etcd-client-certs (ro)
      /etc/prometheus/secrets/kube-rbac-proxy from secret-kube-rbac-proxy (ro)
      /etc/prometheus/secrets/metrics-client-certs from secret-metrics-client-certs (ro)
      /etc/prometheus/secrets/prometheus-k8s-proxy from secret-prometheus-k8s-proxy (ro)
      /etc/prometheus/secrets/prometheus-k8s-thanos-sidecar-tls from secret-prometheus-k8s-thanos-sidecar-tls (ro)
      /etc/prometheus/secrets/prometheus-k8s-tls from secret-prometheus-k8s-tls (ro)
      /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml")
      /prometheus from prometheus-k8s-db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
  config-reloader:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209e20410ec2d3d7a502f568d2b7fe1cd1beadcb36fff2d1e6f59d77be3200e3
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/prometheus-config-reloader
    Args:
      --listen-address=localhost:8080
      --reload-url=http://localhost:9090/-/reload
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  prometheus-k8s-0 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
  thanos-sidecar:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36fc214537c763b3a3f0a9dc7a1bd4378a80428c31b2629df8786a9b09155e6d
    Image ID:
    Ports:         10902/TCP, 10901/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      sidecar
      --prometheus.url=http://localhost:9090/
      --tsdb.path=/prometheus
      --http-address=127.0.0.1:10902
      --grpc-server-tls-cert=/etc/tls/grpc/server.crt
      --grpc-server-tls-key=/etc/tls/grpc/server.key
      --grpc-server-tls-client-ca=/etc/tls/grpc/ca.crt
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/tls/grpc from secret-grpc-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
  prometheus-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:140f8947593d92e1517e50a201e83bdef8eb965b552a21d3caf346a250d0cf6e
    Image ID:
    Port:          9091/TCP
    Host Port:     0/TCP
    Args:
      -provider=openshift
      -https-address=:9091
      -http-address=
      -email-domain=*
      -upstream=http://localhost:9090
      -openshift-service-account=prometheus-k8s
      -openshift-sar={"resource": "namespaces", "verb": "get"}
      -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}
      -tls-cert=/etc/tls/private/tls.crt
      -tls-key=/etc/tls/private/tls.key
      -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
      -cookie-secret-file=/etc/proxy/secrets/session_secret
      -openshift-ca=/etc/pki/tls/cert.pem
      -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  20Mi
    Environment:
      HTTP_PROXY:
      HTTPS_PROXY:
      NO_PROXY:
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro)
      /etc/proxy/secrets from secret-prometheus-k8s-proxy (rw)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
  kube-rbac-proxy:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e
    Image ID:
    Port:          9092/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9092
      --upstream=http://127.0.0.1:9090
      --allow-paths=/metrics
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --client-ca-file=/etc/tls/client/client-ca.crt
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --logtostderr=true
      --tls-min-version=VersionTLS12
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw)
      /etc/tls/client from configmap-metrics-client-ca (ro)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
  kube-rbac-proxy-thanos:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e
    Image ID:
    Port:          10902/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=[$(POD_IP)]:10902
      --upstream=http://127.0.0.1:10902
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --client-ca-file=/etc/tls/client/client-ca.crt
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --allow-paths=/metrics
      --logtostderr=true
      --tls-min-version=VersionTLS12
      --client-ca-file=/etc/tls/client/client-ca.crt
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_IP:   (v1:status.podIP)
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw)
      /etc/tls/client from metrics-client-ca (ro)
      /etc/tls/private from secret-prometheus-k8s-thanos-sidecar-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s
    Optional:    false
  tls-assets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          prometheus-k8s-tls-assets-0
    SecretOptionalName:  <nil>
  config-out:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  prometheus-k8s-rulefiles-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rulefiles-0
    Optional:  false
  web-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-web-config
    Optional:    false
  secret-kube-etcd-client-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-etcd-client-certs
    Optional:    false
  secret-prometheus-k8s-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-tls
    Optional:    false
  secret-prometheus-k8s-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-proxy
    Optional:    false
  secret-prometheus-k8s-thanos-sidecar-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-thanos-sidecar-tls
    Optional:    false
  secret-kube-rbac-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-rbac-proxy
    Optional:    false
  secret-metrics-client-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-client-certs
    Optional:    false
  configmap-serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  configmap-kubelet-serving-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubelet-serving-ca-bundle
    Optional:  false
  configmap-metrics-client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      metrics-client-ca
    Optional:  false
  prometheus-k8s-db:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  metrics-client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      metrics-client-ca
    Optional:  false
  secret-grpc-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-grpc-tls-crdkohb1gb92n
    Optional:    false
  prometheus-trusted-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-trusted-ca-bundle-2rsonso43rc5p
    Optional:  true
  kube-api-access-85zlc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  4m19s (x409 over 15h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-0_openshift-monitoring_debda4d2-6914-4b36-92e0-78f68d539ab3_0(86af91d4e64ab0fbad95352b029762e9856ff24005445b458bccb22e0ee9b655): error adding pod openshift-monitoring_prometheus-k8s-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-monitoring/prometheus-k8s-0/debda4d2-6914-4b36-92e0-78f68d539ab3:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-0 86af91d4e64ab0fbad95352b029762e9856ff24005445b458bccb22e0ee9b655] [openshift-monitoring/prometheus-k8s-0 86af91d4e64ab0fbad95352b029762e9856ff24005445b458bccb22e0ee9b655] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded

Both pods in error state seem to be waiting on this issue "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded"

In multinode we can check nodes object in kubeapi as we can't really validate hosts that are not part of cluster, only the one controller is running on.

And we should validate ip of the host controller is running on.

In case ip was changed log it

Description of problem:

[sig-arch] events should not repeat pathologically is frequently failing in 4.11 upgrade jobs. The error is too many readiness probe errors.

Version-Release number of selected component (if applicable):

 

How reproducible:

Flakey

Steps to Reproduce:

1.
2.
3.

Actual results:

: [sig-arch] events should not repeat pathologically expand_less0s{  1 events happened too frequently

event happened 44 times, something is wrong: ns/openshift-monitoring pod/thanos-querier-d89745c9-xttvz node/ip-10-0-178-252.us-west-2.compute.internal - reason/ProbeError Readiness probe error: Get "https://10.128.2.13:9091/-/ready": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
body: 
}

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3069. The following is the description of the original issue:

Description of problem:

On cluster setting page, it shows available upgrade on page. After user chooses one target version and clicks "Upgrade", wait for a long time, there is no info about upgrade status.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-25-210451

How reproducible:

Always

Steps to Reproduce:

1.Login console with available upgrade for the cluster, select a target version in the available version list. Then click "Update". Check the upgrade progress on the cluster setting page.
2.Check upgrade info from client with "oc adm upgrade".
3.

Actual results:

1.There is not any information or upgrade progress shown on the page.
2.It shows info about retrieving target version failed.
$ oc adm upgrade 
Cluster version is 4.12.0-0.nightly-2022-10-25-210451
  ReleaseAccepted=False  
  Reason: RetrievePayload
  Message: Retrieving payload failed version="4.12.0-0.nightly-2022-10-27-053332" image="registry.ci.openshift.org/ocp/release@sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1" failure=The update cannot be verified: unable to verify sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1 against keyrings: verifier-public-key-redhatUpstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph
Channel: stable-4.12
Recommended updates:  
  VERSION                            IMAGE
  4.12.0-0.nightly-2022-11-01-135441 registry.ci.openshift.org/ocp/release@sha256:f79d25c821a73496f4664a81a123925236d0c7818fd6122feb953bc64e91f5d0
  4.12.0-0.nightly-2022-10-31-232349 registry.ci.openshift.org/ocp/release@sha256:cb2d157805abc413394fc579776d3f4406b0a2c2ed03047b6f7958e6f3d92622
  4.12.0-0.nightly-2022-10-28-001703 registry.ci.openshift.org/ocp/release@sha256:c914c11492cf78fb819f4b617544cd299c3a12f400e106355be653c0013c2530
  4.12.0-0.nightly-2022-10-27-053332 registry.ci.openshift.org/ocp/release@sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1

Expected results:

1. It should also show this kind of message on console page if retrieving target payload failed, so that user knows the actual result after try to upgrade.

Additional info:

 

This is a clone of issue OCPBUGS-4411. The following is the description of the original issue:

Description of problem:

manually configure ipv6 addresses and route on ipv4 OCP cluster to create a dualstack cluster, newly created pods will stay in 'ContainerCreating' status

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Steps to Reproduce:

1. enable ipv6 in network.
# more patch_dual.yaml 
- op: add
  path: /spec/clusterNetwork/-
  value:
    cidr: fd01::/48
    hostPrefix: 64
- op: add
  path: /spec/serviceNetwork/-
  value: fd02::/112
# oc patch network.config.openshift.io cluster --type='json' --patch-file patch_dual.yaml
 
2. Configure ipv6 addresses and routes

PODS=$(oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned --field-selector=status.phase=Running --no-headers -o name)
i=10
for pod in $PODS; do
  oc exec -n openshift-cluster-node-tuning-operator $pod -- ip -6 addr add fd00:172:22::${i}/64 dev br-ex
  oc exec -n openshift-cluster-node-tuning-operator $pod -- ip -6 route add default via fd00:172:22::1 dev br-ex
  ((i=i+1))
done 

3. create pods and they will stay in ContainerCreating status.

4. if remove the ipv6 configuration in network, newly created pods can be ready.


Actual results:

Pod can not be running

Expected results:

Pod should be ready with both ipv4 and ipv6 address.

Additional info:

version:
# oc version
Client Version: 4.12.0-0.nightly-2022-11-30-182550
Kustomize Version: v4.5.7
Server Version: 4.12.0-0.nightly-2022-11-30-182550
Kubernetes Version: v1.25.2+5533733

Describe pods:
# oc describe pod iperf-rc-normal-qg6zd 
Name:             iperf-rc-normal-qg6zd
Namespace:        offload-testing
Priority:         0
Service Account:  default
Node:             openshift-qe-025.lab.eng.rdu2.redhat.com/192.168.111.54
Start Time:       Thu, 01 Dec 2022 21:35:28 -0500
Labels:           name=iperf-pods-normal
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.129.2.7/23","fd01:0:0:6::3/64"],"mac_address":"0a:58:0a:81:02:07","gateway_ips":["10.129.2.1","fd01:0:0:6:...
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicationController/iperf-rc-normal
Containers:
  iperf:
    Container ID:   
    Image:          quay.io/openshifttest/iperf3@sha256:440c59251338e9fcf0a00d822878862038d3b2e2403c67c940c7781297953614
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  340Mi
    Requests:
      memory:     340Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4266b (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-4266b:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  3m4s (x173 over 5h50m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_iperf-rc-normal-qg6zd_offload-testing_18673f13-37b4-40ea-aa5d-85654dfa5c85_0(4899f7150492fa4cd895c62d0ec25ac5c1507016037c31b6019849083b42cdb5): error adding pod offload-testing_iperf-rc-normal-qg6zd to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [offload-testing/iperf-rc-normal-qg6zd/18673f13-37b4-40ea-aa5d-85654dfa5c85:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[offload-testing/iperf-rc-normal-qg6zd 4899f7150492fa4cd895c62d0ec25ac5c1507016037c31b6019849083b42cdb5] [offload-testing/iperf-rc-normal-qg6zd 4899f7150492fa4cd895c62d0ec25ac5c1507016037c31b6019849083b42cdb5] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:02:07 [10.129.2.7/23 fd01:0:0:6::3/64]
'

 

Description of problem:

TestUnmanagedDNSToManagedDNSInternalIngressController E2E test is failing on the error:
{
unmanaged_dns_test.go:272: failed to verify connectivity with workload with reqURL http://10.0.128.7 using external client: timed out waiting for the condition  

How reproducible:

About 75% of the time.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

75%

Steps to Reproduce:

1. Run CI E2E tests on cluster-ingress-operator or 
make test-e2e TEST=TestUnmanagedDNSToManagedDNSInternalIngressController 

Actual results:

E2E test fails about 75% of the time

Expected results:

E2E should always pass

Additional info:

 

This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:

Description of problem:

It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.

Version-Release number of selected component (if applicable):

Openshift 4.8, Serverless Operator 1.27

Additional info:

https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019

 

Description of problem:

Installation fails on AWS because the installer manifests include an invalid ingresses.config.openshift.io/cluster manifest.

Version-Release number of selected component (if applicable):

4.12.

How reproducible:

Seems to be a consistent failure.

Steps to Reproduce:

1. Install a cluster on AWS without specifying lbType in the install-config.

Actual results:

The cluster bootstrap fails with the following error message:

"cluster-ingress-02-config.yml": failed to create ingresses.v1.config.openshift.io/cluster -n : Ingress.config.openshift.io "cluster" is invalid: spec.loadBalancer.platform.aws.type: Required value
 

Expected results:

Cluster bootstrap should succeed.

Additional info:

https://github.com/openshift/installer/pull/6478 introduced the problematic logic that sets spec.loadBalancer.platform.aws without setting spec.loadBalancer.platform.aws.type.

 

cloud-controller-manager does not react to changes to infrastructure secrets (in the OpenStack case: clouds.yaml).
As a consequence, if credentials are rotated (and the old ones are rendered useless), load balancer creation and deletion will not succeed any more. Restarting the controller fixes the issue on a live cluster.

Logs show that it couldn't find the application credentials:

Dec 19 12:58:58.909: INFO: At 2022-12-19 12:53:58 +0000 UTC - event for udp-lb-default-svc: {service-controller } EnsuringLoadBalancer: Ensuring load balancer
Dec 19 12:58:58.909: INFO: At 2022-12-19 12:53:58 +0000 UTC - event for udp-lb-default-svc: {service-controller } SyncLoadBalancerFailed: Error syncing load balancer: failed to ensure load balancer: failed to get subnet to create load balancer for service e2e-test-openstack-q9jnk/udp-lb-default-svc: Unable to re-authenticate: Expected HTTP response code [200 204 300] when accessing [GET https://compute.rdo.mtl2.vexxhost.net/v2.1/0693e2bb538c42b79a49fe6d2e61b0fc/servers/fbeb21b8-05f0-4734-914e-926b6a6225f1/os-interface], but got 401 instead
{"error": {"code": 401, "title": "Unauthorized", "message": "The request you have made requires authentication."}}: Resource not found: [POST https://identity.rdo.mtl2.vexxhost.net/v3/auth/tokens], error message: {"error":{"code":404,"message":"Could not find Application Credential: 1b78233956b34c6cbe5e1c95445972a4.","title":"Not Found"}}

OpenStack CI has been instrumented to restart CCM after credentials rotation, so that we silence this particular issue and avoid masking any other. That workaround must be reverted once this bug is fixed.

This bug is a backport clone of [Bugzilla Bug 2073220](https://bugzilla.redhat.com/show_bug.cgi?id=2073220). The following is the description of the original bug:

Description of problem:

https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Version-Release number of selected component (if applicable): 4.*

How reproducible: always

Steps to Reproduce:
1. Set audit profile to WriteRequestBodies
2. Wait for api server rollout to complete
3. tail -f /var/log/kube-apiserver/audit.log | grep routes/status

Actual results:

Write events to routes/status are recorded at the RequestResponse level, which often includes keys and certificates.

Expected results:

Events involving routes should always be recorded at the Metadata level, per the documentation at https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Additional info:

Description of problem:

When trying to enable Hardware Backed Management Ports (e.g. Virtual functions) on BF2 in NIC mode OR any other MLX NICs (CX-6, CX-5) by setting the node_mgmt_port_netdev_flags flags to a VF in the CNO; then OVN-K Node will crash.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Always

Steps to Reproduce:

Start by enabling OvS HWOL and setting sriovnetworknodepolicy
https://docs.openshift.com/container-platform/4.11/networking/hardware_networks/configuring-hardware-offloading.html
1. Scale down CNO: oc scale --replicas=0 deploy/network-operator -n openshift-network-operator
2. Make changes to OVN-K node: oc edit daemonsets ovnkube-node -n openshift-ovn-kubernetes
    a. Find "node_mgmt_port_netdev_flags=" and replace it with something like this:
          node_mgmt_port_netdev_flags=
          if [[ ${K8S_NODE} != *"master"* ]]; then
                node_mgmt_port_netdev_flags="--ovnkube-node-mgmt-port-netdev=ens1f0v0"
          fi
     b. Additionally you have to add the "node_mgmt_port_netdev_flags"  to the " exec /usr/bin/ovnkube --init-node "${K8S_NODE}"" call in the same script. Since this is missing.
3. Save the edit.
4. Observe OVN-K node on baremetal worker nodes.

Actual results:

I0822 14:21:56.250285  496356 ovs.go:204] Exec(3): stderr: ""
I0822 14:21:56.250290  496356 node.go:310] Detected support for port binding with external IDs
I0822 14:21:56.250516  496356 management-port-dpu.go:181] Setup management port dpu host: ens1f0v0
F0822 14:21:56.250568  496356 ovnkube.go:133] failed to set management port name. file exists

Workaround is to go to the node and run this command: sudo ovs-vsctl del-port br-int ovn-k8s-mp0

Expected results:

There should not be any errors when changing node_mgmt_port_netdev_flags to a valid value.

Additional info:

Reported here: https://github.com/ovn-org/ovn-kubernetes/pull/3160
Discussed briefly here: https://issues.redhat.com/browse/OCPBUGS-4098
Fixed Upstream here: https://github.com/ovn-org/ovn-kubernetes/pull/3251

This is a clone of issue OCPBUGS-17457. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16019. The following is the description of the original issue:

Description of problem:

Hello, one of our customers had several cni-sysctl-allowlist-ds created (around 10.000 pods) in openshift-multus namespace. That caused several issues in the cluster, as nodes were full of pods an run out of IPs.

After deleting them, the situation has improved. But we want to know the root cause of this issue.

Searching in the network-operator pod logs, it seems that the customer faced some networking issues. After this issue, we can see that the cni-sysctl-allowlist pods started to be created.

Could we know why the cni-sysctl-allowlist-ds pods were created?

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988

This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D

Version-Release number of selected component (if applicable):

4.12,4.13

How reproducible:

Every time

Steps to Reproduce:

1. Setup dualstack KinD cluster
2. Create egress fw policy with spec
Spec:
  Egress:
    To:
      Cidr Selector:  0.0.0.0/0
    Type:             Deny
3. create a pod and ping to 1.1.1.1

Actual results:

Egress policy does not block flows to external IP

Expected results:

Egress policy blocks flows to external IP

Additional info:

It seems mixing ip4 and ip6 operands in ACL matchs doesnt work

We cache images by filename, which works when downloading from the Internet as the filename always includes the CoreOS version.

However, when extracting an image from the release payload, it always has the same name. Therefore, we will never update it to a newer image even when running different versions of the installer.

A possible solution:

  1. Check that the cached ISO's checksum matches the RHCOS metadata.
  2. If it doesn't, extract the expected checksum from the release payload and compare that to the cached ISO's checksum.
  3. If it still doesn't match, extract the ISO from the release payload.

An alternative might be to set the name of the cache file to something different. It's not clear how we'd guarantee a match between the release payload we've been given and the ISO unless the name was based on the release payload (which eliminates some of the point of the cache, since ordinarily most release payloads will point to a small number of images).

This is a clone of issue OCPBUGS-9956. The following is the description of the original issue:

Description of problem:

PipelineRun default template name has been updated in the backend in Pipeline operator 1.10, So we need to update the name in the UI code as well.

 

https://github.com/openshift/console/blob/master/frontend/packages/pipelines-plugin/src/components/pac/const.ts#L9

 

This is a clone of issue OCPBUGS-14426. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14149. The following is the description of the original issue:

Description of problem:

Cannot list Kepler CSV

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Install Kepler Community Operator
2. Create Kepler Instance
3. Console gets error and shows "Oops, something went wrong"

Actual results:

Console gets error and shows "Oops, something went wrong"

Expected results:

Should list Kepler Instance

Additional info:

 

This is a clone of issue OCPBUGS-6011. The following is the description of the original issue:

Description of problem:

The 4.12.0 openshift-client package has kubectl 1.24.1 bundled in it when it should have 1.25.x 

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Very

Steps to Reproduce:

1. Download and unpack https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/stable/openshift-client-linux-4.12.0.tar.gz 
2. ./kubectl version

Actual results:

# ./kubectl version

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"1928ac4250660378a7d8c3430478dfe77977cb2a", GitTreeState:"clean", BuildDate:"2022-12-07T05:08:22Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4

Expected results:

kubectl version 1.25.x 

Additional info:

 

In order to support 4.12 there needs to be an entry for OS_IMAGES in images.env.template.

 

Note that the actual url isn't important, just that there is an entry for 4.12.

Description of problem:

Whereabouts reconciliation is not launched when

How reproducible:

Always

Steps to Reproduce:

1. oc edit the networks object and create a net-attach-def that references whereabouts – in a conflist.

Actual results:

The reconciler is not launched.

Expected results:

The reconciler is launched.

DVO metrics have some sensitive data that isn't desired to be sent outside the cluster. For that, IO must remove this data from the metrics before saving it to the archive and uploading it to the pipeline.

Remove the name and namespace from DVO metrics before saving it to the IO archive.

This is a clone of issue OCPBUGS-4700. The following is the description of the original issue:

Description of problem:

In at least 4.12.0-rc.0, a user with read-only access to ClusterVersion can see an "Update blocked" pop-up talking about "...alert above the visualization...".  It is referencing a banner about "This cluster should not be updated to the next minor version...", but that banner is not displayed because hasPermissionsToUpdate is false, so canPerformUpgrade is false.

Version-Release number of selected component (if applicable):

4.12.0-rc.0. Likely more. I haven't traced it out.

How reproducible:

Always.

Steps to Reproduce:

1. Install 4.12.0-rc.0
2. Create a user with cluster-wide read-only permissions. For me, it's via binding to a sudoer ClusterRole. I'm not sure where that ClusterRole comes from, but it's:

$ oc get -o yaml clusterrole sudoer
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: "2020-05-21T19:39:09Z"
  name: sudoer
  resourceVersion: "7715"
  uid: 28eb2ffa-dccd-47e8-a2d5-6a95e0e8b1e9
rules:
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:admin
  resources:
  - systemusers
  - users
  verbs:
  - impersonate
- apiGroups:
  - ""
  - user.openshift.io
  resourceNames:
  - system:masters
  resources:
  - groups
  - systemgroups
  verbs:
  - impersonate

3. View /settings/cluster

Actual results:

See the "Update blocked" pop-up talking about "...alert above the visualization...".

Expected results:

Something more internally consistent. E.g. having the referenced banner "...alert above the visualization..." show up, or not having the "Update blocked" pop-up reference the non-existent banner.

Description of problem:

INSIGHTOCP-1048 is a rule to check if Monitoring pods are using the NFS storage, which is not recommended in OpenShift.

Gathering Persistent Volumes for openshift-monitoring namespace.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Check if the cluster contains already Persistent Volume Claims on openshift-monitoring namespace.
2. If there is none, create this ConfigMap for cluster-monitoring-config. That will setup prometheus default PVCs.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 1Gi

3. Run insights Operator.
4. In the result archive, on folder path /config/persistentvolumes/, there should be a file for each one of the Persistent Volumes bound to the PVCs.
5. Name of the file should match the PV name and contain the resource data.

Actual results:

 

Expected results:

example of persistent volume file:
{
    "metadata": {
        "name": "pvc-99ffaeb3-8ff8-4137-a1fc-0bf72e7209a5",
        "uid": "17122aab-411b-4a71-ae35-c13caac23492",
        "resourceVersion": "20098",
        "creationTimestamp": "2023-02-20T14:44:30Z",
        "labels": {
            "topology.kubernetes.io/region": "us-west-2",
            "topology.kubernetes.io/zone": "us-west-2c"
        },
        "annotations": {
            "kubernetes.io/createdby": "aws-ebs-dynamic-provisioner",
            "pv.kubernetes.io/bound-by-controller": "yes",
            "pv.kubernetes.io/provisioned-by": "kubernetes.io/aws-ebs"
        },
        "finalizers": [
            "kubernetes.io/pv-protection"
        ]
    },
    "spec": {
        "capacity": {
            "storage": "20Gi"
        },
        "awsElasticBlockStore": {
            "volumeID": "aws://us-west-2c/vol-07ecf570b7adfedda",
            "fsType": "ext4"
        },
        "accessModes": [
            "ReadWriteOnce"
        ],
        "claimRef": {
            "kind": "PersistentVolumeClaim",
            "namespace": "openshift-monitoring",
            "name": "prometheus-data-prometheus-k8s-1",
            "uid": "99ffaeb3-8ff8-4137-a1fc-0bf72e7209a5",
            "apiVersion": "v1",
            "resourceVersion": "19914"
        },
        "persistentVolumeReclaimPolicy": "Delete",
        "storageClassName": "gp2",
        "volumeMode": "Filesystem",
        "nodeAffinity": {
            "required": {
                "nodeSelectorTerms": [
                    {
                        "matchExpressions": [
                            {
                                "key": "topology.kubernetes.io/region",
                                "operator": "In",
                                "values": [
                                    "us-west-2"
                                ]
                            },
                            {
                                "key": "topology.kubernetes.io/zone",
                                "operator": "In",
                                "values": [
                                    "us-west-2c"
                                ]
                            }
                        ]
                    }
                ]
            }
        }
    },
    "status": {
        "phase": "Bound"
    }
}

Additional info:

 

This is a clone of issue OCPBUGS-13692. The following is the description of the original issue:

This is a clone of issue OCPBUGS-13549. The following is the description of the original issue:

Description of problem:

Incorrect AWS ARN [1] is used for GovCloud and AWS China regions, which will cause the command `ccoctl aws create-all` to fail:

Failed to create Identity provider: failed to apply public access policy to the bucket ci-op-bb5dgq54-77753-oidc: MalformedPolicy: Policy has invalid resource
	status code: 400, request id: VNBZ3NYDH6YXWFZ3, host id: pHF8v7C3vr9YJdD9HWamFmRbMaOPRbHSNIDaXUuUyrgy0gKCO9DDFU/Xy8ZPmY2LCjfLQnUDmtQ=

Correct AWS ARN prefix:
GovCloud (us-gov-east-1 and us-gov-west-1): arn:aws-us-gov
AWS China (cn-north-1 and cn-northwest-1): arn:aws-cn

[1] https://github.com/openshift/cloud-credential-operator/pull/526/files#diff-1909afc64595b92551779d9be99de733f8b694cfb6e599e49454b380afc58876R211


 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-05-11-024616

How reproducible:

Always
 

Steps to Reproduce:

1. Run command: `aws create-all --name="${infra_name}" --region="${REGION}" --credentials-requests-dir="/tmp/credrequests" --output-dir="/tmp"` on GovCloud regions
2.
3.

Actual results:

Failed to create Identity provider
 

Expected results:

Create resources successfully.
 

Additional info:

Related PRs:
4.10: https://github.com/openshift/cloud-credential-operator/pull/531
4.11: https://github.com/openshift/cloud-credential-operator/pull/530
4.12: https://github.com/openshift/cloud-credential-operator/pull/529
4.13: https://github.com/openshift/cloud-credential-operator/pull/528
4.14: https://github.com/openshift/cloud-credential-operator/pull/526
 

Description of problem:

Disconnected IPI OCP 4.11.5 cluster install on baremetal fails when hostname of master nodes does not include "master"    

Version-Release number of selected component (if applicable): 4.11.5

How reproducible:  Perform disconnected IPI install of OCP 4.11.5 on bare metal with master nodes that do not contain the text "master"

Steps to Reproduce:

Perform disconnected IPI install of OCP 4.11.5 on bare metal with master nodes that do not contain the text "master"

Actual results: master nodes do come up.

Expected results: master nodes should come up despite that the text "master" is not in their hostname.

Additional info:

Disconnected IPI OCP 4.11.5 cluster install on baremetal fails when hostname of master nodes does not include "master"    

My cust reinstall new cluster using the fix here . But they have the exact same issue. The metal3 pod have  PROVISIONING_MACS value  empty.  Can we work together with them to understand why the new code fix https://github.com/openshift/cluster-baremetal-operator/commit/76bd6bc461b30a6a450f85a42e492a0933178aee is not working.

cat metal3-static-ip-set/metal3-static-ip-set/logs/current.log
2022-09-27T14:19:38.140662564Z + '[' -z 10.17.199.3/27 ']'
2022-09-27T14:19:38.140662564Z + '[' -z '' ']'
2022-09-27T14:19:38.140662564Z + '[' -n '' ']'
2022-09-27T14:19:38.140722345Z ERROR: Could not find suitable interface for "10.17.199.3/27"
2022-09-27T14:19:38.140726312Z + '[' -n '' ']'
2022-09-27T14:19:38.140726312Z + echo 'ERROR: Could not find suitable interface for "10.17.199.3/27"'
2022-09-27T14:19:38.140726312Z + exit 1

 

cat metal3-b9bf8d595-gv94k.yaml
...
initContainers:

command: /set-static-ip
env: name: PROVISIONING_IP
value: 10.17.199.3/27 name: PROVISIONING_INTERFACE name: PROVISIONING_MACS <------------------------- missing MACS
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f04793bd109ecba2dfe43be93dc990ac5299272482c150bd5f2eee0f80c983b
imagePullPolicy: IfNotPresent
name: metal3-static-ip-set
.... 
  • omc logs machine-api-controllers-6b9ffd96cd-grh6l -c nodelink-controller  -n openshift-machine-api
    2022-09-21T16:13:43.600517485Z I0921 16:13:43.600513       1 nodelink_controller.go:408] Finding machine from node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca"
    2022-09-21T16:13:43.600521381Z I0921 16:13:43.600517       1 nodelink_controller.go:425] Finding machine from node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" by ProviderID
    2022-09-21T16:13:43.600525225Z W0921 16:13:43.600521       1 nodelink_controller.go:427] Node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" has no providerID
    2022-09-21T16:13:43.600528917Z I0921 16:13:43.600524       1 nodelink_controller.go:448] Finding machine from node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" by IP
    2022-09-21T16:13:43.600532711Z I0921 16:13:43.600529       1 nodelink_controller.go:453] Found internal IP for node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca": "10.17.192.33"
    2022-09-21T16:13:43.600551289Z I0921 16:13:43.600544       1 nodelink_controller.go:477] Matching machine not found for node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" with internal IP "10.17.192.33"

From @dtantsur WIP PR: https://github.com/openshift/cluster-baremetal-operator/pull/299

Customer is waiting for this fix. The previous code change don't fix customer situation.

Please refer to this slack thread :https://coreos.slack.com/archives/CFP6ST0A3/p1664215102459219

Description of problem:

Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future

See OCPBUGS-11636 for additional information

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible in affected regions.

Steps to Reproduce:

1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.

Actual results:

./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output
2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private
2023/04/11 13:01:06 Copying signing key for use by installer
2023/04/11 13:01:07 Bucket abutchertestue2-oidc created
2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs
        status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=

Expected results:

"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.

Additional info:

 

Description of problem:

For example, "openshift-install explain installconfig.platform.gcp.publicDNSZone" tells "PublicDNSZone contains the zone ID and project where the Public DNS zone will be created", but in fact it's for specifying an existing zone where the Public DNS zone records will be put in.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-10-10-015203

How reproducible:

Always

Steps to Reproduce:

1. openshift-install explain installconfig.platform.gcp.publicDNSZone
2. openshift-install explain installconfig.platform.gcp.privateDNSZone
3.

Actual results:

For example, it tells "PublicDNSZone contains the zone ID and project where the Public DNS zone will be created."

Expected results:

It should be like "PublicDNSZone contains the zone ID and project where the Public DNS zone records will be created."

Additional info:

$ openshift-install version
openshift-install 4.12.0-0.nightly-2022-10-10-015203
built from commit 02102a96b3f7c78337b32dcafe2e28be6fb67a0f
release image registry.ci.openshift.org/ocp/release@sha256:00806cf7faaa86981e73b478a72c1b7a838cd08b215f3a9ab9b278ae94d9a794
release architecture amd64
$ 
$ openshift-install explain installconfig.platform.gcp.publicDNSZone
KIND:     InstallConfig
VERSION:  v1RESOURCE: <object>
  PublicDNSZone Technology Preview. PublicDNSZone contains the zone ID and project where the Public DNS zone will be created.FIELDS:
    id <string>
      ID Technology Preview. ID or name of the zone.
    project <string>   
      ProjectID Technology Preview When the ProjectID is provided, the zone will be created in this project. When the ProjectID is empty, the DNS zone with this ID will be created and managed in the Service Project (GCP.ProjectID).
$ 
$ openshift-install explain installconfig.platform.gcp.privateDNSZone
KIND:     InstallConfig
VERSION:  v1RESOURCE: <object>
  PrivateDNSZone Technology Preview. PrivateDNSZone contains the zone ID and project where the Private DNS zone will be created.FIELDS:
    id <string>
      ID Technology Preview. ID or name of the zone.
    project <string>
      ProjectID Technology Preview When the ProjectID is provided, the zone will be created in this project. When the ProjectID is empty, the DNS zone with this ID will be created and managed in the Service Project (GCP.ProjectID).
$ 

 

 

 

 

Description of problem:

Alert actions are not triggering modal from where storage cluster can be expanded.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

1/1

Steps to Reproduce:

1. Fill up a storage cluster to 80%
2. Alert is seen in cluster dashboard.
3. Click the Add Capacity button

Actual results:

Modal is not launched.

Expected results:

Modal should be launched.

Additional info:

 

Description of problem:

place holder bug to backport common latency failures

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-15281. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14637. The following is the description of the original issue:

Description of problem:

An uninstall was started, however it failed due to the hosted-cluster-config-operator being unable to clean up the default ingresscontroller

Version-Release number of selected component (if applicable):

4.12.18

How reproducible:

Unsure - though definitely not 100%

Steps to Reproduce:

1. Uninstall a HyperShift cluster

Actual results:

❯ k logs -n ocm-staging-2439occi66vhbj0pee3s4d5jpi4vpm54-mshen-dr2 hosted-cluster-config-operator-5ccdbfcc4c-9mxfk --tail 10 -f

{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Image registry is removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring ingress controllers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring load balancers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Load balancers are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Ensuring persistent volumes are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"There are no more persistent volumes. Nothing to cleanup.","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}
{"level":"info","ts":"2023-06-06T16:57:21Z","msg":"Persistent volumes are removed","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"3a8e4485-3d0a-41b7-b82c-ff0a7f0040e6"}

After manually connecting to the hostedcluster and deleting the ingresscontroller, the uninstall progressed and succeded

Expected results:

The hosted cluster can cleanup the ingresscontrollers successfully and progress the uninstall

Additional info:

HyperShift dump: https://drive.google.com/file/d/1qqjkG4F_mSUCVMz3GbN-lEoqbshPvQcU/view?usp=sharing 

This is a clone of issue OCPBUGS-1427. The following is the description of the original issue:

Description of problem:

Jump looks the worst on gcp, but looking closer Azure and AWS both jumped as well just not as high.

Disruption data indicates that the image registry on GCP was averaging around 30-40 seconds of disruption during an upgrade, until Aug 27th when it jumped to 125-135 seconds and has remained there ever since.

We see similar spikes in ingress-to-console and ingress-to-oauth. NOTE: image registry backend is also behind ingress, so all three are ingress related disruption.

https://datastudio.google.com/s/uBC4zuBFdTE

These charts show the problem on Aug 27 for registry, ingress to console, and ingress to oauth.

sdn network type appears unaffected.

Something merged Aug 26-27 that caused a significant change for anything behind ingress using ovn on gcp.

This is a clone of issue OCPBUGS-1695. The following is the description of the original issue:

Update initial FCOS used in OKD to 36.20220906.3.2

This is a clone of issue OCPBUGS-881. The following is the description of the original issue:

Description of problem:

Create install-config file for vsphere IPI against 4.12.0-0.nightly-2022-09-02-194931, fail as apiVIP and ingressVIP are not in machine CIDR.

$ ./openshift-install create install-config --dir ipi                
? Platform vsphere
? vCenter xxxxxxxx
? Username xxxxxxxx
? Password [? for help] ********************
INFO Connecting to xxxxxxxx
INFO Defaulting to only available datacenter: SDDC-Datacenter 
INFO Defaulting to only available cluster: Cluster-1 
INFO Defaulting to only available datastore: WorkloadDatastore 
? Network qe-segment
? Virtual IP Address for API 172.31.248.137
? Virtual IP Address for Ingress 172.31.248.141
? Base Domain qe.devcluster.openshift.com 
? Cluster Name jimavmc       
? Pull Secret [? for help] ****************************************************************************************************************************************************************************************
FATAL failed to fetch Install Config: failed to generate asset "Install Config": invalid install config: [platform.vsphere.apiVIPs: Invalid value: "172.31.248.137": IP expected to be in one of the machine networks: 10.0.0.0/16, platform.vsphere.ingressVIPs: Invalid value: "172.31.248.141": IP expected to be in one of the machine networks: 10.0.0.0/16] 

As user could not define cidr for machineNetwork when creating install-config file interactively, it will use default value 10.0.0.0/16, so fail to create install-config when inputting apiVIP and ingressVIP outside of default machinenNetwork.

Error is thrown from https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L655-L666, seems new function introduced from PR https://github.com/openshift/installer/pull/5798

The issue should also impact Nutanix platform.
 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-02-194931

How reproducible:

Always

Steps to Reproduce:

1. create install-config.yaml file by running command "./openshift-install create install-config --dir ipi"
2. failed with above error
3.

Actual results:

fail to create install-config.yaml file

Expected results:

succeed to create install-config.yaml file

Additional info:

 

This is a clone of issue OCPBUGS-3476. The following is the description of the original issue:

Description of problem:

When we detect a refs/heads/branchname we should show the label as what we have now:

- Branch: branchname

And when we detect a refs/tags/tagname we should instead show the label as:

- Tag: tagname

I haven't implemented this in cli but there is an old issue for that here openshift-pipelines/pipelines-as-code#181

Version-Release number of selected component (if applicable):

4.11.z

How reproducible:

 

Steps to Reproduce:

1. Create a repository
2. Trigger the pipelineruns by push or pull request event on the github  

Actual results:

We do not show tag name even is tag is present instead of branch

Expected results:

We should show tag if tag is detected and branch if branch is detedcted.

Additional info:

https://github.com/openshift/console/pull/12247#issuecomment-1306879310

This is a clone of issue OCPBUGS-12153. The following is the description of the original issue:

Description of problem:

When HyperShift HostedClusters are created with "OLMCatalogPlacement" set to "guest" and if the desired release is pre-GA, the CatalogSource pods cannot pull their images due to using unreleased images.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Common

Steps to Reproduce:

1. Create a HyperShift 4.13 HostedCluster with spec.OLMCatalogPlacement = "guest"
2. See the openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff

Actual results:

openshift-marketplace/community-operator-* pods in the guest cluster in ImagePullBackoff

Expected results:

All CatalogSource pods to be running and to use n-1 images if pre-GA

Additional info:

 

This is a clone of issue OCPBUGS-13155. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10009. The following is the description of the original issue:

CNO should respect the `nodeSelector` setting in hostecontrolplane:

https://github.com/openshift/hypershift/blob/5f903f2a48ef2abc3045584f646e92ac0f735fad/docs/content/how-to/distribute-hosted-cluster-workloads.md#topology

 

Affinity and tolerations support is handled here: https://issues.redhat.com/browse/OCPBUGS-8692

Description of problem:

QE has one vsphere6.7 u3 env, privilege "InventoryService.Tagging.ObjectAttachable" does not exist, and installer fails as below.

FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.vsphere.defaultDatastore: Internal error: privileges missing for vSphere vCenter Datastore: InventoryService.Tagging.ObjectAttachable

As vSphere 6.7 U3 is deprecated but not removed, so it should be supported, users may hit the similar issue on 6.7u3 when fresh installing.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-30-142847

How reproducible:

always

Steps to Reproduce:

1. Create role for each vsphere objects and assign listed privileges on it based on instlallation doc, then add permission to each object with created and user 
2. Install IPI cluster on vSphere platform by this user
3. Installer fails and complains that missing privilege "InventoryService.Tagging.ObjectAttachable"

Actual results:

Installer fails and complains that missing privilege "InventoryService.Tagging.ObjectAttachable"

Expected results:

Installer should succeed.

Additional info:

 

This is a clone of issue OCPBUGS-6902. The following is the description of the original issue:

ipv6 upgrade job has been failing for months

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-upgrade-ovn-ipv6

 

Looking at a few of the most recent runs the the failing test common to them all is 

disruption_tests: [sig-network-edge] Verify DNS availability during and after upgrade success 

e.g. from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-upgrade-ovn-ipv6/1620652913376366592

{Feb  1 07:08:57.228: too many pods were waiting: ns/e2e-check-for-dns-availability-7828 pod/dns-test-ec5a79ee-0091-4081-ae69-fa6a4a6ed3ee-7s48w,ns/e2e-check-for-dns-availability-7828 pod/dns-test-ec5a79ee-0091-4081-ae69-fa6a4a6ed3ee-94rq4,ns/e2e-check-for-dns-availability-7828 pod/dns-test-ec5a79ee-0091-4081-ae69-fa6a4a6ed3ee-t9wnk

github.com/openshift/origin/test/e2e/upgrade/dns.(*UpgradeTest).validateDNSResults(0x8793c91?, 0xc005f646e0)
	github.com/openshift/origin/test/e2e/upgrade/dns/dns.go:142 +0x2f4
github.com/openshift/origin/test/e2e/upgrade/dns.(*UpgradeTest).Test(0xc005f646e0?, 0x9407e78?, 0xcb34730?, 0x0?)
	github.com/openshift/origin/test/e2e/upgrade/dns/dns.go:48 +0x4e
github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc000c6f8c0, 0xc0005eebb8)
	github.com/openshift/origin/test/extended/util/disruption/disruption.go:201 +0x4a2
k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
	k8s.io/kubernetes@v1.25.0/test/e2e/chaosmonkey/chaosmonkey.go:94 +0x6a
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
	k8s.io/kubernetes@v1.25.0/test/e2e/chaosmonkey/chaosmonkey.go:91 +0x8b  }

This is a clone of issue OCPBUGS-4190. The following is the description of the original issue:

Description of problem:

Two tests are perma failing in metal-ipi upgrade tests
[sig-imageregistry] Image registry remains available using new connections expand_more    39m27s
[sig-imageregistry] Image registry remains available using reused connections expand_more    39m27s

Version-Release number of selected component (if applicable):

4.12 / 4.13

How reproducible:

all ci runs

Steps to Reproduce:

1.
2.
3.

Actual results:

Nov 24 02:58:26.998: INFO: "[sig-imageregistry] Image registry remains available using reused connections": panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

pass

Additional info:

 

The install_type field in telemetry data is not automatically set from the installer invoker value. Any values we wish to appear must be explicity converted to the corresponding install_type value.

Currently this make clusters installed with the agent-based installer (invoker agent-installer) invisible in telemetry.

in the ironic-rhcos-downloader container the builder image is still based on rhel8
it should be rhel9 instead

Description of problem:

To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Already merged in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1398

This is a clone of issue OCPBUGS-12272. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11057. The following is the description of the original issue:

Description of problem:
When import a Serverless Service from a git repository the topology shows an Open URL decorator also when "Add Route" checkbox was unselected (which is selected by default).

The created kn Route makes the Service available within the cluster and the created URL looks like this: http://nodeinfo-private.serverless-test.svc.cluster.local

So the Service is NOT accidentally exposed. It's "just" that we link an internal route that will not be accessible to the user.

This might happen also for Serverless functions import flow and the import container image import flow.

Version-Release number of selected component (if applicable):
Tested older versions and could see this at least on 4.10+

How reproducible:
Always

Steps to Reproduce:

  1. Install the OpenShift Serverless operator and create the required kn Serving resource.
  2. Navigate to the Developer perspective > Add > Import from Git
  3. Enter a git repository (like https://gitlab.com/jerolimov/nodeinfo
  4. Unselect "Add Route" and press Create

Actual results:
The topology shows the new kn Service with a Open URL decorator on the top right corner.

The button is clickable but the target page could not be opened (as expected).

Expected results:
The topology should not show an Open URL decorator for "private" kn Routes.

The topology sidebar shows similar information, we should maybe release the Link there as well with a Text+Copy button???

A fix should be tested as well with Serverless functions as container images!

Additional info:
When the user unselects the "Add route" option an additional label is added to the kn Service. This label could also be added and removed later. When this label is specified the Open URL decorator should not be shown:

metadata:
  labels:
    networking.knative.dev/visibility: cluster-local

See also:

https://github.com/openshift/console/blob/1f6e238b924f4a4337ef917a0eba8aadae161e9c/frontend/packages/knative-plugin/src/utils/create-knative-utils.ts#L108

https://github.com/openshift/console/blob/1f6e238b924f4a4337ef917a0eba8aadae161e9c/frontend/packages/knative-plugin/src/topology/components/decorators/getServiceRouteDecorator.tsx#L15-L21

Description of problem:

This is an OCP clone of https://bugzilla.redhat.com/show_bug.cgi?id=2099794

In summary, NetworkManager reports the network as being up before the ipv6 address of the primary interface is ready and crio fails to bind to it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

+++ This bug was initially created as a clone of Bug #2102098 +++

Created attachment 1893358
web-hook-error-in-developer-console

Description of problem:
On OSD console, on one node page, select edit label from action list, when save new label, there is no response on the modal, could see a web hook error from developer console.

Version-Release number of selected component (if applicable):
4.10.18

How reproducible:
Always

Steps to Reproduce:
1.Login OSD console with cluster admin user.
2.On one node page, select edit label from action list, save new label.
3.

Actual results:
2. There is no response after click "Save". There is 403 forbidden error in developer console.

Expected results:
2. Should show error message on the modal after click "Save" button

Additional info:

— Additional comment from jhadvig@redhat.com on 2022-07-08 07:52:35 UTC —

@Giao so the overall issue is that we are not rendering the error as part of the modal but rather closing the modal. I could reproduce this issue in 4.12 as well.

— Additional comment from jhadvig@redhat.com on 2022-07-18 10:52:11 UTC —

      • Bug 2108030 has been marked as a duplicate of this bug. ***

Description of problem:

The SQL-based index image created by old opm failed to run in 4.12 even if added the `privileged` permission to the namespace.

 

MacBook-Pro:~ jianzhang$ oc get pods
NAME                   READY   STATUS             RESTARTS     AGE
jian-operators-4g5ln   0/1     CrashLoopBackOff   1 (2s ago)   11s
MacBook-Pro:~ jianzhang$ oc logs jian-operators-4g5ln 
Error: open /etc/nsswitch.conf: permission denied 

 

PS: the SQL-based index created by the new opm version doesn't have this issue.

 

opm version
Version: version.Version{OpmVersion:"e41024eb3", GitCommit:"e41024eb37c721bc43e8b3df226dd30c0589aee7", BuildDate:"2022-08-16T01:50:17Z", GoOs:"darwin", GoArch:"amd64"} 

 

 

Version-Release number of selected component (if applicable):

OCP 4.12

 

MacBook-Pro:~ jianzhang$ oc get clusterversion NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.12.0-0.nightly-2022-08-15-150248   True        False         3h25m   Cluster version is 4.12.0-0.nightly-2022-08-15-150248 

 

How reproducible:

always

Steps to Reproduce:
1. Deploy OCP 4.12

2, Deploy a CatalogSource in the `openshift-marketplace` namespace.

 

MacBook-Pro:~ jianzhang$ oc get ns openshift-marketplace -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    capability.openshift.io/name: marketplace
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c16,c10
    openshift.io/sa.scc.supplemental-groups: 1000260000/10000
    openshift.io/sa.scc.uid-range: 1000260000/10000
    workload.openshift.io/allowed: management
  creationTimestamp: "2022-08-15T23:15:27Z"
  labels:
    kubernetes.io/metadata.name: openshift-marketplace
    olm.operatorgroup.uid/1b776321-2714-4c1f-95ba-2ddff49c4efe: ""
    openshift.io/cluster-monitoring: "true"
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/warn: baseline
  name: openshift-marketplace
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: cd81594b-4f6c-46d6-9369-75deef542ec8
  resourceVersion: "8617"
  uid: 1c35352e-3636-4f2b-a3b1-c84ebc6681e0
spec:
  finalizers:
  - kubernetes
status:
  phase: Active 

3, Check the CatalogSource pod status, crashed.

 

 


MacBook-Pro:~ jianzhang$ oc get catalogsource -n openshift-marketplace jian-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2022-08-16T02:24:20Z"
  generation: 1
  name: jian-operators
  namespace: openshift-marketplace
  resourceVersion: "106145"
  uid: 6a75ecc9-7b88-4411-bcf5-e34618f9b3cd
spec:
  displayName: Jian Operators
  image: quay.io/olmqe/etcd-index:v1
  priority: -100
  publisher: Jian
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
status:
  connectionState:
    address: jian-operators.openshift-marketplace.svc:50051
    lastConnect: "2022-08-16T03:12:28Z"
    lastObservedState: TRANSIENT_FAILURE
  latestImageRegistryPoll: "2022-08-16T02:34:21Z"
  registryService:
    createdAt: "2022-08-16T02:24:20Z"
    port: "50051"
    protocol: grpc
    serviceName: jian-operators
    serviceNamespace: openshift-marketplace

MacBook-Pro:~ jianzhang$ oc get pods -n openshift-marketplace
NAME                                                              READY   STATUS             RESTARTS       AGE
28bb83ea022e9728d25570ab0adbe09a31d6a0a606917488e0ddb00f925mnfw   0/1     Completed          0              3h23m
7049ea48beb27a712fa506b76ad672be201ce5d3a6a93d627a0091e0fesvdlj   0/1     Completed          0              3h23m
certified-operators-ftt2n                                         1/1     Running            0              3h49m
community-operators-27dx9                                         1/1     Running            0              3h49m
jian-operators-5zq7d                                              0/1     CrashLoopBackOff   12 (71s ago)   38m
jian-operators-gpg4v                                              0/1     CrashLoopBackOff   14 (57s ago)   48m
marketplace-operator-9c8496b58-2jfmv                              1/1     Running            0              3h56m
qe-app-registry-rqrrv                                             1/1     Running            0              141m
redhat-marketplace-s6zrj                                          1/1     Running            0              3h49m
redhat-operators-54cqr                                            1/1     Running            0              3h49m

MacBook-Pro:~ jianzhang$ oc -n openshift-marketplace logs jian-operators-gpg4v 
Error: open /etc/nsswitch.conf: permission denied
Usage:
  opm registry serve [flags]


Flags:
  -d, --database string          relative path to sqlite db (default "bundles.db")
      --debug                    enable debug logging
  -h, --help                     help for serve
  -p, --port string              port number to serve on (default "50051")
      --skip-migrate             do  not attempt to migrate to the latest db revision when starting
  -t, --termination-log string   path to a container termination log file (default "/dev/termination-log")
      --timeout-seconds string   Timeout in seconds. This flag will be removed later. (default "infinite")


Global Flags:
      --skip-tls   skip TLS certificate verification for container image registries while pulling bundles or index 

 

4. Create a namespace with the `privileged` permission.

 

MacBook-Pro:~ jianzhang$ oc get ns debug -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c30,c10
    openshift.io/sa.scc.supplemental-groups: 1000890000/10000
    openshift.io/sa.scc.uid-range: 1000890000/10000
  creationTimestamp: "2022-08-16T02:46:41Z"
  labels:
    kubernetes.io/metadata.name: debug
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged
    security.openshift.io/scc.podSecurityLabelSync: "false"
  name: debug
  resourceVersion: "95718"
  uid: bdf93839-6c42-4365-a65c-d9c0b9fe0504
spec:
  finalizers:
  - kubernetes
status:
  phase: Active 

 
5. Deploy a CatalogSource as above step 2. Still crashed.

 

 

MacBook-Pro:~ jianzhang$ oc get pods -n debug
NAME                   READY   STATUS             RESTARTS        AGE
jian-operators-4g5ln   0/1     CrashLoopBackOff   10 (114s ago)   28m
jian-operators-wn766   0/1     CrashLoopBackOff   8 (2m25s ago)   18m
MacBook-Pro:~ jianzhang$ oc -n debug logs jian-operators-wn766
Error: open /etc/nsswitch.conf: permission denied
Usage:
  opm registry serve [flags]


Flags:
  -d, --database string          relative path to sqlite db (default "bundles.db")
      --debug                    enable debug logging
  -h, --help                     help for serve
  -p, --port string              port number to serve on (default "50051")
      --skip-migrate             do  not attempt to migrate to the latest db revision when starting
  -t, --termination-log string   path to a container termination log file (default "/dev/termination-log")
      --timeout-seconds string   Timeout in seconds. This flag will be removed later. (default "infinite")


Global Flags:
      --skip-tls   skip TLS certificate verification for container image registries while pulling bundles or index 

 

 

Actual results:

The sql-based index image created by the old opm version cannot be run.

 

MacBook-Pro:~ jianzhang$ oc -n debug logs jian-operators-wn766 Error: open /etc/nsswitch.conf: permission denied 

 

 

Expected results:

The old SQL-based index image runs well. Or we have a workaround for it.

 

Additional info:

I changed another old sql-based image and have a try, get another permission issue.

 

MacBook-Pro:~ jianzhang$ oc get catalogsource
NAME             DISPLAY          TYPE   PUBLISHER   AGE
jian-operators   Jian Operators   grpc   Jian        37m
xia-operators    Xia Operators    grpc   Xia         101s
MacBook-Pro:~ jianzhang$ oc get catalogsource xia-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2022-08-16T03:22:38Z"
  generation: 1
  name: xia-operators
  namespace: debug
  resourceVersion: "110629"
  uid: 8be42e68-43be-4fd4-9b67-c74edc5e6353
spec:
  displayName: Xia Operators
  image: quay.io/olmqe/ditto-index:test-xzha-1
  priority: -100
  publisher: Xia
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
status:
  connectionState:
    address: xia-operators.debug.svc:50051
    lastConnect: "2022-08-16T03:24:18Z"
    lastObservedState: CONNECTING
  registryService:
    createdAt: "2022-08-16T03:22:38Z"
    port: "50051"
    protocol: grpc
    serviceName: xia-operators
    serviceNamespace: debug

MacBook-Pro:~ jianzhang$ oc project
Using project "debug" on server "https://api.qe-daily-412-0816.ibmcloud.qe.devcluster.openshift.com:6443".
MacBook-Pro:~ jianzhang$ oc get pods
NAME                   READY   STATUS             RESTARTS         AGE
jian-operators-4g5ln   0/1     CrashLoopBackOff   11 (3m41s ago)   35m
jian-operators-wn766   0/1     CrashLoopBackOff   9 (4m13s ago)    25m
xia-operators-6wgjt    0/1     CrashLoopBackOff   1 (8s ago)       13s
MacBook-Pro:~ jianzhang$ oc logs xia-operators-6wgjt 
time="2022-08-16T03:22:43Z" level=warning msg="\x1b[1;33mDEPRECATION NOTICE:\nSqlite-based catalogs and their related subcommands are deprecated. Support for\nthem will be removed in a future release. Please migrate your catalog workflows\nto the new file-based catalog format.\x1b[0m"
Error: open ./db-609956243: permission denied
Usage:
  opm registry serve [flags]


Flags:
  -d, --database string          relative path to sqlite db (default "bundles.db")
      --debug                    enable debug logging

 

Even if that namespace is `privileged`.

MacBook-Pro:~ jianzhang$ oc get ns debug -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c30,c10
    openshift.io/sa.scc.supplemental-groups: 1000890000/10000
    openshift.io/sa.scc.uid-range: 1000890000/10000
  creationTimestamp: "2022-08-16T02:46:41Z"
  labels:
    kubernetes.io/metadata.name: debug
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged
    security.openshift.io/scc.podSecurityLabelSync: "false"
  name: debug
  resourceVersion: "95718"
  uid: bdf93839-6c42-4365-a65c-d9c0b9fe0504
spec:
  finalizers:
  - kubernetes
status:
  phase: Active 

But, both of them work well in the 4.11 cluster. As follows,

 

MacBook-Pro:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-08-15-152346   True        False         91m     Cluster version is 4.11.0-0.nightly-2022-08-15-152346
MacBook-Pro:~ jianzhang$ oc get catalogsource
NAME                  DISPLAY               TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     106m
community-operators   Community Operators   grpc   Red Hat     106m
jian-operators        Jian Operators        grpc   Jian        48m
redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     106m
redhat-operators      Red Hat Operators     grpc   Red Hat     106m
xia-operators         Xia Operators         grpc   Xia         6s
MacBook-Pro:~ jianzhang$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
certified-operators-fsjc8              1/1     Running   0          107m
community-operators-9qvzt              1/1     Running   0          107m
jian-operators-n5s8c                   1/1     Running   0          48m
marketplace-operator-7b777f747-22rwq   1/1     Running   0          109m
redhat-marketplace-2mgrl               1/1     Running   0          107m
redhat-operators-72q6z                 1/1     Running   0          107m
xia-operators-ngq86                    1/1     Running   0          23s
MacBook-Pro:~ jianzhang$ oc get catalogsource jian-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2022-08-16T02:39:52Z"
  generation: 1
  name: jian-operators
  namespace: openshift-marketplace
  resourceVersion: "58565"
  uid: 481a6fbe-00a5-4af5-86f7-d7413c658db3
spec:
  displayName: Jian Operators
  image: quay.io/olmqe/etcd-index:v1
  priority: -100
  publisher: Jian
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
status:
  connectionState:
    address: jian-operators.openshift-marketplace.svc:50051
    lastConnect: "2022-08-16T02:44:45Z"
    lastObservedState: READY
  latestImageRegistryPoll: "2022-08-16T03:24:54Z"
  registryService:
    createdAt: "2022-08-16T02:39:52Z"
    port: "50051"
    protocol: grpc
    serviceName: jian-operators
    serviceNamespace: openshift-marketplace
MacBook-Pro:~ jianzhang$ oc get catalogsource xia-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2022-08-16T03:28:07Z"
  generation: 1
  name: xia-operators
  namespace: openshift-marketplace
  resourceVersion: "59886"
  uid: a270f665-ee0b-49a5-badb-d3394c7a9344
spec:
  displayName: Xia Operators
  image: quay.io/olmqe/ditto-index:test-xzha-1
  priority: -100
  publisher: Xia
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
status:
  connectionState:
    address: xia-operators.openshift-marketplace.svc:50051
    lastConnect: "2022-08-16T03:28:27Z"
    lastObservedState: READY
  registryService:
    createdAt: "2022-08-16T03:28:07Z"
    port: "50051"
    protocol: grpc
    serviceName: xia-operators
    serviceNamespace: openshift-marketplace

MacBook-Pro:~ jianzhang$ oc get ns openshift-marketplace -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    capability.openshift.io/name: marketplace
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    openshift.io/node-selector: ""
    openshift.io/sa.scc.mcs: s0:c16,c5
    openshift.io/sa.scc.supplemental-groups: 1000250000/10000
    openshift.io/sa.scc.uid-range: 1000250000/10000
    workload.openshift.io/allowed: management
  creationTimestamp: "2022-08-16T01:38:10Z"
  labels:
    kubernetes.io/metadata.name: openshift-marketplace
    olm.operatorgroup.uid/24dae571-2843-445b-b09f-5a4631cb25ba: ""
    openshift.io/cluster-monitoring: "true"
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/warn: baseline
  name: openshift-marketplace
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 470d072e-37d9-4203-bc5a-c675800d593c
  resourceVersion: "6981"
  uid: 554a5ceb-8343-46f4-ae69-af36ee45d7fe
spec:
  finalizers:
  - kubernetes
status:
  phase: Active 

This is a clone of issue OCPBUGS-16108. The following is the description of the original issue:

Description of problem:

Customer is facing issue with console slowness when loading workloads page having 300+ workloads.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1. Login to OCP console
2. Workloads — > Projects --> Project-> Deployment Configs(300+)
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

One multus case always fail in QE e2e testing. Using same net-attach-def and pod configure files, testing passed in 4.11 but failed in 4.12 and 4.13

Version-Release number of selected component (if applicable):

4.12 and 4.13

How reproducible:

All the times

Steps to Reproduce:

[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/NetworkAttachmentDefinitions/runtimeconfig-def-ipandmac.yaml
networkattachmentdefinition.k8s.cni.cncf.io/runtimeconfig-def created
[weliang@weliang networking]$ oc get net-attach-def -o yaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2023-01-03T16:33:03Z"
    generation: 1
    name: runtimeconfig-def
    namespace: test
    resourceVersion: "64139"
    uid: bb26c08f-adbf-477e-97ab-2aa7461e50c4
  spec:
    config: '{ "cniVersion": "0.3.1", "name": "runtimeconfig-def", "plugins": [{ "type":
      "macvlan", "capabilities": { "ips": true }, "mode": "bridge", "ipam": { "type":
      "static" } }, { "type": "tuning", "capabilities": { "mac": true } }] }'
kind: List
metadata:
  resourceVersion: ""
[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/Pods/runtimeconfig-pod-ipandmac.yaml
pod/runtimeconfig-pod created
[weliang@weliang networking]$ oc get pod
NAME                READY   STATUS              RESTARTS   AGE
runtimeconfig-pod   0/1     ContainerCreating   0          6s
[weliang@weliang networking]$ oc describe pod runtimeconfig-pod
Name:         runtimeconfig-pod
Namespace:    test
Priority:     0
Node:         weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal/10.0.128.4
Start Time:   Tue, 03 Jan 2023 11:33:45 -0500
Labels:       <none>
Annotations:  k8s.v1.cni.cncf.io/networks: [ { "name": "runtimeconfig-def", "ips": [ "192.168.22.2/24" ], "mac": "CA:FE:C0:FF:EE:00" } ]
              openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  runtimeconfig-pod:
    Container ID:   
    Image:          quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k5zqd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-k5zqd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               26s   default-scheduler  Successfully assigned test/runtimeconfig-pod to weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal
  Normal   AddedInterface          24s   multus             Add eth0 [10.128.2.115/23] from openshift-sdn
  Warning  FailedCreatePodSandBox  23s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(cff792dbd07e8936d04aad31964bd7b626c19a90eb9d92a67736323a1a2303c4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character /
  Normal   AddedInterface          7s    multus             Add eth0 [10.128.2.116/23] from openshift-sdn
  Warning  FailedCreatePodSandBox  7s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(d2456338fa65847d5dc744dea64972912c10b2a32d3450910b0b81cdc9159ca4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character /
[weliang@weliang networking]$ 
 

Actual results:

Pod is not running

Expected results:

Pod should be in running state

Additional info:

 

This is a clone of issue OCPBUGS-13811. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10816. The following is the description of the original issue:

Description of problem:

We have observed a situation where:
- A workload mounting multiple EBS volumes gets stuck in a Terminating state when it finishes.
- The node that the workload ran on eventually gets stuck draining, because it gets stuck on unmounting one of the volumes from that workload, despite no containers from the workload now running on the node.

What we observe via the node logs is that the volume seems to unmount successfully. Then it attempts to unmount a second time, unsuccessfully. This unmount attempt then repeats and holds up the node.

Specific examples from the node's logs to illustrate this will be included in a private comment. 

Version-Release number of selected component (if applicable):

4.11.5

How reproducible:

Has occurred on four separate nodes on one specific cluster, but the mechanism to reproduce it is not known.

Steps to Reproduce:

1.
2.
3.

Actual results:

A volume gets stuck unmounting, holding up removal of the node and completed deletion of the pod.

Expected results:

The volume should not get stuck unmounting.

Additional info:

 

This is a clone of issue OCPBUGS-19942. The following is the description of the original issue:

This is a clone of issue OCPBUGS-16735. The following is the description of the original issue:

Description of problem:

oc adm inspect generated files sometime have the leading "---" and some time do not. This depends on the order of objects collected. This by itself is not an issue.

However this becomes an issue when combined with multiple invocations of oc adm inspect and collecting data to the same directory like must-gather does.

If an object is collected multiple times then the second time oc might overwrite the original file improperly and leave 4 bytes of the original content behind.

This is happening when not writing the "---\n" in the second invocation as this makes the content 4B shorter and the original tailing 4B are left in the file intact.

This garbage confuses YAML parsers.

Version-Release number of selected component (if applicable):

4.14 nighly as of Jul 25 and before

How reproducible:

Always

Steps to Reproduce:

Run oc adm inspect twice with different order of objects:

[msivak@x openshift-must-gather]$ oc adm inspect performanceprofile,machineconfigs,nodes --dest-dir=inspect.dual --all-namespaces
[msivak@x openshift-must-gather]$ oc adm inspect nodes --dest-dir=inspect.dual --all-namespaces


And then check the alphabetically first node yaml file - it will have garbage at the end of the file.

Actual results:

Garbage at the end of the file.

Expected results:

No garbage.

Additional info:

I believe this is caused by the lack of Truncate mode here https://github.com/openshift/oc/blob/master/pkg/cli/admin/inspect/writer.go#L54


Collecting data multiple times cannot be easily avoided when multiple collect scripts are combined with relatedObjects requested by operators.

Description of problem:

We're seeing frequent private DNS zone creation failures in Azure CI jobs recent two days, the Azure CI jobs have been greatly affected.
https://search.ci.openshift.org/?search=error+creating%2Fupdating+Private+DNS+Zone+Virtual+network&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Such as the following error from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1566852244215697408

level=info msg=Consuming Openshift Manifests from target directory
level=info msg=Consuming Common Manifests from target directory
level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json"
level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error creating/updating Private DNS Zone Virtual network link "ci-op-1w80vs6f-7f65d-t2zlz-network-link" (Resource Group "ci-op-1w80vs6f-7f65d-t2zlz-rg"): privatedns.VirtualNetworkLinksClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ParentResourceNotFound" Message="Can not perform requested operation on nested resource. Parent resource 'ci-op-1w80vs6f-7f65d.ci2.azure.devcluster.openshift.com' not found."
level=error
level=error msg=  with module.dns.azureprivatedns_zone_virtual_network_link.network,
level=error msg=  on dns/dns.tf line 13, in resource "azureprivatedns_zone_virtual_network_link" "network":
level=error msg=  13: resource "azureprivatedns_zone_virtual_network_link" "network" 

Version-Release number of selected component (if applicable):

All OCP versions

How reproducible:

https://search.ci.openshift.org/chart?name=e2e-azure&search=error+creating%2Fupdating+Private+DNS+Zone&maxAge=24h&type=build-log
shows 26% of the failed Azure jobs are related to "error creating/updating Private DNS Zone" in the past day. 
3/5 of the failed Azure jobs are caused by this in QE’s CI today. 

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:

 
No Azure outage was reported from https://status.azure.com/en-us/status.
No private zone or DNS records quota exceeded was observed.   

Description of problem:

openshift-install does not detect releaseImage mismatches between cluster-image-set.yaml and registries.conf

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

1.Create ZTP inputs for image generation where registries.conf does not have any source matching the binary releaseimage (the binary image which can be obtained by running "openshift-install version". You can also set this value in ZTP manifest cluster-image-set.yaml 
2.run openshift-install agent create image

Actual results:

Image is generated with no warnings

Expected results:

Image is generated with warning message - "The ImageContentSources configuration in install-config.yaml should have at-least one source field matching the releaseImage value %s", releaseImagePath

 

Additional info:

 

 

This is a clone of issue OCPBUGS-15228. The following is the description of the original issue:

Description of problem:
When try to import the Helm chart "httpd-imagestreams" the "Create Helm Release" page shows a info alert that the form isn't avaiable because there isn't a schema for this helm chart. But the YAML view is also not visible.

Info Alert:

Form view is disabled for this chart because the schema is not available

Version-Release number of selected component (if applicable):
4.9-4.14 (current master)

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the developer perspective
  2. Navigate to Add > Helm Chart
  3. Search and select "httpd-imagestreams", click the card and then Create to open the "Create Helm Release" page

Actual results:

  1. Form / YAML switch is disabled
  2. Info alert is shown: Form view is disabled for this chart because the schema is not available
  3. There is no YAML editor

Expected results:

  1. It's fine that the Form/ YAML switch is disabled
  2. Info alert is also fine
  3. YAML editor should be displayed

Additional info:
The chart yaml is available here and doesn't contain a schema (at the moment).

https://github.com/openshift-helm-charts/charts/blob/main/charts/redhat/redhat/httpd-imagestreams/0.0.1/src/Chart.yaml

libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.

This is a clone of issue OCPBUGS-3668. The following is the description of the original issue:

Description of problem:

Installer fails to install 4.12.0-rc.0 on VMware IPI with the script that worked with prior OCP versions.
Error happens during Terraform prepare step when gathering information in the "Platform Provisioning Check". It looks like a permission issue, but we're using the VCenter administrator account. I double checked and that account has all the necessary permissions.

Version-Release number of selected component (if applicable):

OCP installer 4.12.0-rc.0
VSphere & Vcenter 7.0.3 - no pending updates

How reproducible:

always - we observed this already in the nightlies, but wanted to wait for a RC to confirm

Steps to Reproduce:

1. Try to install using the openshift-install binary

Actual results:

Fails during the preparation step

Expected results:

Installs the cluster ;)

Additional info:

This runs in our CICD pipeline, let me know if you want to need access to the full run log:
https://gitlab.consulting.redhat.com/cblum/storage-ocs-lab/-/jobs/219304

This includes the install-config.yaml, all component versions and the full debug log output

Description of problem:

Seems ART is having trouble building OLM images: https://redhat-internal.slack.com/archives/CB95J6R4N/p1676531421724929

I've already fixed master: 
* https://github.com/openshift/cluster-policy-controller/pull/103
* https://github.com/openshift/cluster-policy-controller/pull/101

Need a bug to backport...

Version-Release number of selected component (if applicable):

4.12

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of the below bug for 4.12 since https://search.ci.openshift.org/?search=Operator+%5C%5C%22prometheus-operator%5C%5C%22+produces+more+watch+requests+than+expected&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job returns results for 4.12 as well.

Description of problem

4.11 CI is flaky because of test failures such as the following:

{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:449]: Expected
    <[]string | len:1, cap:1>: [
        "Operator \"prometheus-operator\" produces more watch requests than expected: watchrequestcount=197, upperbound=180, ratio=1.0944444444444446",
    ]
to be empty
Ginkgo exit error 1: exit with code 1}

This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/894/pull-ci-openshift-cluster-ingress-operator-release-4.11-e2e-aws-single-node/1642909565664104448. Search.ci has additional similar errors.

Version-Release number of selected component (if applicable)

I have seen these failures in 4.11 CI jobs.

How reproducible

Presently, search.ci shows the following stats for the past 7 days:

Found in 0.01% of runs (0.05% of failures) across 139252 total runs and 7354 jobs (16.18% failed)

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=Operator+%5C%5C%22prometheus-operator%5C%5C%22+produces+more+watch+requests+than+expected&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure, and the failures don't show up in search.ci.

Additional info

This issue is not blocking my PR; the affected job, e2e-aws-single-node for cluster-dns-operator's release-4.11 branch, is optional. Please prioritize this bug report as you see fit.

Description of problem:

Automatic ART PRs to update the build config are failing. Needs manual intervention.

Description of problem:

See: https://issues.redhat.com/browse/CPSYN-143

tldr:  Based on the previous direction that 4.12 was going to enforce PSA restricted by default, OLM had to make a few changes because the way we run catalog pods (and we have to run them that way because of how the opm binary worked) was incompatible w/ running restricted.

1) We set openshift-marketplace to enforce restricted (this was our choice, we didn't have to do it, but we did)
2) we updated the opm binary so catalog images using a newer opm binary don't have to run privileged
3) we added a field to catalogsource that allows you to choose whether to run the pod privileged(legacy mode) or restricted.  The default is restricted.  We made that the default so that users running their own catalogs in their own NSes (which would be default PSA enforcing) would be able to be successful w/o needing their NS upgraded to privileged.

Unfortunately this means:
1) legacy catalog images(i.e. using older opm binaries) won't run on 4.12 by default (the catalogsource needs to be modified to specify legacy mode.
2) legacy catalog images cannot be run in the openshift-marketplace NS since that NS does not allow privileged pods.  This means legacy catalogs can't contribute to the global catalog (since catalogs must be in that NS to be in the global catalog).

Before 4.12 ships we need to:
1) remove the PSA restricted label on the openshift-marketplace NS
2) change the catalogsource securitycontextconfig mode default to use "legacy" as the default, not restricted.

This gives catalog authors another release to update to using a newer opm binary that can run restricted, or get their NSes explicitly labeled as privileged (4.12 will not enforce restricted, so in 4.12 using the legacy mode will continue to work)

In 4.13 we will need to revisit what we want the default to be, since at that point catalogs will start breaking if they try to run in legacy mode in most NSes.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:

Description of problem:

The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.

Version-Release number of selected component (if applicable):

discovered in 4.10 -> 4.11 upgrade jobs

How reproducible:

The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.

Steps to Reproduce:

Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:

$ ./check-cvo.py cvo.log && echo PASS || echo FAIL

Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)

Actual results:

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL
FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333
FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251
FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521
FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604
FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531
Upgradeability checks:           5
Upgradeability check cache hits: 12
FAIL

Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.

Expected result

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL
Upgradeability checks:           12
Upgradeability check cache hits: 11
PASS

The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.

Additional info:

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go
...
I1227 06:50:59.023190       1 upgradeable.go:122] Cluster current version=4.10.46
I1227 06:50:59.042735       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:51:14.024345       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:53:23.080768       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:56:59.366010       1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12'
Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.

I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.

Some context in Slack thread

This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:

Description of problem:

Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level

https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3


Version-Release number of selected component (if applicable): none

How reproducible: Every time

Steps to Reproduce:

Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3

It is enabled by default and there is no way to disable it or reduce log level

Actual results:

Please check Case: 03371411, the log file grew to 409 GB

Expected results: Need a way to disable debug

Additional info: Case 03371411. A cluster must gather and log file can be found in the case.

Description of problem:

It is a disconnected cluster on AWS. There is an issue configuring Egress IP where the cluster uses STS. While looking into cloud-network-config-controller pod it is trying to connect to the global sts service "https://sts.amazonaws.com/" rather it should connect to the regional one "https://ec2.ap-southeast-1.amazonaws.com".

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a disconected OCP cluster on AWS.
$ oc get netnamespace | grep egress
egress-ip-test                                     2689387    ["172.16.1.24"]
$ oc get hostsubnet
NAME                                              HOST                                              HOST IP        SUBNET          EGRESS CIDRS   EGRESS IPS
ip-172-16-1-151.ap-southeast-1.compute.internal   ip-172-16-1-151.ap-southeast-1.compute.internal   172.16.1.151   10.130.0.0/23                  
ip-172-16-1-53.ap-southeast-1.compute.internal    ip-172-16-1-53.ap-southeast-1.compute.internal    172.16.1.53    10.131.0.0/23                  ["172.16.1.24"]
ip-172-16-2-15.ap-southeast-1.compute.internal    ip-172-16-2-15.ap-southeast-1.compute.internal    172.16.2.15    10.128.0.0/23                  
ip-172-16-2-77.ap-southeast-1.compute.internal    ip-172-16-2-77.ap-southeast-1.compute.internal    172.16.2.77    10.128.2.0/23                  
ip-172-16-3-111.ap-southeast-1.compute.internal   ip-172-16-3-111.ap-southeast-1.compute.internal   172.16.3.111   10.129.0.0/23                  
ip-172-16-3-79.ap-southeast-1.compute.internal    ip-172-16-3-79.ap-southeast-1.compute.internal    172.16.3.79    10.129.2.0/23                  
$ oc logs sdn-controller-6m5kb -n openshift-sdn I0922 04:09:53.348615       1 vnids.go:105] Allocated netid 2689387 for namespace "egress-ip-test"
E0922 04:24:00.682018       1 egressip.go:254] Ignoring invalid HostSubnet ip-172-16-1-53.ap-southeast-1.compute.internal (host: "ip-172-16-1-53.ap-southeast-1.compute.internal", ip: "172.16.1.53", subnet: "10.131.0.0/23"): related node object "ip-172-16-1-53.ap-southeast-1.compute.internal" has an incomplete annotation "cloud.network.openshift.io/egress-ipconfig", CloudEgressIPConfig: <nil>
 $ oc logs cloud-network-config-controller-5c7556db9f-x78bs -n openshift-cloud-network-config-controller

E0922 04:26:59.468726       1 controller.go:165] error syncing 'ip-172-16-2-77.ap-southeast-1.compute.internal': error retrieving the private IP configuration for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: error: cannot list ec2 instance for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 54.239.29.25:443: i/o timeout, requeuing in node workqueue
$ oc get Infrastructure -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: "2022-09-22T03:28:15Z"
    generation: 1
    name: cluster
    resourceVersion: "598"
    uid: 994da301-2a96-43b7-b43b-4b7c18d4b716
  spec:
    cloudConfig:
      name: ""
    platformSpec:
      aws:
        serviceEndpoints:
        - name: sts
          url: https://sts.ap-southeast-1.amazonaws.com
        - name: ec2
          url: https://ec2.ap-southeast-1.amazonaws.com
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com
      type: AWS
  status:
    apiServerInternalURI: https://api-int.openshiftyy.ocpaws.sadiqueonline.com:6443
    apiServerURL: https://api.openshiftyy.ocpaws.sadiqueonline.com:6443
    controlPlaneTopology: HighlyAvailable
    etcdDiscoveryDomain: ""
    infrastructureName: openshiftyy-wfrpf
    infrastructureTopology: HighlyAvailable
    platform: AWS
    platformStatus:
      aws:
        region: ap-southeast-1
        serviceEndpoints:
        - name: ec2
          url: https://ec2.ap-southeast-1.amazonaws.com
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com
        - name: sts
          url: https://sts.ap-southeast-1.amazonaws.com
      type: AWS
kind: List
metadata:
  resourceVersion: ""
$ oc get secret aws-cloud-credentials -n openshift-machine-api -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-machine-api-aws-cloud-credentials
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credential-operator-iam-ro-creds -n openshift-cloud-credential-operator -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-credential-operator-cloud-creden
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret installer-cloud-credentials -n openshift-image-registry -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-image-registry-installer-cloud-credent
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-ingress-operator -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-ingress-operator-cloud-credentials
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-cloud-network-config-controller -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-network-config-controller-cloud-
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret ebs-cloud-credentials -n openshift-cluster-csi-drivers -o json |jq -r .data.credentials |base64 -d
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cluster-csi-drivers-ebs-cloud-credenti
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 

 

Actual results:

Egress IP not configured properly and cloud-network-config-controller trying to connect to global STS service.

Expected results:

Egress IP should get configured and cloud-network-config-controller should connect to regional STS service instead of global.

Additional info:

 

Description of problem:
The pod fails to mount the PVC using IBM Cloud VPC block storage.

Version-Release number of selected component (if applicable):

How reproducible:
The steps can be followed here: from this link
https://cloud.ibm.com/docs/openshift?topic=openshift-vpc-block
.
The error occurs when the application pod tried to mount the VPC.

Steps to Reproduce:
Describe above.

Actual results:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 26m default-scheduler Successfully assigned default/test to a100-huge-m25p7-worker-3-with-secondary-xdwvl
Normal SuccessfulAttachVolume 26m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba"
Warning FailedMount 26m (x2 over 26m) kubelet MountVolume.MountDevice failed for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba" : rpc error: code = Internal desc = {RequestID: ffbb97b4-e4d0-4016-87a9-dc46f80c5478 , Code: FormatAndMountFailed, Description: Failed to format '/dev/disk/by-id/virtio-0777-6872e22d-5c00-4' and mount it at '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount', BackendError: format of disk "/dev/disk/by-id/virtio-0777-6872e22d-5c00-4" failed: type"ext4") target"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount") options"defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)
The file /dev/disk/by-id/virtio-0777-6872e22d-5c00-4 does not exist and no size was specified.
) , Action: Please check if there is any error in POD describe related with volume attach}
Warning FailedMount 22m kubelet Unable to attach or mount volumes: unmounted volumes=[bs-pvc], unattached volumes=[kube-api-access-6bgvj bs-pvc]: timed out waiting for the condition
Warning FailedMount 4m11s (x9 over 24m) kubelet Unable to attach or mount volumes: unmounted volumes=[bs-pvc], unattached volumes=[bs-pvc kube-api-access-6bgvj]: timed out waiting for the condition
Warning FailedMount 3m51s (x17 over 26m) kubelet MountVolume.MountDevice failed for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba" : rpc error: code = Internal desc = {RequestID: 1a12a7c5-3bd0-41cf-b8a9-90dd3224c2fb , Code: FormatAndMountFailed, Description: Failed to format '/dev/disk/by-id/virtio-0777-6872e22d-5c00-4' and mount it at '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount', BackendError: format of disk "/dev/disk/by-id/virtio-0777-6872e22d-5c00-4" failed: type"ext4") target"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount") options"defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)

Expected results:
The pod should successfully mount the PVC

Additional info:
Had a debugging session with Sameer Shaikh and Arashad Ahamad from the IBM VPC block storage team. The conclusion is that the udevadm utility is missing in the IPI image used by the IBM Cloud VPC block storage CSI.

  1. oc exec -it ibm-vpc-block-csi-controller-5fbb46bdc6-k7kpf -n openshift-cluster-csi-drivers -c iks-vpc-block-driver bash
    kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD][COMMAND] instead.
    bash-4.4$ which udevadm
    which: no udevadm in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Description of problem:

Some of AWS CI jobs failed with error: Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not exist


Errors from: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn/1566090383895564288


level=info msg=Creating infrastructure resources...level=errorlevel=error msg=Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not existlevel=error msg= status code: 400, request id: 5223ac0c-77cb-4f29-adc8-4192e9fc3ef8level=errorlevel=error msg=  with module.vpc.aws_nat_gateway.nat_gw[0],level=error msg=  on vpc/vpc-public.tf line 85, in resource "aws_nat_gateway" "nat_gw":level=error msg=  85: resource "aws_nat_gateway" "nat_gw" {level=errorlevel=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "cluster" stage: failed to create cluster: failed to apply Terraform: exit status 1level=errorlevel=error msg=Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not existlevel=error msg= status code: 400, request id: 5223ac0c-77cb-4f29-adc8-4192e9fc3ef8level=errorlevel=error msg=  with module.vpc.aws_nat_gateway.nat_gw[0],level=error msg=  on vpc/vpc-public.tf line 85, in resource "aws_nat_gateway" "nat_gw":level=error msg=  85: resource "aws_nat_gateway" "nat_gw" {level=errorlevel=error

Version-Release number of selected component (if applicable):

4.11 - 4.12

How reproducible:

Occasionally happened, searching logs in CI pipeline via https://search.ci.openshift.org/?search=InvalidElasticIpID.NotFound&maxAge=48h&context=1&type=build-log&name=.*4%5C.12.*aws.*&excludeName=.*upgrade.*&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

1. All kinds of AWS IPI installations can encounter this error.

Actual results:

Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not exist

Expected results:

Cluster install succeeds.

Additional info:

4.11 jobs with this error: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64/1565061451272425472

While investigating TRT-413, we discovered that many service monitors are configured to use bearer token authentication. Per this document https://github.com/deads2k/openshift-enhancements/blob/master/enhancements/monitoring/client-cert-scraping.md, we should try to use client certification authentication for metrics scraping. This is to make sure metrics collection still works even apiserver is not available. 

 

Currently, the following repos have been identified to be fixed:

 

ServiceMonitor Name Namespace PRs
cloud-credential-operator openshift-cloud-credential-operator https://github.com/openshift/cloud-credential-operator/pull/483
csi-driver-controller-monitor openshift-cluster-csi-drivers https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/103
  openshift-cluster-csi-drivers https://github.com/openshift/csi-driver-manila-operator/pull/153
  openshift-cluster-csi-drivers https://github.com/openshift/csi-driver-shared-resource-operator/pull/54
  openshift-cluster-csi-drivers https://github.com/openshift/gcp-filestore-csi-driver-operator/pull/6
  openshift-cluster-csi-drivers https://github.com/openshift/ovirt-csi-driver-operator/pull/102
     
cluster-machine-approver openshift-cluster-machine-approver https://github.com/openshift/cluster-machine-approver/pull/169
node-tuning-operator openshift-cluster-node-tuning-operator https://github.com/openshift/cluster-node-tuning-operator/pull/427
cluster-samples-operator openshift-cluster-samples-operator https://github.com/openshift/cluster-samples-operator/pull/464
cluster-storage-operator openshift-cluster-storage-operator https://github.com/openshift/cluster-storage-operator/pull/306
cluster-version-operator openshift-cluster-version https://github.com/openshift/cluster-version-operator/pull/816
config-operator openshift-config-operator https://github.com/openshift/cluster-config-operator/pull/259
console openshift-console https://github.com/openshift/console-operator/pull/668
console-operator openshift-console-operator https://github.com/openshift/console-operator/pull/668
dns-default openshift-dns Didn't find the source
dns-operator openshift-dns-operator https://github.com/openshift/cluster-dns-operator/pull/334
image-registry openshift-image-registry https://github.com/openshift/cluster-image-registry-operator/pull/796
image-registry-operator openshift-image-registry https://github.com/openshift/cluster-image-registry-operator/pull/796
router-default openshift-ingress Didn't find the source
ingress-operator openshift-ingress-operator https://github.com/openshift/cluster-ingress-operator/pull/816
kube-scheduler openshift-kube-scheduler https://github.com/openshift/cluster-kube-scheduler-operator/pull/434
cluster-autoscaler-operator openshift-machine-api https://github.com/openshift/cluster-autoscaler-operator/pull/249
machine-api-controllers openshift-machine-api https://github.com/openshift/machine-api-operator/pull/1054
machine-api-operator openshift-machine-api https://github.com/openshift/machine-api-operator/pull/1054
machine-config-controller openshift-machine-config-operator https://github.com/openshift/machine-config-operator/pull/3277
machine-config-daemon openshift-machine-config-operator https://github.com/openshift/machine-config-operator/pull/3277
marketplace-operator openshift-marketplace https://github.com/operator-framework/operator-marketplace/pull/482
cluster-monitoring-operator openshift-monitoring https://github.com/openshift/cluster-monitoring-operator/pull/1738
openshift-state-metrics openshift-monitoring https://github.com/openshift/cluster-monitoring-operator/pull/1738
prometheus-adapter openshift-monitoring https://github.com/openshift/cluster-monitoring-operator/pull/1738
monitor-multus-admission-controller openshift-multus https://github.com/openshift/cluster-network-operator/pull/1522
monitor-network openshift-multus https://github.com/openshift/cluster-network-operator/pull/1522
network-operator openshift-network-operator https://github.com/openshift/cluster-network-operator/pull/1522
catalog-operator openshift-operator-lifecycle-manager https://github.com/openshift/operator-framework-olm/pull/350
olm-operator openshift-operator-lifecycle-manager https://github.com/openshift/operator-framework-olm/pull/350
monitor-ovn-master-metrics openshift-ovn-kubernetes https://github.com/openshift/cluster-network-operator/pull/1522
monitor-ovn-node openshift-ovn-kubernetes https://github.com/openshift/cluster-network-operator/pull/1522
monitor-sdn openshift-sdn https://github.com/openshift/cluster-network-operator/pull/1522
monitor-sdn-controller openshift-sdn https://github.com/openshift/cluster-network-operator/pull/1522

 

Additionally, it is discovered that kube-rabc-proxy is not coded properly to automatically update client ca certificate. That issue is addressed with https://issues.redhat.com/browse/TRT-464. Until the fix lands to openshift, some of the above changes (repositories that uses kube-rbac-proxy) will not be effective. 

 

For the repositories that are not using kube-rbac-proxy (e.g. storage operator), the above change can be merged and verified. 

 

How to verify

  1. Make sure the corresponding ServiceMonitor object contains certFile and keyFile. 
  2. Make sure ServiceMonitor does NOT have bearerTokenFile configured. 
  3. With ServiceMonitor configuration verified above, check prometheus to make sure service for the corresponding namespace still works. A simple "up{namespace='')" check should be good enough.  

 

 

 

This is a clone of issue RHIBMCS-151. The following is the description of the original issue:

Error msg

type: 'Warning' reason: 'ResolutionFailed' constraints not satisfiable: @existing/ibm-common-services//ibm-namespace-scope-operator.v2.0.0 and @existing/ibm-common-services//ibm-namespace-scope-operator.v1.15.0 provide NamespaceScope (operator.ibm.com/v1), subscription ibm-namespace-scope-operator requires @existing/ibm-common-services//ibm-namespace-scope-operator.v2.0.0, subscription ibm-namespace-scope-operator exists, clusterserviceversion ibm-namespace-scope-operator.v1.15.0 exists and is not referenced by a subscription

 

The issue happens  during the upgrade with and without channel switch. And several places which reports this issue

https://ibm-cloudplatform.slack.com/archives/CM95C10RK/p1662557747140069

PrivateCloud-analytics/CPD-Quality#5548

 

Current status

Issue opened in OLM community https://github.com/operator-framework/operator-lifecycle-manager/issues/2201

 

bugzilla ticket https://bugzilla.redhat.com/show_bug.cgi?id=1980755

 

Knowledge Base from Red Hat https://access.redhat.com/solutions/6603001

 

It is a known OLM issue and Bedrock also provides workarounds in documents https://www.ibm.com/docs/en/cpfs?topic=ii-olm-known-issue-updates-subscription-status-creates-csv-asynchronous

 

 

 

Usually when the second error msg happened not referenced by a subscription , it requires us to re-install the operator.

https://www.ibm.com/docs/en/cpfs?topic=ii-olm-known-issue-updates-subscription-status-creates-csv-asynchronous

 Or the mis synchronisation may be rectified by restarting the catalog and olm operators in some efforts.

 

 

 

We saw this in an OKD job:

https://github.com/openshift/machine-config-operator/pull/3358#issuecomment-1267532305

 

It's simple to reproduce, from say a current RHCOS 4.12 doing:

 

[root@cosa-devsh ~]# podman run --privileged --pid=host --net=host --rm -v /:/run/host quay.io/fedora/fedora-coreos:testing-devel "rpm-ostree" "ex" "deploy-from-self" "/run/host"
NOTICE: Experimental commands are subject to change.
error: Writing content object: Setting xattrs: fsetxattr(security.selinux): Invalid argument
 

 

I've tried doing `--security-opt label=type:unconfined_t` which gives the same error (of course), but using `install_t` I get:

 

[root@cosa-devsh ~]# podman run --privileged --security-opt label=type:install_t --pid=host --net=host --rm -v /:/run/host quay.io/fedora/fedora-coreos:testing-devel "rpm-ostree" "ex" "deploy-from-self" "/run/host"
exec /usr/bin/rpm-ostree: permission denied
[root@cosa-devsh ~]#

 

I'm really tempted to just `setenforce 0` for the first OS update...

This is a clone of issue OCPBUGS-16164. The following is the description of the original issue:

LatencySensitive has been functionally equivalent to "" (Default) for several years. Code has forgotten that the featureset must be handled and its more efficacious to remove the featureset (with migration code) than try to plug all the holes.

Description of problem:
Some minor strings in the Quick Start weren't translated, for example, the error message "No Quick Start found" when all Quick Start was disabled and the string "View Prerequisites" in the quick start content drawer.

Version-Release number of selected component (if applicable):
4.11-4.12, maybe earlier

How reproducible:
Always

Steps to Reproduce:
1. Switch the language and open a quick start in the drawer.
2. Or disable all quick starts and check the quick start page.

Actual results:
1. String "View Prerequisites" is not translated
2. Error page is empty

Expected results:
1. String "View Prerequisites" should be translated (at least in pseudo translation)
2. Error page should show a info that no Quick Starts are found.

Additional info: