Back to index

4.10.61

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.9.59

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview
Insights Advisor for OpenShift is integrated within OpenShift Cluster Manager. This has some limitations for adding new features and also for sharing codebase between RHEL Advisor and OCM Insights Advisor tab. Insights Advisor for OpenShift lacks certain features from the RHEL UI, the codebase is not 1:1 clone.
As a customer of Insights I will have same/very similar user experience with Insights for OpenShift and Insights for RHEL. The workflows will share the main concepts, the UI elements will be same and features introduced to Advisor will be automatically considered for both all supported platforms.
As OpenShift users I will still see integrations of Insights Advisor within OpenShift Cluster Manager that shows aggregated information for customer account and single cluster view on Advisor data. These integration will point to new Insights Advisor for OpenShift app that will be tightly integrated into OpenShift Cluster Manager.

  • Note: The application will be reusing the codebase but will run as a separate app for OpenShift. THere's no intent to merge RHEL and OpenShift workflows into a single app.

Goals

  • Q2CY21: Explore possibility to unify codebase between RHEL Advisor and OCM Insights Advisor tab. Identify architecture misalignments, create UI mockups to merge the two existing UIs.
  • Q3CY21: Integrate OpenShift into Advisor codebase, standup the Insights Advisor for OpenShift application and change integration in OpenShift Cluster manager to point at the new app
  • Q4CY21: Deliver missing screen of Insights Advisor for OpenShift (Systems and Recommendations views)

Requirements

  • UX overview of UI elements in both UIs - Marie Doruskova
  • Architecture overview/misalignments for both UIs - Jan Zeleny [~fjansen]

Benefits

  • Feature parity between RHEL and OpenShift
  • Adopting new features developed by RHEL Advisor team quicker
  • Smaller maintenance cost

Questions to answer...

  • Possible deviations between OpenShift and RHEL
  • Remediation workflow different between OpenShift and RHEL

Out of Scope

  • Single app that combines RHEL hosts and OpenShift clusters. Goal is still to differentiate between platforms and offer view only for a single platform.
  • Direct/Supervised remediations and integration of remediations with Advanced Cluster Manager (as a Service)

Background, and strategic fit

  • Insights Advisor for OpenShift follows the goal to introduce multiple applications that add value for OpenShift customers under the Insights brand. The current UI and integration of Advisor into OpenShift cluster manager doesn't follow pattern that other Insights for OpenShift applications can/will follow.

Documentation Considerations

  • OCM documentation is impacted, existing workflows described in OCM documentation will persist. The placement of the application within OCM will be different.

 

OCP WebConsole, in the main dashboard, has an Insights Advisor widget, which has been redirecting users to OCM. Due to the Insights Advisor tab decommission in OCM, the links should point to Advisor instead.

4.10 code freeze = 28 January (marking the task as urgent)

Problem Alignment

The Problem

Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.

That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.

We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.

Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.

High-Level Approach

Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.

Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.

Goal & Success

  • Allow users to deploy individual configurations that allow setting up Alertmanager for their needs without an administrator.

Solution Alignment

Key Capabilities

  • As an OpenShift administrator, I want to control who can CRUD individual configuration so that I can make sure that any unknown third person can touch the central Alertmanager instance shipped within OpenShift Monitoring.
  • As a team owner, I want to deploy a routing configuration to push notifications for alerts to my system of choice.

Key Flows

Team A wants to send all their important notifications to a specific Slack channel.

  • Administrator gives permission to Team A to allow creating a new configuration CR in their individual namespace.
  • Team A creates a new configuration CR.
  • Team A configures what alerts should go into their Slack channel.
  • Open Questions & Key Decisions (optional)
  • Do we want to improve anything inside the developer console to allow configuration?

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Deliver the Projected Resources CSI driver via the OpenShift Payload

Why is this important?

  • Projected resource shares will be a core feature of OpenShift. The share and CSI driver have multiple use cases that are important to users and cluster administrators.
  • The use of projected resources will be critical to distributing Simple Content Access (SCA) certificates to workloads, such as Deployments, DaemonSets, and OpenShift Builds.

Scenarios

As a developer using OpenShift
I want to mount a Simple Content Access certificate into my build
So that I can access RHEL content within a Docker strategy build.

As a application developer or administrator
I want to share credentials across namespaces
So that I don't need to copy credentials to every workspace

Acceptance Criteria

  • OCP conformance suite must ensure that the projected resource CSI driver is installed on every OpenShift deployment.
  • OCP build suite tests that projected resource CSI driver volumes can be added to builds. Only if builds support inline CSI volumes.
  • Release Technical Enablement - Docs and demos on how to create a Projected Resource share and add it as a volume to workloads. A special use case for adding RHEL entitlements to builds should be included.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an OpenShift engineer
I want to know which clusters are using the Shared Resource CSI Driver
So that I can be proactive in supporting customers who are using this tech preview feature

Acceptance Criteria

  • Key metrics for the shared resource CSI driver are exported to Telemeter via the cluster monitoring operator.

Docs Impact

None - metrics exported to telemetry are not formally documented.

QE Impact

QE can verify that the query/recording rule for cluster monitoring operator returns data if the cluster has the Shared Resource CSI driver installed and utilizes a SharedSecret or SharedConfigMap in a pod/workload.

PX Impact

Insights rules can potentially be created off of these exported metrics. This would allow CEE to identify which clusters are using SharedSecrets or SharedConfigMaps, especially if we are exporting mount failure metrics.

Notes

To implement, a prometheus query/recording rule needs to be added to the cluster monitoring operator. Once approved by the monitoring team, the metric data will be available on DataHub once 4.10 clusters are installed with the updated version of the monitoring operator.

User Story

As a cluster admin
I want the cluster storage operator to install the shared resources CSI driver
So that I can test the shared resources CSI driver on my cluster

Acceptance Criteria

  • Cluster storage operator uses image references to resolve the csi-driver-shared-resource-operator and all images needed to deploy the csi driver.
  • Shared resources CSI driver is installed when the cluster enables the CSIDriverSharedResources feature gate, OR
  • Shared resource CSI driver is installed when the cluster enables the TechPreviewNoUpgrade feature set
  • CI ensures that if the TechPreviewNoUpgrade feature set is enabled on the cluster, the shared resource CSI driver is deployed and functions correctly.

Docs Impact

Docs will need to identify how to install the shared resources CSI driver (by enabling the tech preview feature set)

Notes

Tasks:

  • Add the Share APIs (SharedSecret, SharedConfigMap) to openshift/api
  • Generate clients in openshift/client-go for Share APIs
  • Update the CSI driver name used in the enum for the ClusterCSIDriver custom resource.
  • Generate custom resource definitions and include it in the deployment YAMLs for the shared resource operator
  • Add YAML deployment manifests for the shared resource operator to the cluster storage operator (include necessary RBAC)
  • Ensure cluster storage operator has permission to create custom resource definitions
  • Enhance the cluster storage operator to install the shared resource CSI driver only when the cluster enables the CSIDriverSharedResources feature gate

Note that to be able to test all of this on any cloud provider, we need STOR-616 to be implemented. We can work around this by making the CSI driver installable on AWS or GCP for testing purposes.

The cluster storage operator has cluster-admin permissions. However, no other CSI driver managed by the operator includes a CRD for its API.

See https://issues.redhat.com/browse/BUILD-159?focusedCommentId=16360509&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16360509

Feature Overview

This Feature is a general "catch all" for the time being. There are a number of existing priorities from Q1 that should be aligned with existing priorities below but if not, assign to this feature as needed.

Goals

In order to get a better overall portfolio view, we'll leverage this Feature to gather work that doesn't fall into other existing priorities on this board. As this list grows, the portfolio priority grooming team will look to split out or handle appropriately.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

 

requirement                                                                        Notes                                                              isMvp
     
     
     
     

 

 

(Optional) Use Cases

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

<What does success look like?>

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact?>

 <If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Questions

Question Outcome
   

 

Problem:

Console provides support UI for operators which is dynamically enabled when the operator is installed; by using feature flags against presence of CRDs. While operators have their own release cadence separately from OpenShift which makes for alignment of UI to API difficult. As new features are released for the operator, the UI becomes out of sync with APIs and customers must wait till the following OpenShift release to get any new UI.

Goal:

  • Create an extensibility mechanism which allows Red Hat operators to build and package their own UI that extends the console.
  • Make console extensible in areas required to support the needs of contributing plugins.

Why is it important?

  • Allows an operator to maintain their own UI and release at their own cadence.
  • Alleviates the pressure on console to deliver UI features for multiple operators within a release.

Use cases:

  1. Serverless / Pipelines / Helm to contribute resource details pages, import flows, topology visuals etc...

Acceptance criteria:

  1. Red Hat Operator can build their own UI which is deployed alongside the operator and extend the dev-console
    1. objective is to get to a point where it is possible to accomplish this however code will not be moved to a separate repository, nor deployed by an operator
  2. New extensions for console to allow operators to extend the various areas of console needed in order to provide the proper user experience.
  3. Enable operators to override the static built in support, and supply their own UI

Dependencies (External/Internal):

Design Artifacts:

Console extensions:
https://docs.google.com/document/d/1HW5_cl6cOX5P14PQN-1_8c60o9dMY6HbFDRftH6aTno/edit

Dynamic Plugins:
https://docs.google.com/document/d/19BAFo_8BtMZVvKsU-bE61bZpSydeYONkCMWntMU9NgE/edit

Enhancement proposal:
https://github.com/openshift/enhancements/pull/441

Exploration:

Note:

  • plugin framework covered by another epic
  • out of scope:
    • moving plugins to separate git repository

Description

As a developer, I want to be able to contribute a dynamic plugin extension and override the same extension contributed by static plugin.

Acceptance Criteria

  1. Should replace static plugin contribution of same name by dynamic plugin contribution

Additional Details:

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

  • Extend the Console
  • Deliver UI code with their Operator
  • Work in their own git Repo
  • Deliver at their own cadence

Goals

    • Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
    • The dynamic plugin API is similar to the static plugin API to ease migration.
    • Plugins can use shared console components such as list and details page components.
    • Shared components from core will be part of a well-defined plugin API.
    • Plugins can use Patternfly 4 components.
    • Cluster admins control what plugins are enabled.
    • Misbehaving plugins should not break console.
    • Existing static plugins are not affected and will continue to work as expected.

Out of Scope

    • Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
    • We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
    • Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
    • This proposal does not cover allowing plugins to contribute backend console endpoints.

 

Requirements

 

Requirement Notes isMvp?
 UI to enable and disable plugins    YES 
 Dynamic Plugin Framework in place    YES 
Testing Infra up and running   YES 
 Docs and read me for creating and testing Plugins    YES 
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 
 Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Currently, webpack tree shakes PatternFly and only includes the components used by console in its vendor bundle. We need to expose all of the core PatternFly components for use in dynamic plugin, which means we have to disable tree shaking for PatternFly. We should expose this as a separate bundle. This will allow browsers to cache more efficiently and only need to load the PF bundle again when we upgrade PatternFly.

Open Questions

What parts of PatternFly do we consider core?

Acceptance Criteria

  • All PatternFly core components are exposed to dynamic plugins
  • PatternFly is exposed as a separate bundle that is not part of the main vendor bundle

cc Christian Vogt Vojtech Szocs Joseph Caiani James Talton

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a user, I want the ability to run a pod in debug mode.

This should be the equivalent of running:  oc debug pod

Acceptance Criteria for MVP

  • Build off of the crash-loop back off popover from https://github.com/openshift/console/pull/7302 to include a description of what crash-loop back off is, a link to view logs, a link to view events and a link to debug (container-name) in terminal. If more than one container is crash-looping list them individually.
  • Create a debug container page that includes breadcrumbs as well as the terminal to debug. Add an informational alert at the top to make it clear that this is a temporary Pod and closing this page will delete the temporary pod.
  • Add debug in terminal as an action to the logs tool bar. Only enable the action when the crash-loop back off status occurs for the selected container. Add a tool tip to explain when the action is disabled.

Assets
Designs (WIP): https://docs.google.com/document/d/1b2n9Ox4xDNJ6AkVsQkXc5HyG8DXJIzU8tF6IsJCiowo/edit#

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview

  • Connect OpenShift workloads to Google services with Google Workload Identity

Goals

  • Customers want to be able to manage and operate OpenShift on Google Cloud Platform with workload identity, much like they do with AWS + STS or Azure + workload identity.
  • Customers want to be able to manage and operate operators and customer workloads on top of OCP on GCP with workload identity.

Requirements

  • Add support to CCO for the Installation and Upgrade using both UPI and IPI methods with GCP workload identity.
  • Support install and upgrades for connected and disconnected/restriction environments.
  • Support the use of Operators with GCP workload identity with minimal friction.
  • Support for HyperShift and non-HyperShift clusters.
  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Epic Goal

  • Complete the implementation for GCP  workload identity, including support and documentation.

Why is this important?

  • Many customers want to follow best security practices for handling credentials.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to ensure following things in the openshift operators

1)  Make sure to operator uses v0.0.0-20210218202405-ba52d332ba99 or later version of the golang.org/x/oauth2 module

2) Mount the oidc token in the operator pod, this needs to go in the deployment. We have done it for cluster-image-registry-operator here

3) For workload identity to work, gco credentials that the operator pod uses should be of external_account type (not service_account). The external_account credentials type have path to oidc token along, url of the service account to impersonate along with other details. These type of credentials can be generated from gcp console or programmatically (supported by ccoctl). The operator pod can then consume it from a kube secret. Make appropriate code changes to the operators so that can consume these new credentials 

 

Following repos need one or more of above changes

Feature Overview (aka. Goal Summary)  

Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change  

Goals (aka. expected user outcomes)

Bring together all the cloud controller managers (AWS, GCP, Azure), complete testing and prepare for final GA

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Prepare the Cluster Cloud Controller Manager Operator (CCCMO) component, introduced in 4.9 for GA

Why is this important?

  • We must ensure that the component is stable before we can declare the product GA

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Initial work was started there: https://github.com/lobziik/cluster-cloud-controller-manager-operator/pull/1/files

Need to isolate provider specific code in respective packages and introduce interface to leverage it (regular and bootstrap manifests rendering should be there atm)

DoD:

  • Introduce templating logic to replace existing substitution mixture
  • Isolate templating logic so that this is transparent to the core of the CCCMO
  • Improve testing of the substitution

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

The console has many instances of old variables, $grid-float-breakpoint and $grid-gutter-width, controlling margins/padding and responsive breakpoints throughout the Admin and Dev Console. These do not provide spacing and behaviors consistent with Patternfly components which use their own variables, $pf-global-gutter-md, $pf-global-gutter, and $pf-global-breakpoint-{size}. By replacing these, the intent it to bring the console closer to a pure Patternfly structure and behavior, requiring less overrides and customizations.

Update console from Cypress 6.0.0 to 8.5.0. Changes that impact us:

  • cypress run is headless by default
  • cy.intercept URL matching is more strict
  • Uncaught exception and unhandled promise rejection checks are more strict

https://docs.cypress.io/guides/references/migration-guide#Migrating-to-Cypress-8-0

As an adopter of the @openshift-console/dynamic-plugin-sdk I want to easily integrate into my development pipeline so that I can extend the OCP console.

Trying to pull in the dynamic-plugin-sdk into ACM is proving to be problematic. We would have to move to older dependencies. Integrating with webpack and typescript requires a very specific setup.

The dynamic-plugin-sdk has only really been used internally by OCP and is strongly tied to the setup and dependencies of OCP. For the dynamic-plugin-sdk to be externally consumable by adopters, it should be as easy to use as other webpack plugins such as HtmlWebpackPlugin or CompressionPlugin.

Acceptance Criteria

  • Uses up to date dependencies - not tied to specific versions OCP console uses
  • Includes it's own dependencies - does not require adopters to include those dependencies
  • The dynamic demo plugin should be updated to use newer dependencies and use the plugin without a bunch of tweaks to tsconfig paths. 

Currently

  • requires old dependencies 
    • ts-node 5.0.1 → 10.2.1

 

Epic Goal

  • Improve CI testing of the image registry components.

Why is this important?

  • The image registry, image API and the image pruner had a lot of tests removed during transition 4.0. This may make the platform less stable and/or slow down the team.

Scenarios

  1. ...

Acceptance Criteria

  • CI - tests should be more stable and have broader coverage

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.

In the image-registry, we have packages origin-common and kubernetes-common. The problem is that this code doesn't get updates. We can replace them with more supported library-go.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

  • image-registry uses k8s.io/api v1.23.z
  • image-registry uses latest openshift/api, openshift/library-go, openshift/client-go
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story

As a developer using Jenkins to build my application
I want to use the base Jenkins agent image as a sidecar in my PodTemplate
So that I can use any s2i builder image in my Jenkins pipelines

Acceptance Criteria

  • Provide new Kubernetes Plugin Pod Templates which uses the sidecar pattern for NodeJS and Maven.
  • Add documentation on how to use the new pod template in a Jenkinsfile (need to specify the container where the build occurs).
  • Add documentation on how developers can provide an inline pod template within a Jenkinsfile. Documentation should have the following formats:
    • New YAML declarative format
    • Deprecated Groovy format
  • Existing pipelines that use the default Kubernetes Plugin Pod Templates do not break.
  • End to end testing (for client or sync plugin) verifies that the new pod templates work.

QE Impact

QE will need to verify that the new pod templates can successfully execute a JenkinsPipeline build.

Docs Impact

Documentation needs to be updated to explain how to use the new template.

PX Impact

Unclear if we need new CEE/PX materials beyond doc updates.

Notes

We currently have built-in pod templates for NodeJS and Maven, which use specialized agent images with NodeJS/Maven image.
Blog post here outlines the process: https://developers.redhat.com/blog/2020/06/04/an-easier-way-to-create-custom-jenkins-containers/

The Groovy style of declaring in-line pod templates is deprecated in favor of a YAML-style format.

Existing documentation for the Jenkin pod templates: https://docs.openshift.com/container-platform/4.9/openshift_images/using_images/images-other-jenkins.html#images-other-jenkins-config-kubernetes_images-other-jenkins

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • As a CFE team, we would like to enable query logging for all Prometheus read paths
  • As part of this, we would like to enable audit & query logging for Prometheus Adapter(aggregated server audit log), Prometheus(query log) and ThanosQuerier(query log)

Why is this important?

  • This would help all parties(customers, app-sres, CCX, monitoring team,..) to debug an overloaded Prometheus instance.

Scenarios

  1. When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable audit logging in Prometheus Adapter to see which component is calling metrics API
  2. When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable query logging in all Prometheus instances(PM & UWM) and ThanosQuerier to see which query is frequently executed
  3. https://bugzilla.redhat.com/show_bug.cgi?id=1982302

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Prometheus Adapter audit logs must be enabled by default
  • Prometheus Adapter audit logs must be preserved after each CI run

Open questions::

  1. Should we enable ThanosRuler query logs?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

After investigating a complex Bugzilla involving many applications making queries to prometheus-adapter, we've noticed that we were lacking insights on the requests made to prometheus-adapter. To have such information for an aggregated API, the best would be to have audit logs for prometheus-adapter. This wasn't configurable before, but with https://github.com/kubernetes-sigs/custom-metrics-apiserver/pull/92, upstream users should now be able to configure it.

Since this would greatly help in investigating prometheus-adapter Bugzilla in the future, it would be great if we allowed OpenShift users to configure the audit logs so that they could provide them to us.

Note for the assignee: as of the time of the creation of this ticket, the upstream PR hasn't been merged in custom-metrics-apiserver and thus wasn't synced in prometheus-adapter. So we will have to wait a bit before starting looking into this ticket.

 

DoD:

  • Allow OpenShift users to configure audit logs for prometheus-adapter
  • Integrate with must-gather
  • Document how to configure audit logs in the official OpenShift documentation
  • Upstream jsonnet patch that enables this feature through a configuration
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The console requires to know the network type capabilities to show/hide some Network Policy form fields.

As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.

However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.

This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.

Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We want to configure 'default' and 'allowed' values in validation webhook for Guest Accelerators field in GCPProviderSpec. Also revendor it to include newly added Guest Accelerators field.

This can be done after https://github.com/openshift/cluster-api-provider-gcp/pull/172  is merged.

DoD:

  • Make sure that validations return errors on issues with GPU configuration
  • Ensure the unit tests for the webhooks are updated

Description:

Openshift on RHV is composed of the following subproject the team maintains:

Each of those projects currently uses the generated oVirt API project go-ovirt.

This leads to a number of issues:

  1. Duplicated code between the subprojects: Since the go-ovirt is a thin layer around the API then a lot of the code which interacts with oVirt is duplicated between the projects, which leads to all the classic duplication problems such as maintaining the project, lack of clear conventions, and so on.
  2. Bad error handling and unclear errors:
    1. Since the go-ovirt is a thin layer there is a lot of error handling and checking which needs to be done, since a lot of the times it looks like a certain error should be ignored, it is never checked which could lead to unexpected situations.
    2. Since the errors which are returned from the oVirt Engine are sometimes unclear, when we return those errors to the users or log them is hard to understand what is the actual issue.
  3. Lack of retries: sometimes an operation can take some time due to some condition that needs to be met, or an operation can fail due to infrastructure issues, the go-ovirt library doesn't contain any retry logic which means each client needs to implement its own retry logic which is not done at the moment and will cause more duplicated code.
  4. Poor logging: The current go-ovirt library doesn't log anything, and all the logs come from the subprojects, this leads to:
    1. Inconsistent logging between the projects.
    2. Lack of logs.
  5. Almost no test coverage:
    1. It's very hard to mock and write tests with go-ovirt since there are so many calls, but will be much easier to mock and write tests with go-ovirt-clent.
    2. go-ovirt only has rudimentary tests.

Then came go-ovirt-client, go-ovirt-client-log, go-ovirt-client-log-klog and k8sOVirtCredentialsMonitor to the rescue!

The go-ovirt-client is a wrapper around the go-ovirt which contains all the error handling/retry logic/logs/tests needed to provide a decent user experience and an easy-to-use API to the oVirt engine.

go-ovirt-client-log is a library to unify the logging logic between the projects, it is used by go-ovirt-client and should be used by all the sub-projects.

go-ovirt-client-log-klog is a companion library to go-ovirt-client-log enabling logging via the Kubernetes "klog" facility.

k8sOVirtCredentialsMonitor is a utility for monitoring the oVirt credentials secret, which will automatically update the ovirt credentials is they are changed. 

We aim to move all projects which are using the go-ovirt to use go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor instead.

Benefits for the eng:

  • Possible to write unit tests.
  • Easier to maintain since less code duplication - reduce the amount of code.
  • Test coverage exists on the ovirt-client as well.
  • No(Less) bugs regarding operations that needed a retry or polling logic.
  • Solves a number of existing bugs

Benefits for the customers:

  • Clearer error messages and logs.
  • Fewer bugs.

Acceptance criteria:

  1. All sub-projects are not using go-ovirt directly - at least 90% of the calls to go-ovirt should be migrated to go-ovirt-client.
  2. All sub-projects should use the corresponding go-ovirt-client-log for logging.
  3. All csi-driver and cluster provide use k8sOVirtCredentialsMonitor.
  4. CI tests are green for all components.

How to test:

  1. QE regression - make sure all flows are still working.
  2. Green CI on all jobs.
  3. Keep an eye out for log messages that might confuse customers.

Description:

  1. Identify all the communication between ovirt-csi-driver and the go-ovirt.
  2. Port all the logic to go-ovirt-client.
  3. Port all calls on ovirt-csi-driver to go-ovirt-client.

Acceptance:

ovirt-csi-driver uses go-ovirt-client for 95% percent of all oVirt related logic.

T-shirt size: M

Goal:

Provide an easy and successful experience for front end developers to build and deploy their applications

Why is it important?

Currently, the front end dev experience is not positive. It's much easier for them to use other platforms. Improving the front end dev experience will enable us to gain more marketshare

Use cases:

  1. Need to be able to override the npm command when using Node Builder Image
  2. Need to expose target port
  3. Need access to the URL to access my application

Although we provide the ability for 2 & 3 today, the current journey does not match with the mental model of the front end developer

Acceptance criteria:

  1. When importing an app, I should be able to easily provide the npm build and run commands
  2. When opting in to create a route, the target port should be exposed without having to open any Advanced Options
  3. After importing my app, if a route is exposed, I should be able to access/copy that URL

Dependencies (External/Internal):

Design Artifacts:

Desired UX experience

  • enable user to provide the *Build Command* when Node Builder image is being used
  • enable user to provide the *Run Command* when Node Builder image is being used
  • expose the Target Port under the *Create a route to the Application *rather than inside Show advanced Routing options
  • NEED TO FINALIZE HOW TO PROVIDE THE ROUTE TO EASILY COPY – Inline Notification maybe? As well as side panel?

Note:

Description

As a user, I want have the option to add additional labels to a Route, as I could do in OCP3. See RFE-622

The additional labels should only be added to the route, not the service or other components. The advanced option "Labels" should not be touched and these labels are added to all components.

As an small additional we should also show always the "Target port" since it also defines the Service port and to make this more clear, the "Target port" should be shown before the "Create a route to the Application" checkbox.

Acceptance Criteria

The following changes should be applied to the Import flow (from Git, from Container, ...) and to the Edit page as well:

  1. Move the option "Target port" before the checkbox "Create a route to the Application" and do not hide the "Target port" when the checkbox is disabled
  2. Add a new "Additional route labels" option, with a label input field to the "Advanced Routing options"
  3. Save (Import) and update (Edit) the labels to the Route resource. When editing a Deployment with a Route the route labels should not show the shared labels.

Additional Details:

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

There are different code spots which maps the old action items "From Git", "From Dockerfile" and "From Devfile" to the new action "Import from Git".

We should avoid mapping different strings to the new version and instead update our tests so that the feature and page object files matches the latest frontend code.

Code areas I found are marked with

      // TODO (ODC-6455): Tests should use latest UI labels like "Import from Git" instead of mapping strings

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • Port all remaining Protractor tests to Cypress

Why is this important?

  • Protractor is very hard to debug when tests fail/flake
  • Once all protractor tests are ported we can remove all Protractor dependencies, scripts, and configuration files.
  • Cypress has better debugging, plug-ins, and reporting tools

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Please read: migrating-protractor-tests-to-cypress

Protractor test to migrate:  `frontend/integration-tests/tests/oauth.scenario.ts`
Large but straight forward

47) OAuth

   48) BasicAuth IDP
      ✔ creates a Basic Authentication IDP
      ✔ shows the BasicAuth IDP on the OAuth settings page

   49) GitHub IDP
      ✔ creates a GitHub IDP
      ✔ shows the GitHub IDP on the OAuth settings page

   50) GitLab IDP
      ✔ creates a GitLab IDP
      ✔ shows the GitLab IDP on the OAuth settings page

   51) Google IDP
      ✔ creates a Google IDP
      ✔ shows the Google IDP on the OAuth settings page

   52) Keystone IDP
      ✔ creates a Keystone IDP
      ✔ shows the Keystone IDP on the OAuth settings page

   53) LDAP IDP
      ✔ creates a LDAP IDP
      ✔ shows the LDAP IDP on the OAuth settings page

   54) OpenID IDP
      ✔ creates a OpenID IDP
      ✔ shows the OpenID IDP on the OAuth settings page

 Accpetance Criteria

  • Protractor test ported to cypress
  • Remove any unused legacy data-test-id`s
  • Protractor test deleted, and non longer referenced in `frontend/integration-tests/protractor.conf.ts`
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

 

Background

As a follow up to OCPCLOUD-693, we need to, once all of the API definitions are present in openshift/api, migrate the existing code bases to use the new API locations.

 

This will include:

  • Machine API Operator
  • Cluster Machine Approver
  • Cluster API Provider AWS|Azure|GCP|IBM|Alibaba|OpenStack|Kubevirt
  • Cluster API actuator pkg
  • Installer
  • WMCO
  • MCO
  • Hive
  • Grep OpenShift for other references to our old APIs

Steps

  • Replace the Machine API imports with the new openshift/API MAPI locations

Stakeholders

  • Cluster Infra
  • Owners of the repos listed above

Definition of Done

  • The openshift/API defintions are used across components in the MAPI ecosystem
  • Docs
  • Generated docs for API types should now come from openshift/API
  • Testing
  • Regular regression testing should be sufficient, this is a copy paste for the most part and we expect the code won't compile if we break this

Problem:

Complete all the 4.9 epic features automation user stories and merge it to master branch.

Goal:

4.9 epics automation completion

Why is it important?

Tech debt should be completed

Use cases:

  1. <case>

Acceptance criteria:

Create the pr's for 4.9 epic user stories automation
Review it
Merge it to 4.10 master branch and 4.9 master branch

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to store my delivery pipelines in a Git repository as the source of truth and execute the pipeline on OpenShift on Git events, so that I can version and trace changes to the delivery pipelines in Git.

Use Cases

  • Developer can see the list of Git repositories that are added to the namespace for pipeline-as-code execution
  • Developer can navigate from the Console to the Git repository on the Git provider
  • For each Git repository, developer can see the details of the last pipeline execution and the commit id that triggered it with possibility to navigating to the Git commit in the Git provider
  • Developer can see the list of pipelinerun executions related to a Git repository in a chronological order and the commit id that triggered each

Acceptance Criteria

  1. As a user, looking at the Pipelines page in the Developer Console, I should be able to see a list of (a) Git repositories that are added to the namespace for PAC execution AND (b) all pipelines in the namespace
  2. As a user, I should be able to navigate to a details page of the git repo.
    1. This details page should provide access to (a) details of the git repo and (b) a list of pipeline runs.
    2. This PLR tab should show additional information than the typical PLR List view, including SHA (commit id), commit message, branch & trigger type
  3. As a user, when looking at a Pipeline Run Details page, if associate with a git repo (PAC),
    1. Indicate that it's from a specific git repo rather than a PL resource
    2. Include the SHA (commit id), commit message, branch & trigger type

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

This is a clone of issue OCPBUGS-5766. The following is the description of the original issue:

Description of problem:

Data race seen in unit tests:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1448/pull-ci-openshift-ovn-kubernetes-release-4.11-unit/1604898712423763968/artifacts/test/build-log.txt
 

Goal

We have several use cases where dynamic plugins need to proxy to another service on the cluster. One example is the Helm plugin. We would like to move the backend code for Helm to a separate service on the cluster, and the Helm plugin could proxy to that service for its requests. This is required to make Helm a dynamic plugin. Similarly if we want to have ACM contribute any views through dynamic plugins, we will need a way for ACM to proxy to its services (e.g., for Search).

It's possible for plugins to make requests to services exposed through routes today, but that has several problems:

  1. It requires that the service be exposed outside the cluster, which is not always desired.
  2. It requires the service support CORS headers for the console.
  3. There is no way to specify a CA file for the route if it's not trusted by the browser.
  4. Plugins will not have access to the user's access token on the client, which means that there is no simple way to handle auth.

Plugins need a way to declare in-cluster services that they need to connect to. The console backend will need to set up proxies to those services on console load. This also requires that the console operator be updated to pass the configuration to the console backend.

 

This work will apply only to single clusters.

 

Open Questions

  • What happens when a multitenant isolated network policy is configured on the cluster?

https://docs.openshift.com/container-platform/4.7/networking/network_policy/multitenant-network-policy.html

  • How do we (and can we?) support this for multi-cluster where console is running on a different hub cluster?
  • Do we need to auth for all requests?

Acceptance Criteria

  • Plugins can declare a service to proxy to in the ConsolePlugin resource
  • Plugins can specify a CA cert for the service
  • Console falls back to the service signing CA if none is specified
  • Plugins have a way of specifying whether the user's authentication token is included in requests through the service proxy
  • Dynamic plugin enhancement is updated with the implementation details
  • Support for server-side events (SSE) for ACM
  • Add support, or a flag, if auth is needed for each request.  

cc Ali Mobrem [~christianmvogt]

This is a clone of issue OCPBUGS-1354. The following is the description of the original issue:

This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)

How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[

{ "address": "10.0.178.163", "type": "InternalIP" }

,

{ "address": "10.0.187.247", "type": "InternalIP" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "Hostname" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }

]

$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }

~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s

$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd | awk '

{print $1}

' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }

~~~

This is a clone of issue OCPBUGS-11035. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10314. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8741. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5889. The following is the description of the original issue:

Description of problem:

Customer running a cluster with following config:
4.10.23
AWS/IPI
OVNKubernetes

Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace.

Example:
Deployment1 with two pods (A/B) exists in namespace <EXAMPLE>
Deployment2 with 1 pod hosting a service and route exists in same namespace
Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2.

Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks)

Pod A remains able to reach the local gateway on the node

Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted.

Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace.

I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace.

Additional relevant data:
- pods affects throughout cluster; no specific project/service/deployment/application
- pods ride on different nodes all the time (no one node affected)
- pods with fail condition are on same node with other pods without issue
- multiple namespaces see this problem
- all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).



Version-Release number of selected component (if applicable):

4.10.23

How reproducible:

every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.

Steps to Reproduce:

1. deploy pod with at least two replicas in a namespace with allow-from same network policy
2. deploy a different service and route example httpd instance in same namespace
3. observe that one of the two pods may fail to reach service IP after some time
4. apply annotation to pod and it is immediately able to reach services again.

Actual results:

pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster. 

Expected results:

pods should not lose access to service network peers. 

Additional info:

see next comments for relevant uploads/sosreports and inspects.

This bug is a backport clone of [Bugzilla Bug 2081119](https://bugzilla.redhat.com/show_bug.cgi?id=2081119). The following is the description of the original bug:

Description of problem:

`oc explain` output of default overlaySize is 10G. The outdated doc confused the customer with default overlaySize. The description should be updated since the default was removed from the implementation of storage.conf.

Version-Release number of selected component (if applicable):

How reproducible:

Actual results:

$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.logLevel
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: logLevel <string>

DESCRIPTION:
logLevel specifies the verbosity of the logs based on the level it is set
to. Options are fatal, panic, error, warn, info, and debug.
[qiwan@qiwan ~]$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.overlaySize
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: overlaySize <string>

DESCRIPTION:
overlaySize specifies the maximum size of a container image. This flag can
be used to set quota on the size of container images. (default: 10GB).

Expected results:

$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.logLevel
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: logLevel <string>

DESCRIPTION:
logLevel specifies the verbosity of the logs based on the level it is set
to. Options are fatal, panic, error, warn, info, and debug.
[qiwan@qiwan ~]$ oc explain containerRuntimeConfig.spec.containerRuntimeConfig.overlaySize
KIND: ContainerRuntimeConfig
VERSION: machineconfiguration.openshift.io/v1

FIELD: overlaySize <string>

DESCRIPTION:
overlaySize specifies the maximum size of a container image. This flag can
be used to set quota on the size of container images.

Additional info:

+++ This bug was initially created as a clone of OCPBUGSM-46761 +++

Description of problem:
When using the admin console, under "Cluster Settings" and choosing Upstream Configuration, the "window" that appears has a dead link to documentation for how to create a local (disconnected) update server.

The link in the window is
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/updating_clusters/installing-update-service

Assume the right link should be something like:
https://docs.openshift.com/container-platform/4.10/updating/updating-restricted-network-cluster.html#update-restricted-network-cluster-update-service

Version-Release number of selected component (if applicable):

4.10.*

— Additional comment from plarsen@redhat.com on 2022-06-17 16:33:53 UTC —

Created attachment 1890947
Screenshot showing the "window" with the 404 link

— Additional comment from rhamilto@redhat.com on 2022-06-28 19:56:17 UTC —

Thank you, Peter, for wonderfully documenting the bug!

Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:

level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []

Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.

I'm setting the priority to critical as it's causing all our jobs to fail in master.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

To avoid any potential bugs, the oVirt CSI driver should use the latest go-ovirt-client, preferably the tagged 1.0.0 version.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4960. The following is the description of the original issue:

Description of problem:
This is a follow up on OCPBUGSM-47202 (https://bugzilla.redhat.com/show_bug.cgi?id=2110570)

While OCPBUGSM-47202 fixes the issue specific for Set Pod Count, many other actions aren't fixed. When the user updates a Deployment with one of this options, and selects the action again, the old values are still shown.

Version-Release number of selected component (if applicable)
4.8-4.12 as well as master with the changes of OCPBUGSM-47202

How reproducible:
Always

Steps to Reproduce:

  1. Import a deployment
  2. Select the deployment to open the topology sidebar
  3. Click on actions and one of the 4 options to update the deployment with a modal
    1. Edit labels
    2. Edit annotatations
    3. Edit update strategy
    4. Edit resource limits
  4. Click on the action again and check if the data in the modal reflects the changes from step 3

Actual results:
Old data (labels, annotations, etc.) was shown.

Expected results:
Latest data should be shown

Additional info:

Description of problem:

Currently when installing Openshift on the Openstack cluster name length limit is allowed to  14 characters.
Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters.

Version : OCP 4.8.5 UPI Disconnected 
Environment : Openstack 16 

Issue:
User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters.

Error events :
~~~
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

 

Actual results:

Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack

Expected results:

The 14 characters limit should be change for the OCP clustername on Openstack

Additional info:

 

This is a clone of issue OCPBUGS-2113. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:

Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

This is a clone of issue OCPBUGS-2250. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2099. The following is the description of the original issue:

Description of problem:

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 
  • CVE-2022-45047
  • CVE-2022-45379
  • CVE-2022-45381
  • CVE-2022-45380
  • CVE-2022-43403
  • CVE-2022-43409
  • CVE-2022-43408
  • CVE-2022-43407 
  • CVE-2022-43404
  • CVE-2022-43401
  • CVE-2022-43402
  • CVE-2022-43405
  • CVE-2022-43406
  • CVE-2022-42889
  • CVE-2022-25857
  • CVE-2022-36885
  • CVE-2022-36884
  • CVE-2022-36883
  • CVE-2022-34174
  • CVE-2022-30954 
  • CVE-2022-30953
  • CVE-2022-30946
  • CVE-2022-36882
  • CVE-2022-30948
  • CVE-2022-30945
  • CVE-2021-26291
  • CVE-2022-29047
  •  

This bug is a backport clone of [Bugzilla Bug 2117324](https://bugzilla.redhat.com/show_bug.cgi?id=2117324). The following is the description of the original bug:

+++ This bug was initially created as a clone of Bug #2101357 +++

Description of problem:

message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
+0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
+0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
+0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
+0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
+0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
+0x231\n"
reason: Error
startedAt: "2022-06-27T00:00:59Z"

Version-Release number of selected component (if applicable):
mac:~ jianzhang$ oc exec catalog-operator-66cb8fd8c5-j7vkx – olm --version
OLM version: 0.19.0
git commit: 8c2bd46147a90d58e98de73d34fd79477769f11f
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-25-081133 True False 10h Cluster version is 4.11.0-0.nightly-2022-06-25-081133

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.11
2. Check OLM pods

Actual results:

mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-66cb8fd8c5-j7vkx 1/1 Running 2 (8h ago) 10h
collect-profiles-27605340-wgsvf 0/1 Completed 0 42m
collect-profiles-27605355-ffgxd 0/1 Completed 0 27m
collect-profiles-27605370-w7ds7 0/1 Completed 0 12m
olm-operator-6cfd444b8f-r5q4t 1/1 Running 0 10h
package-server-manager-66589d4bf8-csr7j 1/1 Running 0 10h
packageserver-59977db6cf-nkn5w 1/1 Running 0 10h
packageserver-59977db6cf-nxbnx 1/1 Running 0 10h

mac:~ jianzhang$ oc get pods catalog-operator-66cb8fd8c5-j7vkx -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.26"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.26"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-26T23:12:45Z"
generateName: catalog-operator-66cb8fd8c5-
labels:
app: catalog-operator
pod-template-hash: 66cb8fd8c5
name: catalog-operator-66cb8fd8c5-j7vkx
namespace: openshift-operator-lifecycle-manager
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-66cb8fd8c5
    uid: bcf173be-97bc-4152-8cec-d45f820e167c
    resourceVersion: "67395"
    uid: bb34c37b-b22d-4412-bbc0-1e57a1b2bd3a
    spec:
    containers:
  • args:
  • --namespace
  • openshift-marketplace
  • --configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:033bd57b54bb2e827b0ae95a03cb6f7ceeb65422e85de635185cfed88c17b2b4
  • --opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:033bd57b54bb2e827b0ae95a03cb6f7ceeb65422e85de635185cfed88c17b2b4
  • --util-image
  • quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
  • --writeStatusName
  • operator-lifecycle-manager-catalog
  • --tls-cert
  • /srv-cert/tls.crt
  • --tls-key
  • /srv-cert/tls.key
  • --client-ca
  • /profile-collector-cert/tls.crt
    command:
  • /bin/catalog
    env:
  • name: RELEASE_VERSION
    value: 4.11.0-0.nightly-2022-06-25-081133
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    imagePullPolicy: IfNotPresent
    livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8443
    scheme: HTTPS
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    name: catalog-operator
    ports:
  • containerPort: 8443
    name: metrics
    protocol: TCP
    readinessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8443
    scheme: HTTPS
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    resources:
    requests:
    cpu: 10m
    memory: 80Mi
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
    drop:
  • ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
  • mountPath: /srv-cert
    name: srv-cert
    readOnly: true
  • mountPath: /profile-collector-cert
    name: profile-collector-cert
    readOnly: true
  • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    name: kube-api-access-7txv8
    readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-173-128.ap-southeast-1.compute.internal
    nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    preemptionPolicy: PreemptLowerPriority
    priority: 2000000000
    priorityClassName: system-cluster-critical
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
    level: s0:c20,c0
    seccompProfile:
    type: RuntimeDefault
    serviceAccount: olm-operator-serviceaccount
    serviceAccountName: olm-operator-serviceaccount
    terminationGracePeriodSeconds: 30
    tolerations:
  • effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  • effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
    volumes:
  • name: srv-cert
    secret:
    defaultMode: 420
    secretName: catalog-operator-serving-cert
  • name: profile-collector-cert
    secret:
    defaultMode: 420
    secretName: pprof-cert
  • name: kube-api-access-7txv8
    projected:
    defaultMode: 420
    sources:
  • serviceAccountToken:
    expirationSeconds: 3607
    path: token
  • configMap:
    items:
  • key: ca.crt
    path: ca.crt
    name: kube-root-ca.crt
  • downwardAPI:
    items:
  • fieldRef:
    apiVersion: v1
    fieldPath: metadata.namespace
    path: namespace
  • configMap:
    items:
  • key: service-ca.crt
    path: service-ca.crt
    name: openshift-service-ca.crt
    status:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2022-06-26T23:16:59Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T01:19:09Z"
    status: "True"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T01:19:09Z"
    status: "True"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2022-06-26T23:16:59Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: cri-o://f3fce12480556b5ab279236c290acdf15e1bc850426078730dcfd333ecda6795
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d42dfe153a5bfa3a66e5e9d4664b4939aed376b4afbd9e098291c259daca2f8
    lastState:
    terminated:
    containerID: cri-o://5e3ce04f1cf538f679af008dcb62f9477131aae1772f527d6207fdc7bb247a55
    exitCode: 2
    finishedAt: "2022-06-27T01:19:08Z"
    message: "her.go:105 +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
    +0x130\n\ngoroutine 5545 [select, 7 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc00240a780,
    0xc003321a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
    +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0xc002efea80?,
    0xc0009cc7a0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
    +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
    +0x30a\n\ngoroutine 5678 [select, 3 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc000b70480,
    0xc0035d4500)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
    +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc002999e90?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
    +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
    +0x30a\n\ngoroutine 5836 [select, 1 minutes]:\ngolang.org/x/net/http2.(*clientStream).writeRequest(0xc003b00180,
    0xc003ff8a00)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1345
    +0x9c9\ngolang.org/x/net/http2.(*clientStream).doRequest(0x6e5326?, 0xc003a1c8d0?)\n\t/build/vendor/golang.org/x/net/http2/transport.go:1207
    +0x1e\ncreated by golang.org/x/net/http2.(*ClientConn).RoundTrip\n\t/build/vendor/golang.org/x/net/http2/transport.go:1136
    +0x30a\n\ngoroutine 5905 [chan receive, 1 minutes]:\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel.func1()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:51
    +0x85\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver.(*sourceInvalidator).GetValidChannel\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/source_registry.go:50
    +0x231\n"
    reason: Error
    startedAt: "2022-06-27T00:00:59Z"
    name: catalog-operator
    ready: true
    restartCount: 2
    started: true
    state:
    running:
    startedAt: "2022-06-27T01:19:08Z"
    hostIP: 10.0.173.128
    phase: Running
    podIP: 10.130.0.26
    podIPs:
  • ip: 10.130.0.26
    qosClass: Burstable
    startTime: "2022-06-26T23:16:59Z"

Expected results:
catalog-operator works well.

Additional info:
Operators can be subscribed successfully.
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
jian learn learn qe-app-registry beta
openshift-logging cluster-logging cluster-logging qe-app-registry stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable
mac:~ jianzhang$
mac:~ jianzhang$ oc get pods -n jian
NAME READY STATUS RESTARTS AGE
552b4660850a7fe1e1f142091eb5e4305f18af151727c56f70aa5dffc1dg8cg 0/1 Completed 0 54m
learn-operator-666b687bfb-7qppm 1/1 Running 0 54m
qe-app-registry-hbzxg 1/1 Running 0 58m
mac:~ jianzhang$ oc get csv -n jian
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded
learn-operator.v0.0.3 Learn Operator 0.0.3 learn-operator.v0.0.2 Succeeded

— Additional comment from jiazha@redhat.com on 2022-06-27 09:58:18 UTC —

Created attachment 1892927
olm must-gather

— Additional comment from jiazha@redhat.com on 2022-06-27 09:59:01 UTC —

Created attachment 1892928
marketplace project must-gather

— Additional comment from jiazha@redhat.com on 2022-06-28 02:05:39 UTC —

mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-25-132614 True False 145m Cluster version is 4.11.0-0.nightly-2022-06-25-132614

mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-869fb4bd4d-lbhgj 1/1 Running 3 (9m25s ago) 170m
collect-profiles-27606330-4wg5r 0/1 Completed 0 33m
collect-profiles-27606345-lmk4q 0/1 Completed 0 18m
collect-profiles-27606360-mksv6 0/1 Completed 0 3m17s
olm-operator-5f485d9d5f-wczjc 1/1 Running 0 170m
package-server-manager-6cf996b4cc-79lrw 1/1 Running 2 (156m ago) 170m
packageserver-5f668f98d7-2vjdn 1/1 Running 0 165m
packageserver-5f668f98d7-mb2wc 1/1 Running 0 165m
mac:~ jianzhang$ oc get pods catalog-operator-869fb4bd4d-lbhgj -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.34"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.34"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-27T23:13:12Z"
generateName: catalog-operator-869fb4bd4d-
labels:
app: catalog-operator
pod-template-hash: 869fb4bd4d
name: catalog-operator-869fb4bd4d-lbhgj
namespace: openshift-operator-lifecycle-manager
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-869fb4bd4d
    uid: 3a1a8cd3-2151-4650-a96b-c1951d461c67
    resourceVersion: "75671"
    uid: 2f06d663-0697-4e88-9e9a-f802c6641efe
    spec:
    containers:
  • args:
  • --namespace
  • openshift-marketplace
  • --configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
  • --opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
  • --util-image
  • quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
  • --writeStatusName
  • operator-lifecycle-manager-catalog
  • --tls-cert
  • /srv-cert/tls.crt
  • --tls-key
  • /srv-cert/tls.key
  • --client-ca
  • /profile-collector-cert/tls.crt
    command:
  • /bin/catalog
    env:
  • name: RELEASE_VERSION
    value: 4.11.0-0.nightly-2022-06-25-132614
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imagePullPolicy: IfNotPresent
    livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8443
    scheme: HTTPS
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    name: catalog-operator
    ports:
  • containerPort: 8443
    name: metrics
    protocol: TCP
    readinessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8443
    scheme: HTTPS
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    resources:
    requests:
    cpu: 10m
    memory: 80Mi
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
    drop:
  • ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
  • mountPath: /srv-cert
    name: srv-cert
    readOnly: true
  • mountPath: /profile-collector-cert
    name: profile-collector-cert
    readOnly: true
  • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    name: kube-api-access-sdjns
    readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-190-130.ap-south-1.compute.internal
    nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    preemptionPolicy: PreemptLowerPriority
    priority: 2000000000
    priorityClassName: system-cluster-critical
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
    level: s0:c20,c0
    seccompProfile:
    type: RuntimeDefault
    serviceAccount: olm-operator-serviceaccount
    serviceAccountName: olm-operator-serviceaccount
    terminationGracePeriodSeconds: 30
    tolerations:
  • effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  • effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
    volumes:
  • name: srv-cert
    secret:
    defaultMode: 420
    secretName: catalog-operator-serving-cert
  • name: profile-collector-cert
    secret:
    defaultMode: 420
    secretName: pprof-cert
  • name: kube-api-access-sdjns
    projected:
    defaultMode: 420
    sources:
  • serviceAccountToken:
    expirationSeconds: 3607
    path: token
  • configMap:
    items:
  • key: ca.crt
    path: ca.crt
    name: kube-root-ca.crt
  • downwardAPI:
    items:
  • fieldRef:
    apiVersion: v1
    fieldPath: metadata.namespace
    path: namespace
  • configMap:
    items:
  • key: service-ca.crt
    path: service-ca.crt
    name: openshift-service-ca.crt
    status:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2022-06-28T01:53:53Z"
    status: "True"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2022-06-28T01:53:53Z"
    status: "True"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: cri-o://2a83d3ba503aa760d27aeef28bf03af84b9ff134ea76c6545251f3253507c22b
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    lastState:
    terminated:
    containerID: cri-o://2766bb3bad38c02f531915aac0321ffb18c7a993a7d84b27e8c7199935f7c858
    exitCode: 2
    finishedAt: "2022-06-28T01:53:52Z"
    message: "izer/streaming/streaming.go:77 +0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc002b12500)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
    +0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0036178c0)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
    +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
    +0x130\n\ngoroutine 3479 [sync.Cond.Wait]:\nsync.runtime_notifyListWait(0xc003a987c8,
    0x8)\n\t/usr/lib/golang/src/runtime/sema.go:513 +0x13d\nsync.(*Cond).Wait(0xc001f9bf10?)\n\t/usr/lib/golang/src/sync/cond.go:56
    +0x8c\ngolang.org/x/net/http2.(*pipe).Read(0xc003a987b0, {0xc0036e4001, 0xdff, 0xdff}

    )\n\t/build/vendor/golang.org/x/net/http2/pipe.go:76 +0xeb\ngolang.org/x/net/http2.transportResponseBody.Read(

    {0x0?}

    ,

    {0xc0036e4001?, 0x2?, 0x203511d?}

    )\n\t/build/vendor/golang.org/x/net/http2/transport.go:2407
    +0x85\nencoding/json.(*Decoder).refill(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:165
    +0x17f\nencoding/json.(*Decoder).readValue(0xc002fc0640)\n\t/usr/lib/golang/src/encoding/json/stream.go:140
    +0xbb\nencoding/json.(*Decoder).Decode(0xc002fc0640,

    {0x1d377c0, 0xc003523098}

    )\n\t/usr/lib/golang/src/encoding/json/stream.go:63
    +0x78\nk8s.io/apimachinery/pkg/util/framer.(*jsonFrameReader).Read(0xc003127770,

    {0xc0036dd500, 0x1000, 0x1500}

    )\n\t/build/vendor/k8s.io/apimachinery/pkg/util/framer/framer.go:152
    +0x19c\nk8s.io/apimachinery/pkg/runtime/serializer/streaming.(*decoder).Decode(0xc003502aa0,
    0xc001f9bf10?,

    {0x2366870, 0xc0044dca80}

    )\n\t/build/vendor/k8s.io/apimachinery/pkg/runtime/serializer/streaming/streaming.go:77
    +0xa7\nk8s.io/client-go/rest/watch.(*Decoder).Decode(0xc00059f700)\n\t/build/vendor/k8s.io/client-go/rest/watch/decoder.go:49
    +0x4f\nk8s.io/apimachinery/pkg/watch.(*StreamWatcher).receive(0xc0044dcd40)\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:105
    +0xe5\ncreated by k8s.io/apimachinery/pkg/watch.NewStreamWatcher\n\t/build/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go:76
    +0x130\n"
    reason: Error
    startedAt: "2022-06-28T01:06:59Z"
    name: catalog-operator
    ready: true
    restartCount: 3
    started: true
    state:
    running:
    startedAt: "2022-06-28T01:53:53Z"
    hostIP: 10.0.190.130
    phase: Running
    podIP: 10.130.0.34
    podIPs:

  • ip: 10.130.0.34
    qosClass: Burstable
    startTime: "2022-06-27T23:16:47Z"

— Additional comment from jiazha@redhat.com on 2022-06-28 02:09:23 UTC —

mac:~ jianzhang$ oc get pods package-server-manager-6cf996b4cc-79lrw -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.13"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.13"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-27T23:13:10Z"
generateName: package-server-manager-6cf996b4cc-
labels:
app: package-server-manager
pod-template-hash: 6cf996b4cc
name: package-server-manager-6cf996b4cc-79lrw
namespace: openshift-operator-lifecycle-manager
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: package-server-manager-6cf996b4cc
    uid: 191cb627-a1f3-452b-aed2-336de7871004
    resourceVersion: "19056"
    uid: 63cb7200-32d6-4b0d-bfb2-ba725cda7fbd
    spec:
    containers:
  • args:
  • --name
  • $(PACKAGESERVER_NAME)
  • --namespace
  • $(PACKAGESERVER_NAMESPACE)
    command:
  • /bin/psm
  • start
    env:
  • name: PACKAGESERVER_NAME
    value: packageserver
  • name: PACKAGESERVER_IMAGE
    value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
  • name: PACKAGESERVER_NAMESPACE
    valueFrom:
    fieldRef:
    apiVersion: v1
    fieldPath: metadata.namespace
  • name: RELEASE_VERSION
    value: 4.11.0-0.nightly-2022-06-25-132614
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imagePullPolicy: IfNotPresent
    livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8080
    scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    name: package-server-manager
    readinessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8080
    scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    resources:
    requests:
    cpu: 10m
    memory: 50Mi
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
    drop:
  • ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
  • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    name: kube-api-access-k9hh2
    readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-190-130.ap-south-1.compute.internal
    nodeSelector:
    kubernetes.io/os: linux
    node-role.kubern

— Additional comment from jiazha@redhat.com on 2022-06-28 02:10:02 UTC —

preemptionPolicy: PreemptLowerPriority
priority: 2000000000
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsNonRoot: true
runAsUser: 65534
seLinuxOptions:
level: s0:c20,c0
seccompProfile:
type: RuntimeDefault
serviceAccount: olm-operator-serviceaccount
serviceAccountName: olm-operator-serviceaccount
terminationGracePeriodSeconds: 30
tolerations:

  • effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  • effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
    volumes:
  • name: kube-api-access-k9hh2
    projected:
    defaultMode: 420
    sources:
  • serviceAccountToken:
    expirationSeconds: 3607
    path: token
  • configMap:
    items:
  • key: ca.crt
    path: ca.crt
    name: kube-root-ca.crt
  • downwardAPI:
    items:
  • fieldRef:
    apiVersion: v1
    fieldPath: metadata.namespace
    path: namespace
  • configMap:
    items:
  • key: service-ca.crt
    path: service-ca.crt
    name: openshift-service-ca.crt
    status:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:27:28Z"
    status: "True"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:27:28Z"
    status: "True"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2022-06-27T23:16:47Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: cri-o://2c56089fabbaa6d7067feb231750ad20422e299dc33dabc0cd19ceca5951ed3a
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    lastState:
    terminated:
    containerID: cri-o://76cc5448de346ec636070a71868432c8f3aa213e16fe211c8fe18a3fee112d23
    exitCode: 1
    finishedAt: "2022-06-27T23:26:36Z"
    message: "d/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963629913e+09\tERROR\tFailed
    to get API Group-Resources\t {\"error\": \"Get \\\"https://172.30.0.1:443/api?timeout=32s\\\": dial tcp 172.30.0.1:443: connect: connection refused\"}

    \nsigs.k8s.io/controller-runtime/pkg/cluster.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:322\nmain.run\n\t/build/cmd/package-server-manager/main.go:67\ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\n1.6563723963631017e+09\tERROR\tsetup\tfailed
    to setup manager instance\t

    {\"error\": \"Get \\\"https://172.30.0.1:443/api?timeout=32s\\\": dial tcp 172.30.0.1:443: connect: connection refused\"}

    \ngithub.com/spf13/cobra.(*Command).execute\n\t/build/vendor/github.com/spf13/cobra/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/build/vendor/github.com/spf13/cobra/command.go:974\ngithub.com/spf13/cobra.(*Command).Execute\n\t/build/vendor/github.com/spf13/cobra/command.go:902\nmain.main\n\t/build/cmd/package-server-manager/main.go:36\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250\nError:
    Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
    connect: connection refused\nencountered an error while executing the binary:
    Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443:
    connect: connection refused\n"
    reason: Error
    startedAt: "2022-06-27T23:26:11Z"
    name: package-server-manager
    ready: true
    restartCount: 2
    started: true
    state:
    running:
    startedAt: "2022-06-27T23:26:54Z"
    hostIP: 10.0.190.130
    phase: Running
    podIP: 10.130.0.13
    podIPs:

  • ip: 10.130.0.13
    qosClass: Burstable
    startTime: "2022-06-27T23:16:47Z"

— Additional comment from jiazha@redhat.com on 2022-06-29 08:43:51 UTC —

Observed the error restarts:

mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-28-160049 True False 5h57m Cluster version is 4.11.0-0.nightly-2022-06-28-160049

mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-7b88dddfbc-rsfhz 1/1 Running 6 (26m ago) 5h51m
collect-profiles-27608160-6m7r6 0/1 Completed 0 37m
collect-profiles-27608175-94n56 0/1 Completed 0 22m
collect-profiles-27608190-nbzcf 0/1 Completed 0 7m55s
olm-operator-5977ffb855-lgfn8 1/1 Running 0 9h
package-server-manager-75db6dcfc-hql4v 1/1 Running 0 9h
packageserver-5955fb79cd-9n56n 1/1 Running 0 9h
packageserver-5955fb79cd-xf6f6 1/1 Running 0 9h

mac:~ jianzhang$ oc get pods catalog-operator-7b88dddfbc-rsfhz -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.121"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.130.0.121"
],
"default": true,
"dns": {}
}]
openshift.io/scc: nonroot-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
creationTimestamp: "2022-06-29T02:46:23Z"
generateName: catalog-operator-7b88dddfbc-
labels:
app: catalog-operator
pod-template-hash: 7b88dddfbc
name: catalog-operator-7b88dddfbc-rsfhz
namespace: openshift-operator-lifecycle-manager
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-7b88dddfbc
    uid: 4c796f4d-b2f2-4e0d-b2de-6e8ab5f45ce2
    resourceVersion: "278732"
    uid: 10e8c6e9-8f1c-4484-b934-c5208016c426
    spec:
    containers:
  • args:
  • --debug
  • --namespace
  • openshift-marketplace
  • --configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
  • --opmImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:40552deca32531d2fe754f703613f24b514e27ffaa660b57d760f9cd984d9000
  • --util-image
  • quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
  • --writeStatusName
  • operator-lifecycle-manager-catalog
  • --tls-cert
  • /srv-cert/tls.crt
  • --tls-key
  • /srv-cert/tls.key
  • --client-ca
  • /profile-collector-cert/tls.crt
    command:
  • /bin/catalog
    env:
  • name: RELEASE_VERSION
    value: 4.11.0-0.nightly-2022-06-28-160049
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imagePullPolicy: IfNotPresent
    livenessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8443
    scheme: HTTPS
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    name: catalog-operator
    ports:
  • containerPort: 8443
    name: metrics
    protocol: TCP
    readinessProbe:
    failureThreshold: 3
    httpGet:
    path: /healthz
    port: 8443
    scheme: HTTPS
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
    resources:
    requests:
    cpu: 10m
    memory: 80Mi
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
    drop:
  • ALL
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
  • mountPath: /srv-cert
    name: srv-cert
    readOnly: true
  • mountPath: /profile-collector-cert
    name: profile-collector-cert
    readOnly: true
  • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    name: kube-api-access-cv642
    readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
  • name: olm-operator-serviceaccount-dockercfg-f26bf
    nodeName: ip-10-0-130-83.ap-south-1.compute.internal
    nodeSelector:
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
    preemptionPolicy: PreemptLowerPriority
    priority: 2000000000
    priorityClassName: system-cluster-critical
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext:
    runAsNonRoot: true
    runAsUser: 65534
    seLinuxOptions:
    level: s0:c20,c0
    seccompProfile:
    type: RuntimeDefault
    serviceAccount: olm-operator-serviceaccount
    serviceAccountName: olm-operator-serviceaccount
    terminationGracePeriodSeconds: 30
    tolerations:
  • effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  • effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
    volumes:
  • name: srv-cert
    secret:
    defaultMode: 420
    secretName: catalog-operator-serving-cert
  • name: profile-collector-cert
    secret:
    defaultMode: 420
    secretName: pprof-cert
  • name: kube-api-access-cv642
    projected:
    defaultMode: 420
    sources:
  • serviceAccountToken:
    expirationSeconds: 3607
    path: token
  • configMap:
    items:
  • key: ca.crt
    path: ca.crt
    name: kube-root-ca.crt
  • downwardAPI:
    items:
  • fieldRef:
    apiVersion: v1
    fieldPath: metadata.namespace
    path: namespace
  • configMap:
    items:
  • key: service-ca.crt
    path: service-ca.crt
    name: openshift-service-ca.crt
    status:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2022-06-29T02:46:23Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2022-06-29T08:11:56Z"
    status: "True"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2022-06-29T08:11:56Z"
    status: "True"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2022-06-29T02:46:23Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: cri-o://64a99805cca11e9af66452d739a91259566a35a9e580bfac81dd7205f9272e18
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ac69cd6ac451019bc6060627a1cceb475d2b6b521803addb31b47e501d5e5935
    lastState:
    terminated:
    containerID: cri-o://938e2eb90465caf8ac99ff4405cfbb9a0f03b16598474f944cca44c3e70af9df
    exitCode: 2
    finishedAt: "2022-06-29T08:11:54Z"
    message: "ub.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128
    +0xc3 fp=0xc003c11180 sp=0xc003c11118 pc=0x1a36603\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc000a2f420,
    {0x237a238, 0xc0004ce4c0}, {0x236a448, 0xc004993ce0})\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77
    +0x535 fp=0xc003c11648 sp=0xc003c11180 pc=0x1a4f535\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc0001e3ad0,
    {0x237a238, 0xc0004ce4c0}

    , 0xc000cee360)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287
    +0x57c fp=0xc003c11f70 sp=0xc003c11648 pc=0x1a3ca7c\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x10000c0008fd6e0?,

    {0x237a238, 0xc0004ce4c0}

    , 0xc0004837b8?)\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231
    +0x45 fp=0xc003c11fb0 sp=0xc003c11f70 pc=0x1a3c4a5\ngithub.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
    +0x32 fp=0xc003c11fe0 sp=0xc003c11fb0 pc=0x1a3c152\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1571
    +0x1 fp=0xc003c11fe8 sp=0xc003c11fe0 pc=0x4719c1\ncreated by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start\n\t/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221
    +0x557\n"
    reason: Error
    startedAt: "2022-06-29T07:56:16Z"
    name: catalog-operator
    ready: true
    restartCount: 6
    started: true
    state:
    running:
    startedAt: "2022-06-29T08:11:55Z"
    hostIP: 10.0.130.83
    phase: Running
    podIP: 10.130.0.121
    podIPs:

  • ip: 10.130.0.121
    qosClass: Burstable
    startTime: "2022-06-29T02:46:23Z"

— Additional comment from jiazha@redhat.com on 2022-07-04 03:50:38 UTC —

Please ignore comment 4, 5, they are nothing with this issue.

— Additional comment from jiazha@redhat.com on 2022-07-04 06:57:24 UTC —

Check the `previous` log.

mac:~ jianzhang$ oc logs catalog-operator-f8ddcb57b-j5rf2 --previous
time="2022-07-03T23:49:00Z" level=info msg="log level info"
...
...
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
time="2022-07-04T03:43:25Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink=
fatal error: concurrent map writes
fatal error: concurrent map writes

goroutine 559 [running]:
runtime.throw(

{0x2048546?, 0x0?}

)
/usr/lib/golang/src/runtime/panic.go:992 +0x71 fp=0xc001f9c508 sp=0xc001f9c4d8 pc=0x43e9f1
runtime.mapassign_faststr(0x1d09880, 0xc0031847b0,

{0x2079799, 0x2e}

)
/usr/lib/golang/src/runtime/map_faststr.go:295 +0x38b fp=0xc001f9c570 sp=0xc001f9c508 pc=0x419b4b
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.Pod(0xc001f4a900,

{0x203dd0f, 0xf}

,

{0xc00132ccc0, 0x38}

,

{0xc003582d50, 0x13}, 0xc00452c1e0, 0xc0031847b0, 0x5, ...)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/reconciler.go:227 +0xbdf fp=0xc001f9cbb0 sp=0xc001f9c570 pc=0x1a475ff
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*grpcCatalogSourceDecorator).Pod(0xc001f9ce50, {0xc003582d50, 0x13}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:125 +0xf9 fp=0xc001f9cc30 sp=0xc001f9cbb0 pc=0x1a42c99
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).currentPodsWithCorrectImageAndSpec(0xc001f9ce68?,

{0xc001f4a900}

,

{0xc003582d50, 0x13}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:190 +0x198 fp=0xc001f9ce48 sp=0xc001f9cc30 pc=0x1a437b8
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler.(*GrpcRegistryReconciler).CheckRegistryServer(0xc000bcbf80?, 0x493b77?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go:453 +0x4c fp=0xc001f9ce88 sp=0xc001f9ce48 pc=0x1a45fcc
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).healthy(0x38ca8453?, 0xc001f4a900)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:196 +0x7e fp=0xc001f9ced0 sp=0xc001f9ce88 pc=0x1a4ae1e
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).health(0x1bc37c0?, 0xc003e7e7e0, 0x8?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:159 +0x2a fp=0xc001f9cf10 sp=0xc001f9ced0 pc=0x1a4ac8a
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).catalogHealth(0xc000a59a90,

{0xc003356a50, 0x11}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:137 +0x387 fp=0xc001f9d040 sp=0xc001f9cf10 pc=0x1a4a827
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*catalogHealthReconciler).Reconcile(0xc000a59a90,

{0x237a238, 0xc0003fe5c0}, {0x7f9f6e5b3900?, 0xc0050f64f0?})
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/reconciler.go:82 +0x21e fp=0xc001f9d118 sp=0xc001f9d040 pc=0x1a4a1be
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate.ReconcilerChain.Reconcile({0xc0008e43c0?, 0x3, 0xc001f9d258?}, {0x237a238, 0xc0003fe5c0}

,

{0x7f9f6e5b3328?, 0xc0050f6490?}

)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubestate/kubestate.go:128 +0xc3 fp=0xc001f9d180 sp=0xc001f9d118 pc=0x1a36603
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription.(*subscriptionSyncer).Sync(0xc0004dfd50,

{0x237a238, 0xc0003fe5c0}, {0x236a448, 0xc003cbb760})
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/subscription/syncer.go:77 +0x535 fp=0xc001f9d648 sp=0xc001f9d180 pc=0x1a4f535
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*QueueInformer).Sync(...)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer.go:35
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).processNextWorkItem(0xc00057a580, {0x237a238, 0xc0003fe5c0}

, 0xc000954720)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:287 +0x57c fp=0xc001f9df70 sp=0xc001f9d648 pc=0x1a3ca7c
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).worker(0x0?,

{0x237a238, 0xc0003fe5c0}

, 0x0?)
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:231 +0x45 fp=0xc001f9dfb0 sp=0xc001f9df70 pc=0x1a3c4a5
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start.func3()
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x32 fp=0xc001f9dfe0 sp=0xc001f9dfb0 pc=0x1a3c152
runtime.goexit()
/usr/lib/golang/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc001f9dfe8 sp=0xc001f9dfe0 pc=0x4719c1
created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*operator).start
/build/vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:221 +0x557

Seems like it failed at: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227

— Additional comment from agreene@redhat.com on 2022-07-05 16:01:22 UTC —

As Jian pointed out, the catalog operator is failing due to a concurrent write at https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/registry/reconciler/reconciler.go#L227.

This is happening because:

Line 227 in the reconciler.go directly mutates the catalogSource's annotations. The grpcCatalogSourceDecorator's Annotations function should be returning a copy of the annotations or it should be created with a deepcopy of the catalogSource to avoid mutating an object in the lister cache.

This doesn't seem to be a blocker, but we should get a fix in swiftly.

— Additional comment from jiazha@redhat.com on 2022-07-13 05:02:04 UTC —

1, Create a cluster with the fixed PR via the Cluster-bot.
mac:~ jianzhang$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest True False 126m Cluster version is 4.11.0-0.ci.test-2022-07-13-022646-ci-ln-41fvni2-latest

2, Subscribe some operators.
mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
default etcd etcd community-operators singlenamespace-alpha
openshift-logging cluster-logging cluster-logging redhat-operators stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator redhat-operators stable

mac:~ jianzhang$ oc get sub -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
default etcd etcd community-operators singlenamespace-alpha
openshift-logging cluster-logging cluster-logging redhat-operators stable
openshift-operators-redhat elasticsearch-operator elasticsearch-operator redhat-operators stable
mac:~ jianzhang$
mac:~ jianzhang$
mac:~ jianzhang$ oc get csv -n openshift-operators-redhat
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
mac:~ jianzhang$ oc get csv -n openshift-logging
NAME DISPLAY VERSION REPLACES PHASE
cluster-logging.5.4.2 Red Hat OpenShift Logging 5.4.2 Succeeded
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
mac:~ jianzhang$ oc get csv -n default
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Succeeded

3, Check OLM catalog-operator pods status.
mac:~ jianzhang$ oc get pods
NAME READY STATUS RESTARTS AGE
catalog-operator-546db7cdf5-7pldg 1/1 Running 0 145m
collect-profiles-27628110-lr2nv 0/1 Completed 0 30m
collect-profiles-27628125-br8b8 0/1 Completed 0 15m
collect-profiles-27628140-m64gp 0/1 Completed 0 38s
olm-operator-754d7f6f56-26qhw 1/1 Running 0 145m
package-server-manager-77d5cbf696-v9w4p 1/1 Running 0 145m
packageserver-6884994d98-2smtw 1/1 Running 0 143m
packageserver-6884994d98-5d7jg 1/1 Running 0 143m

mac:~ jianzhang$ oc logs catalog-operator-546db7cdf5-7pldg --previous
Error from server (BadRequest): previous terminated container "catalog-operator" in pod "catalog-operator-546db7cdf5-7pldg" not found

No terminated container. catalog-operator works well. Verify it.

— Additional comment from aos-team-art-private@redhat.com on 2022-07-13 22:50:04 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from jiazha@redhat.com on 2022-07-18 07:23:08 UTC —

Changed the status to VERIFIED based on comment 10.

This is a clone of issue OCPBUGS-5947. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4833. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4819. The following is the description of the original issue:

Description of problem:

s2i/run script has a bug - /usr/libexec/s2i/run: line 578: [: too many arguments

Version-Release number of selected component (if applicable):

v4.10

How reproducible:

Start jenkins from the container image using /usr/libexec/s2i/run while having a route that contains a certificate or key that includes special characters.

Steps to Reproduce:

1. create a route that contains a TLS certificate
2. start a pod using openshift4/ose-jenkins:v4.10.0
3. view the log

Actual results:

2022/12/12 17:30:33 [go-init] No pre-start command defined, skip
2022/12/12 17:30:33 [go-init] Main command launched : /usr/libexec/s2i/run
CONTAINER_MEMORY_IN_MB='12288', using /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/java and /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/javac
Administrative monitors that contact the update center will remain active
Migrating slave image configuration to current version tag ...
/usr/libexec/s2i/run: line 578: [: too many arguments
Using JENKINS_SERVICE_NAME=jenkins
Generating jenkins.model.JenkinsLocationConfiguration.xml using (/var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml.tpl) ...
Jenkins URL set to: https://bojenkinsdev.micron.com/ in file: /var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml
+ exec java -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:+ParallelRefProcEnabled -XX:+UseG1GC -XX:+UseStringDeduplication -XX:HeapDumpPath=/var/log/jenkins/ '-Xlog:gc*=debug:file=/var/log/jenkins-engserv/gc-%t.log:utctime:filecount=2,filesize=100m' -Xms2g -Xmx8g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Djavax.net.ssl.trustStore=/var/lib/jenkins/ca-anchors-keystore -Dcom.redhat.fips=false -Djdk.http.auth.tunneling.disabledSchemes= -Djdk.http.auth.proxying.disabledSchemes= -Duser.home=/var/lib/jenkins -Djavamelody.application-name=jenkins -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=false -XX:+AlwaysPreTouch -XX:ErrorFile=/var/log/jenkins-engserv -Dhudson.model.ParametersAction.keepUndefinedParameters=false -jar /usr/lib/jenkins/jenkins.war --prefix=/je...
Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true

Expected results:

not have the following error:
/usr/libexec/s2i/run: line 578: [: too many arguments

Additional info:

 

This is a clone of issue OCPBUGS-501. The following is the description of the original issue:

Description of problem: 

Version-Release number of selected component (if applicable): 4.10.16

How reproducible: Always

Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field

$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:

  • group: system:authenticated:oauth
    profile: AllRequestBodies
  • group: system:authenticated
    profile: AllRequestBodies
    profile: Default

2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc

Actual results:

Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)

Expected results: The command "oc get dc" should display the deploymentconfig without any error.

Additional info:

We should jump to OVS 2.17 to (a) get performance improvements to ovsdb-server and (b) stop using the dead-end 2.16 stream to consolidate on 2.17 which is used by other layered products, fully supported by FDP, and proven in 4.11/4.12/4.13

This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:

Description of problem:

Unit test failing 

=== RUN   TestNewAppRunAll/app_generation_using_context_dir
    newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby'
    --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)


Version-Release number of selected component (if applicable):

 

How reproducible:

100

Steps to Reproduce:

see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648 

Actual results:

unit tests fail

Expected results:

TestNewAppRunAll unit test should pass

Additional info:

 

This is a clone of issue OCPBUGS-2077. The following is the description of the original issue:

Description of problem:

Pipeline list page fetches all the pipelineruns to find the last pipeline run and which results in more load time. This performance issue needs to be addressed in all the pieplines list pages wherever applicable.

Version-Release number of selected component (if applicable):
4.9

How reproducible:
Always

Steps to Reproduce:
1. Create 10+ pipelines in a namespace
2. Create more number of pipelineruns under each pipeline
3. navigate to piplines list page.

Actual results:
Pipelines list will take a long time to load the list.

Expected results:

Pipeline list should not take more time to load the list.

Additional info:

Reduce the amount to data fetched to find the last pipelinerun, maybe use PartialMetadata to find the latest pipeline run and to improve the performance.

#Description of problem:

Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

Input values in Instantiate Template are disappeared randomly.

#Version-Release number of selected component (if applicable):

  • Customer ENV
  • OCP4.10.9
  • Developer Console
  • Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • Internet Disconnected OCP cluster
  • quicklab test ENV
  • Developer Console
  • OCP4.10.12 
  • Chrome 100

#How reproducible:

I reproduced this issue in ocp410ovn shared cluster in the quicklab

Select Apache HTTP Server > Input name "test" in Application Hostname box
After several seconds, the value has disappeared in the web console.

#Steps to Reproduce:

0. Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

1. Input values in the box of template menu.

2. The values are disappeared after several seconds later. (20s~ or randomly)

3. Many users have experienced this issue.

  • The web browser version of users experiencing this issue.
  • Customer: Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • My browser: Chrome 102.x

==> the browser version doesn't matter.

#Actual results:

Input values in "Instantiate Template" are disappeared randomly.
Users can't use the Initiate Template feature in the Dev console.

#Expected results:
Input values remain in the web console and users creat the object by the "Instantiate Template"

#Additional info:

See "Application Name" has disappeared in the video I attached.

This is a clone of issue OCPBUGS-4945. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4805. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:

Description of problem:

We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.

One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby 
   /etc/node-sizing.env 
on its master nodes contained an empty SYSTEM_RESERVED_ES value:

---
cat /etc/node-sizing.env 
SYSTEM_RESERVED_MEMORY=5.36Gi
SYSTEM_RESERVED_CPU=0.11
SYSTEM_RESERVED_ES=
---

causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.

We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.

A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES

This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. 

For clusterB the conditions are more well-known of why the value is empty.

However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. 

We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps  through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

Expected results:

If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2076646](https://bugzilla.redhat.com/show_bug.cgi?id=2076646). The following is the description of the original bug:

openshift-install destroy unable to delete PVC disks in GCP if cluster identifier is longer than 22 characters

Version:

$ openshift-install version
$ ./openshift-install 4.8.18
built from commit bd366e3cdcf892e1bddd841c702738f5254a0188
release image quay.io/openshift-release-dev/ocp-release@sha256:321aae3d3748c589bc2011062cee9fd14e106f258807dc2d84ced3f7461160ea

Platform: GCP

Installation Type: IPI

What happened?

#When run the openshift-install destroy cluster command, it is observed that PVC disks are not getting deleted, if the metadata.name is more than 22 characters.

  1. Always at least include the `.openshift_install.log`

What did you expect to happen?

All resources should get deleted successfully with openshift-installer destroy command.

How to reproduce it (as minimally and precisely as possible)?

$ Setup IPI GCP cluster
$ Provide cluster name with 22 chars.
$ Use standard (default) storage class, create pvc and pv.
$ Once done, destroy the cluster
$ Check on the backend platform if the storage disk for PVC is getting deleted or not.

Anything else we need to know?

We deployed an OpenShift 4 cluster in GCP, the `.metadata.name` field in the install config was gcpuser-a.ocp.redhat. The installer adds a unique identifier to the name for the InfraID, in our case, it resulted in `gcpusc1-a-ops-xpaas-nkp6w`.

After the cluster was provisioned, we created a PVC. The corresponding Google cloud disk followed the name `gcpuser-a.ocp.redhat-nk-pvc-<UID>`. Because the disk name did not exactly match the InfraID, when we ran the openshift-install destroy for this cluster, none of the disks for PVCs were deleted.

This is a clone of issue OCPBUGS-4640. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4489. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:

Description of problem:

Prometheus continuously restarts due to slow WAL replay

Version-Release number of selected component (if applicable):

openshift - 4.11.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2019564](https://bugzilla.redhat.com/show_bug.cgi?id=2019564). The following is the description of the original bug:

Description of problem:
On OpenShift Dev Sandbox users are automatically deleted after a month. Also on other clusters we should cleanup resources when the user are deleted.

Version-Release number of selected component (if applicable):
4.8+

How reproducible:
Always

Steps to Reproduce:
1. Create a user
2. Use console (this should automatically create 3 resources in the namespace "openshift-console-user-settings" (a Role, a RoleBinding and a ConfigMap)
3. Delete the user

Actual results:
The three created resources in namespace "openshift-console-user-settings" stays forever

Expected results:
The three created resources in namespace "openshift-console-user-settings" should be automatically removed

Additional info:
None

We have created a fix in 4.12 that fetches instance type information from Azure API instead of updating the lists. We feel that backporting that fix is too risky, but agreed to update the list in older versions.

Description of problem:

Add the following instance types to azure_instance_types list[1]:

  • Standard_D8s_v5
  • Standard_E8s_v5
  • Standard_E16s_v5

Version-Release number of selected component (if applicable):
OCP 4.8

Steps to Reproduce:
1. Migrate worker/infra nodes to above mentioned (missing) v5 instance types
2. "Failed to set autoscaling from zero annotations, instance type unknown"

Actual results:

  • "Failed to set autoscaling from zero annotations, instance type unknown"
  • New v5 instance types not officially tested/supported

Expected results:
The new instance types are available in the azure_instance_types list[1] and no errors/warnings are observed after migrating:

  • Standard_D8s_v5
  • Standard_E8s_v5
  • Standard_E16s_v5

Additional info:

The related v4 instance types are already available[1] - I suspect adding the mentioned v5 instance types is a minor update:

  • Standard_D8s_v4
  • Standard_E8s_v4
  • Standard_E16s_v4

1) azure_instance_types.go
https://github.com/openshift/cluster-api-provider-azure/blob/release-4.8/pkg/cloud/azure/actuators/machineset/azure_instance_types.go

This is a clone of issue OCPBUGS-3049. The following is the description of the original issue:

This is a copy of Bugzilla bug 2117524 for backport to 4.11.z

Original Text:

Description of problem:
On routers configured with mTLS and CRL defined in the CA with a CDP ; new CRL is downloaded only when restarting the ingress-operator.

2022-07-20T23:36:26.943Z	INFO	operator.clientca_configmap_controller	controller/controller.go:298	reconciling	{"request": "openshift-ingress-operator/service-bdrc"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:69	certificate revocation list has expired	{"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:69	retrieving certificate revocation list	{"subject key identifier": "6aa909992e9890457b2a8de5659a44cab8e867a8"}
2022-07-20T23:36:26.943Z	INFO	operator.crl	crl/crl_configmap.go:169	retrieving CRL distribution point	{"distribution point": "http://crl.domain.com/der/CN=XXXX,OU=XXX,O=XXX,C=XXX"}

Version-Release number of selected component (if applicable):
4.9.33

How reproducible:
Enable mTLS with a CRL

Actual results:
CRL is not download when expired
Clients get "SSL client certificate not trusted" errors while accessing resources

Expected results:
ingress-operator triggers CRL download when approaching expiration date so that the configmap is updated without manual action

This is a clone of issue OCPBUGS-3882. The following is the description of the original issue:

This bug is a backport clone of [Bugzilla Bug 2034883](https://bugzilla.redhat.com/show_bug.cgi?id=2034883). The following is the description of the original bug:

Description of problem:

Situation (starting point):

  • There is an ongoing change to the machine-config-daemon daemonset being applied by the machine-config-operator pod. It is waiting for the daemonset to roll out.
  • There are some nodes not ready, so daemonset rollout never ends and waiting on that ends in timeout error.

Problem:

  • Machine-config-operator pods stops trying to reconcile stuff whenever it finds timeout error in waiting for the machine-config-daemon rollout
  • This implies that the `spec.kubeAPIServerServingCAData` field of controllerconfig/machine-config-controller object is not updated when the kube-apiserver-operator updates kube-apiserver-to-kubelet-client-ca configmap.
  • Without that field updated, a kube-apiserver-to-kubelet-client-ca change is never rolled out to the nodes.
  • That ultimately leads to cluster-wide unavailability of "oc logs", "oc rsh" etc. commands when the kube-apiserver-operator starts using a client cert signed by the new kube-apiserver-to-kubelet-client-ca to access kubelet ports.

Version-Release number of MCO (Machine Config Operator) (if applicable):

4.7.21

Platform (AWS, VSphere, Metal, etc.): (not relevant)

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Always if the said conditions are met.

Steps to Reproduce:
1. Have some nodes not ready
2. Force a change that requires machine-config-daemon daemonset rollout (I think that changing proxy settings would work for this)
3. Wait until a new kube-apiserver-to-kubelet-client-ca is rolled out by kube-apiserver-operator

Actual results:

New kube-apiserver-to-kubelet-client-ca not forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca not deployed on nodes

Expected results:

kube-apiserver-to-kubelet-client-ca forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca deployed to nodes.

Additional info:

In comments

Description of problem:

We need to include the `openshift_apps_deploymentconfigs_strategy_total` metrics to the IO archive file.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Create a cluster
2. Download the IO archive
3. Check the file `config/metrics`
4. You must find `openshift_apps_deploymentconfigs_strategy_total` insde of it

Actual results:

 

Expected results:

You should see the `openshift_apps_deploymentconfigs_strategy_total` at the `config/metrics` file.

Additional info:

 

Description of problem:

Running the discovery cache every 10 minutes has a significant productivity impact on using kubectl on clusters with many CRDs as it is taking time to run these unnecessary requests.
The discovery cache doesn't really have to run every 10 minutes, as CRDs don't change that often. A lot of unnecessary load is created on clients and servers.

Version-Release number of selected component (if applicable):

4.10.z

How reproducible:

Cluster with a lot of CRDs

Steps to Reproduce:

https://github.com/kubernetes/kubernetes/issues/107130

Actual results:

Significant grow of kubectl request completion every 10 minutes

Expected results:

Significant grow of kubectl request completion every 24 hours

Additional info:

https://github.com/kubernetes/kubernetes/issues/107130

This is a clone of issue OCPBUGS-1099. The following is the description of the original issue:

This is a clone of issue OCPBUGS-224. The following is the description of the original issue:

Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf

  1. Generated by KNI resolv prepender NM dispatcher script
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx

OpenShift 4.8.29

  1. Generated by KNI resolv prepender NM dispatcher script
    search sepia.lab.iad2.dc.paas.redhat.com
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx
    ~~~

ENV: OpenStack IAD2, IPI installation. Connected cluster.

Version-Release number of selected component (if applicable):
OCP v4.9.31

How reproducible:
Always

Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.

Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.

Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.

Additional info:

  • Cu was trying to deploy secure Kerberos on the CoreOS nodes and it failed when the IPA-client install command failed. This is when the customer noticed this unusual behavior. They did not manually update the resolv.conf file to include the $search domain. They instead added the script below to /etc/NetworkManager/dispatcher.d/ and restarted NetworkManager on the node to fix this issue and installation was successful.
    ~~~
    #!/bin/bash

set -eo pipefail

DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \

grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \
tr '\n' ' ')"

>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"

sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp

mv /etc/resolv.tmp /etc/resolv.conf
~~~

  • Cu confirms that the $search domain was missing since the cluster was freshly installed/ They even confirmed this with a fresh new cluster as well that it was missing.
  • The fresh cluster was initially installed at v4.9.31 but was updated afterward to v4.9.43 (the latest z-stream) to see if the updates fixed anything but it didn't make any difference. The cluster is currently running v4.9.43 and shows the $search domain missing in the /etc/resolv.conf file on all nodes.

This is a clone of issue OCPBUGS-7494. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6671. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3228. The following is the description of the original issue:

While starting a Pipelinerun using UI, and in the process of providing the values on "Start Pipeline" , the IBM Power Customer (Deepak Shetty from IBM) has tried creating credentials under "Advanced options" with "Image Registry Credentials" (Authenticaion type). When the IBM Customer verified the credentials from  Secrets tab (in Workloads) , the secret was found in broken state. Screenshot of the broken secret is attached. 

The issue has been observed on OCP4.8, OCP4.9 and OCP4.10.

This is a clone of issue OCPBUGS-10943. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10661. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:

Description of problem:

Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build.

This is resulting in all jobs invoked by the 4.12 nightlies failing to install.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-13-172313 and later

How reproducible:

consistently in 4.12 nightlies only(ci builds do not seem to be impacted).

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log

This is a clone of issue OCPBUGS-533. The following is the description of the original issue:

Description of problem:

customer is using Azure AD as openid provider and groups synchronization from the provider.

The scenario is the following:

1)

  • user A login.
    groups are created with the membership.
    User A is member of a group with admin rights and it's cluster-admin

2)

  • user B login:
    groups are updated with membership
    UserB is also member of the group with admin rights and it's cluster admin

3)

  • user A login:
    groups are identical as in the former step.
    user A has no administration rights.

The groups memberships are the same in step 2 and 3.
The cluster role bindings of the groups have never changed.

the only way to have user A again the admin rights is to delete the membership from the group and have user A login again.

I have not managed to reproduce this using RH SSO. Neither Azure AD.

But my configuration is not exactly the same yet.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

 

This is a clone of issue OCPBUGS-2083. The following is the description of the original issue:

Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

For some reason, the LSP of a pod is not properly added to the port group where the ACL of a NetworkPolicy is applied. This results on the networkpolicy not being applied to the pod and communication not possible.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always with a concrete pod at customer environment.

Steps to Reproduce:

(not known exactly yet)

Actual results:

LSP not in port group. ACL not applied. Netpol not in effect.

Expected results:

LSP in port group. ACL applied. Netpol in effect.

Additional info:

Details in private comments, as they involve sensitive data.

Deleting the pod does nothing, but it is possible that this has something to do with the pod being recreated with the same name (although the LSPs UUIDs are different in each incarnation).

This is a clone of issue OCPBUGS-723. The following is the description of the original issue:

Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~

WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.

In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).

This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.

Version-Release number of selected component (if applicable): OCP 4.9

How reproducible: Occurred only twice(once in test and in current cluster)

Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty

Actual results: ClusterQuota should display the values

Expected results: ClusterQuota not displaying values

This is a request to back port the fix in OCPBUGS-1718 to Openshift 4.10.

Description cloned from that bug:
Description of problem:

prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests

Version-Release number of selected component (if applicable):

4.11.6

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Hard reboot node via out of band interface
3. oc -n openshift-monitoring get pods prometheus-k8s-0 

Actual results:

NAME               READY   STATUS             RESTARTS          AGE
prometheus-k8s-0   5/6     CrashLoopBackOff   125 (4m57s ago)   5h28m

Expected results:

Running

Additional info:

Attaching must-gather.

The pod recovers successfully after deleting/re-creating.


[kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0
ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)"
ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)"
ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))"
ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090
ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..."
ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped"
ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped"
ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped"
ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"

This is a clone of issue OCPBUGS-2079. The following is the description of the original issue:

Description of problem:

The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected. 

Version-Release number of selected component (if applicable):

4.10.z, may exist on other OCP versions as well. 

How reproducible:

always

Steps to Reproduce:

1. Create a KubeletConfig on the node:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: system-reserved-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""
  kubeletConfig:
    systemReserved:
      cpu: 500m
      memory: 500Mi
      ephemeral-storage: 10Gi


2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage

Actual results:

The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)  

Expected results:

The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default) 

Additional info:

The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example:

root        6824       1 27 Sep30 ?        1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s


*USER STORY:*

 As a customer or OpenShift engineer, I want to see the user agent for anything calling from OpenShift -> vSphere to eliminate troubleshooting guesswork.

*DESCRIPTION:*

A question in #forum-vmware was raised where we identified that the user-agent may not be configured for all OpenShift components calling to vSphere API.

https://coreos.slack.com/archives/CH06KMDRV/p1627368902058800

*Required:*

Audit of OpenShift components calling to vSphere API to make sure user agent strings are set appropriately.

*Nice to have:*

How can this be prevented in the future? How can we minimize maintenance costs added by new PRs/bugs reported from this spike?

*ACCEPTANCE CRITERIA:*

New PRs or bug reports for each effected component.

Description of problem:

Whereabouts doesn't allow the use of network interface names that are not preceded by the prefix "net", see https://github.com/k8snetworkplumbingwg/whereabouts/issues/130.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Define two Pods, one with the interface name 'port1' and the other with 'net-port1':

test-ip-removal-port1:	
              k8s.v1.cni.cncf.io/networks:
                [
                  {
                    "name": "test-sriovnd",
                    "interface": "port1",
                    "namespace": "default"
                  }
                ]

test-ip-removal-net-port1:
              k8s.v1.cni.cncf.io/networks:
                [
                  {
                    "name": "test-sriovnd",
                    "interface": "net-port1",
                    "namespace": "default"
                  }
                ]

2. IP allocated in the IPPool:

kind: IPPool
...
spec:
  allocations:
    "16":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-port1
    "17":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-net-port1

3. When the ip-reconciler job is run, the allocation for the port with the interface name 'port1' is removed:

[13:29][]$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        14m             11d

[13:29][]$ oc get ippools.whereabouts.cni.cncf.io -n openshift-multus   2001-1b70-820d-2610---64 -o yaml
apiVersion: whereabouts.cni.cncf.io/v1alpha1
kind: IPPool
metadata:
...
spec:
  allocations:
    "17":
      id: ...
      podref: test-ecoloma-1/test-ip-removal-net-port1
  range: 2001:1b70:820d:2610::/64

[13:30][]$ oc get cronjob -n openshift-multus
NAME            SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
ip-reconciler   */15 * * * *   False     0        9s              11d

 

Actual results:

The network interface with a name that doesn't have a 'net' prefix is removed from the ip-reconciler cronjob.

Expected results:

The network interface must not be removed, regardless of the name.

Additional info:

Upstream PR @ https://github.com/k8snetworkplumbingwg/whereabouts/pull/147 master PR @ https://github.com/openshift/whereabouts-cni/pull/94

Description of problem:

When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.

Version-Release number of selected component (if applicable):

 

How reproducible:

with big exclude ranges, 100%

Steps to Reproduce:

1. create network-attachment-definition with a large range:

$ cat <<EOF| oc apply -f -       
apiVersion: k8s.cni.cncf.io/v1                                            
kind: NetworkAttachmentDefinition
metadata:
  name: nad-w-excludes
spec:
  config: |-
    {
      "cniVersion": "0.3.1",
      "name": "macvlan-net",
      "type": "macvlan",
      "master": "ens3",
      "mode": "bridge",
      "ipam": {
         "type": "whereabouts",
         "range": "fd43:01f1:3daa:0baa::/64",
         "exclude": [ "fd43:01f1:3daa:0baa::/100" ],
         "log_file": "/tmp/whereabouts.log",
         "log_level" : "debug"
      }
    }
EOF
2. create a pod with the network attached:

$ cat <<EOF|oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-exclude-range
  annotations:
    k8s.v1.cni.cncf.io/networks: nad-w-excludes
spec:
  containers:
  - name: pod-1
    image: openshift/hello-openshift
EOF

3. check pod status, event log and whereabouts logs after a while: 

$ oc get pods
NAME                        READY   STATUS              RESTARTS   AGE
pod-with-exclude-range      0/1     ContainerCreating   0          2m23s

$ oc get events
<...>
6m39s       Normal    Scheduled                                    pod/pod-with-exclude-range                   Successfully assigned default/pod-with-exclude-range to <worker-node>
6m37s       Normal    AddedInterface                               pod/pod-with-exclude-range                   Add eth0 [10.129.2.49/23] from openshift-sdn
2m39s       Warning   FailedCreatePodSandBox                       pod/pod-with-exclude-range                   Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

$ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log
Starting pod/<worker-node>-debug ...
To use host binaries, run `chroot /host`
2022-10-27T14:14:50Z [debug] Finished leader election
2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil>
2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf
2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 
2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82
2022-10-27T14:14:59Z [debug] Started leader election
2022-10-27T14:14:59Z [debug] OnStartedLeading() called
2022-10-27T14:14:59Z [debug] Elected as leader, do processing
2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range
2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff

Actual results:

Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Expected results:

additional network gets attached to the pod

Additional info:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

+++ This bug was initially created as a clone of
Bug #2102632
+++

Version:
4.11.0-0.nightly-2022-06-28-160049

$ ./openshift-install version
./openshift-install 4.11.0-0.nightly-2022-06-28-160049
built from commit 6daed68b9863a9b2ecebdf8a4056800aa5c60ad3
release image registry.ci.openshift.org/ocp/release@sha256:b79b1be6aa4f9f62c691c043e0911856cf1c11bb81c8ef94057752c6e5a8478a
release architecture amd64

Platform:
GCP

IPI (automated install with `openshift-install`.

What happened?
During uninstall, the cluster uninstall I received:

E0630 13:17:58.830361 271713 runtime.go:78] Observed a panic: runtime.boundsError{x:22, y:21, signed:true, code:0x1} (runtime error: slice bounds out of range [:22] with length 21)
goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x41d43c0?, 0xc0010637e8})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x18?})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x41d43c0, 0xc0010637e8})
/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).formatClusterIDForStorage(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:25
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageIDFilter(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:29
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageLabelOrClusterIDFilter(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:39 +0x1fe
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).listDisks(0xc0015dc900?)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:43 +0x1e
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyDisks(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:116 +0x36
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyCluster(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:174 +0x78e
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000700000})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x19f06638?, 0xc0000721c0?}, 0xc00047d888?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x19f06638, 0xc0000721c0}, 0xc8?, 0x1108485?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x19f06638, 0xc0000721c0}, 0x40d687?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:566 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x19f06670?, 0xc00008b8c0?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:555 +0x46
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).Run(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:130 +0x519
main.runDestroyCmd({0x7fffe6a88d87, 0x9}, 0x0)
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:67 +0x92
main.newDestroyClusterCmd.func1(0xc000536780?, {0xc000906100?, 0x2?, 0x2?})
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:53 +0x7f
github.com/spf13/cobra.(*Command).execute(0xc000536780, {0xc0009060c0, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc00098db80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:60 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
panic: runtime error: slice bounds out of range [:22] with length 21 [recovered]
panic: runtime error: slice bounds out of range [:22] with length 21

goroutine 1 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x18?})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x41d43c0, 0xc0010637e8})
/usr/lib/golang/src/runtime/panic.go:838 +0x207
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).formatClusterIDForStorage(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:25
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageIDFilter(...)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:29
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).storageLabelOrClusterIDFilter(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:39 +0x1fe
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).listDisks(0xc0015dc900?)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:43 +0x1e
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyDisks(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/disk.go:116 +0x36
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).destroyCluster(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:174 +0x78e
k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc000700000})
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:220 +0x1b
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x19f06638?, 0xc0000721c0?}, 0xc00047d888?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233 +0x57
k8s.io/apimachinery/pkg/util/wait.poll({0x19f06638, 0xc0000721c0}, 0xc8?, 0x1108485?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580 +0x38
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x19f06638, 0xc0000721c0}, 0x40d687?, 0x10?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:566 +0x49
k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x19f06670?, 0xc00008b8c0?)
/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:555 +0x46
github.com/openshift/installer/pkg/destroy/gcp.(*ClusterUninstaller).Run(0xc000f22540)
/go/src/github.com/openshift/installer/pkg/destroy/gcp/gcp.go:130 +0x519
main.runDestroyCmd({0x7fffe6a88d87, 0x9}, 0x0)
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:67 +0x92
main.newDestroyClusterCmd.func1(0xc000536780?, {0xc000906100?, 0x2?, 0x2?})
/go/src/github.com/openshift/installer/cmd/openshift-install/destroy.go:53 +0x7f
github.com/spf13/cobra.(*Command).execute(0xc000536780, {0xc0009060c0, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc00098db80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:60 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff

Anything else we need to know?

Uninstall with openshift-install binary from OCP 4.10.16 worked fine.

— Additional comment from
jmencak@redhat.com
on 2022-06-30 12:15:49 UTC —

Created attachment 1893636 [details]Install/uninstall directory tar ball.

Adding install/uninstall directory tar ball.

— Additional comment from
padillon@redhat.com
on 2022-07-01 14:11:22 UTC —

Can we get an install config for the failing destroy?

— Additional comment from
padillon@redhat.com
on 2022-07-01 14:13:30 UTC —

Sorry. I see the install config is in the attachment. I thought that was only the destroy log.

— Additional comment from
padillon@redhat.com
on 2022-07-01 14:28:10 UTC —

Marking this as blocker+. It looks like
https://github.com/openshift/installer/pull/5976
introduced a regression when destroying disks. We should have a PR to fix up today.

— Additional comment from
eparis@redhat.com
on 2022-07-01 15:00:11 UTC —

This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+

— Additional comment from
padillon@redhat.com
on 2022-07-01 17:56:37 UTC —

For QE: This error would occur after installing and provisioning PV.

— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-05 04:27:36 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from
jmencak@redhat.com
on 2022-07-07 08:38:59 UTC —

I still see the same issue with the latest nightly 4.11.0-0.nightly-2022-07-06-145812. Is the fix included there?

— Additional comment from
padillon@redhat.com
on 2022-07-07 12:51:25 UTC —

We didn't cherry-pick this fix into 4.11 so it is not in the nightlies. You should be able to check it against a master build. We will cherry-pick to 4.11 now.

$ oc adm release extract --tools registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-06-145812
$ tar -xvf openshift-install-linux-4.11.0-0.nightly-2022-07-06-145812.tar.gz
README.md
openshift-install
$ ./openshift-install version
./openshift-install 4.11.0-0.nightly-2022-07-06-145812
built from commit b2e7be726e400022e71ef3b8bd01a2093e53bc5a
release image registry.ci.openshift.org/ocp/release@sha256:616c5fefa87d116dd2440c75d9832c462078d635ed155c8d6cd486dd09540184
release architecture amd64

$ git show b2e7be726e400022e71ef3b8bd01a2093e53bc5a
commit b2e7be726e400022e71ef3b8bd01a2093e53bc5a (upstream/release-4.11)
Merge: 6daed68b9 2426260d5
Author: openshift-ci[bot] <75433959+openshift-ci[bot]@users.noreply.github.com>
Date: Thu Jun 30 22:11:44 2022 +0000

Merge pull request #6060 from mike-nguyen/dnm_411_test
Bug 2093126
: bump RHCOS 4.11 boot image metadata
 

Description of problem:

Insight Operator keeps rebooting on Openshift 4.10.18 in disconnected environment

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10.18

How reproducible:

Install Openshift 4.10.18 in disconnected environment

Steps to Reproduce:

1. Download the global cluster pull secret to your local file system.
 2. In a text editor, edit the .dockerconfigjson file that was downloaded. 3. Remove the cloud.openshift.com JSON entry
4. Observe that insight operator keeps rebooting 

Actual results:

Insight Operator keeps rebooting

Expected results:

Insight Operator should be stable

Additional info:

Disconnected Environment 

 

Tracker bug for bootimage bump in 4.10. This bug should block bugs which need a bootimage bump to fix.

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-7510. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7373. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.

This is particularly pronounced in SNO topologies, leading to installation failures. 

The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.

Description of problem:

Description of problem:
The startup script for ovnkube-master container starts ovn-nbctl daemon that logs to ovn-nbctl.log file[1]. However, apart from tailing it, nothing is done with that file. Concretely, it is never rotated.
Recently, one 4.8 cluster has hit the situation where this file grew up to ~24GB size. It was necessary to manually remove that file and restart ovnkube-master.
Version-Release number of selected component (if applicable):
4.8. It seems that 4.10 can hit this problem as well[2] but latest master branch has changed so much that I am not sure if it will be possible for this to happen in 4.11.
How reproducible:
Always (but may take very long and requires no ovnkube-master container restart)
Steps to Reproduce:
1. Let /run/ovn/ovn-nbctl.log grow for long enough
Actual results:
File growing without control
Expected results:
Log file to be eventually rotated.
References:
[1] - https://github.com/openshift/cluster-network-operator/blob/release-4.8/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L813
[2] - https://github.com/openshift/cluster-network-operator/blob/release-4.10/bindata/network/ovn-kubernetes/ovnkube-master.yaml#L750

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7590. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7458. The following is the description of the original issue:

Description of problem:

- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted.
- We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful.
- This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again.
- The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.

Version-Release number of selected component (if applicable):

 

How reproducible:

N/A, I wasn't able to reproduce the issue.

Steps to Reproduce:

 

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

All Nodes overview in console are showing "Something went wrong"

Version-Release number of selected component (if applicable):

4.10

How reproducible:

I have tested in test-lab and i am able see node overviews

Steps to Reproduce:

1.
2.
3.

Actual results:

In the Console, Under compute tab nodes overview showing error "Oh no! something went wrong"

Expected results:

It should show the node overview

Additional info:

It is happening for all users with admin roles 

Description of problem:

openshift template cakephp-mysql-persistent has a typo in apiVersion for DeploymentConfig

Version-Release number of selected component (if applicable):

4.10

How reproducible:

This bug was resolved for ocp 4.11, Now there is requirement to backport this fix to ocp4.10

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Install failed if specify sts endpoint in the install-config.yaml file:
platform:
  aws:
    region: us-east-2
    serviceEndpoints:
    - name: sts
      url: https://sts.us-east-2.amazonaws.com

Errors:

level=error msg=Error: error configuring Terraform AWS Provider: error validating provider credentials: error calling sts:GetCallerIdentity: SignatureDoesNotMatch: Credential should be scoped to a valid region. 
level=error msg=  status code: 403, request id: f4e877fe-9e90-4cba-a455-2538d489a8d0
level=error
level=error msg=  on ../../../../../../../../tmp/openshift-install-cluster-1552617948/main.tf line 11, in provider "aws":
level=error msg=  11: provider "aws" {
level=error
level=error
level=error msg=Failed to read tfstate: open /tmp/openshift-install-cluster-1552617948/terraform.cluster.tfstate: no such file or directory

Version-Release number of selected component (if applicable):

4.10.z

How reproducible:

* Always

Steps to Reproduce:

1. Create an install-config.yaml, and specify sts endpoint 
2. Create an STS cluster 

Actual results:

Error: error configuring Terraform AWS Provider: error validating provider credentials: error calling sts:GetCallerIdentity: SignatureDoesNotMatch: Credential should be scoped to a valid region. 

Expected results:

No errors, create cluster successfully

Additional info:

No such issues on 4.8, 4.9, 4.11, 4.12

+++ This bug was initially created as a clone of Bug #2117811 +++

Description of problem:
We are currently unable to merge any pull requests to fix CVEs because of the use of the xmlstarlet command line utility which is not currently packaged for and available for RHEL.

Version-Release number of selected component (if applicable):

How reproducible:
ALways

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:
Another bug will be opened to reenable this code once the xmlstarlet package is available in RHEL or we find an alternative fix.

This is a clone of issue OCPBUGS-11612. The following is the description of the original issue:

This is a clone of issue OCPBUGS-9955. The following is the description of the original issue:

Description of problem:

OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. 
Hub cluster is OCP 4.10.33
ACM is 2.5.4

When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. 

To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.

Version-Release number of selected component (if applicable):

OCP 4.10.33
ACM 2.5.4

How reproducible:

Always

Steps to Reproduce:

1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue:
  a. Invalid rootDeviceHint in BareMetalHost CR
  b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName
2. Apply installation CRs to hub cluster
3. Wait for cluster installation to fail
4. Remove cluster installation CRs and namespace

Actual results:

Cluster namespace remains in terminating state indefinitely:
$ oc get ns cnfocto1
NAME       STATUS        AGE    
cnfocto1   Terminating   17h

Expected results:

Cluster namespace (and all installation CRs in it) are successfully removed.

Additional info:

The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest):
Namespace: 0
AgentClusterInstall: 1
ClusterDeployment: 1
NMStateConfig: 1
InfraEnv: 1
BareMetalHost: 1
HostFirmwareSettings: 1
ConfigMap: 1 (extra manifests)
ManagedCluster: 2
KlusterletAddonConfig: 2

 

Description of problem:

Unit-tests flaking on 4.10 PRs 

Version-Release number of selected component (if applicable):

 

How reproducible:

Sometimes

Steps to Reproduce:

1.
2.
3.

Actual results:

Unit-test job fails with the following error: 

[Fail] OVN Pod Operations during execution [It] should not deallocate in-use and previously freed completed pods IP 
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/pods_test.go:560
 Ran 153 of 153 Specs in 215.549 seconds
 FAIL! -- 152 Passed | 1 Failed | 0 Pending | 0 Skipped
 --- FAIL: TestClusterNode (216.05s)
 FAIL	github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn	228.843s
  {"component":"entrypoint","error":"wrapped process failed: exit status 2","file":"k8s.io/test-infra/prow/entrypoint/run.go:79","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-12-17T0

Expected results:

All tests pass

Additional info:

 

This bug is a backport clone of [Bugzilla Bug 2072040](https://bugzilla.redhat.com/show_bug.cgi?id=2072040). The following is the description of the original bug:

Description of problem:

configure-ovs resets network configuration on boot, and while doing so waits for all devices to become unmanaged in NetworkManager. Currently the patch port between br-int and br-ex created by ovn-kubernetes is managed by NetworkManager, never becomes unmanaged and causes an accumulated delay of 2 minutes on boot.

This patct port should never be managed by NetworkManager.

This is a clone of issue OCPBUGS-5258. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:

Description of problem:

It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.

Version-Release number of selected component (if applicable):

Openshift 4.8, Serverless Operator 1.27

Additional info:

https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019

 

When using cluster scaling and querying a URL in a pod on the side. All the while running some custom watches on endpoints and nodes.

When the nodes scale down, it seems a few seconds before an event marks the node as Not Ready and before the dns-default endpoint is removed from the endpoints list a DNS query can fail.

We wrote some simply watcher (see below for details) to log this and got the following events:

DNS lookup failure:

Tue Oct 18 12:33:23 UTC 2022 - Lookup success                                                                                                                                                                                                                                                                                                      
Tue Oct 18 12:33:28 UTC 2022 - DNS failure                                                                                                                                                                                                                                                                                                         
Tue Oct 18 12:33:41 UTC 2022 - Lookup success 

The node was not yet Not Ready and the endpoint was still in the list of endpoints at that time (ntrdy indicates a NotReadyEndpoint):

2022-10-18 12:33:21.712180649 +0000 UTC m=+1047.610174444 - ip-10-0-137-100.ec2.internal -  MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True,
2022-10-18 12:33:39.11806612 +0000 UTC m=+1065.016059955 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:39.525574893 +0000 UTC m=+1065.423568712 - dns-default rdy: 10.128.0.2 rdy: 10.128.10.4 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.131.8.4
2022-10-18 12:33:39.526424974 +0000 UTC m=+1065.424418833 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.130.8.4 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4
2022-10-18 12:33:39.528532869 +0000 UTC m=+1065.426526744 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:39.729859144 +0000 UTC m=+1065.627852917 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:39.936928994 +0000 UTC m=+1065.834922825 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:44.749587947 +0000 UTC m=+1070.647581767 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:44.952196646 +0000 UTC m=+1070.850190469 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.130.8.4 ntrdy: 10.131.8.4
2022-10-18 12:33:44.954865089 +0000 UTC m=+1070.852858965 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:45.159460169 +0000 UTC m=+1071.057454007 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:48.641412229 +0000 UTC m=+1074.539406059 - ip-10-0-188-175.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:48.846438064 +0000 UTC m=+1074.744431900 - ip-10-0-129-193.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:33:54.068542745 +0000 UTC m=+1079.966536563 - ip-10-0-150-205.ec2.internal -  MemoryPressure - Unknown, DiskPressure - Unknown, PIDPressure - Unknown, Ready - Unknown,
2022-10-18 12:34:31.752294563 +0000 UTC m=+1117.650288381 - ip-10-0-253-198.ec2.internal -  MemoryPressure - False, DiskPressure - False, PIDPressure - False, Ready - True,
2022-10-18 12:34:39.531848219 +0000 UTC m=+1125.429842032 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4 ntrdy: 10.131.8.4
2022-10-18 12:34:39.736866622 +0000 UTC m=+1125.634860439 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3 ntrdy: 10.128.10.4
2022-10-18 12:34:39.941934912 +0000 UTC m=+1125.839928742 - dns-default rdy: 10.128.0.2 rdy: 10.128.2.5 rdy: 10.129.0.2 rdy: 10.130.0.16 rdy: 10.131.0.3

So we can observe that the node goes into 'Unknown' at 12:33:39, and the endpoint goes into Not Ready soon after.
Not sure if this is a logic problem of draining a node or an issue with the autoscaler at this point in time, but it fixes itself at the next lookup 5 seconds later.

Detailed breakdown of how this was reproduced:
1. A cluster with autoscaling enabled is required.
2. Deploy a daemonset that attempt to use DNS / HTTP in a loop, e.g. the following Daemonset was used to test this:


apiVersion: apps/v1
kind: DaemonSet
metadata: 
  name: dns-tester
  labels: 
    app: dns-tester
spec: 
  selector: 
    matchLabels: 
      app: dns-tester
  template: 
    metadata: 
      labels: 
        app: dns-tester
    spec: 
      containers: 
      - name: dns-tester
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:72f2f7e906c321da6d6a00ce610780e8766e8432f7c553c5d03492f65fe5416c
        command: ["/bin/sh", "-c"]
        args: ['while true; do CURL=$(curl redhat.com 2>&1); if [[ "$CURL" == *"not resolve"* ]]; then echo `date` - "DNS failure"; else echo `date` - "Lookup success"; fi;  sleep 5; done']
        resources: 
          limits: 
            cpu: 100m
            memory: 200Mi 

3. Run the following go program against the same cluster (this is what watches the node and endpoint events for the dns-default endpoints):

package main

import (
	"context"
	"fmt"
	"path/filepath"
	"sync"
	"time"

	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	_ "k8s.io/client-go/plugin/pkg/client/auth"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/client-go/util/homedir"
)

const dnsNamespace = "openshift-dns"
const dnsEndpoint = "dns-default"

func nodeWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) {
	ctx := context.Background()
	defer waitGroup.Done()

	var nodes = clientset.CoreV1().Nodes()
	watcher, err := nodes.Watch(ctx, metav1.ListOptions{})
	if err != nil {
		panic(err.Error())
	}
	ch := watcher.ResultChan()

	for {
		event := <-ch
		node, ok := event.Object.(*corev1.Node)
		if !ok {
			fmt.Printf("%v", event)
			panic("Could not cast to nodes")
		}
		fmt.Printf("%v - %s - ", time.Now(), node.Name)
		for _, condition := range node.Status.Conditions {
			fmt.Printf(" %v - %v,", condition.Type, condition.Status)
		}
		fmt.Println()
	}
}

func dnsWatch(clientset *kubernetes.Clientset, waitGroup sync.WaitGroup) {
	ctx := context.Background()
	defer waitGroup.Done()
	var api = clientset.CoreV1().Endpoints(dnsNamespace)
	watcher, err := api.Watch(ctx, metav1.ListOptions{})
	if err != nil {
		panic(err.Error())
	}
	ch := watcher.ResultChan()
	for {
		event := <-ch
		endpoints, ok := event.Object.(*corev1.Endpoints)
		if !ok {
			fmt.Printf("%v", event)
			panic("Could not cast to Endpoint")
		}
		fmt.Printf("%v - %v", time.Now(), endpoints.ObjectMeta.Name)
		for _, endpoint := range endpoints.Subsets {
			for _, address := range endpoint.Addresses {
				fmt.Printf(" rdy: %v", address.IP)
			}
			for _, address := range endpoint.NotReadyAddresses {
				fmt.Printf(" ntrdy: %v", address.IP)
			}
		}
		fmt.Println()
	}
}

func main() {
	// AUTHENTICATE
	var home = homedir.HomeDir()
	var kubeconfig = filepath.Join(home, ".kube", "config")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
	if err != nil {
		panic(err.Error())
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}
	wg := sync.WaitGroup{}
	wg.Add(2)
	go dnsWatch(clientset, wg)
	go nodeWatch(clientset, wg)
	wg.Wait()
}

4. Create simulated pressure on the nodes to force a scaleup - e.g. use the following deployment:

apiVersion: apps/v1
kind: Deployment
metadata: 
  name: resource-eater
spec: 
  replicas: 4
  selector: 
    matchLabels: 
      app: resource-eater
  template: 
    metadata: 
      labels: 
        app: resource-eater
    spec: 
      containers: 
      - name: resource-eater
        image: busybox:latest
        command: ["/bin/sh", "-c"]
        args: ["sleep 3600"]
        resources: 
          requests: 
            memory: "8Gi"
            cpu: "1000m"
      affinity: 
        podAntiAffinity: 
          requiredDuringSchedulingIgnoredDuringExecution: 
          - labelSelector: 
              matchExpressions: 
              - key: app
                operator: In
                values: 
                - store
            topologyKey: "kubernetes.io/hostname" 

5. Wait for the scale up to happen.
6. Delete the deployment that created the node pressure, so the scale down can happen (this can easily take 15 minutes).
7. Observe the events in the watcher program and the logs for the daemonset - this should show the same behavior as detailed above.

we found a few logged bugs that seemed related to this issue affecting clusters on 4.8 through 4.10. Those bugs are as follows:

https://issues.redhat.com/browse/OCPBUGS-647
https://issues.redhat.com/browse/OCPBUGS-488
https://bugzilla.redhat.com/show_bug.cgi?id=2061244

Using the boave mentioned steps, we have been able to reliably reproduce the issue of DNS failures during autoscale-down in 4.10 clusters.

This bug is a backport clone of [Bugzilla Bug 2073220](https://bugzilla.redhat.com/show_bug.cgi?id=2073220). The following is the description of the original bug:

Description of problem:

https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Version-Release number of selected component (if applicable): 4.*

How reproducible: always

Steps to Reproduce:
1. Set audit profile to WriteRequestBodies
2. Wait for api server rollout to complete
3. tail -f /var/log/kube-apiserver/audit.log | grep routes/status

Actual results:

Write events to routes/status are recorded at the RequestResponse level, which often includes keys and certificates.

Expected results:

Events involving routes should always be recorded at the Metadata level, per the documentation at https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config

Additional info:

This is a clone of issue OCPBUGS-1410. The following is the description of the original issue:

Clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106803 to backport the e2e fix to 4.11 and 4.10.

Description of problem: E2E: intermittent failure is seen on tests for devfile due to network call to devfile registry

Deploy git workload with devfile from topology page: A-04-TC01

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/11726/pull-ci-openshift-console-master-e2e-gcp-console/1547046629775773696/artifacts/e2e-gcp-console/test/artifacts/gui_test_screenshots/cypress/screenshots/e2e/add-flow-ci.feature/Deploy%20git%20workload%20with%20devfile%20from%20topology%20page%20A-04-TC01%20--%20after%20each%20hook%20(failed).png

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/11768/pull-ci-openshift-console-master-e2e-gcp-console/1547061730478133248

Version-Release number of selected component (if applicable):

How reproducible: Intermittent

Steps to Reproduce:
1. Run test for add-flow-ci.feature to test Deploy git workload with devfile from topology page: A-04-TC01

Actual results:

Expected results: Show always pass

Additional info:

Description of problem:
Users on a fully-disconnected cluster could not see Devfiles in the developer catalog or import a Devfiles. That's fine.

But the API calls /api/devfile/samples/ and /api/devfile/ takes 30 seconds until they fail with a 504 Gateway timeout error.

If possible they should fail immediately.

Version-Release number of selected component (if applicable):
This might happen since 4.8

Tested this yet only on 4.12.0-0.nightly-2022-09-07-112008

How reproducible:
Always

Steps to Reproduce:

  1. Start a disconnected cluster with a proxy
  2. Open the browser network inspector and filter for /api/devfile
  3. Switch to Developer perspective
    1. Navigate to Add > Developer Catalog (All Services) > Devfiles
    2. Or Add > Import from Git > and enter https://github.com/devfile-samples/devfile-sample-go-basic.git

Actual results:

  • Network call fails after 30 seconds
  • Import doesn't work

Expected results:

  • Network calls should fail immediately
  • We doesn't expect that the import will work

Additional info:
The console Pod log contains this error:

E0909 10:28:18.448680 1 devfile-handler.go:74] Failed to parse devfile: failed to populateAndParseDevfile: Get "https://registry.devfile.io/devfiles/go": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This bug is a backport clone of [Bugzilla Bug 1983056](https://bugzilla.redhat.com/show_bug.cgi?id=1983056). The following is the description of the original bug:

Description of problem:

During upgrade of 4.5.40 to 4.6.31 the CNI is restarting due to unable to plug the VIF provided as it is already being used by another Pod.

2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service [-] Error when processing addNetwork request. CNI Params:

{'CNI_IFNAME': 'eth0', 'CNI_NETNS': '/var/run/netns/0420f2a3-d2fe-40e6-86f0-9a38a17c933a', 'CNI_PATH': '/opt/multus/bin:/var/lib/cni/bin:/usr/libexec/cni', 'CNI_COMMAND': 'ADD', 'CNI_CONTAINERID': '73eee9240ae6bcfec8b539fa2b12c8e82f51f8a95f29aaaedc95e4e05f7cb734', 'CNI_ARGS': 'IgnoreUnknown=true;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_NAME=prometheus-k8s-0;K8S_POD_INFRA_CONTAINER_ID=73eee9240ae6bcfec8b539fa2b12c8e82f51f8a95f29aaaedc95e4e05f7cb734'}

: pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service Traceback (most recent call last):
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/daemon/service.py", line 82, in add
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service vif = self.plugin.add(params)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 75, in add
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service vifs = self._do_work(params, b_base.connect, timeout)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/plugins/k8s_cni_registry.py", line 184, in _do_work
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service container_id=params.CNI_CONTAINERID)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/base.py", line 156, in connect
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service driver.connect(vif, ifname, netns, container_id)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/kuryr_kubernetes/cni/binding/nested.py", line 126, in connect
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service iface.net_ns_fd = utils.convert_netns(netns)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/transactional.py", line 209, in _exit_
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service self.commit()
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 650, in commit
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service raise newif
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/ipdb/interfaces.py", line 589, in commit
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service self.nl.link('add', **request)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/iproute/linux.py", line 1163, in link
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service msg_flags=msg_flags)
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 373, in nlm_request
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service return tuple(self._genlm_request(*argv, **kwarg))
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 864, in nlm_request
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service callback=callback):
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 376, in get
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service return tuple(self._genlm_get(*argv, **kwarg))
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service File "/usr/lib/python3.6/site-packages/pyroute2/netlink/nlsocket.py", line 701, in get
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service raise msg['header']['error']
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')
2021-07-16 10:55:02.580 232 ERROR kuryr_kubernetes.cni.daemon.service
2021-07-16 10:55:02.585 232 INFO werkzeug [-] 127.0.0.1 - - [16/Jul/2021 10:55:02] "POST /addNetwork HTTP/1.1" 500 -
2021-07-16 10:55:02.656 251 INFO os_vif [-] Successfully unplugged vif VIFVlanNested(active=True,address=fa:16:3e:c1:cd:25,has_traffic_filtering=False,id=88bdb7f9-65e6-4c54-83d1-73341876da08,network=Network(cc5c0761-5f89-42b8-a4fc-0d829eba818d),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap88bdb7f9-65',vlan_id=2482)

The prometheus Pod is configured to used the same IP as the alert Pod, and the alert Pod is using IP different than the one specified on annotation:

[stack@undercloud-0 ~]$ oc get po prometheus-k8s-0 -n openshift-monitoring -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
openshift.io/scc: anyuid
openstack.org/kuryr-pod-label: '

{"app": "prometheus", "controller-revision-hash": "prometheus-k8s-5949f47544", "prometheus": "k8s", "statefulset.kubernetes.io/pod-name": "prometheus-k8s-0"}

'
openstack.org/kuryr-vif: '{"versioned_object.changes": ["default_vif"], "versioned_object.data":
{"additional_vifs": {}, "default_vif": {"versioned_object.changes": ["has_traffic_filtering",
"plugin", "active", "vif_name", "preserve_on_delete", "network", "id", "address",
"vlan_id"], "versioned_object.data": {"active": true, "address": "fa:16:3e:c1:cd:25",
"has_traffic_filtering": false, "id": "88bdb7f9-65e6-4c54-83d1-73341876da08",
"network": {"versioned_object.changes": ["mtu", "multi_host", "subnets", "label",
"id", "should_provide_bridge", "should_provide_vlan"], "versioned_object.data":
{"id": "cc5c0761-5f89-42b8-a4fc-0d829eba818d", "label": "ns/openshift-monitoring-net",
"mtu": 1442, "multi_host": false, "should_provide_bridge": false, "should_provide_vlan":
false, "subnets": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["ips", "gateway", "routes", "cidr",
"dns"], "versioned_object.data": {"cidr": "10.128.8.0/23", "dns": [], "gateway":
"10.128.8.1", "ips": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["address"], "versioned_object.data":

{"address": "10.128.9.175"}

, "versioned_object.name": "FixedIP", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}]}, "versioned_object.name": "FixedIPList",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"},
"routes": {"versioned_object.changes": ["objects"], "versioned_object.data":

{"objects": []}

, "versioned_object.name": "RouteList", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}}, "versioned_object.name": "Subnet",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"}]},
"versioned_object.name": "SubnetList", "versioned_object.namespace": "os_vif",
"versioned_object.version": "1.0"}}, "versioned_object.name": "Network", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.1"}, "plugin": "noop", "preserve_on_delete":
false, "vif_name": "tap88bdb7f9-65", "vlan_id": 2482}, "versioned_object.name":
"VIFVlanNested", "versioned_object.namespace": "os_vif", "versioned_object.version":
"1.0"}}, "versioned_object.name": "PodState", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}'
creationTimestamp: "2021-07-15T12:24:52Z"
generateName: prometheus-k8s-
labels:
app: prometheus
controller-revision-hash: prometheus-k8s-5949f47544
prometheus: k8s
statefulset.kubernetes.io/pod-name: prometheus-k8s-0
name: prometheus-k8s-0
namespace: openshift-monitoring
ownerReferences:

  • apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: prometheus-k8s
    uid: 08334f30-2552-499b-9245-e4f61fe92a76
    resourceVersion: "112100"
    selfLink: /api/v1/namespaces/openshift-monitoring/pods/prometheus-k8s-0
    uid: 1087ca8f-9f00-486a-8471-60956e9c27a4

[stack@undercloud-0 ~]$ oc get po -A -o wide |grep 10.128.9.175
openshift-monitoring alertmanager-main-2 5/5 Running 0 22h 10.128.9.175 ostest-f57bt-worker-vprrk <none> <none>
[stack@undercloud-0 ~]$ oc get po alertmanager-main-2 -n openshift-monitoring -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "kuryr",
"interface": "eth0",
"ips": [
"10.128.9.175"
],
"mac": "fa:16:3e:c1:cd:25",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "kuryr",
"interface": "eth0",
"ips": [
"10.128.9.175"
],
"mac": "fa:16:3e:c1:cd:25",
"default": true,
"dns": {}
}]
openshift.io/scc: anyuid
openstack.org/kuryr-pod-label: '

{"alertmanager": "main", "app": "alertmanager", "controller-revision-hash": "alertmanager-main-5548759bbd", "statefulset.kubernetes.io/pod-name": "alertmanager-main-2"}

'
openstack.org/kuryr-vif: '{"versioned_object.changes": ["default_vif"], "versioned_object.data":
{"additional_vifs": {}, "default_vif": {"versioned_object.changes": ["active",
"has_traffic_filtering", "network", "address", "id", "preserve_on_delete", "vlan_id",
"plugin", "vif_name"], "versioned_object.data": {"active": true, "address":
"fa:16:3e:77:a3:12", "has_traffic_filtering": false, "id": "f6dd52db-40e1-4339-a7e6-1e2bd2f6f772",
"network": {"versioned_object.changes": ["multi_host", "label", "should_provide_vlan",
"should_provide_bridge", "mtu", "id", "subnets"], "versioned_object.data": {"id":
"cc5c0761-5f89-42b8-a4fc-0d829eba818d", "label": "ns/openshift-monitoring-net",
"mtu": 1442, "multi_host": false, "should_provide_bridge": false, "should_provide_vlan":
false, "subnets": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["routes", "dns", "cidr", "gateway",
"ips"], "versioned_object.data": {"cidr": "10.128.8.0/23", "dns": [], "gateway":
"10.128.8.1", "ips": {"versioned_object.changes": ["objects"], "versioned_object.data":
{"objects": [{"versioned_object.changes": ["address"], "versioned_object.data":

{"address": "10.128.9.238"}

, "versioned_object.name": "FixedIP", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}]}, "versioned_object.name": "FixedIPList",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"},
"routes": {"versioned_object.changes": ["objects"], "versioned_object.data":

{"objects": []}

, "versioned_object.name": "RouteList", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}}, "versioned_object.name": "Subnet",
"versioned_object.namespace": "os_vif", "versioned_object.version": "1.0"}]},
"versioned_object.name": "SubnetList", "versioned_object.namespace": "os_vif",
"versioned_object.version": "1.0"}}, "versioned_object.name": "Network", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.1"}, "plugin": "noop", "preserve_on_delete":
false, "vif_name": "tapf6dd52db-40", "vlan_id": 3914}, "versioned_object.name":
"VIFVlanNested", "versioned_object.namespace": "os_vif", "versioned_object.version":
"1.0"}}, "versioned_object.name": "PodState", "versioned_object.namespace":
"os_vif", "versioned_object.version": "1.0"}'
creationTimestamp: "2021-07-15T12:23:41Z"
generateName: alertmanager-main-
labels:
alertmanager: main
app: alertmanager
controller-revision-hash: alertmanager-main-5548759bbd
statefulset.kubernetes.io/pod-name: alertmanager-main-2
name: alertmanager-main-2
namespace: openshift-monitoring

(shiftstack) [stack@undercloud-0 ~]$ openstack port list |grep 10.128.9.175

88bdb7f9-65e6-4c54-83d1-73341876da08   fa:16:3e:c1:cd:25 ip_address='10.128.9.175', subnet_id='a4ee6044-8ddd-4dbf-bcd3-22f95ec4ce16' ACTIVE

(shiftstack) [stack@undercloud-0 ~]$ openstack port list |grep 10.128.9.238

f6dd52db-40e1-4339-a7e6-1e2bd2f6f772   fa:16:3e:77:a3:12 ip_address='10.128.9.238', subnet_id='a4ee6044-8ddd-4dbf-bcd3-22f95ec4ce16' ACTIVE

(shiftstack) [stack@undercloud-0 ~]$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.31 True False False 22h
cloud-credential 4.6.31 True False False 26h
cluster-autoscaler 4.6.31 True False False 25h
config-operator 4.6.31 True False False 25h
console 4.6.31 True False False 22h
csi-snapshot-controller 4.6.31 True False False 25h
dns 4.5.40 True False False 25h
etcd 4.6.31 True False False 25h
image-registry 4.6.31 True False False 25h
ingress 4.6.31 True False False 22h
insights 4.6.31 True False False 25h
kube-apiserver 4.6.31 True False False 25h
kube-controller-manager 4.6.31 True False False 25h
kube-scheduler 4.6.31 True False False 25h
kube-storage-version-migrator 4.6.31 True False False 25h
machine-api 4.6.31 True False False 25h
machine-approver 4.6.31 True False False 25h
machine-config 4.5.40 True False False 23h
marketplace 4.6.31 True False False 22h
monitoring 4.5.40 False True True 22h
network 4.5.40 True True False 25h
node-tuning 4.6.31 True False False 22h
openshift-apiserver 4.6.31 True False False 25h
openshift-controller-manager 4.6.31 True False False 22h
openshift-samples 4.6.31 True False False 22h
operator-lifecycle-manager 4.6.31 True False False 25h
operator-lifecycle-manager-catalog 4.6.31 True False False 25h
operator-lifecycle-manager-packageserver 4.6.31 True False False 22h
service-ca 4.6.31 True False False 25h
storage 4.6.31 True False False 22h

(shiftstack) [stack@undercloud-0 ~]$ oc get po -A -o wide |grep 10.128.9.238 |wc -l
0

The same issue would be possible on 3.11 as it's also based on Annotations.

Version-Release number of selected component (if applicable):

Red Hat OpenStack Platform release 16.1.6 GA
How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Description of problem:

When a custom machineConfigPool is created and no node is associated with it, the mcp remains at 0% progress.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create a custom mcp:
~~~
cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: custom
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,custom]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/custom: "" 
EOF
~~~

Actual results:

The mcp is visible from the from "Administrator view > Cluster Settings > Details" at 0% progress

Expected results:

It shouldn't be stuck at 0% 

Additional info:

 

 

Currently, Telemeter is not equipped with configurable request limit for receive endpoint (for full context see: https://github.com/openshift/cluster-monitoring-operator/pull/1416). It is using the default limit defined in the code base, however it seems this limit might not be suitable for our usage.

As a part of this ticket, it should be:

1) Understood what is the appropriate limit for request size for our use cases

2) Make the limit configurable in Telemeter via a flag

3) Deploy the changes, initially to the staging environment, to enable our team to test it.

Description of problem:
Previously we had a bug opened for "Reduce buildah log level for default build log level [NEEDINFO]":

https://bugzilla.redhat.com/show_bug.cgi?id=1996883

We suggested customer using secrets, however, customer confirmed that they are using secrets as per:
https://docs.openshift.com/container-platform/4.8/cicd/builds/creating-build-inputs.html#builds-input-secrets-configmaps_creating-build-inputs

But not able to use "--quiet" build argument since they are using openshift s2i config:

strategy:
type: Source
sourceStrategy:
from:
kind: ImageStreamTag
namespace: xxxxxxx
name: 's2i-xxxxxx-xxxxxxx:v1.0.0'

Actual results:
Under this setting, the secret as well as every other openshift secret are printed.

Expected results:
Sensitive information (ENV) should not appear in build logs

Additional info:
Maybe pass the --quiet option via the buildconfig fir s2i?

— Additional comment from taxu@redhat.com on 2022-07-01 03:42:46 UTC —

Hi Build team,

I see that the target release for this bug is 4.11.0

Please let us know where to add the LOG_LEVEL for s2i and/or passing "--quiet" options once the fix is deployed.

Kind regards,

Tao Xu

— Additional comment from aos-team-art-private@redhat.com on 2022-07-13 06:44:30 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from rdlugyhe@redhat.com on 2022-07-26 20:34:06 UTC —

Please approve the updated Doc Text.

— Additional comment from cdaley@redhat.com on 2022-07-26 20:49:05 UTC —

Approved

— Additional comment from jitsingh@redhat.com on 2022-07-27 04:48:18 UTC —

verified

— Additional comment from oarribas@redhat.com on 2022-08-15 12:11:23 UTC —

@Corey, I can see this BZ now verified for 4.12. Can this be backported to previous OCP versions?

— Additional comment from cdaley@redhat.com on 2022-08-15 13:57:02 UTC —

What version are you interested in getting it backported to?

— Additional comment from oarribas@redhat.com on 2022-08-15 15:46:10 UTC —

@Corey, I can see only 4.10 and 4.11 as "Full Support" in [1], so at least to those versions.

Description of problem:

We observed that a dual stack cluster deployed with AI gui only fails.
This cluster is dhcp for ipv4, RA/RS autoconfiguration for ipv6.

It fails with error in the onvkube container

```
I0906 07:45:43.044090   87450 gateway_init.go:261] Initializing Gateway Functionality
I0906 07:45:43.046398   87450 gateway_localnet.go:152] Node local addresses initialized to: map[10.131.31.214:{10.131.31.208 fffffff0} 10.255.0.2:{10.255.0.0 fffffe00} 127.0.0.1:{127.0.0.0 ff000000} 2001:1b74:480:613a:f6e9:d4ff:fef1:6f26:{2001:1b74:480:613a:: ffffffffffffffff0000000000000000} ::1:{::1 ffffffffffffffffffffffffffffffff} fd01:0:0:1::2:{fd01:0:0:1:: ffffffffffffffff0000000000000000} fe80::8ce9:b4ff:fe1a:1208:{fe80:: ffffffffffffffff0000000000000000} fe80::c8ef:ecff:fee3:64c7:{fe80:: ffffffffffffffff0000000000000000} fe80::f6e9:d4ff:fef1:6f26:{fe80:: ffffffffffffffff0000000000000000}]
I0906 07:45:43.047759   87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7
I0906 07:45:43.048045   87450 helper_linux.go:97] Found default gateway interface br-ex 10.131.31.209
I0906 07:45:43.048152   87450 helper_linux.go:71] Provided gateway interface "br-ex", found as index: 7
F0906 07:45:43.048318   87450 ovnkube.go:133] failed to get default gateway interface
```

on the node we observed that there is multi-path entry during

```
default proto ra metric 48 pref medium
        nexthop via fe80::e2f6:2d01:ab14:ec71 dev br-ex weight 1
        nexthop via fe80::e2f6:2d01:ab11:c271 dev br-ex weight 1
```

I manually remove one of the entries (`ip route delete`) and then delete the ovnkube-node pod. Then the installation continues, container works.

Every time there is multiple entry, if the onvkube-node starts, it fails.


Version-Release number of selected component (if applicable):

4.10.30

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

There might a side issue: the interface of the node upon boot takes time to get the ipv6 autoconfiguration, no RS packets seemed to be sent out (observed zero on all routers).

This is a clone of issue OCPBUGS-1972. The following is the description of the original issue:

Description of problem:

The metal3 Pod of openshift-machine-api is in CrashLoopBackOff status.

Version-Release number of selected component (if applicable):

4.10.31

How reproducible:

Always reproduce in IPv6

Steps to Reproduce:

1. Preparing to configure ipi on the provisioning node
   - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)

2. configuring the install-config.yaml (attached)
   - provisioningNetwork: disabled
   - machine network: only IPv6
   - disconnected installation
   
3. deploy the cluster 

Actual results:

It is not possible to add worker nodes because metal3 does not start normally.

Expected results:

metal3 starts normally in IPv6 environment

Additional info:

1. attached must-gather
https://drive.google.com/file/d/1GKxj3syROIMnURx_PYzOYhJdEuXBNXVW/view?usp=sharing

2. pod status
[kni@prov ~]$ oc get pod
NAME                                           READY   STATUS             RESTARTS          AGE
cluster-autoscaler-operator-6656bfd7b9-bt4j8   2/2     Running            0                 35h
cluster-baremetal-operator-6bbdd6758-rmxgq     2/2     Running            0                 35h
machine-api-controllers-55fb545b56-kl5sj       7/7     Running            0                 34h
machine-api-operator-845b6cf855-q7gdd          2/2     Running            0                 35h
metal3-574876cfdb-98fmz                        6/7     CrashLoopBackOff   179 (3m17s ago)   14h
metal3-image-cache-5mq2w                       1/1     Running            0                 14h
metal3-image-cache-nftpj                       1/1     Running            0                 14h
metal3-image-cache-t7whh                       1/1     Running            0                 14h
metal3-image-customization-68d4d6d99b-dbqgn    1/1     Running            0                 15h

[kni@prov ~]$ oc logs metal3-574876cfdb-98fmz -c metal3-httpd
AH00526: Syntax error on line 8 of /etc/httpd/conf.d/vmedia.conf:
The port number "2001:feed:101::102" is outside the appropriate range (i.e., 1..65535).

Description of problem:

[ovn] [ocp 4.10.z] Service `spec.externalTrafficPolicy` does not trigger rules update in ovnkube-node pod handlers on edit, even though it does successfully update the rules if deployed explicitly with that spec value set, or if you delete the handler pods for ovn (forces a refresh).

Version-Release number of selected component (if applicable):

observed in 4.10.32 and 4.10.40, tested on azure platform.

How reproducible:

every time

Steps to Reproduce:

1. Deploy a test pod with curlable resource in a test namespace

2. create a service from yaml exposing pod at internal clusterIP (example yaml provided by customer below)
~~~ 
apiVersion: v1
kind: Service
metadata:
  labels:
    run: test
  name: test
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: paas1
spec:
  allocateLoadBalancerNodePorts: true
  externalTrafficPolicy: Cluster ##MODIFY THIS SPEC VALUE AND OBSERVE FAIL CONDITION
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    run: test
  sessionAffinity: None
  type: LoadBalancer
~~~

3. curl against service succeeds
4. edit service to change `spec.externalTrafficPolicy: local`
5. observe externalIP does not change, but healthz port updates
6. curl against same externalIP:port time out indefinitely, no response.

//workaround: 

delete service and redeploy with spec line set already to `local`, or delete ovnkube-node pod serving pod(s) to force refresh the local ruleset and allow traffic (curls subsequently will succeed).

Actual results:

spec change appears to update properly in the database but does not send a notification to update the ovnkube-node pod handlers (or similar) to allow traffic through once the externalTrafficPolicy spec value is changed.

Expected results:

spec change to service yaml should be immediately updated in DB AND update ovnkube-node handlers for same.

Additional info:

Attachments available and case number with specifics in next internal comment. 

Description of problem:

[4.10.z] Fix kubevirt-console tests

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

The CMO e2e tests create a bunch of resources. These should be cleaned up on a successful run. However:

  • Some test failures leave the create resource behind, which have to be cleaned up before a re-run.
  • There have been developer reports that even successful runs don't tidy up everything.

In a CI context this is rarely a problem, however running the tests locally can be made quite awkward, especially repeated runs on the same cluster.

We should tag all resources created by the e2e tests with a label (app.kubernetes.io/created-by: cmo-e2e-test).
This will allow easy cleanup by deleting all resources with that label and will allow for checking proper clean-up.

DoD:
All e2e resources get properly tagged.
It is straight forward to ensure that future code changes don't skip adding this tag.

Description of problem:

View https://bugzilla.redhat.com/show_bug.cgi?id=2037513

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:
When creating a incomplete ClusterServiceVersion resource the OLM details page crashes (on 4.11).

apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  name: minimal-csv
  namespace: christoph
spec:
  apiservicedefinitions:
    owned:
      - group: A
        kind: A
        name: A
        version: v1
  customresourcedefinitions:
    owned:
      - kind: B
        name: B
        version: v1
  displayName: My minimal CSV
  install:
    strategy: ''

Version-Release number of selected component (if applicable):
Crashes on 4.8-4.11, work fine from 4.12 onwards.

How reproducible:
Alway

Steps to Reproduce:
1. Apply the ClusterServiceVersion YAML from above
2. Open the Admin perspective > Installed Operator > Operator detail page

Actual results:
Details page crashes on tab A and B.

Expected results:
Page should not crash

Additional info:
Thi is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2084287

Description of problem:

Customer is facing issue similar to https://github.com/devfile/api/issues/897

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA

+++ This bug was initially created as a clone of Bug #2118717 +++
https://bugzilla.redhat.com/show_bug.cgi?id=2118717

Description of problem:
This BZ is a spin-off of BZ-2114945 so we can track possible issues with new TCP connections from pods failing to be created on the nodes leading to pods being unable to start or crash.

Version-Release number of selected component (if applicable):
OCP 4.10.24 with OVN-Kubernetes

How reproducible:
Periodically on the customer only so far.

— Additional comment from Andre Costa on 2022-08-10 16:30:00 UTC —

There are 3 must-gathers here that were gathered during the issues and after the restart of OVNk-masters which makes all these issues go away and pods start connections immediately.

This must-gather was taken at 11 AM today when they received a report from one of the customers:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/3f820e22-3d1e-4c01-b56d-e479a6cc5578

Customer reported the issue again and this time we also got sosreport and inspect from the project.
In the pod they get errors like this (the same we saw on the call last week with them where it seems no TCP connections entries are created at all. First we though it was DNS but even with IPs directly there were issues like this):
-----------------
mx-toni-dev toni-dev-build 0/1 Error 0 18m 10.195.80.253 demchdc5vvx <none> <none>

[z0003rbj-z07@stuart ~]$ oc logs toni-dev-build
time="2022-08-10T10:59:02Z" level=info msg="Start building app with registry type openshift"
time="2022-08-10T10:59:02Z" level=info msg="Adding ssl certificate /etc/ssl/certs/ca-bundle.crt"
time="2022-08-10T10:59:02Z" level=info msg="Certificate /etc/ssl/certs/ca-bundle.crt has been added successfully"
time="2022-08-10T10:59:02Z" level=info msg="Updating docker config with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Docker config has been updated with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Downloading MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042"
time="2022-08-10T10:59:32Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout"
-----------------------

This keeps happening if they continue to run the builds which they did and created the MG and sosreport:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/d10d9646-5437-4806-b34f-3acab4d83266

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/1e9e0575-66b0-4184-a3d4-e84fa245c9a2

And like we have seen so far restarting the ovnk-master pods makes these connections work immediately again:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/e03ee9c6-2ce4-40d3-a484-84da61607de5

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/652ebe47-b71f-4d63-8499-cf989b3ebfc7

— Additional comment from Andre Costa on 2022-08-10 16:30:43 UTC —

— Additional comment from Tim Rozet on 2022-08-10 22:36:06 UTC —

Thanks for the must gathers. From Flavio and I examining them, there is definitely a bug here in ovn-kube. The toni-dev-build pod is deleted/recreated multiple times, and during this time it moves to different nodes. However due to a bug in OVNK, this port is updated with the new ip address and information as if it was moving to the new node, but stays on the previous logical switch. So for example, this is what happens:

1. The pod is originally assigned to node demchdc6zax. This node's cluster subnet is 10.195.79.0/24:

2022-08-09T09:22:57.078688727+00:00 stderr F I0809 09:22:57.078632 2239319 cni.go:248] [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1] ADD finished CNI request [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1], result "{\"interfaces\":[

{\"name\":\"b1a4fb0be20ff71\",\"mac\":\"a6:68:38:ad:66:c8\"}

,

{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:4f:52\",\"sandbox\":\"/var/run/netns/e354d2d5-83cb-406f-a2d9-c5f3e786bae4\"}

],\"ips\":[

{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.79.82/24\",\"gateway\":\"10.195.79.1\"}

],\"dns\":{}}", err <nil

2. Over time this pod is completed, deleted, recreated many times. Until eventually it lands on demchdc5vvx the next day:
2022-08-10T08:43:58.428759111Z I0810 08:43:58.428719 1837017 cni.go:248] [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a] ADD finished CNI request [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a], result "{\"interfaces\":[

{\"name\":\"5d4f195cbda5269\",\"mac\":\"a2:22:6c:1d:43:c1\"}

,

{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:50:1e\",\"sandbox\":\"/var/run/netns/bae8f77a-b368-4b3a-86dc-df925330fa26\"}

],\"ips\":[

{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.80.30/24\",\"gateway\":\"10.195.80.1\"}

],\"dns\":{}}", err <nil>

3. Although it lands on a new node, in OVNK we update the old port (somehow the old port is not being removed) that is attached to the old switch:
[root@fedora ~]# ovn-nbctl list logical_switch_port c345cc07-8a89-4e70-beff-d8d9f4dac46a
_uuid : c345cc07-8a89-4e70-beff-d8d9f4dac46a
addresses : ["0a:58:0a:c3:50:1e 10.195.80.30"]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids :

{namespace=mx-toni-dev, pod="true"}

ha_chassis_group : []
name : mx-toni-dev_toni-dev-build
options :

{iface-id-ver="655cc043-53d5-4550-9c0a-74095a0fbde4", requested-chassis=demchdc5vvx}

parent_name : []
port_security : ["0a:58:0a:c3:50:1e 10.195.80.30"]
tag : []
tag_request : []
type : ""
up : false

[root@fedora ~]# ovn-nbctl lsp-list demchdc6zax | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
c345cc07-8a89-4e70-beff-d8d9f4dac46a (mx-toni-dev_toni-dev-build)
[root@fedora ~]# ovn-nbctl lsp-list demchdc5vvx | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
[root@fedora ~]#

This will cause the pod not to be able to send any traffic as its IP is in the wrong subnet for this switch.

4. Additionally the default node SNAT for this pod is in the right place:
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc6zax | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5vvx | grep 10.195.80.30
snat 139.25.144.25 10.195.80.30

5. But there is no egress IP reroute or SNAT entry for this pod:
Egress IP:
status:
items:

  • egressIP: 139.25.144.72
    node: demchdc5z6x
    name: egress-mx-toni-dev

[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 139.25.144.72
snat 139.25.144.72 10.195.77.156
snat 139.25.144.72 10.195.80.40
snat 139.25.144.72 10.195.76.65
snat 139.25.144.72 10.195.76.184
snat 139.25.144.72 10.195.80.42
snat 139.25.144.72 10.195.80.156

6. We see in the ovnkube-master logs that ovnk attempts to delete this pod, but it fails because we try to delete a logical switch port that is still bound to the wrong logical switch:
2022-08-10T09:07:32.033303027Z I0810 09:07:32.033270 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.80.30]}}] Timeout:<nil> Where:[where column _uuid ==

{4fd75aab-222c-4f17-9407-3764f9b6760b}

] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[

{GoUUID:c345cc07-8a89-4e70-beff-d8d9f4dac46a}]}}] Timeout:<nil> Where:[where column _uuid == {f5073eaa-3f72-4ec2-94c3-3744d412864a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c345cc07-8a89-4e70-beff-d8d9f4dac46a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"

2022-08-10T09:07:32.037857343Z I0810 09:07:32.037163 1 pods_retry.go:57] [655cc043-53d5-4550-9c0a-74095a0fbde4/mx-toni-dev/toni-dev-build] teardown retry failed; will try again later

From the engineering side we need to try to reproduce this. Flavio and I think this toni-dev-build must be a stateful set running to completion. We need to investigate further to understand how we ended up trying to add the new pod without first deleting the old. Could be a bug related to completed pods logic or retry logic that was added in 4.10.

— Additional comment from Anurag saxena on 2022-08-11 04:52:37 UTC —

Thanks @trozet@redhat.com for investigation. @huirwang@redhat.com @jechen@redhat.com Probably we can follow these steps to internally repro the issue.

— Additional comment from Andre Costa on 2022-08-11 15:09:25 UTC —

Hi all,

Many thanks for the great work here and customer is relieved that we found something

In the meantime a new example that looks to be another caused by this bug. We have seen this live last week and today the customer saw it again on another project:

- Similar build pod started by their operator:

1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.185]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vux
1h21m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.81.116/24] from ovn-kubernetes
1h21m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
1h21m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 622.636779ms
1h21m Normal Created pod/ngat-dev-build Created container mendix-build
1h21m Normal Started pod/ngat-dev-build Started container mendix-build
1h20m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.81.116]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {95072525-7d91-4f1f-9ad8-38a4171b978e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
1h2m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vvx
1h2m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.80.33/24] from ovn-kubernetes
1h2m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
1h2m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 599.594895ms
1h2m Normal Created pod/ngat-dev-build Created container mendix-build
1h2m Normal Started pod/ngat-dev-build Started container mendix-build
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
28m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
27m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
27m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
27m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 575.907145ms
27m Normal Created pod/ngat-dev-build Created container mendix-build
27m Normal Started pod/ngat-dev-build Started container mendix-build
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
26m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
26m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
26m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 606.11715ms
26m Normal Created pod/ngat-dev-build Created container mendix-build
26m Normal Started pod/ngat-dev-build Started container mendix-build
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
22m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
22m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
22m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
22m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 534.430683ms
22m Normal Created pod/ngat-dev-build Created container mendix-build
22m Normal Started pod/ngat-dev-build Started container mendix-build
21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:}

Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)

  • Eventually the pod seems to start but then fails again with the same TCP timeout:

2022-08-11T08:26:24.264204783Z time="2022-08-11T08:26:24Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout"

  • Customer took a sosreport from the node where the pod terminated in case is worth it to check:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/bd3f86b2-6d2a-44f7-bb0f-eec85630586d?usePresignedUrl=true

— Additional comment from Tim Rozet on 2022-08-11 21:19:12 UTC —

We were able to reproduce the issue locally. There are two potential paths that can cause a stale logical switch port to be re-used on the wrong node:
Scenario 1: pod is created, deleted, and recreated on another node extremely quickly. This plays out like this:
Events:
1. pod toni is created on node A
2. pod toni is deleted on node A
3. pod toni is recreated on node B

What happens in ovnk:
1. pod toni is created on node B
2. pod toni is deleted on node A (fails)

This happens because we grab the latest version of the pod to add in event 1, which by the time we grab it is actually the value from event 3. Then after processing event 1, we move to event 2 and when we delete the pod, we use what was given to us in the event, which is now incorrect to remove the pod from node A, because it actually is on node B. The fix is to ignore the value in the pod spec on deletion, and use what we store internally in our internal port cache. If there is no entry in the cache, we search OVN NBDB to find the right switch (this is an expensive operation so we want to avoid it when possible).

Flavio is working on a fix for this.

Scenario 2: pod is created, runs to completion, is deleted very quickly, and then is recreated on another node:
Events:
1. pod toni is created on node A
2. pod toni runs to completion
3. pod toni immediately is deleted
4. pod toni is recreated on node B

What happens in ovnk:
1. pod toni is created on node A
2. completion causes an update event, ovnk does not delete the pod (bug)
3. deletion event is processed, but since it is a completed pod, we ignore it (since we should have already delted it in step 2)
4. pod toni is recreated on node B, ovnk sees there is already an switch port for toni on the wrong switch, and just updates that with the new information

This happens when a pod goes to completed, but is deleted very quickly. In step 2 during update event we try to grab the latest version of the pod, but it doesn't exist anymore since it was deleted. In this case we skip the update, instead of tearing down the pod. The delete code in step 3 assumes that if the pod is completed we must have already handled it in update, so the stale port stays around until an add later re-uses it. Patryk already has a fix for this:

https://github.com/ovn-org/ovn-kubernetes/pull/3071

There may still be more egress IP issues to investigate (tracked in other bugs) after fixing this, and we will look into those after fixing these fundamental pod issues.

— Additional comment from Tim Rozet on 2022-08-11 22:19:30 UTC —

Andre, re: comment 5. This looks like the same issue. If you have a must gather we can confirm. I think the sosreport from the worker node is not enough. The referential integrity violation occurs because we attempt to do these operations:
1. remove the logical switch port from the logical switch
2. delete the logical switch port

In this case:
1. we remove the logical switch port from the logical switch, but the switch port actually exists on a different switch (node) - this is a noop
2. we try to delete the logical switch port- NBDB complains this is an violation and refuses to delete it, because there is another switch that holds a reference to this object

— Additional comment from Anurag saxena on 2022-08-12 21:16:56 UTC —

@rbrattai@redhat.com Can you help verifying this and backports? feel free to re-assign to someone else if needed while I am on PTO. Thanks!

The static authorizer feature has landed in upstream kube-rbac-proxy. Lets use it by configuring a static authorizer for all requests that hit a /metrics endpoint.

DoD:

  • Downstream kube-rbac-proxy is synced.
  • All CMO operands are configured with static authorization.
  • Bugzillas created for all non-monitoring components using kube-rbac-proxy for metrics authn/authz.

This is a clone of issue OCPBUGS-9986. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7445. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:

At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.

But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.

However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.

This was probably not detected in QE as well for the same reason as CI.

Description of problem:

  intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy

Version-Release number of selected component (if applicable):

  OpenShift 4.10.12

How reproducible:

Always

Steps to Reproduce:
  1. Define deny all network policy for egress an ingress in a namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

2. Define the following network policy to allow the traffic between the pods in the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-intra-namespace-001
spec:
  egress:
  - to:
    - podSelector: {}
  ingress:
  - from:
    - podSelector: {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress 

3. Test the connectivity between two pods from the namespace.

Actual results:

   The connectivity is not allowed

Expected results:

  The connectivity should be allowed between pods from the same namespace.

Additional info:

  After performing a test and analyzing SDN flows for the namespace: 

sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 
 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27
 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]

We see that the following rule doesn't match because `reg1` hasn't been defined:

 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 

 

This is a clone of issue OCPBUGS-1346. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1226. The following is the description of the original issue:

We added server groups for control plane and computes as part of OSASINFRA-2570, except for UPI that only creates server group for the control plane.

We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.

Related to OCPCLOUD-1135.

Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988

This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D

Version-Release number of selected component (if applicable):

4.12,4.13

How reproducible:

Every time

Steps to Reproduce:

1. Setup dualstack KinD cluster
2. Create egress fw policy with spec
Spec:
  Egress:
    To:
      Cidr Selector:  0.0.0.0/0
    Type:             Deny
3. create a pod and ping to 1.1.1.1

Actual results:

Egress policy does not block flows to external IP

Expected results:

Egress policy blocks flows to external IP

Additional info:

It seems mixing ip4 and ip6 operands in ACL matchs doesnt work

This is a clone of issue OCPBUGS-10225. The following is the description of the original issue:

Description of problem:
Pipeline Repository (Pipeline-as-code) list never shows an Event type.

Version-Release number of selected component (if applicable):
4.9+

How reproducible:
Always

Steps to Reproduce:

  1. Install Pipelines Operator and setup a Pipeline-as-code repository
  2. Trigger an event and a build

Actual results:
Pipeline Repository list shows a column Event type but no value.

Expected results:
Pipeline Repository list should show the Event type from the matching Pipeline Run.

Similar to the Pipeline Run Details page based on the label.

Additional info:
The list page packages/pipelines-plugin/src/components/repository/list-page/RepositoryRow.tsx renders obj.metadata.namespace as event type.

I believe we should show the Pipeline Run event type instead. packages/pipelines-plugin/src/components/repository/RepositoryLinkList.tsx uses

{plrLabels[RepositoryLabels[RepositoryFields.EVENT_TYPE]]}

to render it.

Also the Pipeline Repository details page tried to render the Branch and Event type from the Repository resource. My research says these properties doesn't exist on the Repository resource. The code should be removed from the Repository details page.

Description of problem:

Custonmer is facing this problem in OCP 4.10.40.
https://github.com/coredns/coredns/issues/5593

 Its root cause seems to be this bug in k8s code:

https://github.com/kubernetes/kubernetes/issues/109115
https://github.com/kubernetes/kubernetes/pull/109137

The issue seems to be fixed in OpenShift 4.11, but the customer can't update at this moment.

Can this fix be backported to OpenShift 4.10?

Version-Release number of selected component (if applicable):

4.10.40

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

 

  • CVE-2022-36882
  • CVE-2022-29047
  • CVE-2022-30945
  • CVE-2022-30946
  • CVE-2022-30948
  • CVE-2022-30952
  • CVE-2022-30953
  • CVE-2022-30954
  • CVE-2022-34174
  • CVE-2022-36883
  • CVE-2022-36884
  • CVE-2022-36885
  • CVE-2022-34177
  • CVE-2022-34176
  • CVE-2022-36881

This bug is a backport clone of [Bugzilla Bug 2056519](https://bugzilla.redhat.com/show_bug.cgi?id=2056519). The following is the description of the original bug:

Version: 4.9

Platform:
azure

Please specify:

  • IPI

What happened?
Issue: Customer reports unable to install IPI PRIVATE OpenShift cluster in Azure. They have orgnization policy which do not allow them to create storage account with public access. It should be disallowed.

What did you expect to happen?
Installer completes successfully.

Description of problem:

Sometimes we see VMs fail to power on when the land on a host that does not have enough resources. The current power on does not retry or leverage DRS to power on the node on a suitable host.

https://github.com/vmware/govmomi/issues/1026

Our code is still making calls to PowerOnVM_Task which, according to the vsphere docs, is deprecated and we should use PowerOnMultiVM_Task instead.

PowerOnVM_Task does not return a DRS ClusterRecommendation, no vmotion nor host power operations will be done as part of a DRS-facilitated power on. To have DRS consider such operations use PowerOnMultiVM_Task.

https://vdc-download.vmware.com/vmwb-repository/dcr-public/b50dcbbf-051d-4204-a3e7-e1b618c1e384/538cf2ec-b34f-4bae-a332-3820ef9e7773/vim.VirtualMachine.html#powerOn:
As of vSphere API 5.1, use of this method with vCenter Server is deprecated; use PowerOnMultiVM_Task instead.

Version-Release number of selected component (if applicable):
4.8.x

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Sometimes powers on fails requiring manual intervention.

Expected results:
PowerOn should use DRS to ensure it's always successful.

Additional info:

OCPBUGS-1251 landed an admin-ack gate in 4.11.z to help admins prepare for Kubernetes 1.25 API removals which are coming in OpenShift 4.12. Poking around in a 4.12.0-ec.2 cluster where APIRemovedInNextReleaseInUse is firing:

$ oc --as system:admin adm must-gather -- /usr/bin/gather_audit_logs
$ zgrep -h v1beta1/poddisruptionbudget must-gather.local.1378724704026451055/quay*/audit_logs/kube-apiserver/*.log.gz | jq -r '.verb + " " + (.user | .username + " " + (.extra["authentication.kubernetes.io/pod-name"] | tostr
ing))' | sort | uniq -c
parse error: Invalid numeric literal at line 29, column 6
     28 watch system:serviceaccount:openshift-machine-api:cluster-autoscaler ["cluster-autoscaler-default-5cf997b8d6-ptgg7"]

Finding the source for that container:

$ oc --as system:admin -n openshift-machine-api get -o json pod cluster-autoscaler-default-5cf997b8d6-ptgg7 | jq -r '.status.containerStatuses[].image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f81ab7ce0c851ba5e5169bba717cb54716ce5457cbe89d159c97a5c25fd820ed | grep github
             SOURCE_GIT_URL=https://github.com/openshift/kubernetes-autoscaler
             io.openshift.build.commit.url=https://github.com/openshift/kubernetes-autoscaler/commit/1dac0311b9842958ec630273428b74703d51c1c9
             io.openshift.build.source-location=https://github.com/openshift/kubernetes-autoscaler

Poking about in the source:

$ git clone --depth 30 --branch master https://github.com/openshift/kubernetes-autoscaler.git
$ cd kubernetes-autoscaler
$ find . -name vendor
./addon-resizer/vendor
./cluster-autoscaler/vendor
./vertical-pod-autoscaler/e2e/vendor
./vertical-pod-autoscaler/vendor

Lots of vendoring. I haven't checked to see how new the client code is in the various vendor packages. But the main issue seems to be the v1beta1 in:

$ git grep policy cluster-autoscaler/core cluster-autoscaler/utils | grep policy.*v1beta1
cluster-autoscaler/core/scaledown/actuation/actuator_test.go:   policyv1beta1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/actuation/actuator_test.go:                                   eviction := createAction.GetObject().(*policyv1beta1.Eviction)
cluster-autoscaler/core/scaledown/actuation/drain.go:   policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/actuation/drain_test.go:      policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/legacy/legacy.go:     policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/legacy/wrapper.go:    policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/scaledown/scaledown.go: policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/core/static_autoscaler_test.go:      policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/drain/drain.go:        policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/drain/drain_test.go:   policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/kubernetes/listers.go: policyv1 "k8s.io/api/policy/v1beta1"
cluster-autoscaler/utils/kubernetes/listers.go: v1policylister "k8s.io/client-go/listers/policy/v1beta1"

The main change from v1beta1 to v1 involves spec.selector; I dunno if that's relevant to the autoscaler use-case or not.

Do we run autoscaler CI? I was poking around a bit, but did not find a 4.12 periodic excercising the autoscaler that might have turned up this alert and issue.

 As mentioned in [1], the cluster monitoring operator doesn't define the relatedObjects field in the ClusterOperator manifest which is initially deployed by CVO [2].
If the CMO pod fails to start, the must-gather might miss information from the monitoring namespace. Note that once CMO runs, it will update the initial ClusterOperator object with the proper information [3].

[1] http://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html
[2] https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/0000_50_cluster-monitoring-operator_06-clusteroperator.yaml
[3] https://github.com/openshift/cluster-monitoring-operator/blob/a6bc9824035ceb8dbfe7c53cf0c138bfb2ec5643/pkg/client/status_reporter.go#L49-L63

This is a clone of issue OCPBUGS-3174. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3117. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3084. The following is the description of the original issue:

Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603

Long log lines get corrupted when using '--timestamps' by the Kubelet.

The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.

https://github.com/kubernetes/kubernetes/blob/f892ab1bd7fd97f1fcc2e296e85fdb8e3e8fb82d/pkg/kubelet/kuberuntime/logs/logs.go#L325

apiVersion: v1
kind: Pod
metadata:
  name: logs
spec:
  restartPolicy: Never
  containers:
  - name: logs
    image: fedora
    args:
    - bash
    - -c
    - 'for i in `seq 1 10000000`; do echo -n $i; done'
kubectl logs logs --timestamps

This is a clone of issue OCPBUGS-1523. The following is the description of the original issue:

Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

Description of problem:

When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart. 

The restart happens because of (ovnkube-master pods)

2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-09-114621

How reproducible:

Steps to Reproduce:
1. Run node-density on a 120 node cluster

Actual results:

Spike observed in pod-latency graph ~22s

Expected results:

Steady pod-latency graph ~4s

Additional info:

This is a clone of issue OCPBUGS-6907. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6517. The following is the description of the original issue:

Description of problem:

When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].

[1]https://github.com/openshift/cluster-image-registry-operator/pull/819

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Deploy a cluster with proxy and restricted installation
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

The current integration of prometheus-adapter in OpenShift uses the platform Prometheus as a backend to get metrics. The problem with this design is that we are getting metrics from 2 different Prometheus instances which don't have replicated data, so two queries sent at the same time to prometheus-adapter might yield different results since the underlying promQL queries executed by prometheus-adapter might be on different Prometheus servers. The consequence is that we end up having inconsistent data across multiple autoscaling requests.

This can be easily tested by running:

$ while true ; do date; oc adm top pod -n openshift-monitoring  prometheus-k8s-0 ; echo; sleep 1 ;done 

Mon Jul 26 03:55:07 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:08 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

Mon Jul 26 03:55:09 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:10 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

This isn't a bug in itself since it was designed that way, but we could do better by using thanos-querier as a backend instead of the platform Prometheus because it will duplicate the metrics from both instances and serve one consistent result based on the data that it will get from the Prometheuses.

DoD:

  • Use thanos-querier as a backend for prometheus-adapter

This is a clone of issue OCPBUGS-10671. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10514. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10221. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:

Description of problem:

When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.

Version-Release number of selected component (if applicable):

4.10.z, 4.11.z, 4.12, 4.13

How reproducible:

100% 

Steps to Reproduce:

1. Install 4.10.34
2. Switch from stable-4.10 to stable-4.11
3. 

Actual results:

Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them

Expected results:

Risks are computed in a timely manner for an interactive UX, lets say < 10s

Additional info:

This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be.

We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.

Description of problem:
We are seeing customer's upgrade cannot kickoff due to the availableUpdates is null in clusterversion CR
Version-Release number of the following components:

How reproducible:
sometime
Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:

Description of problem:

When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update.

However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:

Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.

Steps to Reproduce:

1.
2.
3.

Actual results:

Port still exists in OVN, IP remains allocated for a deleted pod.

Expected results:

IP should be freed, port should be removed from OVN.

Additional info:

Similar bug for completed pods with "err: the range is full" fixed in 4.10.21 https://bugzilla.redhat.com/show_bug.cgi?id=2091157

This is a clone of issue OCPBUGS-5078. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5019. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:

Description of problem: This is a follow-up to OCPBUGS-3933.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Acceptance criteria:

  • All tests (including e2e) pass
  • No regressions are introduced
  • openshift/api points to a recent commit on the master branch

This is a clone of issue OCPBUGS-1104. The following is the description of the original issue:

Description of problem:

In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OCP 4.8

2. Upgrade to OCP 4.9 

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2022-08-31-160214   True        True          50m     Working towards 4.9.47: 619 of 738 done (83% complete)

$ oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.47    True        False         4m26s   Cluster version is 4.9.47
 

Actual results:

Check packageserver CSV. It's in v0.17.0 

$ oc get csv  NAME            DISPLAY          VERSION   REPLACES   PHASE packageserver   Package Server   0.17.0               Succeeded 

Expected results:

packageserver CSV is at 0.18.3 

Additional info:

packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12

packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8

 

copy of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2053622

Description of problem:

PodDisruptionBudgetAtLimit Warning alert when CR replica count is zero.

Version-Release number of selected component (if applicable):
4.7

How reproducible: Everytime

Steps to Reproduce:
1. oc new-project test

2. oc new-app httpd

3. oc create -f pdb

$ cat pdb.yaml

~~~
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
deployment: httpd
~~~

$ oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
my-pdb N/A 0 0 3h27m

4. oc scale deployment httpd --replicas=0

5. Wait for some time alert will be triggered at the console.

Actual results: unexpected warning alert

Expected results: As we are intentionally dropping down the replicas it should not generate an alert.

Before platformStatus, the operator used to get information about AWS and GCP from the install-config config map. This code can be removed.

This is a clone of issue OCPBUGS-7800. The following is the description of the original issue:

This is a clone of issue OCPBUGS-266. The following is the description of the original issue:

Description of problem: I am working with a customer who uses the web console.  From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console.  This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead.  What we really need is a third column in the Project Access tab that says whether a resource is a user or group.

 

Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well

How reproducible: Every time.  My customer is running on ROSA, but I have determined this issue to be general to OpenShift.

Steps to Reproduce:

From the oc cli, I create a group and add a user to it.

$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME                                     USERS
cluster-admins                           
dedicated-admins                         admin
techlead   admin
I create a new namespace so that I can assign a group project level access:

$ oc new-project my-namespace

$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access.  I verified the rolebinding named 'edit' is bound to a group named 'techlead'.

$ oc get rolebinding
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      15m
admin-dedicated-admins                                            ClusterRole/admin                      15m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      15m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   15m
edit                                                              ClusterRole/edit                       2m18s
system:deployers                                                  ClusterRole/system:deployer            15m
system:image-builders                                             ClusterRole/system:image-builder       15m
system:image-pullers                                              ClusterRole/system:image-puller        15m

$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:16:56Z"
  name: edit
  namespace: my-namespace
  resourceVersion: "108357"
  uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: techlead
    Now, from the same Project Access tab in the web console, I added the developer with role "View".  From this web console, it is unclear whether developer and techlead are users or groups.

Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.

$ oc get rolebinding                                                                      
NAME                                                              ROLE                                   AGE
admin                                                             ClusterRole/admin                      17m
admin-dedicated-admins                                            ClusterRole/admin                      17m
admin-system:serviceaccounts:dedicated-admin                      ClusterRole/admin                      17m
dedicated-admins-project-dedicated-admins                         ClusterRole/dedicated-admins-project   17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin   ClusterRole/dedicated-admins-project   17m
edit                                                              ClusterRole/edit                       4m25s
developer-view-c15b720facbc8deb     ClusterRole/view                       90s
system:deployers                                                  ClusterRole/system:deployer            17m
system:image-builders                                             ClusterRole/system:image-builder       17m
system:image-pullers                                              ClusterRole/system:image-puller        17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: "2022-08-15T14:19:51Z"
  name: developer-view-c15b720facbc8deb
  namespace: my-namespace
  resourceVersion: "113298"
  uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:

  • apiGroup: rbac.authorization.k8s.io
      kind: User
      name: developer

So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups.  This is in essence our ask for this RFE.

 

Actual results:

Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them.  Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.

 

Expected results:

Should have the ability to add groups and differentiate between users and groups.  Ideally, we're looking at a third column for user or group.

 

Additional info:

Description of problem:

Remove the self-provisioner role for the system authenticated users as per  https://access.redhat.com/solutions/4040541 to stop users from having the ability to create new projects, but the customer has found this is only partially working. It appears that when you use cluster Web UI Administrator view, the "Create Project" button is not available but switching to the default Developer view default user can create a project

Version-Release number of selected component (if applicable):

 

How reproducible:

Follow https://access.redhat.com/solutions/1529893

Steps to Reproduce:

1. oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth
2. log back in as user and switch between admin/Dev view
3. User still has link showing in Dev console

Actual results:

Create new project link still exists

Expected results:

Create new project link should be removed, similar to Admin Console

 

Additional info:

Although the loink still exists, the user get's a correct permission denied message.

Description of problem:

+++ This bug was initially created as a clone of 

Bug #2118514

Various CI steps use the upi-installer container for it's access to the
aws cli tools among other things. However, most of those steps also
curl yq directly from GitHub. We can save ourselves some headaches
when GitHub is down by just embedding the binary in the image already.

Whenever GitHub has issues or throttles us, YQ hash mismatch error out. The hash mismatch error is because github is probably returning an error page, although our scripts hide it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-4137. The following is the description of the original issue:

This is a clone of issue OCPBUGS-3824. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2598. The following is the description of the original issue:

Description of problem:

Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too. 

This makes the main command 

ovs-appctl -t ovs-monitor-ipsec

To hang until the subcommand is finished.

As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be  insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

Version-Release number of selected component (if applicable):

Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

How reproducible:

I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

Steps to Reproduce:

1. Install an Openshift cluster with IPSEC enabled
2. Scale to 170+ nodes or more
3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.

Actual results:

Ip Sec pods are stuck in a Crashloopbackoff state

Expected results:

Ip Sec pods to work normally

Additional info:

We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

 

This is a clone of issue OCPBUGS-4579. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4504. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1557. The following is the description of the original issue:

Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:

  "scheduling": {
    "automaticRestart": false,
    "onHostMaintenance": "MIGRATE",
    "preemptible": false,
    "provisioningModel": "STANDARD"
  },

From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:

If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".

But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:

$ git clone https://github.com/openshift/machine-api-provider-gcp.git
$ cd machine-api-provider-gcp
$ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api'
44f0f958 migrate to openshift/api

But that's not where the 4.9 and earlier code is located:

$ git branch -a | grep origin/release
  remotes/origin/release-4.10
  remotes/origin/release-4.11
  remotes/origin/release-4.12
  remotes/origin/release-4.13

Hunting for 4.9 code:

$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp
  gcp-machine-controllers                        https://github.com/openshift/cluster-api-provider-gcp                       c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6
  gcp-pd-csi-driver                              https://github.com/openshift/gcp-pd-csi-driver                              48d49f7f9ef96a7a42a789e3304ead53f266f475
  gcp-pd-csi-driver-operator                     https://github.com/openshift/gcp-pd-csi-driver-operator                     d8a891de5ae9cf552d7d012ebe61c2abd395386e

So looking there:

$ git clone https://github.com/openshift/cluster-api-provider-gcp.git
$ cd cluster-api-provider-gcp
$ git log --oneline | grep 'migrate to openshift/api'
...no hits...
$ git grep -i automaticRestart origin/release-4.9  | grep -v '"description"\|compute-gen.go'
origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json:        "automaticRestart": {

Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.

This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:

Description of problem:

Frequently we see the loading state of the topology view, even when there aren't many resources in the project.

Including an example

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. load topology
  2. if it loads successfully, keep trying  until it fails to load

Actual results:

topology will sometimes hang with the loading indicator showing indefinitely

Expected results:

topology should load consistently without fail

Reproducibility (Always/Intermittent/Only Once):

intermittent

Build Details:

4.9

Additional info:

This is a clone of issue OCPBUGS-5926. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4486. The following is the description of the original issue:

This is a clone of issue OCPBUGS-95. The following is the description of the original issue:

In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.

 

The bug is 100% reproducible.

Steps for reproducing the issue are:

1. Install a cluster with OpenShiftSDN network plugin.

2. Configure egressip for a project.

3. Install NMstate operator.

4. Create a NodeNetworkConfigurationPolicy.

5. Identify on which node the egressIP is present.

6. Restart the nmstate-handler pod running on the identified node.

7. Verify that the egressIP is no more present.

Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.

This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.

The expectation is that NMstate operator should not interfere with SDN configuration.

Description of problem:

Building the installer image started to fail on ppc64le

Version-Release number of selected component (if applicable):

4.10.0

How reproducible:

Always on the brew builders

Steps to Reproduce:

1.
2.
3.

Actual results:

2022-11-14 11:55:00,819 - atomic_reactor.tasks.binary_container_build - INFO - + go build -mod=vendor -ldflags ' -X github.com/openshift/installer/pkg/version.Raw=v4.10.0 -X github.com/openshift/installer/pkg/version.Commit=4e8922c14379dce4845f362ac3a83ff80f1dc655 -X github.com/openshift/installer/pkg/version.defaultArch=ppc64le -s -w' -tags ' release' -o bin/openshift-install ./cmd/openshift-install
2022-11-14 11:58:01,556 - atomic_reactor.tasks.binary_container_build - INFO - # github.com/openshift/installer/cmd/openshift-install
2022-11-14 11:58:01,556 - atomic_reactor.tasks.binary_container_build - INFO - github.com/aliyun/alibaba-cloud-sdk-go/services/cms.(*Client).DescribeMonitoringAgentHostsWithChan.func1: direct call too far: runtime.duffzero+1f0-tramp0 -20000e8

Expected results:

Successful build

Additional info:

This could potentially be a golang issue. After some preliminary investigation, we tried setting `-linkmode=external` in the ldflags which seems to allow the build to finish.
So while we investigate the real cause of the issue, changing the link mode will serve as a workaround to unblock the build pipeline.

Description of problem:

I checked the log of the ironic-conductor container to get the shared lock of iDRAC, but the next installation does not proceed.

Provision Node
 ㄴ bootstrap
    ㄴ ironic-conductor    --- pending --> iDRAC8 (Baremetal master1,2,3)

Version-Release number of selected component (if applicable):

4.10.31

How reproducible:

1) baremetal inventory
   - Dell R630 firmware 2.75.75
   - iDRAC8

2) nodes
provisioning node, master 1,2,3

3) network : static IP(no DHCP) and bond/vlan
             disconnected installation

Steps to Reproduce:

1. Preparing to configure ipi on the provisioning node
   - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)

2. configuring the install-config.yaml (attached)
   
3. deploy the cluster

Actual results:

ironic does an iso mount to baremetal via idrac and starts coreos booting

Expected results:

ironic has established a session with idrac, but the installation does not proceed.

Additional info:

## curl <idrac api>
{
    "error": {
        "@Message.ExtendedInfo": [
            {
                "Message": "The request failed due to an internal service error.  The service is still operational.",
                "MessageArgs": [],
                "MessageArgs@odata.count": 0,
                "MessageId": "Base.1.2.InternalError",
                "RelatedProperties": [],
                "RelatedProperties@odata.count": 0,
                "Resolution": "Resubmit the request.  If the problem persists, consider resetting the service.",
                "Severity": "Critical"
            }
        ],
        "code": "Base.1.2.GeneralError",
        "message": "A general error has occurred. See ExtendedInfo for more information"
    }
}

Description of problem:

See https://bugzilla.redhat.com/show_bug.cgi?id=2104275

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-7885. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7617. The following is the description of the original issue:

Description of problem:

Azure Disk volume is taking time to attach/detach
Version-Release number of selected component (if applicable):

Openshift ARO 4.10.30
How reproducible:

While performing scaledown and scaleup of statefulset pod takes time to attach and detach volume from nodes.

Reviewed must-gather and test output will share my findings in comments.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Description of problem:
During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6).

Logs from the alertmanager:

level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)"
level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)"

alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073
We tried to manually increase the timers in the startupprobe, but it was not possible.

Version-Release number of selected component (if applicable):
OCP 4.10.10

How reproducible:
OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome.

Actual results:
CMO is not being deployed

Expected results:
CMO deploys without errors

Additional info:

  • CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster
  • cluster is 3 nodes with masters schedulable

 

This is a clone of issue OCPBUGS-7830. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7729. The following is the description of the original issue:

Description of problem:

Etcd's liveliness probe should be removed. 

Version-Release number of selected component (if applicable):

4.11

Additional info:

When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again..  

There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.  

This bug is a backport clone of [Bugzilla Bug 2075091](https://bugzilla.redhat.com/show_bug.cgi?id=2075091). The following is the description of the original bug:

Symptom Detection.Undiagnosed panic detected in pod

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Symptom%20Detection.Undiagnosed%20panic%20detected%20in%20pod

This problem seemed existing before. But number of cases surged and caused two nightly payloads to be rejected:

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-150057
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.11.0-0.nightly/release/4.11.0-0.nightly-2022-04-12-185124

After that, it mysteriously disappeared.

Here is a specific case:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968

Message from the test case:

{ pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log.gz:E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError

{x:4, y:4, signed:true, code:0x0}

(runtime error: index out of range [4] with length 4)}

Panic trace from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1513895844351315968/artifacts/e2e-azure-ovn-upgrade/gather-extra/artifacts/pods/openshift-monitoring_kube-state-metrics-67c5b7c7c6-88vxn_kube-state-metrics_previous.log:

E0412 15:52:33.358619 1 runtime.go:78] Observed a panic: runtime.boundsError

{x:4, y:4, signed:true, code:0x0}

(runtime error: index out of range [4] with length 4)
goroutine 77 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(

{0x1741840, 0xc000b635f0})
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x7d
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ac9740})
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1741840, 0xc000b635f0}

)
/usr/lib/golang/src/runtime/panic.go:1038 +0x215
k8s.io/kube-state-metrics/v2/internal/store.createPodContainerInfoFamilyGenerator.func1(0xc003422c00)
/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:134 +0x375
k8s.io/kube-state-metrics/v2/internal/store.wrapPodFunc.func1(

{0x1804880, 0xc003422c00})
/go/src/k8s.io/kube-state-metrics/internal/store/pod.go:1386 +0x5a
k8s.io/kube-state-metrics/v2/pkg/metric_generator.(*FamilyGenerator).Generate(...)
/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:67
k8s.io/kube-state-metrics/v2/pkg/metric_generator.ComposeMetricGenFuncs.func1({0x1804880, 0xc003422c00}

)
/go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc0000c13c0,

{0x1804880, 0xc003422c00})
/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:72 +0xd4
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0xc003422c00, {0x1804880, 0xc003422c00}

)
/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:87 +0x25
k8s.io/client-go/tools/cache.(*Reflector).watchHandler(0xc000192fc0,

{0x0, 0x0, 0x26cdee0}

,

{0x1a373f8, 0xc0011c24c0}

, 0xc000623d60, 0xc0005ff380, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:506 +0xa55
k8s.io/client-go/tools/cache.(*Reflector).ListAndWatch(0xc000192fc0, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:429 +0x696
k8s.io/client-go/tools/cache.(*Reflector).Run.func1()
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:221 +0x26
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02ffada1d0)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00036a2c0,

{0x1a1daa0, 0xc000386e60}

, 0x1, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/client-go/tools/cache.(*Reflector).Run(0xc000192fc0, 0xc0002cc480)
/go/src/k8s.io/kube-state-metrics/vendor/k8s.io/client-go/tools/cache/reflector.go:220 +0x1f8
created by k8s.io/kube-state-metrics/v2/internal/store.(*Builder).startReflector
/go/src/k8s.io/kube-state-metrics/internal/store/builder.go:508 +0x2c8
panic: runtime error: index out of range [4] with length 4 [recovered]
panic: runtime error: index out of range [4] with length 4

It points to https://github.com/openshift/kube-state-metrics/blob/6efa87f858ee53028fd2de40941b61c09e9ee049/internal/store/pod.go#L134 where the len of p.Status.ContainerStatuses and p.Spec.Containers seems to diverge.

Unfortunately the condition is ephemeral and the condition that caused the panic does not exist in the must-gather data.

The ask is to safe guard the code to avoid the panic and log useful debugging info to track down offenders.

This is a clone of issue OCPBUGS-2508. The following is the description of the original issue:

Description of problem:

Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes.

$ oc get machines -A
NAMESPACE               NAME                          PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-kwtf8-master-0         Running                               23h
openshift-machine-api   ostest-kwtf8-master-1         Running                               23h
openshift-machine-api   ostest-kwtf8-master-2         Running                               23h
openshift-machine-api   ostest-kwtf8-worker-0-g7nrw   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-lrkvb   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-vwrsk   Provisioning                          23h

$ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller
[...]
E1018 10:51:49.355143       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"

Version-Release number of selected component (if applicable):

4.10.0-0.nightly-2022-10-14-023020

How reproducible:

Always

Steps to Reproduce:

1. Install 4.10 within provider networks (in primary or secondary interface)

Actual results:

Installation failure:
4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out

Expected results:

Successful installation

Additional info:

Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.

 

This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:

The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.

This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:

0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found  

The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.

$ oc get cm vsphere-csi-config -o yaml  -n openshift-cluster-csi-drivers | grep datacenters
    datacenters = "IBMCloud" 

 

console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.

Description of problem:

Intended to backport the corresponding https://bugzilla.redhat.com/show_bug.cgi?id=2095852 which has been fixed already for this version.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Jenkins and Plugin versions need to be updated to mitigate pending CVEs

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

kube_daemonset_updated_number_scheduled got renamed to kube_daemonset_status_updated_number_scheduled in 4.9, but the definition of KubeDaemonSetRolloutStuck didn't get updated up until 4.11 https://github.com/openshift/cluster-monitoring-operator/commit/35ffa690cec23d6a708aafea36a2d8b77f8a8556

Version-Release number of selected component (if applicable):

How reproducible:

Always

We suggest backporting the fix at least to 4.10, as it can affect both customers and ours ability to detect and troubleshoot certain issues, both from single-cluster as  well as from the fleet-wide longer-term trends perspective point of view.

This is a clone of issue OCPBUGS-3116. The following is the description of the original issue:

Description of problem:
-----------------------
On dualstack baremetal IPI cluster next error message is present in ovnkube logs:

oc logs -n openshift-ovn-kubernetes ovnkube-node-rvggh -c ovnkube-node
...

E0810 02:12:46.343460 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:16.347603 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:13:46.351108 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:16.355047 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:14:46.358950 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:15:13.313945 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Service total 9 items received
E0810 02:15:16.362737 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:15:46.366490 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:16:16.369963 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:16:24.306561 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Endpoints total 560 items received
E0810 02:16:46.373482 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:16.377497 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:17:46.380726 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:15.325871 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 50 items received
E0810 02:18:16.384732 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
I0810 02:18:38.299738 353971 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 9 items received
E0810 02:18:46.388162 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1
E0810 02:19:16.391669 353971 node_linux.go:593] Failed to dump flows for flow sync, stderr: "ovs-ofctl: br-ext is not a bridge or a socket\n", error: exit status 1

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
OCP-4.10.26

ovn-2021-21.12.0-58.el8fdp.x86_64
ovn-2021-host-21.12.0-58.el8fdp.x86_64
ovn-2021-central-21.12.0-58.el8fdp.x86_64
ovn-2021-vtep-21.12.0-58.el8fdp.x86_64

How reproducible:
-----------------
so far spotted on 2 different clusters

Steps to Reproduce:
-------------------
1. Deploy dualstack baremetal IPI cluster with OVNKubernetesHybrid network(add next to cluster's config before running cluster install):

defaultNetwork:
type: OVNKubernetes
ovnKubernetesConfig:
hybridOverlayConfig:
hybridClusterNetwork: []

Actual results:
---------------
Error message in logs

Expected results:
-----------------
No error message in logs

Additional info:
----------------
Baremetal dualstack setup with 3 masters and 4 workers, bonding configured for baremetal network on masters and workers

This is a clone of issue OCPBUGS-5876. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5761. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5458. The following is the description of the original issue:

reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479

Description of problem:

Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-7409. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO. 

similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103

 

This fix was contributed by lance5890, thanks a lot!

 

This is a clone of issue OCPBUGS-11182. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8497. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8442. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8437. The following is the description of the original issue:

Description of problem:

Jenkins images based on rhel8 are wrongly tagged with rhel7

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1.Check the OS version.
$ podman run -it --rm --entrypoint bash quay.io/openshift/origin-jenkins-agent-base:latest
bash-4.4# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.6 (Ootpa)

2.Check the image labels.
$ podman inspect quay.io/openshift/origin-jenkins-agent-base:latest | grep rhel7
                    "com.redhat.component": "jenkins-slave-base-rhel7-container",
                    "name": "openshift4/jenkins-slave-base-rhel7",
               "com.redhat.component": "jenkins-slave-base-rhel7-container",
               "name": "openshift4/jenkins-slave-base-rhel7",

Actual results:

OS is rhel8, but labels are rhel7

Expected results:

Labels should match the OS version

Additional info:

 

Description of problem:

The storageclass "thin-csi" is created by vsphere-CSI-Driver-Operator, after deleting it manually, it should be re-created immediately. 

Version-Release number of selected component (if applicable):

4.11.4

How reproducible:

Always

Steps to Reproduce:

1. Check storageclass in running cluster, thin-csi is present:
$ oc get sc 
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate              false                  41m
thin-csi         csi.vsphere.vmware.com         Delete          WaitForFirstConsumer   true                   38m
2. Delete thin-csi storageclass:
$ oc delete sc thin-csi
storageclass.storage.k8s.io "thin-csi" deleted
3. Check storageclass again, thin-csi is not present:
$ oc get sc
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate           false                  50m