Back to index

4.9.52

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.8.57

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem Alignment

The Problem

Customers typically run more than one cluster and/or applications deployed across different regions. In such a hybrid cloud environment, aggregating metrics is a key requirement to avoid admins and or applications owners to drop in into individual clusters to troubleshoot specific problems. And since Red Hat does not offer a standalone metrics aggregation service, customers have started to use existing, home-grown technologies based on, for example, InfluxDB or Kafka to achieve that.

In summary:

  • OpenShift Monitoring is optimized for short-term retention only.
  • Red Hat does not offer a central metrics aggregation service yet.
  • Customers use existing, home-grown technologies to distribute information across other stakeholders in their company.

High-Level Approach

Expose Prometheus remote-write configuration via our OpenShift Monitoring (Cluster and User Workload) ConfigMap to allow customers to push time-series data to a remote location.

Please note that we do not plan to support certain third party “receivers” with this solution. Customers will be responsible to ensure an appropriate receiving component is up and running that implements the “remote-write” API. Here is a list of possible “receiver” plugins.

Goal & Success

  • Introduce some “ease of use” features to configure certain parts for remote-write to decrease possible misconfigurations.
  • Allow customers to push metrics off the cluster to allow aggregation use cases and more options for our partners to integrate into OpenShift - e.g. to allow long-term retention or security/analytics scenarios.

Solution Alignment

Key Capabilities

  • As an OpenShift administrator, I want to configure remote-write for both the OOTB infrastructure bundle and the user workload stack, so that time-series data will be available on the system of my choice.
  • As an OpenShift administrator, I want to easily build an allow list of metrics that should be pushed externally.

Key Flows

User configures one of the available ConfigMaps to allow node_cpu_seconds_total to be written into a remote Thanos system.

  • Administrator opens the cluster-monitoring-config ConfigMap.
  • They add a new field to configure remote write.
  • They add the node_cpu_seconds_total metric to the allow list.
  • They add the remote URL for the Thanos receiver.
  • They add a Secret to configure authentication against the remote service.

Additional resources

Remote write allows to replicate time-series data to a remote location. This is important for several scenarios like you want to use "remote-write enabled" systems (e.g. InfluxDB) for long-term storage and historical analysis; as well as for aggregating metrics across multiple clusters.

Currently, remote-write is in an experimental stage in Prometheus[1] but the chances are high that it will be stable some time this year. Furthermore, we are using remote-write pretty extensively already for Telemetry as well as ACM in the near future. With that in mind, we think that we are in a perfect spot to move what we already have[2] from dev preview to at least tech preview.

Acceptance criteria

  • mTLS support (important for positioning Red Hat's Advanced Cluster Manager (ACM) as they will need it for pushing metrics from OpenShift clusters into their central management solution backed by Observatorium.
  • Default configurations coming from Red Hat (such as Telemetry and ACM) should not be overridden. ACM for example may inject their configuration automatically post installation (mechanism to be discussed).

Non-goals

  • Configuration isolation for cluster and user workload monitoring ConfigMap to allow separating remote-write configuration per "tenant" or "user".
  • Remote write for Thanos Ruler (this isn't supported yet, see https://github.com/thanos-io/thanos/issues/1724).

Open questions

  • Do we want to expose a different API to make configuring an allow list easier for everyone rather than exposing relabeling configuration directly? Reason is that we want to avoid validating "syntax" requests in a BZ or internal.

Documentation

  • New section inside the configuration chapter that describes how to setup remote-write with an example on how it looks like for standard remote write implementation. For both CMO and UWM.
  • A small note about implications on setting up remote-write to the overall Prometheus cluster.
  • How to configure security/auth (e.g. (m)TLS).
  • The API.
  • Tuning.
  • Proxy configuration (if not supported, then we need a statement).

Other resources

[1] https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec - "If specified, the remote_write spec. This is an experimental feature, it may change in any upcoming release in a breaking way." The experimental flag was removed.

We'll want to give user the option to add remote_write configs to both the cluster monitoring and UWM.
AC:

  • decide what features we want to give users
  • decide what API we want to expose to users, i.e. basic rw config, low-level relabel-config, or high-level-streamlined API
  • implement the API

https://issues.redhat.com/browse/MON-1069?focusedCommentId=16252560&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16252560

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a cluster administrator,

I want OpenShift to include a recent CoreDNS version,

so that I have the latest available performance and security fixes.

 

We should strive to follow upstream CoreDNS releases by bumping openshift/coredns with every OpenShift 4.y release, so that OpenShift benefits from upstream performance and security fixes, and so that we avoid large version-number jumps when an urgently needed change necessitates bumping CoreDNS to the latest upstream release. This bump should happen as early as possible in the OpenShift release cycle, so as to maximize soak time.

 

For OpenShift 4.9, this means bumping from CoreDNS 1.8.1 to 1.8.3, or possibly a later release should one ship before we do the bump.

 

Note that CoreDNS upstream does not maintain release branches—that is, once CoreDNS is released, there will be no further 1.8.z releases—so we may be better off updating to 1.9 as soon as it is released, rather than staying on the 1.8 series which would then be unmaintained.

 

We may consider bumping CoreDNS again during the OpenShift 4.9 release cycle if upstream ships additional releases during the 4.9 development cycle. However, we will need to weigh the risks and available remaining soak time in the release schedule before doing so, should that contingency arise.

 

Feature Overview

As a OpenShift administrator, I would like a solution that allows me to upgrade from one EUS version to another with very few steps and only minimum disruption to application workloads while still allowing new application services to be deployed.

Goals

4.8

  • Spike, Design, and Scope
  • Begin foundational development if possible

4.9

  • Foundational items delivered and back ported as necessary

4.10

  • Remaining delivery artifacts complete
  • Documentation and enablement complete
  • Full testing complete

Requirements

Functional requirements break down into the following prioritized list:

 

  1. Make serial upgrades safe
    1. Prevent upgrades before the core components are ready (version skewing, incompatible APIs)
    2. Prevent upgrades before operators or ready
      1. Ensure Operators have a way to express max version
      2. Ensure OLM policy is clear on what happens if max version is not specified
    3. Make back pressure items (reasons you cannot upgrade) clear to administrators along with the actions to resolve
    4. CI MUST be running with test automation
    5. Note: Forcing an upgrade is still possible
  2. Make updates faster
    1. Optimize where possible to increase speed of upgrade for core components (SDN/Daemonsets)
  3. Reduce the amount of workload disruption
    1. Work load disruption is not just reboots it is any disruption to workloads during the upgrade, of which a reboot is likely the worst case scenario.  This may also include things like rescheduling of workloads.
    2. We will not change the model of how components are deployed, changes to the host still require a reboot
    3. Discover and document any necessary guidelines to reduce the number of items that are developed which would cause a reboot between EUS releases where possible (4.8, 4.9).  
    4. As a stretch goal, discover if it is possible to reduce the reboots between 4.6 and 4.7 
  4. Should take into consideration clusters with RHEL workers

 

Non-Functional Requirements

Requirement Notes isMvp?
Release Technical Enablement Provide necessary release enablement details and documents. YES
Documentation This is a requirement for ALL end user facing features YES

Questions to answer…

Out of Scope

  • It is not intended to support version skews that fall outside the upstream version skew policy
  • It is not intended to eliminate all reboots
  • It is not intended to skip releases at this time

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

EUS to EUS Focus Area Discussion: https://docs.google.com/document/d/17I1Wd7-R1wRxmboyv1jUFHFkqQcBTorJccdGi1ZqjQE/edit?usp=sharing

EUS Feature: https://issues.redhat.com/browse/OCPPLAN-5484

Epic Goal

  • Ensure the user experience for upgrades in console supports EUS -> EUS upgrades.

Why is this important?

  • This is a product-wide initiative.

Scenarios

  1. The console cluster settings page should inform administrators of upgrade requirements prior to the first upgrade step.
  2. The console cluster settings page should report problems during an upgrade.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. CVO - Sufficient APIs (ClusterVersion, Alerts) for console to show requirements before an upgrade and problems during an upgrade to an administrator.

Previous Work (Optional):

Open questions::

  1. We have an R&D story to investigate what the console experience should be and what APIs might be necessary.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Use case
As an Admin, one of my operators says it can't be upgraded. An action is required, as I will be unable to upgrade to a .y minor release until I fix the problem.
 
Possible Design Solution 
Create a message saying you can upgrade to .z patch releases even when one of your cluster operators says it's not upgradeable.

Ideally, the message string on the condition explains what the admin needs to resolve , and until they resolve the issue they can only update within their current z stream.

 
Questions
Need to do a little R&D to find out when this happens and what happens when you're in this state.

Designs (WIP)
Doc: https://docs.google.com/document/d/1iUZlHbv5nTYtb7Cq4rn_bYPqD4Jtie59xIogxN-2Eyc/edit#heading=h.5eoflxvaj1m4

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Epic Goal

  • Enable Image Registry to use Azure Blob Storage from AzureStackCloud

Why is this important?

  • While certifying Azure Stack Hub as OCP provider we need to ensure all the required components for UPI/IPI deployments are ready to be used

Scenarios

  1. Create an OCP cluster is Azure Stack Hub and use Internal Registry with Azure Blob Storage from AzureStackCloud

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Story: As an OpenShift admin I want the internal registry of the cluster use storage from Azure Stack Hub so that I can run a fully supported OpenShift environment on that infrastructure provider.

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

  • Extend the Console
  • Deliver UI code with their Operator
  • Work in their own git Repo
  • Deliver at their own cadence

Goals

    • Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
    • The dynamic plugin API is similar to the static plugin API to ease migration.
    • Plugins can use shared console components such as list and details page components.
    • Shared components from core will be part of a well-defined plugin API.
    • Plugins can use Patternfly 4 components.
    • Cluster admins control what plugins are enabled.
    • Misbehaving plugins should not break console.
    • Existing static plugins are not affected and will continue to work as expected.

Out of Scope

    • Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
    • We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
    • Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
    • This proposal does not cover allowing plugins to contribute backend console endpoints.

 

Requirements

 

Requirement Notes isMvp?
 UI to enable and disable plugins    YES 
 Dynamic Plugin Framework in place    YES 
Testing Infra up and running   YES 
 Docs and read me for creating and testing Plugins    YES 
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 
 Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We need to support localization of dynamic plugins. The current proposal is to have one i18n namespace per dynamic plugin with a fixed name: `${plugin-name}-plugin`. Since console will know the list of plugins on startup, it can add these namespaces to the i18next config.

The console backend will need to implement an endpoint at the i18next load path. The endpoint will see if the namespace matches the known plugin namespaces. If so, it will proxy to the plugin. Otherwise it will serve the static file from the local filesystem.

The dynamic plugins enhancement describes a `disable-plugins` query parameter for disabling specific console plugins.

  • ?disable-plugins or ?disable-plugins= prevents loading of any dynamic plugins (disable all)
  • ?disable-plugins=foo,bar prevents loading of dynamic plugins named foo or bar (disable selectively)

This has no effect on static plugins, which are built into the Console application.

https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md#error-handling

We need a UI for enabling and disabling dynamic plugins. The plugins will be discovered either through a custom resource or an annotation on the operator CSV. The enabled plugins will be persisted through the operator config (consoles.operator.openshift.io).

This story tracks enabling and disabling the plugin during operator install through Cluster Settings. This is needed in the future if a plugin is installed outside of an OLM operator.

UX design: https://github.com/openshift/openshift-origin-design/pull/536 

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Goal
By default the Cluster Utilization card should not include metrics from `master` nodes in its queries for CPU, Memory, Filesystem, Network, and Pod count.

A new filter option should allow users to toggle between a combined view of what is seen on the Cluster Utilization card today, which is mostly useful on small clusters where masters are schedulable for user workloads.

Assets

  • Marvel with two scenarios:
    • Windows nodes exist
    • Windows nodes do not exist

Background

As discussed in this thread, the`kube_node_role` metric available since 4.3 should allow us to filter the card's PromQL queries to not include master node metrics.

This filtered view would likely make the card's data more useful for users who aren't running their workloads on masters, like OpenShift Dedicated users.

As noted by some folks during design discussions, this filter isn't perfect, and wouldn't filter out the data from "Infra" nodes that users may have set up using labels/taints. Until we determine a good way to provide more advanced filtering, this basic "Include masters" checkbox is still more flexible than what the card offers today.

Requirements

  • When windows nodes exist in the cluster:
    • Node type filter will be added to the Cluster Utilization card that lists all node types available
    • It will be pre-filtered to only show Worker nodes
    • The filter will be single select and will display the selected item in the toggle.
  • When windows nodes do not exist in the cluster:
    • Node type filter will be added to the Cluster Utilization card that lists node types available, plus an "all types" item.
    • It will be pre-filtered to only show Worker nodes
    • The filter will be multi-select
    • The badge in the toggle will update as more items are selected
    • If the "all nodes" is selected, the other items will automatically become deselected, and the badge will update to "All".

Goal

Currently we are showing system projects within the list view of the Projects page. As stated here https://issues.redhat.com/browse/RFE-185, there are many projects that are considered as system projects that are not important to the user. The value should be remember across sessions, but it something we should be able to toggle directly from the list.

Design assets

Design doc

Marvel

Requirements

  • The user should be able to hide/show system projects within the project list page (and namespace list page)
  • The user should be able to hide/show system projects from the project selector
  • The same capability should work from the project list page in the developer perspective

In OpenShift, reserved namespaces are `default`, `openshift`, and those that start with `openshift-`, `kubernetes-`, or `kube-`.

Edge case scenarios

  • If the user filters out system projects from the projects or namespaces list view, then filters and there are no results, an empty state will be surfaced with ability to clear filters. (see design assets)
  • If the user has hidden system projects from the project selector and has favorited or defaulted system projects in the project selector, those favorited or defaulted system projects will NOT appear in the project selector list. (see design assets)
  • If the user has hidden system projects from the project selector, then navigates to some resource page where a system project is selected, the system project name will still appear in the project selector toggle but not within the list of projects in the selector. (see design assets)

As a admin, I want to be able to access the node logs from the nodes detail page in order to troubleshoot what is going on with the node.

We should support getting node logs for different units for node journal logs and evaluate the other CLI flags.

We currently have a gap with the CLI:

  •   oc adm node-logs [-l LABELS] [NODE...] [flags]

We need to investigate whether the k8s API supports WebSockets for streaming node logs.

Feature Overview

OpenShift console supports new features and elevated experience for Operator Lifecycle Manager (OLM) Operators and Cluster Operators.

Goal:

OCP Console improves the controls and visibility for managing vendor-provided software in customers’ infrastructure and making these solutions available for customers' internal users.

 

To achieve this, 

  • Operator Lifecycle Manager (OLM) teams have been introducing new features aiming towards simplification and ease of use for both developers and cluster admins.
  • On the Cluster Operators side, the console iteratively improves the visibilities to the resources being associated with the Operators to improve the overall managing experience.

We want to make sure OLM’s and Cluster Operators' new features are exposed in the console so admin console users can benefit from them.

Benefits:

  • Cluster admin/Operator consumers:
    • Able to see, learn, and interact with OLM managed and/or Cluster Operators associated resources in openShift console.

Requirements

Requirement Notes isMvp?
OCP console supports the latest OLM APIs and features This is a requirement for ALL features. YES
OCP console improves visibility to Cluster Operators related resources and features. This is a requirement for ALL features. YES
     

 


(Optional) Use Cases
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Main success scenarios - high-level user stories
* Alternate flow/scenarios - high-level user stories
* ...

Questions to answer...
How will the user interact with this feature?
Which users will use this and when will they use it?
Is this feature used as part of the current user interface?

Out of Scope
<--- Remove this text when creating a Feature in Jira, only for reference --->
# List of non-requirements or things not included in this feature
# ...

Background, and strategic fit
<--- Remove this text when creating a Feature in Jira, only for reference --->
What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Are there assumptions being made regarding prerequisites and dependencies?
* Are there assumptions about hardware, software or people resources?
* ...

Customer Considerations
<--- Remove this text when creating a Feature in Jira, only for reference --->
* Are there specific customer environments that need to be considered (such as working with existing h/w and software)?
...

Documentation Considerations
<--- Remove this text when creating a Feature in Jira, only for reference --->
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • OCP console supports devs to easier focus and create Operand/CR instances on the creation form page.
  • OCP console supports cluster admins to better see/understand the Operator installation status in the OperatorHub page.

Why is this important?

  • OperatorHub page currently shows an Operator as Installed as long as a Subscription object exists for that operator in the current namespace, which can be misleading because the installation could be stalled or require additional interactions from the user (e.g. "manual upgrade approval") in order to complete the installation.
  • Some Operator managed services use these advanced properties in their CRD validation schema, but the current form generator in the console ignores/skips them. Hence, those fields on the creation form are missing.

Scenarios

  1. As a user of OperatorHub, I'd like to have an improved "status display" for Operators being installed before so I can better understand if those Operators actually being successfully installed or require additional actions from me to complete the installation.
  2. As a user of the OCP console, I'd like to Operand/CR creation form that covers advanced JSONSchema validation properties so I can create a CR instance solely with the form view.

Acceptance Criteria

  • Console improves the visibility of Operator installation status on OperatorHub page
  • Console operand creation form adds support for `allOf`, `anyOf`, `oneOf`, and `additionalProperties` JSONSchema validation keywords so the creation form UI can render them and not skipping those properties/fields.
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

  • Options

 

OLM is adding a property to the CSV to signal that the operator should clean up the operand on operator uninstall. See https://github.com/operator-framework/enhancements/pull/46

Console will need to add a checkbox to the UI to prompt ask the user if the operand should be cleaned up (with strong warnings about what this means). On delete, console should set the `spec.cleanup` property on the CSV to indicate whether cleanup should happen.

Additionally, console needs to be able to show proper status for CSVs that are terminating in the UI so it's clear the operator is being deleted and cleanup is in progress. If there are errors with cleanup, those should be surfaced back through the UI.

Depends on OLM-1733

cc Ali Mobrem Tony Wu Daniel Messer Peter Kreuser

User Story

As a user of OperatorHub, I'd like to have an improved "status display" for Operators being installed before so I can better understand if those Operators actually being successfully installed or require additional actions from me to complete the installation.

Desired Outcome

Improve visibility of Operator installation status on OperatorHub page

Why this is important?

OperatorHub page currently shows an Operator as Installed as long as a Subscription object exists for that operator in the current namespace.

This can be misleading because the installation could be stalled or require additional interactions from the user (e.g. "manual upgrade approval") in order to complete the installation.

The console could potentially have some indication of an "in-between" or "requires attention" state for Operators that are in these states + links to the actual "Installed Operators" page for more details.

Related Info:

1. BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1899359
2. RFE: https://issues.redhat.com/browse/RFE-1691

Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster. 
Why customers want this?

  1. Single interface to accomplish their tasks
  2. Consistent UX and patterns
  3. Easily accessible: One URL, one set of credentials

Why we want this?

  • Shared code -  improve the velocity of both teams and most importantly ensure consistency of the experience at the code level
  • Pre-built PF4 components
  • Accessibility & i18n
  • Remove barriers for enabling ACM

Phase 1 Goal: Get something to market (OCP 4.8, ACM 2.3)
Phase 1 —> OCP deploys ACM Hub Operator —> ACM Perspective becomes available —> User can switch between ACM multi-cluster view and local OCP Console —> No SSO user has to login in twice

Phase 2 Goal: Productization of the united Console (OCP 4.9, ACM 2.4)

  1. Enable user to quickly change context from fleet view to single cluster view
    1. Add Cluster selector with “All Cluster” Option. “All Cluster” = ACM
    2. Shared SSO across the fleet
    3. Hub OCP Console can connect to remote clusters API
    4. When ACM Installed the user starts from the fleet overview aka “All Clusters”
  2. Share UX between views
    1. ACM Search —> resource list across fleet -> resource details that are consistent with single cluster details view
    2. Add Cluster List to OCP —> Create Cluster

Phase 2  Use Cases:

  1.  As a user, I want to be able to quickly switch context from the Fleet view(ACM) to any spoke cluster Console view all from the same web browser tab.
    1. ACM Hub Operator deployed to OCP—> Cluster picker become available, with “All cluster option”= ACM —> Single cluster user will get perspective picker(Admin, Dev) —> User needs the ability to quickly change context to single cluster —> All clusters should be linked via shared SSO
    2.   
  2. As a user, I should be able to drill down into resources in the ACM view and get the OCP resource details page
    1. ACM Hub Operator deployed to OCP—> User Searches for pods from the ACM view("All clusters")--> Single pod is selected --> OCP pod detail page

We need to coordinate with the ACM team so that the masthead looks the same when switching between contexts. This might require us to consume a common masthead component in OCP console.

The ACM team will need to honor our custom branding configuration so that the logo does not change when switching contexts.

Known differences:

  • Branding customization
  • Console link CRDs
  • Global notifications
  • Import button
  • Notification drawer
  • Language preferences
  • Search link (ACM only)
  • Web terminal (ACM only)

Open questions:

  • How do we handle alerts in the notification drawer across cluster contexts?

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview

  • Connect OpenShift workloads to Google services with Google Workload Identity

Goals

  • Customers want to be able to manage and operate OpenShift on Google Cloud Platform with workload identity, much like they do with AWS + STS or Azure + workload identity.
  • Customers want to be able to manage and operate operators and customer workloads on top of OCP on GCP with workload identity.

Requirements

  • Add support to CCO for the Installation and Upgrade using both UPI and IPI methods with GCP workload identity.
  • Support install and upgrades for connected and disconnected/restriction environments.
  • Support the use of Operators with GCP workload identity with minimal friction.
  • Support for HyperShift and non-HyperShift clusters.
  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Epic Goal

  • Complete the implementation for GCP  workload identity, including support and documentation.

Why is this important?

  • Many customers want to follow best security practices for handling credentials.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Investigate if this will work for OpenShift components similar to how we implemented STS.

Can we distribute credentials fashion that is transparent to the callers (as to whether it is normal service account of a short lived token) like we did for AWS?

What changes would be required for operators?

Can ccoctl do the heavy lifting as we did for AWS?

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow ConfigMaps and Secrets (resources) to be mounted as volumes in a build

Why is this important?

  • Secrets and ConfigMaps can be added to builds as "source" code that can leak into the resulting container image
  • When using sensitive credentials in a build, accessing secrets as a mounted volume ensure that these credentials are not present in the resulting container image.

Scenarios

  1. Access private artifact repositories (Artifactory, jFrog, Mavein)
  2. Download RHEL packages in a build

Acceptance Criteria

  • Builds can mount a Secret or ConfigMap in a build
  • Content in the secret or ConfigMap are not present in the resulting container image.

Dependencies (internal and external)

  1. Buildah - support mounting of volumes when building with a Dockerfile

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.

Goals

  • Reduce CaaS platform compute needs so that it can fit within a single physical core with Hyperthreading enabled. (i.e. 2 CPUs)
  • Ensure existing DU Profile components fit within reduced compute budget.
  • Ensure existing ZTP, TALM, Observability and ACM functionality is not affected.
  • Ensure largest partner vDU can run on Single Core OCP.

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES
 
Provide a mechanism to tune the platform to use only one physical core. 
Users need to be able to tune different platforms.  YES 
Allow for full zero touch provisioning of a node with the minimal core budget configuration.   Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. YES 
Platform meets all MVP KPIs   YES

(Optional) Use Cases

  • Main success scenario: A telecommunications provider uses ZTP to provision a vDU workload on Single Node OpenShift instance running on an Intel Sapphire Rapids platform. The SNO is managed by an ACM instance and it's lifecycle is managed by TALM.

Questions to answer...

  • N/A

Out of Scope

  • Core budget reduction on the Remote Worker Node deployment model.

Background, and strategic fit

Assumptions

  • The more compute power available for RAN workloads directly translates to the volume of cell coverage that a Far Edge node can support.
  • Telecommunications providers want to maximize the cell coverage on Far Edge nodes.
  • To provide as much compute power as possible the OpenShift platform must use as little compute power as possible.
  • As newer generations of servers are deployed at the Far Edge and the core count increases, no additional cores will be given to the platform for basic operation, all resources will be given to the workloads.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
    • Administrators must know how to tune their Far Edge nodes to make them as computationally efficient as possible.
  • Does this feature have doc impact?
    • Possibly, there should be documentation describing how to tune the Far Edge node such that the platform uses as little compute power as possible.
  • New Content, Updates to existing content, Release Note, or No Doc Impact
    • Probably updates to existing content
  • If unsure and no Technical Writer is available, please contact Content Strategy. What concepts do customers need to understand to be successful in [action]?
    • Performance Addon Operator, tuned, MCO, Performance Profile Creator
  • How do we expect customers will use the feature? For what purpose(s)?
    • Customers will use the Performance Profile Creator to tune their Far Edge nodes. They will use RHACM (ZTP) to provision a Far Edge Single-Node OpenShift deployment with the appropriate Performance Profile.
  • What reference material might a customer want/need to complete [action]?
    • Performance Addon Operator, Performance Profile Creator
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
    • N/A
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
    • Likely updates to existing content / unsure

Goals

  • Expose a mechanism to allow the Monitoring stack to be more a "collect and forward" stack instead of a full E2E Monitoring solution.
  • Expose a corresponding configuration to allow sending alerts to a remote Alertmanager in case a local Alertmanager is not needed.
    • Support proxy environments with also proxy envs.
  • The overall goal is to fit all platform components into 1 core (2 HTs, 2 CPUs) for single node openshift deployments. The monitoring stack is one of the largest cpu consumers on single node openshift consuming ~ 200 mc at steady state, primarilty prometheus and the node exporter. This epic would track optimizations to the monitoring stack to reduce this usage as much as possible. Two items to be explored: 
    • Reducing the scrape interval 
    • Reducing the number of series to be scraped 

Non-Goals

  • Switching off all Monitoring components.
  • Reducing metrics from any component not owned by the Monitoring team.

Motivation

Currently, OpenShift Monitoring is a full E2E solution for monitoring infrastructure and workloads locally inside a single cluster. It comes with everything that an SRE needs from allowing to configure scraping of metrics to configuring where alerts go.

With deployment models like Single Node OpenShift and/or resource restricted environments, we now face challenges that a lot of the functions are already available centrally or are not necessary due to the nature of a specific cluster (e.g. Far Edge). Therefore, you don't need to deploy components that expose these functions.

Also, Grafana is not FIPS compliant, because it uses PBKDF2 from x/crypto to derive a 32 byte key from a secret and salt, which is then used as the encryption key. Quoting https://bugzilla.redhat.com/show_bug.cgi?id=1931408#c10 "it may be a problem to sell Openshift into govt agencies if
grafana is a required component."

Alternatives

We could make the Monitoring stack as is completely optional and provide a more "agent-like" component. Unfortunately, that would probably take much more time and in the end just reproduce what we already have just with fewer components. It would also not reduce the amount of samples scraped which has the most impact on CPU usage.

Acceptance Criteria

  • Verify that all alerts fire against a remote Alertmanager when a user configures that option.
  • Verify that Alertmanager is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
  • Verify that if you have a local Alertmanager deployed and a user decides to use a remote Alertmanager, the Monitoring stack sends alerts to both destinations.
  • Verify that Grafana is not deployed when a user configures that option in the cluster-monitoring-operator configmap.
  • Verify that Prometheus fires alerts against an external Alertmanager in proxy environments (1) configure proxy settings inside CMO and (2) cluster-wide proxy settings through ENV.

Risk and Assumptions

Documentation Considerations

  • Any additions to our ConfigMap API and their possible values.

Open Questions

  • If we set a URL for a remote Alertmanager, how are we handle authentication?
  • Configuration of remote Alertmanagers would support whatever Prometheus supports (basic auth, client TLS auth and bearer token)

Additional Notes

Use cases like single-node deployments (e.g. far-edge) don't need to deploy a local Alertmanager cluster because alerts are centralized at the core (e.g. hub cluster), running Alertmanager locally takes resources from user workloads and adds management overhead. Cluster admins should be able to not deploy Alertmanager as a day-2 operation.

DoD

  • Alertmanager isn't deployed when switched off in the CMO configmap.
  • The OCP console handles the situation gracefully when Alertmanager isn't installed informing users that it can't manage the local Alertmanager configuration and resources.
    • Silences page
    • Alertmanager config editor

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

 

Monitoring needs to be reliable and is the very useful when trying to debug clusters in an already degraded state. We want to ensure that metrics scraping can always work if the scraper can reach the target, even if the kube-apiserver is unavailable or unreachable. To do this, we will combine a local authorizer (already merged in many binaries and the rbac-proxy) and client-cert based authentication to have a fully local authentication and authorization path for scraper targets.

If networking (or part of networking) is down and a scraper target cannot reach the kube-apiserver to verify a token and a subjectaccessreview, then the metrics scraper can be rejected. The subjectaccessreview (authorization) is already largely addressed, but service account tokens are still used for scraping targets. Tokens require an external network call that we can avoid by using client certificates. Gathering metrics, especially client metrics, from partially functionally clusters helps narrow the search area between kube-apiserver, etcd, kubelet, and SDN considerably.

In addition, this will significantly reduce the load on the kube-apiserver. We have observed in the CI cluster that token and subject access reviews are a significant percentage of all kube-apiserver traffic.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User story:

As cluster-policy-controller I automatically approve cert signing requests issued by monitoring.

DoD:

  • cert signing requests issued by the cluster-monitoring-operator service account are approved automatically.

Implementation hints: leverage approving logic implemented in https://github.com/openshift/library-go/pull/1083.

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

ListPage is still JSX, we should convert to TSX, add proper types and make sure rest of the code is passing correct props.

resources.js contains functions to work with k8s API (CRUD). It would be good to convert to TS and add proper types. We will want to expose these functions in some form to dynamic plugins too so proper types is a must

Table component is Class component currently, we want to update to function component.

There's also many properties with `any`  type, we will want to reduce those and be more strict.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.22
  • Rebase Jenkins and plugins to latest long term support versions

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.22 release - expected August 4th 2021

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

Rebase samples operator to k8s 1.22

Acceptance Criteria

  • Samples operator deploys with k8s 1.22 libraries
  • Core components continue to function (CI tests pass, including build suite).

Docs Impact

None

Notes

Problem:

This epic is mainly focused to track the dev console QE activities for 4.9 Release

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

  1. External [For reviewing existing gherkin scripts]
  2. Internal [Planned tasks for 4.9 release]

Design Artifacts:

Exploration:

Note:

Description

Automation of Application grouping under display options
As a user,

Acceptance Criteria

  1. Display dropdown
  2. Consuption mode

Additional Details:

While executing the script "Yarn run gherkin-lint", error is displaying due to ","

Scenario length increased to 20. To avoid couple of quick start scenarios errors

max scenarios increased to 20 in feature files, because for few features this is needed

Update the OWNERS file in all plugin folder. As Gaja and praveen left from the org

Description

Automate the quick starts - quick-start-devperspective.feature

 

Acceptance Criteria

  1. Execute the test scenarios manually
  2. Execute them on chrome browser
  3. Remove the @to-do tags once it is done
  4. Execute them on remote cluster

Additional Details:

Description

Topology chart view automation
As a user,

Acceptance Criteria

  1. Empty state of Topology
  2. Topology with workloads
  3. Filters in chart view

Additional Details:

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Epic Goal

  • Teach the cluster-version operator how to remove in-cluster objects.

Why is this important?

  • OpenShift releases can remove components.  For example, two service-catalog operators were removed in 4.5.  Teaching the CVO how to remove manifests will allow us to clean up these resources without leaving cleanup jobs and associated RBAC behind.  We may also benefit from it in some rollback cases, where the presence of a new-in 4.(y+1) object makes the application of 4.y difficult.

Scenarios

  1. Born-before-4.5 clusters currently have dangling resources from cleaning up the service-catalog operators (like the openshift-service-catalog-removed namespace in this job).  This enhancement would provide a mechanism for removing them, and any other cruft that we accumulate which lack in-cluster operators.
  2. We can add removal manifests to the 4.(y-1) before growing a new component in 4.y, to make 4.y->4.(y-1) rollback more convenient for customers.

Acceptance Criteria

  • CVO CI - MUST be running successfully with automated e2e-operator tests covering this functionality.
  • Release Technical Enablement - Not required.  This is a purely OCP-internal change.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. Enhancement from OTA-279 has been accepted.  Remaining work is just implementing the accepted proposal.

Open questions::

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

This is a clone of issue OCPBUGS-2800. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2113. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:

Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

Description of problem:
All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason.

The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe.
Version-Release number of selected component (if applicable):

4.11

How reproducible:

Steps to Reproduce:
1. Delay an 4.11 OpenShift cluster
2. Check registry pods for default catalogsources such as redhat-operators
Actual results:
The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s
Expected results:
The registry pods for default catalogsources should be up and running.
Additional info:
See Slack thread for more information:
https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689

Note: This bug is for backporting process. The 4.10.z BZ is https://bugzilla.redhat.com/show_bug.cgi?id=2115874

Manually creating for 4.9 backport

Description of problem:
For brief window while the openshift-router binary is starting up, it ignores shutdown signals (SIGTERMs) and will never shutdown.

This becomes a larger issue when K8S sends a graceful shutdown while the router is starting up and subsequently waits the terminationGracePeriodSeconds as specified in the router deployment, which is 1 hour.

This becomes even more of an issue with
https://github.com/openshift/cluster-ingress-operator/pull/724
which makes the ingress controller wait for all pods before deleting itself. So if these pods are stuck in Terminating for an hour, then the ingress controller will be stuck in Terminating for an hour.

OpenShift release version:

Cluster Platform:

How reproducible:
You can start/stop the router pod quickly to get it to be stuck in a hour-long Terminating state.

Steps to Reproduce (in detail):
1. Create a YAML file with the following content:

apiVersion: v1
items:

  • apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
    name: loadbalancer
    namespace: openshift-ingress-operator
    spec:
    replicas: 1
    routeSelector:
    matchLabels:
    type: loadbalancer
    endpointPublishingStrategy:
    type: LoadBalancerService
    nodePlacement:
    nodeSelector:
    matchLabels:
    node-role.kubernetes.io/worker: ""
    status: {}
    kind: List
    metadata:
    resourceVersion: ""
    selfLink: ""

2. Run the following command:

oc apply -f <YAML_FILE>.yaml && while ! oc get pod -n openshift-ingress | grep -q router-loadbalancer; do echo "Waiting"; done; oc delete pod -n openshift-ingress $(oc get pod -n openshift-ingress --no-headers | grep router-loadbalancer | awk '{print $1}');

It is considered a failure if it hangs for more than 45 seconds. You can ctrl-c after it deletes the pod and run "oc get pods -n openshift-ingress" to see that it is stuck in a terminating state with a AGE longer than 45 seconds.

The pod will take 1 hour to terminate, but you can always clean up by force deleting it.

Actual results:
Pod takes 1 hour to be deleted.

Expected results:
Pod should be deleted in about 45 seconds.

Impact of the problem:
Router pods hang in terminating for 1 hour and that will affect user experience.

Additional info:
 

Description of problem:

When a custom machineConfigPool is created and no node is associated with it, the mcp remains at 0% progress.

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. Create a custom mcp:
~~~
cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: custom
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,custom]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/custom: "" 
EOF
~~~

Actual results:

The mcp is visible from the from "Administrator view > Cluster Settings > Details" at 0% progress

Expected results:

It shouldn't be stuck at 0% 

Additional info:

 

Description of problem:

Need to backport rotated logs collection to 4.9, since it's crucial for debugging.

See https://bugzilla.redhat.com/show_bug.cgi?id=2103910, https://issues.redhat.com/browse/SDN-3520

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This BUG tracks a backport of Bugzilla 2040376 (https://bugzilla.redhat.com/show_bug.cgi?id=2040376) to release 4.9. 

Description of problem:

Original bug description:
AWS instance m6i.xlarge is supported [1], but it does not appear on ec2_instance_types.go [2] and the following errors show up when it is used (hidden information marked with "XXX"):

$ oc logs -l k8s-app=controller -n openshift-machine-api -c machine-controller
[...]
I0112 14:51:10.117191       1 controller.go:59] controllers/MachineSet "msg"="Reconciling" "machineset"="XXX" "namespace"="openshift-machine-api"
E0112 14:51:10.117604       1 controller.go:115] Unable to set scale from zero annotations: unknown instance type: %sm6i.xlarge
E0112 14:51:10.117622       1 controller.go:116] Autoscaling from zero will not work. To fix this, manually populate machine annotations for your instance type: %v[machine.openshift.io/vCPU machine.openshift.io/memoryMb machine.openshift.io/GPU]
I0112 14:51:10.117792       1 recorder.go:104] controller-runtime/manager/events "msg"="Warning"  "message"="Failed to set autoscaling from zero annotations, instance type unknown" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"XXX","uid":"XXX","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"121131434"} "reason"="FailedUpdate"

Version-Release number of selected component (if applicable):

4.9

How reproducible:

Always

Steps to Reproduce:

Steps to Reproduce:
1. Create a MachineSet with m6i.xlarge instances.
2. Check the logs of machine-controller container.
3. 

Actual results:

Error messages show up, although no impact on the cluster.

Expected results:

No error messages because it is a supported instance.

Additional info:

[1] https://docs.openshift.com/container-platform/4.9/installing/installing_aws/installing-aws-customizations.html#installation-supported-aws-machine-types_installing-aws-customizations
[2] https://github.com/openshift-cherrypick-robot/cluster-api-provider-aws/blob/2f3f7442ef525c7eaa2177d1903cebf219a2c3fc/pkg/actuators/machineset/ec2_instance_types.go

Description of problem:

Helm chart README file is coded in Chinese,the content turns into messy code in developer perspective while configuring the helm chart.

Version-Release number of selected component (if applicable):

OpenShift Container Platform : 4.8.20 and also found same behavior on : 4.10.16

How reproducible:

 

Steps to Reproduce:

1. Create a custom HelmChartRepository which consist a helm chart with a README.md file coded in Chinese  

2. Then check and try to install the helm chart from : Developer Catalog > Helm Charts , The README file contents will be showing messy.

Actual results:

Helm chart README file is coded in Chinese,the content turns into messy code in developer perspective while configuring the helm chart.

Expected results:

README file Chinese characters must show normally.

Additional info:

 

Description of problem:

Various disruption_tests fail on API availability during upgrade jobs in 4.9:
disruption_tests: [sig-api-machinery] Kubernetes APIs remain available for new connections
disruption_tests: [sig-api-machinery] OpenShift APIs remain available for new connections
disruption_tests: [sig-api-machinery] OAuth APIs remain available for new connections 
disruption_tests: [sig-api-machinery] Kubernetes APIs remain available with reused connections
disruption_tests: [sig-api-machinery] OpenShift APIs remain available with reused connections
disruption_tests: [sig-api-machinery] OAuth APIs remain available with reused connections

This is very similar to: https://issues.redhat.com/browse/OCPBUGS-1052 and might need a tolerance threshold increase.

Version-Release number of selected component (if applicable):

4.9

How reproducible:

about 40% of the time from: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade

Additional info:

sippy tracker - https://sippy.dptools.openshift.org/sippy-ng/release/4.9/streams/amd64/nightly/testfailures
+++ This bug was initially created as a clone of Bug #2093454 +++

Description of problem:
There is a logic error in the haproxy template code that the "accept-proxy" specifier doesn't get appropriately applied to both IPv4 and IPv6 haproxy interfaces if BOTH IPv4 and IPv6 are enabled.

The "accept-proxy" specifier is added via when the ENV variable ROUTER_USE_PROXY_PROTOCOL is true.

OpenShift release version:
4.11

Cluster Platform:
All

How reproducible:
Always

Steps to Reproduce (in detail):
1. Enable IPv4 and IPv6 via ROUTER_IP_V4_V6_MODE="v4v6" on router deployment
2. Set ROUTER_USE_PROXY_PROTOCOL to true on router deployment
3. RSH into router and confirm that "accept-proxy" is on both "bind :<PORT>" and "bind :::<PORT>" lines for "frontend public" and "frontend public_ssl"


Actual results:
"accept-proxy" is only on "bind :::<PORT>" and missing from "bind :<PORT>"

Expected results:
"accept-proxy" should be on both "bind :<PORT>" and "bind :::<PORT>"

Impact of the problem:
Can't have a dual stack IPv4 and IPv6 configuration with "accept-protocol" on both stacks.

This is a clone of issue OCPBUGS-1678. The following is the description of the original issue:

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

#Description of problem:

Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

Input values in Instantiate Template are disappeared randomly.

#Version-Release number of selected component (if applicable):

  • Customer ENV
  • OCP4.10.9
  • Developer Console
  • Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • Internet Disconnected OCP cluster
  • quicklab test ENV
  • Developer Console
  • OCP4.10.12 
  • Chrome 100

#How reproducible:

I reproduced this issue in ocp410ovn shared cluster in the quicklab

Select Apache HTTP Server > Input name "test" in Application Hostname box
After several seconds, the value has disappeared in the web console.

#Steps to Reproduce:

0. Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

1. Input values in the box of template menu.

2. The values are disappeared after several seconds later. (20s~ or randomly)

3. Many users have experienced this issue.

  • The web browser version of users experiencing this issue.
  • Customer: Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • My browser: Chrome 102.x

==> the browser version doesn't matter.

#Actual results:

Input values in "Instantiate Template" are disappeared randomly.
Users can't use the Initiate Template feature in the Dev console.

#Expected results:
Input values remain in the web console and users creat the object by the "Instantiate Template"

#Additional info:

See "Application Name" has disappeared in the video I attached.

+++ This bug was initially created as a clone of Bug #2060329 +++

Description of problem:
As a user, I was stopped from using the developer perspective when switching into a namespace with a lot of workloads (Deployments, Pods, etc.)

This is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2006395

We recommend the following safety precautions against a lazy or crashing topology, also if we continue to work on performance improvements to allow more workloads rendered.

At the moment we expect that a topology with around about 100 nodes could be displayed. This could also depend on the node types, the used browser, the computer power of the PC, and how often the workload conditions changes.

Recommended safety guard:
The topology graph (maybe the list as well) should check how many nodes are fetched and will be rendered.

1. We need to evaluate if we could make this decision based on the shown graph nodes and edges or the number of underlying resources.

For example, is it required to count each Pod in a Deployment or not?

2. Based on a threshold (~ 100?) the topology graph should skip the rendering.

3. We should show a 'warning page' instead, which explains that the topology could not handle this amount of X nodes at the moment.

4. This page could have an option to "Show topology anyway" so that users who don't have issues here can still use the topology.

— Additional comment from bugzilla@redhat.com on 2022-05-09 08:32:18 UTC —

Account disabled by LDAP Audit for extended failure

— Additional comment from aos-team-art-private@redhat.com on 2022-05-09 19:42:54 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-05-10 04:50:55 UTC —

Bugfix included in accepted release 4.11.0-0.nightly-2022-05-09-224745
Bug will not be automatically moved to VERIFIED for the following reasons:

PR openshift/console#11334 not approved by QA contact
This bug must now be manually moved to VERIFIED by spathak@redhat.com

Description of problem:

+++ This bug was initially created as a clone of https://issues.redhat.com//browse/OCPBUGS-784

Various CI steps use the upi-installer container for it's access to the
aws cli tools among other things. However, most of those steps also
curl yq directly from GitHub. We can save ourselves some headaches
when GitHub is down by just embedding the binary in the image already.

Whenever GitHub has issues or throttles us, YQ hash mismatch error out. The hash mismatch error is because github is probably returning an error page, although our scripts hide it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

The QuickStart content shows a shadow above and/or below the content when the user can scroll into that direction. This feature is missing now.

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Open a quick start
  2. Reduce the window so that the content of the quick start is scrollable.

Actual results:

No shadow when the user can scroll the content into the content direction.

Expected results:

A shadow when the user can scroll the content into the content direction.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

4.9 (tested on master 2860e58114b4f811ac2ebf6ce34dd99263920e17)

Additional info:

Quick start is now extracted into PatternFly

Description of problem:

User expects a shadow on the form footer when the add page content is longer then the viewport shows. This was shown in 4.8.

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Switch to developer perspective
  2. Add page
  3. Import from Git for example

Actual results:

No shadow at the top of the form footer when the content view is scrollable.

Expected results:

A shadow at the top of the form footer when the content view is scrollable.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Happen on a cluster (4.9.0-0.nightly-2021-07-07-021823)
and local development (4.9 master, tested with 0588bc0f0b838ae448a68f35c5424f9bbfc65bc9)

Additional info:

None

This is a clone of issue OCPBUGS-1828. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1786. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

Description of problem:

Customer is facing issue similar to https://github.com/devfile/api/issues/897

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA

+++ This bug was initially created as a clone of Bug #2104386 +++

+++ This bug was initially created as a clone of Bug #2101157 +++

Description of problem:
Customer is struggling to instal OpenShift with a `no such connection profile.` error displayed in the `configure-ovs.sh` logs.

The displayed connection only contains the first half of the connection name.

Version-Release number of selected component (if applicable):
OpenShift 4.10.20

How reproducible:
Every time a connection name containing multiple words is used.

Steps to Reproduce:
1. Attempt to install OpenShift using IPI with Nodes containing default connections using names containing spaces

Actual results:
Failure shown in the below logs.

Expected results:
OpenShift installs correctly.

Additional info:
The following logs can be seen throughout the opened case:
~~~
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + local conn=System
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: ++ nmcli -g GENERAL.STATE conn show System
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: Error: System - no such connection profile.
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + local active_state=
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + '[' '' '!=' activated ']'
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + for i in

{1..10}

Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + echo 'Attempt 1 to bring up connection System'
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: Attempt 1 to bring up connection System
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + nmcli conn up System
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: Error: unknown connection 'System'.
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + s=10
Jun 24 17:41:53 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + sleep 5
Jun 24 17:41:58 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + for i in {1..10}

Jun 24 17:41:58 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + echo 'Attempt 2 to bring up connection System'
Jun 24 17:41:58 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: Attempt 2 to bring up connection System
Jun 24 17:41:58 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: + nmcli conn up System
Jun 24 17:41:58 slabnode2332.sprintlab735.netact.nsn-rdnet.net configure-ovs.sh[3472]: Error: unknown connection 'System'.
~~~

Following this we can see that the expected connection to be activated should not be "System" but "System ens3f0", etc.
~~~
[core@slabnode2332 ~]$ nmcli -g NAME c
bond0
System ens3f0
System ens3f1
Wired Connection
~~~

Reviewing the code we can see the loop that performs this iteration:
https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/configure-ovs-network.yaml#L677

Testing the execution flow, we can see below that ' ' is used as a separator:
~~~
$ nmcli -g NAME c | grep System
System eth0

$ for connection in $(nmcli -g NAME c | grep – "System") ; do echo $connection ; done
System
eth0
~~~

To handle multi-word connection names, something similar to the following should be used:
~~~
$ TMP_IFS=$IFS
$ IFS=$"\n"

$ for connection in $(nmcli -g NAME c | grep – "$MANAGED_NM_CONN_SUFFIX"); do
activate_nm_conn "$connection"
done

$ IFS=$TMP_IFS
~~~

— Additional comment from mwasher@redhat.com on 2022-06-26 06:01:12 UTC —

This was incorrectly tagged with OpenShift 4.6 but should be 4.10. Also this is not directly related to OpenShift SDN/OVNK but with the scaffolding for OVS configuration.

— Additional comment from rravaiol@redhat.com on 2022-06-27 13:36:45 UTC —

Hi Aurko, don't hesitate to reach out to Andreas if needed.

— Additional comment from aos-team-art-private@redhat.com on 2022-07-04 23:11:22 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-07-06 03:45:47 UTC —

Bugfix included in accepted release 4.12.0-0.nightly-2022-07-05-083442
Bug will not be automatically moved to VERIFIED for the following reasons:

  • PR openshift/machine-config-operator#3214 not approved by QA contact

This bug must now be manually moved to VERIFIED by rbrattai@redhat.com

— Additional comment from rbrattai@redhat.com on 2022-07-08 11:43:48 UTC —

PR 3227 Failed on UPI vSphere static-ip kargs active-backup

bond0 MAC != br-ex MAC

http://file.rdu.redhat.com/~rbrattai/logs/PR3227-ovs-configuration.log

— Additional comment from rbrattai@redhat.com on 2022-07-08 12:47:58 UTC —

Failed due to bash subshell issue https://github.com/openshift/machine-config-operator/pull/3227#discussion_r916774605

— Additional comment from rbrattai@redhat.com on 2022-07-25 22:13:04 UTC —

Tested on 4.11.0-0.ci.test-2022-07-25-103020-ci-ln-slzt3k2-latest

libvirt IPI DHCP RHCOS active_backup

Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + ovs-vsctl --timeout=30 --if-exists del-br br0
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com ovs-vsctl[2248]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=30 --if-exists del-br br0
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + connections=()
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[2249]: ++ nmcli -g NAME c
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ Wired Connection == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ Wired connection bond0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ Wired connection enp5s0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ Wired connection enp6s0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ Wired connection enp5s0-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + connections+=("$connection")
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ Wired connection enp6s0-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + connections+=("$connection")
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ ovs-if-br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ ovs-if-phys0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ ovs-port-br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + [[ ovs-port-phys0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + IFS=
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + read -r connection
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + connections+=(ovs-if-phys0 ovs-if-br-ex)
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + '[' -f /etc/ovnk/extra_bridge ']'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + activate_nm_connections 'Wired connection enp5s0-slave-ovs-clone' 'Wired connection enp6s0-slave-ovs-clone' ovs-if-phys0 ovs-if-br-ex
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + connections=("$@")
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + local connections
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + for conn in "${connections[@]}"
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[2254]: ++ nmcli -g connection.slave-type connection show 'Wired connection enp5s0-slave-ovs-clone'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + local slave_type=bond
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + '[' bond = team ']'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + '[' bond = bond ']'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + nmcli c mod 'Wired connection enp5s0-slave-ovs-clone' connection.autoconnect yes
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + for conn in "${connections[@]}"
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[2262]: ++ nmcli -g connection.slave-type connection show 'Wired connection enp6s0-slave-ovs-clone'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + local slave_type=bond
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + '[' bond = team ']'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + '[' bond = bond ']'
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + nmcli c mod 'Wired connection enp6s0-slave-ovs-clone' connection.autoconnect yes
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[1950]: + for conn in "${connections[@]}"
Jul 25 17:07:45 master-0-2.0.qe.lab.redhat.com configure-ovs.sh[2270]: ++ nmcli -g connection.slave-type connection show ovs-if-phys0

vSphere UPI DHCP RHEL

Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + ovs-vsctl --timeout=30 --if-exists del-br br0
Jul 25 10:05:27 xrskc-rhel-0 ovs-vsctl[2071]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=30 --if-exists del-br br0
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + connections=()
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[2072]: ++ nmcli -g NAME c
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [ test bond0#} == *-\s\l\a\v\e-\o\v\s-\c\l\o\n\e ]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ test ens192 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ test ens224 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ test ens256 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ ovs-if-br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ ovs-if-phys0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ ovs-port-br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ ovs-port-phys0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ test ens192-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + connections+=("$connection")
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ test ens224-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + connections+=("$connection")
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + [[ test ens256-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + connections+=("$connection")
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + IFS=
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + read -r connection
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + connections+=(ovs-if-phys0 ovs-if-br-ex)
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' -f /etc/ovnk/extra_bridge ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + activate_nm_connections 'test ens192-slave-ovs-clone' 'test ens224-slave-ovs-clone' 'test ens256-slave-ovs-clone' ovs-if-phys0 ovs-if-br-ex
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + connections=("$@")
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + local connections
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + for conn in "${connections[@]}"
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[2077]: ++ nmcli -g connection.slave-type connection show 'test ens192-slave-ovs-clone'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + local slave_type=bond
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' bond = team ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' bond = bond ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + nmcli c mod 'test ens192-slave-ovs-clone' connection.autoconnect yes
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + for conn in "${connections[@]}"
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[2085]: ++ nmcli -g connection.slave-type connection show 'test ens224-slave-ovs-clone'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + local slave_type=bond
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' bond = team ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' bond = bond ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + nmcli c mod 'test ens224-slave-ovs-clone' connection.autoconnect yes
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + for conn in "${connections[@]}"
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[2093]: ++ nmcli -g connection.slave-type connection show 'test ens256-slave-ovs-clone'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + local slave_type=bond
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' bond = team ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + '[' bond = bond ']'
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + nmcli c mod 'test ens256-slave-ovs-clone' connection.autoconnect yes
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[1627]: + for conn in "${connections[@]}"
Jul 25 10:05:27 xrskc-rhel-0 configure-ovs.sh[2101]: ++ nmcli -g connection.slave-type connection show ovs-if-phys0

vSphere UPI static-ip RHEL

Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + ovs-vsctl --timeout=30 --if-exists del-br br0
Jul 25 17:02:58 xrskc-rhel-1 ovs-vsctl[2039]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=30 --if-exists del-br br0
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + connections=()
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[2040]: ++ nmcli -g NAME c
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [ test bond0#} == *-\s\l\a\v\e-\o\v\s-\c\l\o\n\e ]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ test ens192 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ test ens224 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ test ens256 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ ovs-if-br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ ovs-if-phys0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ ovs-port-br-ex == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ ovs-port-phys0 == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ test ens192-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + connections+=("$connection")
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ test ens224-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + connections+=("$connection")
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + [[ test ens256-slave-ovs-clone == *\s\l\a\v\e\o\v\s-\c\l\o\n\e ]]
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + connections+=("$connection")
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + IFS=
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + read -r connection
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + connections+=(ovs-if-phys0 ovs-if-br-ex)
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' -f /etc/ovnk/extra_bridge ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + activate_nm_connections 'test ens192-slave-ovs-clone' 'test ens224-slave-ovs-clone' 'test ens256-slave-ovs-clone' ovs-if-phys0 ovs-if-br-ex
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + connections=("$@")
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + local connections
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + for conn in "${connections[@]}"
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[2045]: ++ nmcli -g connection.slave-type connection show 'test ens192-slave-ovs-clone'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + local slave_type=bond
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' bond = team ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' bond = bond ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + nmcli c mod 'test ens192-slave-ovs-clone' connection.autoconnect yes
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + for conn in "${connections[@]}"
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[2053]: ++ nmcli -g connection.slave-type connection show 'test ens224-slave-ovs-clone'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + local slave_type=bond
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' bond = team ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' bond = bond ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + nmcli c mod 'test ens224-slave-ovs-clone' connection.autoconnect yes
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + for conn in "${connections[@]}"
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[2061]: ++ nmcli -g connection.slave-type connection show 'test ens256-slave-ovs-clone'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + local slave_type=bond
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' bond = team ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + '[' bond = bond ']'
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + nmcli c mod 'test ens256-slave-ovs-clone' connection.autoconnect yes
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[1593]: + for conn in "${connections[@]}"
Jul 25 17:02:58 xrskc-rhel-1 configure-ovs.sh[2069]: ++ nmcli -g connection.slave-type connection show ovs-if-phys0:

— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-15 07:53:50 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from rbrattai@redhat.com on 2022-08-15 17:25:50 UTC —

Fix in 4.11.0-0.nightly-2022-08-15-074436

Description of problem:
Each time the ovn-master change by a new election, the egressfirewall rules are written again in the OVN Northbound database. So, the egressfirewall rules grow indefinitely in the NBDB and we notice some problems with the priority rules appear.

Version-Release number of selected component (if applicable):
UPI Baremetal OCP 4.9.37 and 4.9.46

How reproducible:
Every time

Steps to Reproduce:
1. create an egressfirewall on the namespace
2. restart the ovnkube-master active pod
3. check the nbdb for duplicate egressfirewall rules

Actual results:
duplicate entries and priorities changed for the rules

Expected results:
there shouldn't be any duplicate entries

Additional info:

Description of problem:

Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10

https://access.redhat.com/solutions/6172402

When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`

which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7

Actual results:

4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod

Expected results:

I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment

Additional info:

Description of problem:

.NET builder image is not getting detected

Prerequisites (if any, like setup, operators/versions):

Install Openshift Pipelines Operator

Steps to Reproduce

  1. Follow steps of the test case

Actual results:

When git url of a .NET project is provided, the builder image is not getting detected

Expected results:

When git url of a .NET project is provided, the builder image should automatically get detected

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Additional info:

Manual backport for 4.9.z

Description of problem:

Having `IngressController` with `endpointPublishingStrategy` set to `Private` and a `kubernetes` service created with same naming convention `NodePort` in `openshift-ingress` namespace is being removed when the `ingress-operator` is restarted.

$ oc get ingresscontroller -n openshift-ingress-operator example-service-testing -o json
{
"apiVersion": "operator.openshift.io/v1",
"kind": "IngressController",
"metadata":

{ "creationTimestamp": "2022-02-14T10:34:35Z", "finalizers": [ "ingresscontroller.operator.openshift.io/finalizer-ingresscontroller" ], "generation": 2, "name": "example-service-testing", "namespace": "openshift-ingress-operator", "resourceVersion": "19329705", "uid": "ffc9f14d-63ad-43bb-8a56-5e590cda9b38" }

,
"spec": {
"clientTLS": {
"clientCA":

{ "name": "" }

,
"clientCertificatePolicy": ""
},
"domain": "apps.example.com",
"endpointPublishingStrategy":

{ "type": "Private" }

,
"httpEmptyRequestsPolicy": "Respond",
"httpErrorCodePages":

{ "name": "" }

,
"tuningOptions": {},
"unsupportedConfigOverrides": null
},
"status": {
"availableReplicas": 2,
"conditions": [

{ "lastTransitionTime": "2022-02-14T10:34:35Z", "reason": "Valid", "status": "True", "type": "Admitted" }

,

{ "lastTransitionTime": "2022-02-14T10:34:35Z", "status": "True", "type": "PodsScheduled" }

,

{ "lastTransitionTime": "2022-02-14T10:35:10Z", "message": "The deployment has Available status condition set to True", "reason": "DeploymentAvailable", "status": "True", "type": "DeploymentAvailable" }

,

{ "lastTransitionTime": "2022-02-14T10:35:10Z", "message": "Minimum replicas requirement is met", "reason": "DeploymentMinimumReplicasMet", "status": "True", "type": "DeploymentReplicasMinAvailable" }

,

{ "lastTransitionTime": "2022-02-14T10:35:10Z", "message": "All replicas are available", "reason": "DeploymentReplicasAvailable", "status": "True", "type": "DeploymentReplicasAllAvailable" }

,

{ "lastTransitionTime": "2022-02-14T10:34:35Z", "message": "The configured endpoint publishing strategy does not include a managed load balancer", "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer", "status": "False", "type": "LoadBalancerManaged" }

,

{ "lastTransitionTime": "2022-02-14T10:34:35Z", "message": "The endpoint publishing strategy doesn't support DNS management.", "reason": "UnsupportedEndpointPublishingStrategy", "status": "False", "type": "DNSManaged" }

,

{ "lastTransitionTime": "2022-02-14T10:35:10Z", "status": "True", "type": "Available" }

,

{ "lastTransitionTime": "2022-02-14T10:35:10Z", "status": "False", "type": "Degraded" }

],
"domain": "apps.example.com",
"endpointPublishingStrategy":

{ "type": "Private" }

,
"observedGeneration": 2,
"selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=example-service-testing",
"tlsProfile":

{ "ciphers": [ "ECDHE-ECDSA-AES128-GCM-SHA256", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES256-GCM-SHA384", "ECDHE-RSA-AES256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "DHE-RSA-AES128-GCM-SHA256", "DHE-RSA-AES256-GCM-SHA384", "TLS_AES_128_GCM_SHA256", "TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }

}
}

$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 87m

After `IngressController` creation, it all looks as expected and for the Private `IngressController` we can see `router-internal-example-service-testing` Service.

$ oc create svc nodeport router-example-service-testing --tcp=80
service/router-example-service-testing created

$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 88m
router-example-service-testing NodePort 172.30.2.39 <none> 80:31874/TCP 3s

Now we are creating a `kubernetes` service of type NodePort with the same naming scheme like the one created by the `IngressController`. So far so good and also no impact or similar with regards to functionality.

$ oc get pod -n openshift-ingress-operator
NAME READY STATUS RESTARTS AGE
ingress-operator-7d56fd784c-plwpj 2/2 Running 0 78m

$ oc delete pod ingress-operator-7d56fd784c-plwpj -n openshift-ingress-operator
pod "ingress-operator-7d56fd784c-plwpj" deleted

$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 88m
router-example-service-testing NodePort 172.30.2.39 <none> 80:31874/TCP 53s

$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 89m

$ oc logs ingress-operator-7d56fd784c-7g48r -n openshift-ingress-operator -c ingress-operator
2022-02-14T12:03:26.624Z INFO operator.main ingress-operator/start.go:63 using operator namespace {"namespace": "openshift-ingress-operator"}
I0214 12:03:27.675884 1 request.go:668] Waited for 1.02063447s due to client-side throttling, not priority and fairness, request: GET:
https://172.30.0.1:443/apis/apps.openshift.io/v1?timeout=32s
2022-02-14T12:03:29.284Z INFO operator.main ingress-operator/start.go:63 registering Prometheus metrics for canary_controller
2022-02-14T12:03:29.284Z INFO operator.main ingress-operator/start.go:63 registering Prometheus metrics for ingress_controller
[...]
2022-02-14T12:03:33.119Z INFO operator.dns dns/controller.go:535 using region from operator config {"region name": "us-west-1"}
2022-02-14T12:03:33.417Z INFO operator.ingress_controller controller/controller.go:298 reconciling {"request": "openshift-ingress-operator/example-service-testing"}
2022-02-14T12:03:33.509Z INFO operator.ingress_controller ingress/load_balancer_service.go:190 deleted load balancer service {"namespace": "openshift-ingress", "name": "router-example-service-testing"}
[...]

When restarting the `ingress-operator` pod we can see that shortly after, the manual created `kubernetes` service of type NodePort is being removed. Looking through the code it looks related to
https://bugzilla.redhat.com/show_bug.cgi?id=1914127
but that should only target/focus on `kubernetes` Service of type Loadbalancer. But we can clearly see that this is happening for all `kubernetes` Service type if they are matching the pre-defined `IngressController` naming scheme.

As this is not expected and also the `kubernetes` Services don't have any owner reference to the `IngressController` created services, it's unexpected that does are being removed and thus this should be fixed.

OpenShift release version:

  • OpenShift Container Platform 4.9.15

Cluster Platform:

  • AWS but likely on other platform as well

How reproducible:

  • Always

Steps to Reproduce (in detail):
1. See the steps in the problem description

Actual results:

`kubernetes` services of any type and without owner reference to the `IngressController` are being removed by the `IngressController` if they have a specific naming scheme.

Expected results:

`kubernetes` services without `IngressController` reference should never be touched/modified/removed by the same as they may be required for 3rd party integration or similar.

Impact of the problem:

3rd party implementation broken after updating to OpenShift Container Platform 4.8 as some helper services were removed unexpected.

Additional info:

Check
https://bugzilla.redhat.com/show_bug.cgi?id=1914127
as this seems the change that introduced that behavior. Although this seems specific for `kubernetes` type LoadBalancer and we are therefore wondering why other services are in scope as well.

This is a clone of issue OCPBUGS-249. The following is the description of the original issue:

+++ This bug was initially created as a clone of
Bug #2070318
+++

Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.

2.Start cluster installation.

3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.

Actual results:
After 10-15 minutes OCP will correct that by itself.

Expected results:
Wrong natting should not happen.

Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)

— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —

This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —

Hi Ben,

We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.

— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —

Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.

— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —

Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.

— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —

All nodes have two interfaces:

eth0: 10.30.1.0/24
eth1: 10.40.1.0/24

machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1

The kubeapi service ip is 172.30.0.1:443

all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)

To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.

Initially kubeapi is only active on the bootstrap node so there should be a nat rule like

172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.

So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.

Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.

— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —

Note from Juniper regarding requested SOS report:

In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ

  • Making note private to hide partner link

— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —

Can we please get an update on this BZ?

Do let us know if there is any other information needed.

— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —

Can you please provide another link to the sosreport? Looks like the link is dead.

— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —

See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —

Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:

[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h

I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?

— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —

Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.

— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —

Hi Tim,

You are right, kube-proxy is deployed by default and we don't change that behavior.

There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.

— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —

Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.

— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —

Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —

Created attachment 1875023 [details]sosreport

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —

Created attachment 1875024 [details]sosreport-part2

Hi Tim,

We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.

I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.

Here are some notes about the cluster:

First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.

[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd

We see the behavior that wrong nat'ing was done at the beginning, then corrected later:

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 3 bytes 180 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 3 bytes 180 dnat to 10.40.1.7:6443 }

}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }
}
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }

}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM

{ ip saddr 10.30.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 0 bytes 0 dnat to 10.30.1.7:6443 }

}
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —

@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.

— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —

Created support Case 03223143.

— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —

Hello Team,

Any update on this?

Thanks,
Vinu K

— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —

This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.

I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.

Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.

[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —

@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?

— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —

Sorry, I missed that this came back to me.

(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?

I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.

— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —

(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.

— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —

Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —

Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895

In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):

We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”

It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.

Thank you

— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —

Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.

I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?

Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.

— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —

Moving to the assisted-installer component for further investigation.

— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —

@Eran Cohen:

Please see my response inline.

You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.

You are trying to form a cluster from OpenStack Vms?
--> Yes.

You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.

Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.

— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —

@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture

— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —

OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'

To be honest I'm surprised that the installation was completed successfully.

@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?

— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —

@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:

  • user-managed-networking set to true when we use external Load Balancer and DNS server. For VRRP we use OpenShift's internal LB and DNS server hence it's set to false, following the doc.
  • As explained OpenShift returns platform type as 'none' for OpenStack:
    https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
    , therefore we set the platformtype as 'baremetal' in the cluster object for provisioning the cluster using OpenStack VMs.

— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —

@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —

Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:56:06 UTC —

Created attachment 1889994 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Please find the cluster-log attached per your request. In this deployment the wrong NAT was not automatically resolved by OpenShift hence the deployment failed eventually.

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —

@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing

— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —

@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).

We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.

— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —

For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":

{ "endpointslice.kubernetes.io/skip-mirror": "true" }

,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [

{ "ip": "10.40.1.7" }

],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},

— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —

The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.

— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —

@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.

— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —

I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.

— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —

I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.

— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —

=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —

//EMT note//

Update from our consultant Manuel Martinez Briceno -

====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====

Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat

This is a clone of issue OCPBUGS-1523. The following is the description of the original issue:

Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

Description of problem:

The topology use a PatternFly toolbar component to render its toolbar. It looks like the latest version uses a grid and flex layout which adds an addtional spacing below the toolbar.

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Open developer console
  2. Check topology toolbar design

Actual results:

There is a spacing below the toolbar (Display options, Filter by resource, Filter by name), see attached screenshot.

Expected results:

No additional spacing below the toolbar. (See screenshot of 4.8)

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

4.9 master, tested with commit 3c6537eec4f5c165cf214c4100bddeccc104ed44

Additional info:

Maybe this is a patternfly issue, or we should check if the second div in the toolbar should not be rendered. See attached screenshot.

This is a clone of issue OCPBUGS-1758. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1354. The following is the description of the original issue:

This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)

How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[

{ "address": "10.0.178.163", "type": "InternalIP" }

,

{ "address": "10.0.187.247", "type": "InternalIP" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "Hostname" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }

]

$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }

~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s

$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd | awk '

{print $1}

' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }

~~~

Description of problem:

When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ?

# In OCP v4.8.13 case
// Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod.
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
:
07:16:33 :172.217.175.238
07:16:34 :172.217.175.238 <--- Refreshed the upstream result
07:16:36 :142.250.207.46
07:16:37 :142.250.207.46

// The dig results is matched with the running node one as you can see the above one.
$ oc rsh  test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
:
07:16:35 :172.217.175.238 
07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed.
07:16:37 :142.250.207.46
07:16:38 :142.250.207.46


But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows.

# In OCP v4.10.23 case, pod's response from DNS services are not consistent.
$ oc rsh test-848fcf8ddb-zrcbx  bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
07:23:00 :142.250.199.110
07:23:01 :142.250.207.46
07:23:02 :142.250.207.46
07:23:03 :142.250.199.110
07:23:04 :142.250.199.110
07:23:05 :172.217.161.78

# Even though the node which is running the pod keep responding the same IP...
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
07:23:00 :172.217.161.78
07:23:01 :172.217.161.78
07:23:02 :172.217.161.78
07:23:03 :172.217.161.78
07:23:04 :172.217.161.78
07:23:05 :172.217.161.78

Version-Release number of selected component (if applicable):

v4.10.23 (ROSA)
SDN: OpenShiftSDN

How reproducible:

You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.

Steps to Reproduce:

1. Run any usual pod, and check which node the pod is running on.
2. Run dig google.com on the pod and the node.
3. Check the IP is consistent with the running node each other. 

Actual results:

The response IPs are not consistent and random IP is responded.

Expected results:

The response IP is kind of consistent, and aware of prefer local dns.

Additional info:

This issue affects EgressNetworkPolicy dnsName feature.

Description

All test scenarios related to add flow should get executed on CI

Acceptance Criteria

  • Update the devconsole/package.json
  • Verify the scripts on remote cluster

Additional Details:

Create README based on shared Google doc and add it to OpenShift so we have documentation for i18n work.

This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106838 for backporting purposes

 
+++ This bug was initially created as a clone of
Bug #2015023
+++

Description of problem:
Tried to uninstall the operator and it worked. However, the operator custom resource doesn't gets deleted.
Strangely, it is getting recreated even after issuing "oc delete operator <name>"

Version-Release number of selected component (if applicable):
4.7.31

How reproducible:
100%

Steps to Reproduce:
1. Install any operator
2. Check operator resource
3. Uninstall the operator
4. Try to delete the "operator" resource.
5. List the operator resource. The operator resource will get recreated.

Actual results:
The operator resource is not getting removed even after deleting it.

Expected results:
After issuing "oc delete operator <name>", the operator resource should get removed.

Additional info:
Similar bug[1] was fix in 4.7.0 but looks like the issue is still there.

[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1899588
— Additional comment from Dhruv Gautam on 2021-10-18 09:14:25 UTC —

Hi Team

Let me know if you need any logs.

Regards
Dhruv Gautam

— Additional comment from Nick Hale on 2021-12-02 20:09:45 UTC —

Sorry for the slow response!

@
dgautam@redhat.com
I'm going to need the status of the Operator resource after the deletion attempt is made. That status should show any remaining components. Without that info, I can only suspect that some cluster scoped resources still exist that reference the Operator – e.g. a CRD – since I have been unable to reproduce the issue myself.

A must-gather will help too.

— Additional comment from Dhruv Gautam on 2021-12-03 12:45:05 UTC —

Hi Nick

Must-gather is available in below google drive:
[-]
https://drive.google.com/file/d/1DtGkIzZpWYjihu_kG0OR4BkLFjIsPlFB/view?usp=sharing
If required, we can try to reproduce the issue together.

Regards
Dhruv Gautam

— Additional comment from Nick Hale on 2021-12-03 15:53:47 UTC —

Looking at `cluster-scoped-resources/operators.coreos.com/operators/cloud-native-postgresql.sandbox-kokj.yaml` in the must-gather shows that there are still resources related to the Operator on the cluster:

```

  • apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    name: postgresql-operator-controller-manager-1-9-1-service-auth-reader
    namespace: kube-system
  • apiVersion: apiextensions.k8s.io/v1
    conditions:
  • lastTransitionTime: "2021-10-06T10:29:10Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  • lastTransitionTime: "2021-10-06T10:29:11Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
    kind: CustomResourceDefinition
    name: clusters.postgresql.k8s.enterprisedb.io
  • apiVersion: apiextensions.k8s.io/v1
    conditions:
  • lastTransitionTime: "2021-10-06T10:29:10Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  • lastTransitionTime: "2021-10-06T10:29:10Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
    kind: CustomResourceDefinition
    name: backups.postgresql.k8s.enterprisedb.io
  • apiVersion: apiextensions.k8s.io/v1
    conditions:
  • lastTransitionTime: "2021-10-06T10:29:11Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  • lastTransitionTime: "2021-10-06T10:29:11Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
    kind: CustomResourceDefinition
    name: scheduledbackups.postgresql.k8s.enterprisedb.io
    ```

These must be deleted before the Operator resource can be.

I'm now fairly confident that this is not a bug. If you delete these resources then the Operator resource and it is still recreated, then please reopen this BZ.

Thanks!

— Additional comment from Dhruv Gautam on 2021-12-03 17:07:34 UTC —

Hi Nick

You got that absolutely right. The CRDs were not cleared due to which the operator resource was getting recreated.
This is not a bug.

Regards
Dhruv Gautam

— Additional comment from Nick Hale on 2021-12-14 15:57:11 UTC —

Looks like we're seeing cases with no components as well. I'm reopening this.

Just got out of a call with the related customer – they'll be posting a new must-gather soon.

— Additional comment from Dhruv Gautam on 2021-12-14 17:09:57 UTC —

Hi Nick

Thanks for assisting over the remote.
Please find latest must-gather below:
[-]
https://drive.google.com/file/d/1mNvJFmoabUTCXiZ0TMmucR8YAIEsWAk5/view?usp=sharing
Regards
Dhruv Gautam
Red Hat

— Additional comment from Dhruv Gautam on 2022-01-25 08:39:24 UTC —

Hello Nick

Any update on the bugzilla ?

Regards
Dhruv Gautam
Red Hat

— Additional comment from Dhruv Gautam on 2022-02-08 17:41:59 UTC —

Hello Team

Is there any update on this bugzilla ?

Regards
Dhruv Gautam
Red Hat

— Additional comment from Nick Hale on 2022-02-10 15:25:38 UTC —

Hi Dhruv,

Very sorry for the delayed response.

We have an upstream PR from an external contributor addressing this. I'm moving to get either get that in or create a patch myself.

Hopefully, there should be something merged upstream within the week – after that, I'll focus on getting it merged downstream, although I'm not sure if we can backport all the way to 4.7.z at this point. I'll follow up with the team and see what's possible.

I'll post my findings here later today.

— Additional comment from Nick Hale on 2022-02-10 15:26:05 UTC —

Current upstream PR:
https://github.com/operator-framework/operator-lifecycle-manager/pull/2582
— Additional comment from Nick Hale on 2022-02-10 16:16:50 UTC —

Okay, so it looks like we don't backport fixes for medium issues to 4.7.z. Have the customers upgraded to a newer version of OpenShift?

We can look into changing the severity if this issue is causing significant interruptions for users.

Here are the support phase dates for current OpenShift release:
https://access.redhat.com/support/policy/updates/openshift#dates
(that doc also has the phase SLAs as well)

— Additional comment from Dhruv Gautam on 2022-03-09 17:14:19 UTC —

Hi Nick

I understand your point about supportable phase and SLAs.

I would like to know:

  • In which RHOCP version the fix will be made available ?
  • Are there any tentative dates when the fix will be released ?

Regards
Dhruv Gautam
Red Hat

— Additional comment from Nick Hale on 2022-04-05 19:05:52 UTC —

Dhruv,

An upstream fix for this – different than the one mentioned in my previous comment – has merged (see
https://github.com/operator-framework/operator-lifecycle-manager/pull/2697
).> - In which RHOCP version the fix will be made available ? If we can get the change synced to our downstream repository on time, it will be released in version 4.11.0.> - Are there any tentative dates when the fix will be released ?As of today, 4.11.0 is planned to go GA on July 13th 2022 (see
https://docs.google.com/spreadsheets/d/19bRYespPb-AvclkwkoizmJ6NZ54p9iFRn6DGD8Ugv2c/edit#gid=0
).

— Additional comment from Per da Silva on 2022-04-07 09:04:37 UTC —

Hi Dhruv,

We've brought this change downstream on this PR:
https://github.com/openshift/operator-framework-olm/pull/278
I'll update this ticket to ON_QA

Cheers,

Per

— Additional comment from Jian Zhang on 2022-04-07 09:59:45 UTC —

Hi Nick,> Current upstream PR: https://github.com/operator-framework/operator-lifecycle-manager/pull/2582This fix PR had been rejected, could you help link the right one? Thanks!

Hi Bruno,> We've brought this change downstream on this PR: https://github.com/openshift/operator-framework-olm/pull/278I checked the latest payload, it contains the fixed PR. Could you help verify it? Thanks!
mac:~ jianzhang$ oc adm release info registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-04-07-053433 -a .dockerconfigjson --commits|grep olm
operator-lifecycle-manager
https://github.com/openshift/operator-framework-olm
491ea010345b42d0ffd19208124e16bc8a9d1355

— Additional comment from Bruno Andrade on 2022-04-08 00:09:44 UTC —

oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-07-053433 True False 4h25m Cluster version is 4.11.0-0.nightly-2022-04-07-053433

oc exec olm-operator-67fc464567-8wl9l -n openshift-operator-lifecycle-manager – olm --version
OLM version: 0.19.0
git commit: 491ea010345b42d0ffd19208124e16bc8a9d1355

cat og-single.yaml
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
name: og-single1
namespace: default
spec:
targetNamespaces:

  • default

oc apply -f og-single.yaml
operatorgroup.operators.coreos.com/og-single1 created

cat teiidcatsrc.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: teiid
namespace: default
spec:
displayName: "teiid Operators"
image: quay.io/bandrade/teiid-index:1898500
publisher: QE
sourceType: grpc

oc create -f teiidcatsrc.yaml
catalogsource.operators.coreos.com/teiid created

cat teiidsub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: teiid
namespace: default
spec:
source: teiid
sourceNamespace: default

channel: alpha
installPlanApproval: Automatic
name: teiid

oc apply -f teiidsub.yaml
subscription.operators.coreos.com/teiid created

oc get sub -n default
NAME PACKAGE SOURCE CHANNEL
teiid teiid teiid alpha

oc get ip -n default
NAME CSV APPROVAL APPROVED
install-psjsf teiid.v0.3.0 Automatic true

oc get csv
NAME DISPLAY VERSION REPLACES PHASE
teiid.v0.3.0 Teiid 0.3.0 Succeeded

oc get operators -n default
NAME AGE
cluster-logging.openshift-logging 5h14m
elasticsearch-operator.openshift-operators-redhat 5h14m
teiid.default 13m

oc delete sub teiid
subscription.operators.coreos.com "teiid" deleted

oc delete csv teiid.v0.3.0
clusterserviceversion.operators.coreos.com "teiid.v0.3.0" deleted

oc get operator teiid.default -o yaml
apiVersion: operators.coreos.com/v1
kind: Operator
metadata:
creationTimestamp: "2022-04-07T23:22:22Z"
generation: 1
name: teiid.default
resourceVersion: "146694"
uid: d74c796d-7482-4caa-96ed-fbd401a35f19
spec: {}
status:
components:
labelSelector:
matchExpressions:

  • key: operators.coreos.com/teiid.default
    operator: Exists

oc delete operator teiid.default -n default 1 ↵
warning: deleting cluster-scoped resources, not scoped to the provided namespace
operator.operators.coreos.com "teiid.default" deleted

oc get operator teiid.default -o yaml
Error from server (NotFound): operators.operators.coreos.com "teiid.default" not found

LGTM, marking as VERIFIED

— Additional comment from Raúl Fernández on 2022-05-27 10:24:05 UTC —

Hi,

My customer Telefonica is experiencing this problem (case linked) and requesting this backport to 4.8, as they are having this issue with multiple operators.

Could we have a backport schedule for this?

Thanks.

Best regards,
Raúl Fernández

— Additional comment from errata-xmlrpc on 2022-06-15 18:25:09 UTC —

This bug has been added to advisory RHEA-2022:5069 by OpenShift Release Team Bot (ocp-build/
buildvm.openshift.eng.bos.redhat.com@REDHAT.COM
)

— Additional comment from Silvia Parpatekar on 2022-06-24 08:07:53 UTC —

Hello Team,

We have a customer with OCP 4.9.36 Baremetal and is facing this issue with Performance Addon Operator.
Easily reproducible: Installed the operator and Deleted it's resources:
~~~
$ oc delete csv performance-addon-operator.v4.8.8
$ oc delete ip install-flsns
$ oc delete sub performance-addon-operator
$ oc delete job.batch/77c159445752f625e1337c92dabd6cfdfc1eebcb7accbbfd5d1c7227656cec6 -n openshift-marketplace
$ oc delete configmap/77c159445752f625e1337c92dabd6cfdfc1eebcb7accbbfd5d1c7227656cec6 -n openshift-marketplace
$ oc get crd | grep performance
$ oc delete crd performanceprofiles.performance.openshift.io
$ oc scale deployment -n openshift-operator-lifecycle-manager olm-operator --replicas=0
$ oc delete operator performance-addon-operator.openshift-operators
$ oc scale deployment -n openshift-operator-lifecycle-manager olm-operator --replicas=1
$ oc get operator
NAME AGE
performance-addon-operator.openshift-operators 7m31s
~~~

The Operator still exists in CLI but when I checked the web console I couldn't find any operator there. We tried various other ways to delete but no luck.

Can we get an update in which version is this going to be fixed?

— Additional comment from Immanuvel on 2022-07-01 12:44:27 UTC —

Hello Team,

Any workaround available for this ?

Thanks
Immanuvel

— Additional comment from himadri on 2022-07-01 13:16:41 UTC —

Hello Team,

Accounts team has reached out to EMT on this as the Customer is frustrated with the time being consumed without a fix. The Customer temperature is high and would request your intervention in getting a permanent fix or workaround at the earliest, to avoid the situation blowing up further.

Please find the Business impact as shared by the Accounts Team for your reference.

Business Impact - This issue has made the development functionality miss 2 sprint cycles of delivery for the customer. Customer cannot afford to miss delivery of yet another cycle and delay in delivery.

Thanks,

Himadri.

— Additional comment from Red Hat Bugzilla on 2022-07-09 04:22:15 UTC —

remove performed by PnT Account Manager <
pnt-expunge@redhat.com
>

— Additional comment from Red Hat Bugzilla on 2022-07-09 04:22:46 UTC —

remove performed by PnT Account Manager <
pnt-expunge@redhat.com
>

— Additional comment from Raúl Fernández on 2022-07-11 07:41:36 UTC —

Hi,

Now that this BZ is in verified state, I think a back port to 4.10 is needed as this issue is affecting many customers. Is there any plan for it?

This is a clone of issue OCPBUGS-1538. The following is the description of the original issue:

Description of problem:

Tracking this for backport of https://bugzilla.redhat.com/show_bug.cgi?id=2072710

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Currently we see this issue:

 
Aug 28 00:02:20.755103 ip-10-0-131-145 hyperkube[1366]: I0828 00:02:20.755067 1366 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-quorum-guard-588ff9b55d-8lhb7" podUID=5b79def2-9e56-4c93-b8ab-1d04db0f552f containerName="guard" probeResult=failure output=""
then few seconds later
Aug 28 00:02:25.797258 ip-10-0-131-145 hyperkube[1366]: I0828 00:02:25.797231 1366 kubelet.go:2175] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-etcd/etcd-quorum-guard-588ff9b55d-8lhb7"
Try to improve the clustermembercontroller sync loop for health status or just improve to not fail there on probe quard during install at least or scale. Instead of maybe operator status use metrics to track this.

Slack for more context https://coreos.slack.com/archives/C027U68LP/p1630506922034600

 

AC: 

  • come up with a solution which approach we want to take and present in the team meeting
  • implement the proposed solution

+++ This bug was initially created as a clone of Bug #2103899 +++

+++ This bug was initially created as a clone of Bug #2099945 +++

Description of problem:

vSphere UPI static IP active-backup bonding using kargs

NetworkManager enters an infinite link flap loop after active-backup primary slave link is restored.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. disconnect link of primary slave (ens192) using vSphere console
2. wait for link to fail over to backup slave (ens224).
3. reboot
4. re-connect old primary slave (ens192).

Actual results:

NetworkManager enters an infinite loop of link flaps. Network connectivity to the node is lost.

Once old primary slave (ens192) is re-disconnected NetworkManager link flap stops and network recovers.

Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2206] device (bond0): assigned bond port ens192
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2207] device (ens192): Activation: connection 'ens192' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2213] policy: auto-activating connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2214] policy: auto-activating connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2267] device (ens192): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2272] device (ens224): Activation: starting connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2273] device (ens256): Activation: starting connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2275] device (bond0): disconnecting for new activation request.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2276] device (bond0): state change: ip-config -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2282] device (ens224): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2284] device (ens256): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2309] device (ens192): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2311] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2596] device (bond0): released bond slave ens192
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <warn> [1655849684.2676] device (ens192): queue-state[activated] reason:none, id:452056]: replace previously queued state change
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3073] device (bond0): Activation: starting connection 'ovs-if-phys0' (7ac68a21-25ac-47c4-a8ca-70a83181965d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3095] device (ens192): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3099] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3243] device (bond0): set-hw-addr: set-cloned MAC address to 00:50:56:AC:59:95 (00:50:56:AC:59:95)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3245] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3294] device (ens224): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3297] device (ens256): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3300] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3304] device (bond0): Activation: connection 'ovs-if-phys0' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3306] device (ens192): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3338] device (ens224): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3639] device (bond0): assigned bond port ens224
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3640] device (ens224): Activation: connection 'ens224-slave-ovs-clone' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3642] device (ens256): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (bond0): assigned bond port ens256
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (ens256): Activation: connection 'ens256-slave-ovs-clone' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3906] device (bond0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3910] policy: auto-activating connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3912] device (bond0): carrier: link connected
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3931] device (ens224): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3934] device (ens256): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3939] device (ens192): Activation: starting connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3940] device (bond0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): disconnecting for new activation request.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3944] device (bond0): releasing ovs interface bond0
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3946] device (bond0): released from master device bond0
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3948] device (ens192): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3953] device (ens224): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3954] device (ens224): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3957] device (ens224): Activation: successful, device activated.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3963] device (ens256): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3965] device (ens256): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3967] device (ens256): Activation: successful, device activated.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4014] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4349] device (bond0): released bond slave ens224
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4779] device (bond0): released bond slave ens256
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4781] device (bond0): set-hw-addr: set MAC address to 00:50:56:AC:59:95 (restore)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4863] device (bond0): set-hw-addr: reset MAC address to D2:09:18:BB:91:74 (deactivate)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4866] device (bond0): Activation: starting connection 'bond0' (ef21706f-9968-419c-877d-f3c80a098daf)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4925] device (ens224): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4930] device (ens256): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4935] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5079] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5097] device (ens192): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5099] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5101] device (ens224): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5109] device (ens256): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5120] device (ens192): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5380] device (bond0): assigned bond port ens192

Expected results:

No link flaps, no network disconnction.

Additional info:

Unable to reproduce yet on IPI libvirt baremetal.

— Additional comment from rbrattai@redhat.com on 2022-06-22 04:47:51 UTC —

Logs from compute-1

http://file.rdu.redhat.com/~rbrattai/logs/bz-2099945-compute-1.tar.xz

— Additional comment from rbrattai@redhat.com on 2022-06-22 05:05:13 UTC —

Does not reproduce on RHEL8.6 worker with

{ens192,ens224,bond0}

.nmconnection

— Additional comment from rbrattai@redhat.com on 2022-06-22 05:18:13 UTC —

Logs with two interfaces instead of three.

http://file.rdu.redhat.com/~rbrattai/logs/bz-2099945-compute-0.tar.xz

— Additional comment from jcaamano@redhat.com on 2022-06-22 08:21:00 UTC —

From configure-ovs perspective, this might be due to the fact that only slave active profiles are cloned. Since ens192 was down on reboot, it had no active profile, so no profile was cloned. Then when it is reconnected, the original profile activates, which then activates the original bond profile as well instead of the clone made for ovn-k, and so on...

Not sure though about the loop or if it is worth looking at it since now things are not working as we expect anyway.

This should work if there were no reboot.

I guess that we ought to clone all slave profiles, active or not. We could filter instead by autoconnect being set.

I don't consider this a regression. We probably introduced this issue when we started cloning the profiles, a while ago, and that was to solve another set of bonding issues that might had this scenario not working either.

— Additional comment from jcaamano@redhat.com on 2022-06-22 10:56:26 UTC —

@rbrattai@redhat.com

Ross, as time allows, please try out this tentative improvement: https://github.com/openshift/machine-config-operator/pull/3203

— Additional comment from aos-team-art-private@redhat.com on 2022-07-05 01:55:51 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from rbrattai@redhat.com on 2022-07-26 18:50:58 UTC —

Tested on 4.11.0-0.ci.test-2022-07-26-135031-ci-ln-yzxxi7k-latest

vSphere UPI RHCOS static-ip active-backup fail_over_mac=0

cloned the link down slave.

Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + for conn_uuid in $(nmcli -g UUID connection show)
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2356]: ++ nmcli -g connection.master connection show uuid 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' 0aadf316-4058-441a-a11a-9ae7aa1806dd '!=' 0aadf316-4058-441a-a11a-9ae7aa1806dd ']'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2360]: ++ nmcli -g GENERAL.STATE connection show 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local active_state=
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2364]: ++ nmcli -g connection.autoconnect connection show 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local autoconnect=yes
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' '' '!=' activated ']'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' yes '!=' yes ']'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local new_uuid
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ clone_slave_connection 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local uuid=6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local old_name
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2369]: +++ nmcli -g connection.id connection show uuid 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ old_name=ens192
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local new_name=ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli connection show id ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli connection clone 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2 ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli -g connection.uuid connection show ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + new_uuid=d0d12ff7-6a29-4ab1-bef1-1c52404129f1
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod uuid d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.master cc20e638-a6c7-4ad4-b1fe-bda4599c3775
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.autoconnect-priority 100
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.autoconnect no
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + echo 'Replaced master 0aadf316-4058-441a-a11a-9ae7aa1806dd with cc20e638-a6c7-4ad4-b1fe-bda4599c3775 for slave profile d0d12ff7-6a29-4ab1-bef1-1c52404129f1'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: Replaced master 0aadf316-4058-441a-a11a-9ae7aa1806dd with cc20e638-a6c7-4ad4-b1fe-bda4599c3775 for slave profile d0d12ff7-6a29-4ab1-bef1-1c52404129f1
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + for conn_uuid in $(nmcli -g UUID connection show)

— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-11 21:05:07 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from rbrattai@redhat.com on 2022-08-15 17:27:18 UTC —

Fix in 4.11.0-0.nightly-2022-08-15-074436

Description of problem:

When the user opens the topology sidebar for a Deployment with a BuildConfig the string Build #x was complete (and others) are not translated. The browser log shows also an error

Missing i18n key "Build <1></1> was complete" in namespace "public" and language "en."

The strings are translated but are saved as:

Build {link} was complete

Prerequisites (if any, like setup, operators/versions):

None

Steps to Reproduce

  1. Open developer perspective
  2. Add page
  3. Import from Git
  4. Select the deployment in the topology graph
  5. Switch the language

Actual results:

Build #x ... string is not translated. An error is logged in the browser console.

Expected results:

String is shown in the selected language and no error is logged in the browser console.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Happen on a cluster (4.9.0-0.nightly-2021-07-07-021823)
and local development (4.9 master, tested with 0588bc0f0b838ae448a68f35c5424f9bbfc65bc9)

Additional info:

None

Description of problem:

I am seeing "disruption_tests: [sig-network-edge] Cluster frontend ingress remain available" test fail in periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade. This is one of the failures causing 4.9 payload releases from being accepted.

Version-Release number of selected component (if applicable):

OCP 4.9

How reproducible:

Sometimes

Additional info: Sippy failure tracker

The error seen is:

{Sep  6 15:24:54.995: Frontends were unreachable during disruption for at least 50s of 1h26m45s (1%):Sep 06 14:44:55.180 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests over new connections
Sep 06 14:44:55.180 - 12s   E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests over new connections
Sep 06 14:44:55.180 E ns/openshift-console route/console Route stopped responding to GET requests on reused connections
Sep 06 14:44:55.180 - 11s   E ns/openshift-console route/console Route is not responding to GET requests on reused connections
Sep 06 14:44:55.180 E ns/openshift-console route/console Route stopped responding to GET requests over new connections
Sep 06 14:44:55.180 - 11s   E ns/openshift-console route/console Route is not responding to GET requests over new connections
Sep 06 14:44:55.180 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests on reused connections
Sep 06 14:44:55.180 - 14s   E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests on reused connections
Sep 06 14:45:05.680 I ns/openshift-console route/console Route started responding to GET requests on reused connections
Sep 06 14:45:05.682 I ns/openshift-console route/console Route started responding to GET requests over new connections
Sep 06 14:45:05.685 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests over new connections
Sep 06 14:45:05.689 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests on reused connections Failure Sep  6 15:24:54.995: Frontends were unreachable during disruption for at least 50s of 1h26m45s (1%):Sep 06 14:44:55.180 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests over new connections
Sep 06 14:44:55.180 - 12s   E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests over new connections
Sep 06 14:44:55.180 E ns/openshift-console route/console Route stopped responding to GET requests on reused connections
Sep 06 14:44:55.180 - 11s   E ns/openshift-console route/console Route is not responding to GET requests on reused connections
Sep 06 14:44:55.180 E ns/openshift-console route/console Route stopped responding to GET requests over new connections
Sep 06 14:44:55.180 - 11s   E ns/openshift-console route/console Route is not responding to GET requests over new connections
Sep 06 14:44:55.180 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests on reused connections
Sep 06 14:44:55.180 - 14s   E ns/openshift-authentication route/oauth-openshift Route is not responding to GET requests on reused connections
Sep 06 14:45:05.680 I ns/openshift-console route/console Route started responding to GET requests on reused connections
Sep 06 14:45:05.682 I ns/openshift-console route/console Route started responding to GET requests over new connections
Sep 06 14:45:05.685 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests over new connections
Sep 06 14:45:05.689 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests on reused connectionsgithub.com/openshift/origin/test/extended/util/disruption/frontends.(*AvailableTest).Test(0xc0017492a8, 0xc001829760, 0xc000560360, 0x2)
    github.com/openshift/origin/test/extended/util/disruption/frontends/frontends.go:137 +0x8fb
github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc0016f6af0, 0xc001749440)
    github.com/openshift/origin/test/extended/util/disruption/disruption.go:190 +0x3be
k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc001749440, 0xc0016b2250)
    k8s.io/kubernetes@v1.22.1/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6d
created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
    k8s.io/kubernetes@v1.22.1/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xc9}

Example jobs: 1, 2, 3

This is a clone of Bug 2117557 to track backport to 4.9.z
+++ This bug was initially created as a clone of
Bug #2108021
+++

+++ This bug was initially created as a clone of
Bug #2106733
+++

Description of problem:
During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated".

```
I0628 10:09:02.518169 1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine
I0628 10:09:03.090641 1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111
I0628 10:09:03.090662 1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine
I0628 10:09:03.090669 1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC
I0628 10:09:03.090682 1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance
panic: assignment to entry in nil map

goroutine 125 [running]:
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
```

What is the business impact? Please also provide timeframe information.
We failed to recover from a major outage due to this bug.

Where are you experiencing the behavior? What environment?
Production and all envs.

When does the behavior occur? Frequency? Repeatedly? At certain times?
It appeared only once so far, but can appear in larger scaling scenarios.

Version-Release number of selected component (if applicable):
4.8.39

Actual results:

With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines.
Expected results:
Make the cluster scaleable again without deleting manually.

Additional info:

— Additional comment from
gferrazs@redhat.com
on 2022-07-13 13:34:08 UTC —

Probably the issue is here:

Since issue is in machine-api, moving it to correct team.

— Additional comment from
rmanak@redhat.com
on 2022-07-14 08:20:00 UTC —

I am working on a fix for this.

— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-14 19:10:14 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from
jspeed@redhat.com
on 2022-07-18 15:18:16 UTC —

Waiting for the first 4.11.z stream before we merge

— Additional comment from
jspeed@redhat.com
on 2022-08-08 15:07:16 UTC —

Waiting on 4.11 GA to move ahead here

This is a clone of issue OCPBUGS-2523. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2451. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2181. The following is the description of the original issue:

Description of problem:

E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name

CI Search: https://search.ci.openshift.org/?search=Installs+Red+Hat+Integration+-+3scale+operator+in+test+namespace+and+creates+3scale+Backend+Schema+operand+instance&maxAge=24h&context=1&type=bug%2Bissue%2Bjunit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

 

Description of problem:

Jenkins install-plugins.sh script does not ignore update requests for locked versions of plugins, and does not verify that the locked version was actually included in the bundle-plugins.txt file.

Version-Release number of selected component (if applicable):


How reproducible:

Run make plugins-list

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


This is a clone of issue OCPBUGS-2070. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1099. The following is the description of the original issue:

This is a clone of issue OCPBUGS-224. The following is the description of the original issue:

Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf

  1. Generated by KNI resolv prepender NM dispatcher script
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx

OpenShift 4.8.29

  1. Generated by KNI resolv prepender NM dispatcher script
    search sepia.lab.iad2.dc.paas.redhat.com
    nameserver 172.xx.xx.xx
    nameserver 10.xx.xx.xx
    nameserver 10.xx.xx.xx
  2. nameserver 10.xx.xx.xx
    ~~~

ENV: OpenStack IAD2, IPI installation. Connected cluster.

Version-Release number of selected component (if applicable):
OCP v4.9.31

How reproducible:
Always

Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.

Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.

Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.

Additional info:

  • Cu was trying to deploy secure Kerberos on the CoreOS nodes and it failed when the IPA-client install command failed. This is when the customer noticed this unusual behavior. They did not manually update the resolv.conf file to include the $search domain. They instead added the script below to /etc/NetworkManager/dispatcher.d/ and restarted NetworkManager on the node to fix this issue and installation was successful.
    ~~~
    #!/bin/bash

set -eo pipefail

DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \

grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \
tr '\n' ' ')"

>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"

sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp

mv /etc/resolv.tmp /etc/resolv.conf
~~~

  • Cu confirms that the $search domain was missing since the cluster was freshly installed/ They even confirmed this with a fresh new cluster as well that it was missing.
  • The fresh cluster was initially installed at v4.9.31 but was updated afterward to v4.9.43 (the latest z-stream) to see if the updates fixed anything but it didn't make any difference. The cluster is currently running v4.9.43 and shows the $search domain missing in the /etc/resolv.conf file on all nodes.

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Description of problem:
After trying to upgrade to an unavailable payload(no upgrade happens as expected), cvo can not continue to start a new upgrade even with a correct payload repo.

=======================================
Check cvo log to find cvo struggling for the update job version--v5f88 and fail due to timeout. But it did not respond to the new upgrade requirement after that.

  1. ./oc -n openshift-cluster-version logs cluster-version-operator-68ccb8c4fd-p7x4r|grep 'quay.io/openshift-release-dev/ocp-release@sha256\:90fabdb'|head -n1
    I0310 04:52:15.072040 1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4", Force:false}
  1. ./oc -n openshift-cluster-version logs cluster-version-operator-68ccb8c4fd-p7x4r|grep 'registry.ci.openshift.org/ocp/release@sha256\:90fabdb'|head -n1
    #

...
0310 04:52:15.072040 1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4", Force:false}
...
I0310 04:52:15.225739 1 batch.go:53] No active pods for job version--v5f88 in namespace openshift-cluster-version
I0310 04:52:15.225778 1 batch.go:22] Job version--v5f88 in namespace openshift-cluster-version is not ready, continuing to wait.
...
I0310 05:03:12.238308 1 batch.go:53] No active pods for job version--v5f88 in namespace openshift-cluster-version
E0310 05:03:12.238525 1 batch.go:19] deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"
.....

  1. ./oc get all -n openshift-cluster-version
    NAME READY STATUS RESTARTS AGE
    pod/cluster-version-operator-68ccb8c4fd-p7x4r 1/1 Running 0 61m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cluster-version-operator ClusterIP 172.30.220.176 <none> 9099/TCP 62m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/cluster-version-operator 1/1 1 1 61m

NAME DESIRED CURRENT READY AGE
replicaset.apps/cluster-version-operator-68ccb8c4fd 1 1 1 61m

NAME COMPLETIONS DURATION AGE
job.batch/version--v5f88 0/1 30m 30m

Version-Release number of the following components:
4.11.0-0.nightly-2022-03-04-063157

How reproducible:
always

Steps to Reproduce:
1. Trigger an upgrade to an unavailable image(by mistake), from 4.11.0-0.nightly-2022-03-04-063157 to 4.11.0-0.nightly-2022-03-08-191358

#./oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4

2. Wait for several mins(>5mins), no upgrade will happen(expected), and no any failure info(not expected)

  1. ./oc get clusterversion -ojson|jq .items[].status.conditions
    {
    "lastTransitionTime": "2022-03-10T04:20:12Z",
    "message": "Payload loaded version=\"4.11.0-0.nightly-2022-03-04-063157\" image=\"registry.ci.openshift.org/ocp/release@sha256:cdeb8497920d9231ecc1ea7535e056b192f2ccf0fa6257d65be3bb876c1b9de6\"",
    "reason": "PayloadLoaded",
    "status": "True",
    "type": "ReleaseAccepted"
    },
    ...
  2. ./oc get clusterversion
    NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
    version 4.11.0-0.nightly-2022-03-04-063157 True False 27m Cluster version is 4.11.0-0.nightly-2022-03-04-063157
  1. ./oc adm upgrade
    Cluster version is 4.11.0-0.nightly-2022-03-04-063157

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.11
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-03-04-063157 not found in the "stable-4.11" channel

3. Continue upgrade to target payload with correct repo

  1. ./oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4 --allow-explicit-upgrade
    warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
    Updating to release image registry.ci.openshift.org/ocp/release@sha256:90fabdb570eb248f93472cc06ef28d09d5820e80b9ed578e2484f4ef526fe6d4

4. Still no upgrade happen, the same with step 2(not expected)

Actual results:
An update to available payload will bring cvo does not work.

Expected results:
Upgrade to correct target payload should be triggerred.

Additional info:
`oc adm upgrade --clear` to cancel the initial invalid upgrade before triggering new upgrade does not work. Only delete cvo pod to get it re-deployed, then cvo will work again.

  • CVE-2022-36882
  • CVE-2022-29047
  • CVE-2022-30945
  • CVE-2022-30946
  • CVE-2022-30948
  • CVE-2022-30952
  • CVE-2022-30953
  • CVE-2022-30954
  • CVE-2022-34174
  • CVE-2022-36883
  • CVE-2022-36884
  • CVE-2022-36885
  • CVE-2022-34177
  • CVE-2022-34176
  • CVE-2022-36881