Back to index

4.18.0-ec.0

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.17.0-rc.2

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview (aka. Goal Summary)

Today we expose two main APIs for HyperShift, namely `HostedCluster` and `NodePool`. We also have metrics to gauge adoption by reporting the # of hosted clusters and nodepools.

But we are still missing other metrics to be able to make correct inference about what we see in the data.

Goals (aka. expected user outcomes)

  • Provide Metrics to highlight # of Nodes per NodePool or # of Nodes per cluster
  • Make sure the error between what appears in CMO via `install_type` and what we report as # Hosted Clusters is minimal.

Use Cases (Optional):

  • Understand product adoption
  • Gauge Health of deployments
  • ...

 

Overview

Today we have hypershift_hostedcluster_nodepools as a metric exposed to provide information on the # of nodepools used per cluster. 

 

Additional NodePools metrics such as hypershift_nodepools_size and hypershift_nodepools_available_replicas are available but not ingested in Telemetry.

In addition to knowing how many nodepools per hosted cluster, we would like to expose the knowledge of the nodepool size.

 

This will help inform our decision making and provide some insights on how the product is being adopted/used.

Goals

The main goal of this epic is to show the following NodePools metrics on Telemeter, ideally as recording rules: 

  • Hypershift_nodepools_size
  • hypershift_nodepools_available_replicas

Requirements

The implementation involves creating updates to the following GitHub repositories:

similar PRs:
https://github.com/openshift/hypershift/pull/1544
https://github.com/openshift/cluster-monitoring-operator/pull/1710

Feature Overview (aka. Goal Summary)

 

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

Goals (aka. expected user outcomes)

Customers can override the default (three) value and set it to a custom value.

Make sure we document (or link) the VMWare recommendations in terms of performances.

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

https://kb.vmware.com/s/article/1025279

Requirements (aka. Acceptance Criteria):

The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.

No change in the default

Use Cases (Optional):

As an OCP admin I would like to change the maximum number of snapshots per volumes.

Out of Scope

Anything outside of 

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

Background

The default value can't be overwritten, reconciliation prevents it.

Customer Considerations

Make sure the customers understand the impact of increasing the number of snapshots per volume.

https://kb.vmware.com/s/article/1025279

Documentation Considerations

Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.

Interoperability Considerations

N/A

Epic Goal*

The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.

Possible future candidates:

  • configure EFS volume size monitioring (via driver cmdline arg.) - STOR-1422
  • configure OpenStack topology - RFE-11

 
Why is this important? (mandatory)

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

https://kb.vmware.com/s/article/1025279

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to configure the maximum number of snapshots per volume.
  2. As a user I would like to create more than 3 snapshots per volume

 
Dependencies (internal and external) (mandatory)

1) Write OpenShift enhancement (STOR-1759)

2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)

3) Update vSphere operator to use the new snapshot options (STOR-1804)

4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)

  • prerequisite: add e2e test and demonstrate stability in CI (STOR-1838)

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Enablement
  • Others -

Acceptance Criteria (optional)

Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.

Drawbacks or Risk (optional)

Setting this config setting with a high value can introduce performances issues. This needs to be documented.

https://kb.vmware.com/s/article/1025279

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

  • As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Assisted Installer using the agnostic platform (platform=none) for connected deployments.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

In order to remove IPI/UPI support for Alibaba Cloud in OpenShift (currently Tech Preview, see also OCPSTRAT-1042), we need to provide an alternate method for Alibaba Cloud customers to spin up an OpenShift cluster. To that end, we want customers to use Assisted Installer with platform=none (and later platform=external) to bring up their OpenShift clusters.

  • Stretch goal to do this with platform=external.
  • Note: We can TP with platform=none or platform=external, but for GA it must be with platform=external.
  • Document how to use this installation method

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Hybrid Cloud Console updated to reflect Alibaba Cloud installation with Assisted Installer (Tech Preview).
  • Documentation that tells customer how to use this install method
  • CI for this install method optional for OCP 4.16 (and will be addressed in the future)

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed
Classic (standalone cluster) Classic
Hosted control planes N/A
Multi node, Compact (three node), or Single node (SNO), or all Multi-node
Connected / Restricted Network Connected for OCP 4.16 (Future: restricted)
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64
Operator compatibility This should be the same for any operator on platform=none
Backport needed (list applicable versions) OpenShift 4.16 onwards
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Hybrid Cloud Console changes needed
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

  • Restricted network deployments, i.e. As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Agent-based Installer using the agnostic platform (platform=none) for restricted network deployments.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

For OpenShift 4.16, we want to remove IPI support (currently Tech Preview) for Alibaba Cloud support (OCPSTRAT-1042). Instead we want it to make it Assisted Installer (Tech Preview) with the agnostic platform for Alibaba Cloud in OpenShift 4.16 (OCPSTRAT-1149).

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Previous UPI-based installation doc: Alibaba Cloud Red Hat OpenShift Container Platform 4.6 Deployment Guide

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Assisted Installer using the agnostic platform (platform=none) for connected deployments.

Epic Goal

  • Start with the original Alibaba Cloud Red Hat OpenShift Container Platform 4.6 Deployment Guide and adjust it to use the Assisted Installer with platform=none.
  • Document the steps for a successful installation using that method and feed the docs team with that information.
  • Narrow down the scope to the minimum viable to achieve Tech Preview in 4.16. We'll handle platform=external and better tools and automation in future releases.
  • Engage with the Assisted Installer team and the Solutions Architect / PTAM of Alibaba for support.
  • Provide frequent updates on work progress (at least weekly).
  • Assist QE in testing.

Why is this important?

  • In order to remove IPI/UPI support for Alibaba Cloud in OpenShift, we need to provide an alternate method for Alibaba Cloud customers to spin up an OpenShift cluster. To that end, we want customers to use Assisted Installer with platform=none (and in future releases platform=external) to bring up their OpenShift clusters.

Acceptance Criteria

  • Reproducible, stable, and documented installation steps using the Assisted Installer with platform=none provided to the docs team and QE.

Out of scope

  1. CI

Previous Work (Optional):

  1. https://www.alibabacloud.com/blog/alibaba-cloud-red-hat-openshift-container-platform-4-6-deployment-guide_597599
  2. https://github.com/kwoodson/terraform-openshift-alibaba for reference, it may help
  3. Alibaba IPI for reference, it may help
  4. Using the Assisted Installer to install a cluster on Oracle Cloud Infrastructure for reference

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

{}USER STORY:{}

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

{}DESCRIPTION:{}

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

{}Required:{}

...

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

{}ENGINEERING DETAILS:{}

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The GCP IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing GCP Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision GCP infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing GCP

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

installing into Shared VPC stuck in waiting for network infrastructure ready

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-10-225505

How reproducible:

Always

Steps to Reproduce:

1. "create install-config" and then insert Shared VPC settings (see [1])
2. activate the service account which has the minimum permissions in the host project (see [2])
3. "create cluster"

FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project. 

Actual results:

1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed.
2. 2 firewall-rules are created in the service project unexpectedly (see [3]).

Expected results:

The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.

Additional info:

 

Description of problem:

    Shared VPC installation using service account having all required permissions failed due to cluster operator ingress degraded, by telling error "error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc'"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-multi-2024-08-07-221959

How reproducible:

    Always

Steps to Reproduce:

1. "create install-config", then insert the interested settings (see [1])
2. "create cluster" (see [2])

Actual results:

    Installation failed, because cluster operator ingress degraded (see [2] and [3]). 

$ oc get co ingress
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress             False       True          True       113m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc', forbidden...
$ 

In fact the mentioned k8s firewall-rule doesn't exist in the host project (see [4]), and, the given service account does have enough permissions (see [6]).

Expected results:

    Installation succeeds, and all cluster operators are healthy. 

Additional info:

    

Goal

This goals of this features are:

  • optimize and streamline the operations of HyperShift Operator (HO) on Azure Kubernetes Service (AKS) clusters
  • Enable auto-detectopm of the underlying environment (managed or self-managed) to optimize the HO accordingly.

Place holder epic to capture all azure tickets.

TODO: review.

User Story:

As an end user of a hypershift cluster, I want to be able to:

  • Not see internal host information when inspecting a serving certificate of the kubernetes API server

so that I can achieve

  • No knowledge of internal names for the kubernetes cluster.

From slack thread: https://redhat-external.slack.com/archives/C075PHEFZKQ/p1722615219974739 

We need 4 different certs:

  • common sans
  • internal san
  • fqdn
  • svc ip

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP GA
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This is continuation of CORS-2455 / CFE-719 work, where support for GCP tags & labels delivered as TechPreview in 4.14 and to make it GA in 4.15. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

TechPreview featureSet check added in installer for userLabels and userTags should be removed and the TechPreview reference made in the install-config GCP schema should be removed.

Acceptance Criteria

  • Should be able to define userLabel and userTags without setting TechPreviewNoUpgrade featureSet.

TechPreview featureSet check added in machine-api-provider-gcp operator for userLabels and userTags.

And the new featureGate added in openshift/api should also be removed.

Acceptance Criteria

  • Should be able to define userLabel and userTags without setting featureSet.

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The Azure IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing Azure Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision Azure infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing Azure
  2. terraform provider.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

In install-config file, there is no zone/instance type setting under controlplane or defaultMachinePlatform
==========================
featureSet: CustomNoUpgrade
featureGates:
- ClusterAPIInstallAzure=true
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

create cluster, master instances should be created in multi zones, since default instance type 'Standard_D8s_v3' have availability zones. Actually, master instances are not created in any zone.
$ az vm list -g jima24a-f7hwg-rg -otable
Name                                        ResourceGroup     Location        Zones
------------------------------------------  ----------------  --------------  -------
jima24a-f7hwg-master-0                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-master-1                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-master-2                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-worker-southcentralus1-wxncv  jima24a-f7hwg-rg  southcentralus  1
jima24a-f7hwg-worker-southcentralus2-68nxv  jima24a-f7hwg-rg  southcentralus  2
jima24a-f7hwg-worker-southcentralus3-4vts4  jima24a-f7hwg-rg  southcentralus  3

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-23-145410

How reproducible:

Always

Steps to Reproduce:

1. CAPI-based install on azure platform with default configuration
2. 
3.

Actual results:

master instances are created but not in any zone.

Expected results:

master instances should be created per zone based on selected instance type, keep the same behavior as terraform based install.

Additional info:

When setting zones under controlPlane in install-config, master instances can be created per zone.
install-config:
===========================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      zones: ["1","3"]

$ az vm list -g jima24b-p76w4-rg -otable
Name                                        ResourceGroup     Location        Zones
------------------------------------------  ----------------  --------------  -------
jima24b-p76w4-master-0                      jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-master-1                      jima24b-p76w4-rg  southcentralus  3
jima24b-p76w4-master-2                      jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-worker-southcentralus1-bbcx8  jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-worker-southcentralus2-nmgfd  jima24b-p76w4-rg  southcentralus  2
jima24b-p76w4-worker-southcentralus3-x2p7g  jima24b-p76w4-rg  southcentralus  3

 

Description of problem:

    CAPZ creates an empty route table during installs

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    Very

Steps to Reproduce:

    1.Install IPI cluster using CAPZ
    2.
    3.
    

Actual results:

    Empty route table created and attached to worker subnet

Expected results:

    No route table created

Additional info:

    

Description of problem:

Launch CAPI based installation on Azure Government Cloud, installer was timeout when waiting for network infrastructure to become ready.

06-26 09:08:41.153  level=info msg=Waiting up to 15m0s (until 9:23PM EDT) for network infrastructure to become ready...
...
06-26 09:09:33.455  level=debug msg=E0625 21:09:31.992170   22172 azurecluster_controller.go:231] "failed to reconcile AzureCluster" err=<
06-26 09:09:33.455  level=debug msg=	failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	RESPONSE 404: 404 Not Found
06-26 09:09:33.456  level=debug msg=	ERROR CODE: SubscriptionNotFound
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	{
06-26 09:09:33.456  level=debug msg=	  "error": {
06-26 09:09:33.456  level=debug msg=	    "code": "SubscriptionNotFound",
06-26 09:09:33.456  level=debug msg=	    "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found."
06-26 09:09:33.456  level=debug msg=	  }
06-26 09:09:33.456  level=debug msg=	}
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	. Object will not be requeued
06-26 09:09:33.456  level=debug msg= > logger="controllers.AzureClusterReconciler.reconcileNormal" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" reconcileID="f2ff1040-dfdd-4702-ad4a-96f6367f8774" x-ms-correlation-request-id="d22976f0-e670-4627-b6f3-e308e7f79def" name="jima26mag-9bqkl"
06-26 09:09:33.457  level=debug msg=I0625 21:09:31.992215   22172 recorder.go:104] "failed to reconcile AzureCluster: failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: SubscriptionNotFound\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"SubscriptionNotFound\",\n    \"message\": \"The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.\"\n  }\n}\n--------------------------------------------------------------------------------\n. Object will not be requeued" logger="events" type="Warning" object={"kind":"AzureCluster","namespace":"openshift-cluster-api-guests","name":"jima26mag-9bqkl","uid":"20bc01ee-5fbe-4657-9d0b-7013bd55bf96","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"1115"} reason="ReconcileError"
06-26 09:17:40.081  level=debug msg=I0625 21:17:36.066522   22172 helpers.go:516] "returning early from secret reconcile, no update needed" logger="controllers.reconcileAzureSecret" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" name="jima26mag-9bqkl" reconcileID="2df7c4ba-0450-42d2-901e-683de399f8d2" x-ms-correlation-request-id="b2bfcbbe-8044-472f-ad00-5c0786ebbe84"
06-26 09:23:46.611  level=debug msg=Collecting applied cluster api manifests...
06-26 09:23:46.611  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure is not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
06-26 09:23:46.611  level=info msg=Shutting down local Cluster API control plane...
06-26 09:23:46.612  level=info msg=Stopped controller: Cluster API
06-26 09:23:46.612  level=warning msg=process cluster-api-provider-azure exited with error: signal: killed
06-26 09:23:46.612  level=info msg=Stopped controller: azure infrastructure provider
06-26 09:23:46.612  level=warning msg=process cluster-api-provider-azureaso exited with error: signal: killed
06-26 09:23:46.612  level=info msg=Stopped controller: azureaso infrastructure provider
06-26 09:23:46.612  level=info msg=Local Cluster API system has completed operations
06-26 09:23:46.612  [ERROR] Installation failed with error code '4'. Aborting execution.

From above log, Azure Resource Management API endpoint is not correct, endpoint "management.azure.com" is for Azure Public cloud, the expected one for Azure Government should be "management.usgovcloudapi.net".

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-06-23-145410

How reproducible:

    Always

Steps to Reproduce:

    1. Install cluster on Azure Government Cloud, capi-based installation 
    2.
    3.
    

Actual results:

    Installation failed because of the wrong Azure Resource Management API endpoint used.

Expected results:

    Installation succeeded.

Additional info:

    

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Requirement description:

As an VM Admin, I want to improve overall density. In our traditional VM environments, we find that we are memory bound much more than CPU. Even with properly sized VMs, we see a lot of memory just sitting around allocated to the VM, but not actually used. Moreover, we always see people requesting VMs that are sized way too big for their workloads. It is better customer service allow it to some degree and then recover the memory at the hypervisor level.

MVP:

  • Move SWAP to beta (OCP TP)
  • Dashboard for monitoring
  • Make sure the scheduler sees the real memory available, rather than that allocated to the VMs.

Documents:

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Prometheus query for UI:
sum by (instance)(((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) + (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes)) / node_memory_MemTotal_bytes) *100

In human words: This is approximating how much over-committment of memory is taking place. A value of 100 means RAM+SWAP usage are 100% of system RAM capacity. 105% means RAM+SWAP are factor 105% of system RAM capacity.

Threshold: Yellow 95%, Red 105%
Based on: https://docs.google.com/document/d/1AbR1LACNMRU2QMqFpe-Se2mCEFLMqW_M9OPKh2v3yYw,

https://docs.google.com/document/d/1E1joajwxQChQiDVTsr9Qk_iIhpQkSI-VQP-o_BMx8Aw

 

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal

Goals

  • Validating OpenShift on OCI baremetal to make it officially supported. 
  • Enable installation of OpenShift 4 on OCI bare metal using Assisted Installer.
  • Provide published installation instructions for how to install OpenShift on OCI baremetal
  • OpenShift 4 on OCI baremetal can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI baremetal for connected OpenShift clusters (e.g. platform=external or none + some other indicator to know it's running on OCI baremetal).

Use scenarios

  • As a customer, I want to run OpenShift Virtualization on OpenShift running on OCI baremetal.
  • As a customer, I want to run Oracle BRM on OpenShift running OCI baremetal.

Why is this important

  • Customers who want to move from on-premises to Oracle cloud baremetal
  • OpenShift Virtualization is currently only supported on baremetal

Requirements

 

Requirement Notes
OCI Bare Metal Shapes must be certified with RHEL It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot (OCPSTRAT-1246)
Certified shapes: https://catalog.redhat.com/cloud/detail/249287
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. Oracle will do these tests.
Updating Oracle Terraform files  
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. Support Oracle Cloud in Assisted-Installer CI: MGMT-14039

 

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

OCI Bare Metal Shapes to be supported

Any bare metal Shape to be supported with OCP has to be certified with RHEL.

From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.

As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes 

Assumptions

  • Pre-requisite: RHEL certification which includes RHEL and OCI baremetal shapes (instance types) has successfully completed.

 

 

 

 
 

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15 when using OCI external platform.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend. 

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1 ip=ibft` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Oracle

Reasoning (why it’s important?)

  • In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

  PR https://github.com/openshift/assisted-service/pull/6257 must be adapted to be used along external platform.

Since we ensure that the iscsi network is not the default route, the PR above will ensure that automatically select the subnet used by the default route.

The secondary VNIC must be configured manually in OCI, a script must be injected in the discovery ISO to configure it.

Feature Overview (aka. Goal Summary)  

Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.

Goals (aka. expected user outcomes)

  • Provide a configurable way to indicate that a pod should be connected to a unique network of a specific type via its primary interface.
  • Allow networks to have overlapping IP address space.
  • The primary network defined today will remain in place as the default network that pods attach to when no unique network is specified.
  • Support cluster ingress/egress traffic for unique networks, including secondary networks.
  • Support for ingress/egress features where possible, such as:
    • EgressQoS
    • EgressService
    • EgressIP
    • Load Balancer Services

Requirements (aka. Acceptance Criteria):

  • Support for 10,000 namespaces
  •  

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Design Document

Use Cases (Optional):

  • As an OpenStack or vSphere/vCenter user, who is migrating to OpenShift Kubernetes, I want to guarantee my OpenStack/vSphere tenant network isolation remains intact as I move into Kubernetes namespaces.
  • As an OpenShift Kubernetes user, I do not want to have to rely on Kubernetes Network Policy and prefer to have native network isolation per tenant using a layer 2 domain.
  • As an OpenShift Network Administrator with multiple identical application deployments across my cluster, I require a consistent IP-addressing subnet per deployment type. Multiple applications in different namespaces must always be accessible using the same, predictable IP address.

Questions to Answer (Optional):

  •  

Out of Scope

  • Multiple External Gateway (MEG) Support - support will remain for default primary network.
  • Pod Ingress support - support will remain for default primary network.
  • Cluster IP Service reachability across networks. Services and endpoints will be available only within the unique network.
  • Allowing different service CIDRs to be used in different networks.
  • Localnet will not be supported initially for primary networks.
  • Allowing multiple primary networks per namespace.
  • Allow connection of multiple networks via explicit router configuration. This may be handled in a future enhancement.
  • Hybrid overlay support on unique networks.

Background

OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.

As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.

Network Policy has its issues:

  • it can be cumbersome to configure and manage for a large cluster
  • it can be limiting as it only matches TCP, UDP, and SCTP traffic
  • large amounts of network policy can cause performance issues in CNIs

With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.

Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.

Customer Considerations

  •  

Documentation Considerations

  •  

Interoperability Considerations

Test scenarios:

  • E2E upstream and downstream jobs covering supported features across multiple networks.
  • E2E tests ensuring network isolation between OVN networked and host networked pods, services, etc.
  • E2E tests covering network subnet overlap and reachability to external networks.
  • Scale testing to determine limits and impact of multiple unique networks.

Feature Overview (aka. Goal Summary)  

Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default 

Benefits of Crun is covered here https://github.com/containers/crun 

 

FAQ.:  https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit

***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that  

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

CRIO wipe is existing feature in Openshift . When node reboots CRIO wipe goes and clear the node of all images so that node boots clean . When node come back up it need access to image registry to get all images and it takes time to get all images . For telco and edge situation node might not have access to image registry and takes time to come up .

Goal of this feature is to adjust CRIO wipe to wipe only images that has been corrupted because of sudden reboot not all images 

Feature Overview

Phase 2 of the enclave support for oc-mirror with the following goals

  • Incorporate feedback from the field from 4.16 TP
  • Performance improvements

Goals

  • Update the batch processing using `containers/image` to do the copying for setting the number of blobs (layers) to download
  • Introduce a worker for concurrency that can also update the number of images to download to improve overall performance (these values can be tweaked via CLI flags). 
  • Collaborate with the UX team to improve the console output while pulling or pushing images. 

Feature Overview

Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift

prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Phase 1 & 2 covers implementing base functionality for CAPI.

Background, and strategic fit

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic Goal

  • As we prepare to move over to using Cluster API (CAPI) we need to make sure that we have the providers in place to work with this. This Epic is to track the tech preview of the provider for Azure

Why is this important?

  • What are the benefits to the customer, or to us, that make this worth
    doing? Fulfills a critical need for a customer? Improves
    supportability/debuggability? Improves efficiency/performance? This
    section is used to help justify the priority of this item vs other things
    we can do.

Drawbacks

  • Reasons we should consider NOT doing this such as: limited audience for
    the feature, feature will be superceded by other work that is planned,
    resulting feature will introduce substantial administrative complexity or
    user confusion, etc.

Scenarios

  • Detailed user scenarios that describe who will interact with this
    feature, what they will do with it, and why they want/need to do that thing.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps

Background

Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.

Steps

  • Install new CAPI manifest generator as a go `tool` to all the CAPI provider repositories
  • Setup a make target under the `/openshift/Makefile` to invoke the generator. Make it output the manifests under `/openshift/manifests`
  • Make sure `/openshift/manifests` is mapped to `/manifests` in the openshift/Dockerfile, so that the files are later picked up by CVO
  • Make sure the manifest generation works by triggering a manual generation
  • Check in the newly generated transport ConfigMap + Credential Requests (to let them be applied by CVO)

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • CAPI manifest generator tool is installed 
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Epic Goal*

Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

 
Why is this important? (mandatory)

OpenShift 4.18 cannot be released without Kubernetes 1.31

 
Scenarios (mandatory) 

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

PRs:

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

Description

During an upgrade, once control plane is successfully updated, status items related to that part of the upgrade cease to be relevant, and therefore we can either hide them entirely, or we can show a simplified version of them. The relevant sections are Control plane and Control plane nodes.

We utilize MCO annotations to determine whether a node is degraded or unavailable, and we solely source the Reason annotation to put into the insight. Many common cases are not covered by this, especially the unavailable ones: nodes can be cordoned, have a condition like DiskPressure, be in the process of termination etc. Not sure whether our code or something like MCO should provide it, but captured this as a card for now.

Feature Overview (aka. Goal Summary)  

Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.

Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.

We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.

Goals (aka. expected user outcomes)

As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.

As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

TBD
 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.

Right now, our pods are SingleReplica because to have multiple replicas we need more than one zone for nodes which translates into AZ in OpenStack. We need to figure that out.

HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.

We deprecated "DeploymentConfig" in-favor of "Deployment" in OCP 4.14

Now in 4.18  we want to make "Deployment " as default out of box that means customer will get Deployment when they install OCP 4.18 . 

Deployment Config will still be available in 4.18 as non default for user who still want to use it . 

FYI "DeploymentConfig" is tier 1 API in Openshift and cannot be removed from 4.x product 

Please Review this FAQ : https://docs.google.com/document/d/1OnIrGReZKpc5kzdTgqJvZYWYha4orrGMVjfP1fUpljY/edit#heading=h.oranye5nwtsy 

Epic Goal*

WRKLDS-695 was implemented to make the DC enabled through capability in 4.14. In order to prepare customers for migration to Deployments the capability got enabled by default. After 3 releases we need to reconsider whether disabling the capability by default is feasible.

More about capabilities in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capability-sets.
 
Why is this important? (mandatory)

Disabling a capability by default make an OCP installation lighter. Less component running by default reduces a security risk/vulnerability surface.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  Users can still enable the capability in vanilla clusters. Existing cluster will keep the DC capability enabled during a cluster upgrade.

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - Workloads team
  • Documentation - Docs team
  • QE - Workloads QE team
  • PX - 
  • Others -

Acceptance Criteria (optional)

  • The DC capability is disabled by default in vanilla OCP installations
  • The DC capability can be enabled in a vanilla OCP installation
  • The DC capability is enabled after an upgrade in OCP clusters that have the capability already enabled before the upgrade
  • The DC capability is disabled after an upgrade OCP clusters that have the capability disabled before the upgrade

Drawbacks or Risk (optional)

None. The DC capability can be enabled if needed.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Before the DCs can be disabled by default all the relevant e2e relying on DCs need to be migrated to Deployments to maintain the same testing coverage.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Allow customer to enabled EFS CSI usage metrics.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

OCP already supports exposing CSI usage metrics however the EFS metrics are not enabled by default. The goal of this feature is to allows customers to optionally turn on EFS CSI usage metrics in order to see them in the OCP console.

The EFS metrics are not enabled by default for a good reason as it can potentially impact performances.  It's disabled in OCP, because the CSI driver would walk through the whole volume, and that can be very slow on large volumes. For this reason, the default will remain the same (no metrics), customers would need explicitly opt-in.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Clear procedure on how to enable it as a day 2 operation. Default remains no metrics. Once enabled the metrics should be available for visualisation.

 

We should also have a way to disable metrics.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all AWS only
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all AWS/EFS supported
Operator compatibility EFS CSI operator
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Should appear in OCP UI automatically
Other (please specify) OCP on AWS only

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP user i want to be able to visualise the EFS CSI metrics

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Additional metrics

Enabling metrics by default.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Customer request as per 

https://issues.redhat.com/browse/RFE-3290

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

We need to be extra clear on the potential performance impact

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Document how to enable CSI metrics + warning about the potential performance impact.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

It can benefit any cluster on AWS using EFS CSI including ROSA

Epic Goal*

This goal of this epic is to provide a way to admin to turn on EFS CSI usage metrics. Since this could lead to performance because the CSI driver would walk through the whole volume this option will not be enabled by default; admin will need to explicitly opt-in.

 
Why is this important? (mandatory)

Turning on EFS metrics allows users to monitor how much EFS space is being used by OCP.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to turn on EFS CSI metrics 
  2. As an admin I would like to visualise how much EFS space is used by OCP.

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Yes, knowledge transfer
  • Others -

Acceptance Criteria (optional)

Enable CSI metrics via the operator - ensure the driver is started with the proper cmdline options. Verify that the metrics are sent and exposed to the users.

Drawbacks or Risk (optional)

Metrics are calculated by walking through the whole volume which can impact performances. For this reason enabling CSI metrics will need an explicit opt-in from the admin. This risk needs to be explicitly documented.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

When using OpenShift in a mixed, multi-architecture environment some key details or checks or not always available. With this feature we will take a first pass at improving the UI/UX for customers as adoption of this configuration continues at pace.

Goals (aka. expected user outcomes)

The UI/UX experience should improved when being used in a mixed architecture OCP cluster

Requirements (aka. Acceptance Criteria):

  • check that only the relevant CSI drivers are deployed to the relevant architectures
  • Improve filtering/autodeterming arches in operatorhub
  • Console improvements, especially node views

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Y
Classic (standalone cluster) Y
Hosted control planes Y
Multi node, Compact (three node), or Single node (SNO), or all Y
Connected / Restricted Network Y
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All architectures
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) OpenShift Console
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Console improvements, especially node views

Why is this important?

  •  

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
  • API PR to make this GA
  • Release PR to remove TechPreview from being a required job in OVNK repo

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

  • Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
  • Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
  • Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
  • Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
  • Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

  • Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
  • Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
  • Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
  • Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
  • Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
  • Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
  • Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

  • Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
  • Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

  • Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
  • Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

  • Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
  • Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Goal

As a dev I want the base code to be easier to read, maintain and test

Why is this important?

If devs are don't have a healthy dev environment the project will go and the business won't make $$

Scenarios

  1. ...

Acceptance Criteria

  • 80% unit tested code
  • No file > 1000 lines of code

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Goal of this epic is to prepare the console codebase as well as dynamic plugins SDK. In order to do that we need to identify areas in console that need to be updated and issues which need to be fixed.

Why is this important?

  • Console as well as its dynamic plugins will need to support PF6 once its available in a stable version

Acceptance Criteria

  • Identity all the areas of code that need to be updated or fixed
  • Create stories which will address those updates and fixes

Open questions::

  1. Should we be removing PF4 as part of 4.16 ?

NOTE:

Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...

resource-dropdown.tsx (checkbox, options have tooltips, grouped options, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)

resource-log.tsx

filter-toolbar.tsx (grouped, checkbox select)

monitoring/dashboards/index.tsx  (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead) covered by https://issues.redhat.com/browse/ODC-7655

silence-form.tsx (Currently using DropdownDeprecated, should be using a Select)

timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655

poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655

 

Note

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).

 

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces 4.18 4.17 4.16 4.15
monitored 82 82 82 82
fix needed 69 69 69 69
fixed 34 30 30 39
remaining 35 39 39 30
~ remaining non-runlevel 15 19 19 10
~ remaining runlevel (low-prio) 20 20 20 20
~ untested 2 2 2 82

Progress breakdown

# namespace 4.18 4.17 4.16 4.15
1 oc debug node pods #1763 #1816 #1818
2 openshift-apiserver-operator #573 #581
3 openshift-authentication #656 #675
4 openshift-authentication-operator #656 #675
5 openshift-catalogd #50 #58
6 openshift-cloud-credential-operator #681 #736
7 openshift-cloud-network-config-controller #2282 #2490 #2496  
8 openshift-cluster-csi-drivers     #170 #459 #484
9 openshift-cluster-node-tuning-operator #968 #1117
10 openshift-cluster-olm-operator #54 n/a
11 openshift-cluster-samples-operator #535 #548
12 openshift-cluster-storage-operator #516   #459 #196 #484 #211
13 openshift-cluster-version     #1038 #1068
14 openshift-config-operator #410 #420
15 openshift-console #871 #908 #924
16 openshift-console-operator #871 #908 #924
17 openshift-controller-manager #336 #361
18 openshift-controller-manager-operator #336 #361
19 openshift-e2e-loki #56579 #56579 #56579 #56579
20 openshift-image-registry     #1008 #1067
21 openshift-infra        
22 openshift-ingress #1031      
23 openshift-ingress-canary #1031      
24 openshift-ingress-operator #1031      
25 openshift-insights     #915 #967
26 openshift-kni-infra #4504 #4542 #4539 #4540
27 openshift-kube-storage-version-migrator #107 #112
28 openshift-kube-storage-version-migrator-operator #107 #112
29 openshift-machine-api   #407 #315 #282 #1220 #73 #50 #433 #332 #326 #1288 #81 #57 #443
30 openshift-machine-config-operator   #4219 #4384 #4393
31 openshift-manila-csi-driver #234 #235 #236
32 openshift-marketplace #578   #561 #570
33 openshift-metallb-system #238 #240 #241  
34 openshift-monitoring     #2335 #2420
35 openshift-network-console        
36 openshift-network-diagnostics #2282 #2490 #2496  
37 openshift-network-node-identity #2282 #2490 #2496  
38 openshift-nutanix-infra #4504 #4504 #4539 #4540
39 openshift-oauth-apiserver #656 #675
40 openshift-openstack-infra #4504 #4504 #4539 #4540
41 openshift-operator-controller #100 #120
42 openshift-operator-lifecycle-manager #703 #828
43 openshift-route-controller-manager #336 #361
44 openshift-service-ca #235 #243
45 openshift-service-ca-operator #235 #243
46 openshift-sriov-network-operator #754 #995 #999 #1003
47 openshift-storage        
48 openshift-user-workload-monitoring #2335 #2420
49 openshift-vsphere-infra #4504 #4542 #4539 #4540
50 (runlevel) kube-system        
51 (runlevel) openshift-cloud-controller-manager        
52 (runlevel) openshift-cloud-controller-manager-operator        
53 (runlevel) openshift-cluster-api        
54 (runlevel) openshift-cluster-machine-approver        
55 (runlevel) openshift-dns        
56 (runlevel) openshift-dns-operator        
57 (runlevel) openshift-etcd        
58 (runlevel) openshift-etcd-operator        
59 (runlevel) openshift-kube-apiserver        
60 (runlevel) openshift-kube-apiserver-operator        
61 (runlevel) openshift-kube-controller-manager        
62 (runlevel) openshift-kube-controller-manager-operator        
63 (runlevel) openshift-kube-proxy        
64 (runlevel) openshift-kube-scheduler        
65 (runlevel) openshift-kube-scheduler-operator        
66 (runlevel) openshift-multus        
67 (runlevel) openshift-network-operator        
68 (runlevel) openshift-ovn-kubernetes        
69 (runlevel) openshift-sdn        

Feature Overview (aka. Goal Summary)  

Phase 2 Goal:  

  • Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
  • attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
  • manage the lifecycle of Cluster API components within OpenShift standalone clusters
  • E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To add support for generating Cluster and Infrastructure Cluster resources on Cluster API based clusters

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-vsphere/openshift
  • Create a controller in the above module Go to manage the VSphereCluster resource for non-capi bootstrapped clusters
  • Ensure the VSphereCluster controller is only enabled for VSphere platform clusters
  • Create an "externally-managed" VSphereCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the VSphereCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • VSphereCluster resource is correctly created and populated on VSphere clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview (aka. Goal Summary)  

Implement Migration core for MAPI to CAPI for AWS

  • This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
  • This Design investigates possible solutions for AWS
  • Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI .  Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

  • We need to build out the core so that development of the migration for individual providers can then happen in parallel
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.

When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.

Behaviours

  • Should not reconcile when .status.authoritativeAPI is not MachineAPI
  • Except when it is empty (prior to defaulting migration webhook)

Steps

  • Ensure MAO has new API fields vendored
  • Add checks in Machine/MachineSet for authoritative API in status not Machine API
  • When not machine API, set paused condition == true, otherwise paused == false (same as CAPI)
    • Condition should be giving reasons for both false and true
  • This feature must be gated on the ClusterAPIMigration feature gate

Stakeholders

  • Cluster Infra

Definition of Done

  • When the status of Machine indicates that the Machine API is not authoritative, the Paused condition should be set and no action should be taken.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview

Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.

Goals

Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).

Use Cases

Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.

 

粗文本*h3. *Feature Overview

Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.

Goals

Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).

Use Cases

Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.

 

Done Done Done Criteria

This section contains all the test cases that we need to make sure work as part of the done^3 criteria.

  • Clean install of new cluster with multi vCenter configuration
  • Clean install of new cluster with single vCenter still working as previously
  • VMs / machines can be scaled across all vCenters / Failure Domains
  • PVs should be able to be created on all vCenters

Out-of-Scope

This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.

  • Migration of single vCenter OCP to a multi vCenter (stretch
  •  

Feature Overview

Add authentication to the internal components of the Agent Installer so that the cluster install is secure.

Goals

  • Day1: Only allow agents booted from the same agent ISO to register with the assisted-service and use the agent endpoints
  • Day2: Only allow agents booted from the same node ISO to register with the assisted-service and use the agent endpoints
  •  
  • Only allow access to write endpoints to the internal services
  • Use authentication to read endpoints

 

Epic Goal

  • This epic scope was originally to encompass both authentication and authorization but we have split the expanding scope into a separate epic.
  • We want to add authorization to the internal components of Agent Installer so that the cluster install is secure. 

Why is this important?

  • The Agent Installer API server (assisted-service) has several methods for Authorization but none of the existing methods are applicable tothe Agent Installer use case. 
  • During the MVP of Agent Installer we attempted to turn on the existing authorization schemes but found we didn't have access to the correct API calls.
  • Without proper authorization it is possible for an unauthorized node to be added to the cluster during install. Currently we expect this to be done as a mistake rather than maliciously.

Brainstorming Notes:

Requirements

  • Allow only agents booted from the same ISO to register with the assisted-service and use the agent endpoints
  • Agents already know the InfraEnv ID, so if read access requires authentication then that is sufficient in some existing auth schemes.
  • Prevent access to write endpoints except by the internal systemd services
  • Use some kind of authentication for read endpoints
  • Ideally use existing credentials - admin-kubeconfig client cert and/or kubeadmin-password
  • (Future) Allow UI access in interactive mode only

 

Are there any requirements specific to the auth token?

  • Ephemeral
  • Limited to one cluster: Reuse the existing admin-kubeconfig client cert

 

Actors:

  • Agent Installer: example wait-for
  • Internal systemd: configurations, create cluster infraenv, etc
  • UI: interactive user
  • User: advanced automation user (not supported yet)

 

Do we need more than one auth scheme?

Agent-admin - agent-read-write

Agent-user - agent-read

Options for Implementation:

  1. New auth scheme in assisted-service
  2. Reverse proxy in front of assisted-service API
  3. Use an existing auth scheme in assisted-service

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. AGENT-60 Originally we wanted to just turn on local authorization for Agent Installer workflows. It was discovered this was not sufficient for our use case.

Open questions::

  1. Which API endpoints do we need for the interactive flow?
  2. What auth scheme does the Assisted UI use if any?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user, when creating node ISOs, I want to be able to:

  • See the ISO's expiration time logged when the ISO is generated using "oc adm node-image create"

so that I can achieve

  • Enhanced awareness of the ISO expiration date
  • Prevention of unexpected expiration issues
  • Improved overall user experience during node creation

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities. 

Epic Goal

  • Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

  • MSI authentication is required for any component that will run on the control plane side in ARO hosted control planes.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (Goal Summary)

This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.

Goal

  • Support configuring azure diagnostics for boot diagnostics on nodepools

Why is this important?

  • When a node fails to join the cluster, serial console logs are useful in troubleshooting, especially for managed services. 

Scenarios

  1. Customer scales / creates nodepool
    1. nodes created
    2. one or more nodes fail to join the cluster
    3. cannot ssh to nodes because ssh daemon did not come online
    4. Can use diagnosics + managed storage account to fetch serial console logs to troubleshoot

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. Capz already supports this, so dependency should be on hypershift team implementing this: https://github.com/openshift/cluster-api-provider-azure/blob/master/api/v1beta1/azuremachine_types.go#L117

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

This feature is to track automation in ODC, related packages, upgrades and some tech debts

Goals

  • Improve automation for Pipelines dynamic plugins
  • Improve automation for OpenShift Developer console
  • Move cypress script into frontend to make it easier to approve changes
  • Update to latest PatternFly QuickStarts

Requirements

  • TBD
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. No

 

Questions to answer…

  • Is there overlap with what other teams at RH are already planning?  No overlap

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

This won't impact documentation and this feature is to mostly enhance end to end test and job runs on CI

Assumptions

  • ...

Customer Considerations

  • No direct impact to customer

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem:

Improving existing tests in CI to run more tests

Goal:

Why is it important?

Use cases:

  1. Improving test execution to get more tests run on CI

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Feature Overview

Improve onboarding experience for using Shipwright Builds in OpenShift Console

Goals

Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright 

Requirements

 

Requirements Notes IS MVP
Enable creating Shipwright Builds using a form   Yes
Allow use of Shipwright Builds for image builds during import flows   Yes
Enable access to build strategies through navigation   Yes

Use Cases

  • Use Shipwright Builds for image builds in import flows
  • Enable form-based creation of Shipwright Builds and without YAML expertise
  • Provide access to Shipwright resources through navigation

Out of scope

TBD

Dependencies

TBD

Background, and strategic fit

Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.

Assumptions

TBD

Customer Considerations

TBD

Documentation/QE Considerations

TBD

Impact

TBD

Related Architecture/Technical Documents

TBD

Definition of Ready

  • The objectives of the feature are clearly defined and aligned with the business strategy.
  • All feature requirements have been clearly defined by Product Owners.
  • The feature has been broken down into epics.
  • The feature has been stack ranked.
  • Definition of the business outcome is in the Outcome Jira (which must have a parent Jira).

 
 

Problem:

Creating Shipwright Builds through YAML is complex and requires Shipwright expertise which makes it difficult for novice user to user Shipwright

Goal:

Provide a form for creating Shipwright Builds

Why is it important?

To simply adoption of Shipwright and ease onboarding

Use cases:

Create build

Acceptance criteria:

  • User can create Shipwright Builds through a form (instead of YAML editor)
  • The Shipwright build asks user for the following input
    • User can provide Git repository url
    • User can choose to see the advanced options for Git url and provide additional details
      • Branch/tag/ref
      • Context dir
      • Source secret
    • User is able to create a source secret without navigating away from the form
    • User can select a build strategy from strategies that are available in the cluster (cluster-wide or in the namespace)
    • User can provide param values related to the selected build strategy
    • User can provide environment variables (text, from configmap, from secret)
    • User can provide output image url to an image registry and push secret
    • User is able to create a push secret without navigating away from the form
    • User add volumes to the build

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to create a Shipwright build using the form,

Acceptance Criteria

  1. Create form yaml switcher create page
  2. Users should provide Shipwright build name
  3. Users should provide Git repository url
  4. Users should choose to see the advanced options for Git url and provide additional details
    • Branch/tag/ref
    • Context dir
    • Source secret
  5. Users should create a source secret without navigating away from the form
  6. Users should select a build strategy from strategies that are available in the cluster (cluster-wide or in the namespace)
  7. Users should provide param values related to the selected build strategy
  8. Users should provide environment variables
  9. Users should provide output image URL to an image registry and push secret
  10. Users should create a push secret without navigating away from the form
  11. Users should add volumes to the build
  12. Add e2e tests

Additional Details:

BU Priority Overview

To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.

Goals

  • Introduce ARM CPUs for RAN DU scenario (SNO deployment)  with a feature parity to Intel Ice Lake/SPR-EE/SPR-SP w/o QAT for DU with:
    • STD-kernel (RT-Kernel is not supported by RHEL)
    • SR-IOV and DPDK over SR-IOV
    • PTP (OC, BC). Partner asked for LLS-C3, according to Fujitsu - ptp4l and phy2sys to work with NVIDIA Aerial SDK
  • Characterize ARM-based RAN DU solution performance and power metrics (unless performance parameters are specified by partners,  we should propose them, see Open questions)
  • Productize ARM-based RAN DU solution by 2024 (partner’s expectation).

State of the Business

Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.

There are currently three main obvious thrusts to how to go about this:

  • Introducing tools that improve overall observability w.r.t. power utilization of the network.
  • Improvement of existing RAN architectures via smarter orchestration of workloads, fine tuning hardware utilization on a per site basis in response to network usage, etc.
  • Introducing alternative architectures which have been designed from the ground up with lower power utilization as a goal.

This BU priority focuses on the third of these approaches.

BoM

Out of scope

Open questions:

  • What are the latency KPIs? Do we need a RT-kernel to meet them?
  • What page size is expected?
  • What are the performance/throughout requirements?

Reference Documents:

Planning call notes from Apr 15

Epic Goal

Both the Node Tuning Operator and TuneD assume the Intel x86 architecture is used when a Performance Profile is applied. For example, they both configure Intel x86 specific kernel parameters (e.g. intel_pstate).

In order to support Telco RAN DU deployments on the ARM architecture, we will need a way to apply a performance profile to configure the server for low latency applications. This will include tuning common to both Intel/ARM and tuning specific to one of the architectures.

The purpose of this Epic:

  • Design an NTO/TuneD solution that will support Intel, ARM and AMD specific tunings. Investigate whether the best approach will be to have a common performance profile that can apply to all architectures or separate performance profiles for each architecture.
  • Implement the NTO/TuneD changes to enable multi-architecture support. Depending on the scope of the changes, additional epics may be required.

Why is this important?

  • In order to support Telco RAN DU deployments on the ARM architecture, we need a way to apply a performance profile to configure the server for low latency applications.

Scenarios

  1. SNO configured with Telco 5G RAN DU reference configuration

Acceptance Criteria

  • Design for ARM support in NTO/TuneD has been reviewed and approved by appropriate stakeholders.
  • NTO/TuneD changes implemented to enable multi-architecture support. All the ARM specific tunings will not yet be known, but the framework to support these tunings needs to be there.

Dependencies (internal and external)

  1. Obtaining an ARM server to do the investigation/testing.

Previous Work (Optional):

  1. Some initial prototyping on an HPE ARM server has been done and some initial tuned issues have been documented: https://docs.google.com/presentation/d/1dBQpdVXe3kIjlLjj1orKShktEr1zIqtXoBcIo6oykrs/edit#slide=id.g2ac442e1556_0_69

Open questions::

  1. TBD

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This story will serve to collect minor upstream enhancements to NTO that do not directly belong to an objective story in the greater epic

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Convert Cluster Configuration single page form into a multi-step wizard. The goal is to avoid overwhelming user with all information on a single page, provide guidance through the configuration process.

Wireframes: 

Phase1:
https://marvelapp.com/prototype/fjj6g57/screen/76442394

Future:
https://marvelapp.com/prototype/78g662d/screen/71444815
https://marvelapp.com/prototype/7ce7ib3/screen/73190117

 

Phase 1 wireframes: https://marvelapp.com/prototype/fjj6g57/screen/76442399

 

This requires UX investigation to handle the case when base dns is not set yet and clusters list has several clusters with the same name.

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Description
This epic covers the changes needed to the ARO RP for

  1. Introducing workload identity support in the ARO RP
    1. Creating (or validating) user assigned identities during installation
      1. Adding role assignments to the custom roles and proper scope
      2. Validating permissions
    2. Creating OIDC blob store for the cluster
      1. Attaching a cert/pem file to the kube-serviceaccount signer
      2. Exposing the issuerURL to the customer
  2. Introduce MSI / Cluster MSI

ACCEPTANCE CRITERIA:

What is "done", and how do we measure it? You might need to duplicate this a few times.
 

  1. RP level changes
    1. Create keypair, and generate OIDC documents
      1. Generate issuerURL
      2. Validate permissions / roles on the identities
  1. Code changes are completed and merged to ARO-RP and its components to allow customers to install workload identity clusters.
    1. Clusters are configured with proper credentialsrequests
    2. The authentication configuration is configured with the correct service account issuer
    3. The pod identity webhook config is created
    4. Bound service account key is passed to the installer

NON GOALS:

  1. Release the API
  2. Support migration between tenants
  3. Hive changes
  4. Allow migration between service principal clusters and workload identity clusters
  5. Support key rotation for OIDC
     
    CUSTOMER EXPERIENCE:

Only fill this out for Product Management / customer-driven work. Otherwise, delete it.

  • Does this feature require customer facing documentation? YES/NO
    • If yes, provide the link once available
  • Does this feature need to be communicated with the customer? YES/NO
      • How far in advance does the customer need to be notified?
      • Ensure PM signoff that communications for enabling this feature are complete
  • Does this feature require a feature enablement run (i.e. feature flags update) YES/NO
    • If YES, what feature flags need to change?
      • FLAG1=valueA
    • If YES, is it safe to bundle this feature enablement with other feature enablement tasks? YES/NO

 
BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

NOTES:

Need to determine if (in 4.14 azure workload identity functionality) we need to create secrets/secret manifests for each operator manually as part of the ARO cluster install, or if we can leverage credentialsrequests to do this automatically somehow. How will necessary secrets be created?

DESCRIPTION:

  • This story covers investigating SDK / code changes to allow auth to cluster storage accounts with RBAC / Azure AD rather than shared keys.
    • Currently, our cluster storage accounts deploy with shared key access enabled. The flow for auth is:
      • Image registry operator uses a secret in its own namespace to pull the account keys. It uses these keys to fetch an access token that then grants data plane access to the storage account.
      • Cluster storage account is accessed by the RP using a SAS token. This storage account is used to host ignition files, graph, and boot diagnostics.
        • RP accesses it for boot diagnostics when SRE executes a geneva action to view them. Graph and ignition are also stored here.
          • This storage account doesn’t appear to be accessed from inside of the cluster, only by the first party service principal
  • Image registry team has asked ARO for assistance in identifying how to best migrate away from the shared key access.

ACCEPTANCE CRITERIA:

  • Image registry uses managed identity auth instead of SAS tokens. SRE understands how to make the changes.
  • Cluster storage account uses managed identity auth instead of SAS tokens. SRE understands how to make the changes

NON GOALS:

  •  

BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend. 

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Oracle
    • NetApp
    • Cisco

Reasoning (why it’s important?)

  • In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

In order to successfully install OCP on an iSCSI boot volume, we need to make sure that the machine has 2 network interfaces:

  • an interface connected to the iSCSI volume
  • an interface used as default gateway that will be used by OCP

This is required because on startup OVS/OVN will reconfigure the default interface (the network interface used for the default gateway). This behavior makes the usage of the default interface impracticable for the iSCSI traffic because we loose the root volume, and the node becomes unusable. See https://issues.redhat.com/browse/OCPBUGS-26071

In the scope of this issue we need to:

  • report iSCSI host IP address from the assisted agent
  • check that the network interface used for the iSCSI boot volume is not the default one (default gateway is goes to one of the other interfaces) => implies 2 network interfaces
  • ensure that the network interface connected to the iSCSI network is configured with DHCP in the kernel args in order to mount the root volume over iSCSI
  • workaround https://issues.redhat.com/browse/OCPBUGS-26580 by dropping a script in a MachineConfig manifest that will reload the network interfaces on first boot
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

CMO creates a default Alertmanager configuration on cluster bootstrap. The configuration should have the following snippet when a cluster proxy is configured:

global: 

  http_config: 

    proxy_from_environment: true

 

The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image  API with respect to importing imagestreams  images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single  arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.

There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:

  • CVO would need to expose a field in the status section indicative of the type of payload in the cluster (single vs multi)
  • cluster-openshift-apiserver-operator would read this field and add it to the apiserver configmap. openshift-apiserver would use this value to determine the setting of importMode value.
  • Document clearly that the behavior of imagestreams in a cluster with multi payload is different from the traditional single payload

Some open questions:

  • What happens to existing imagestreams on upgrades
  • How do we handle CVO managed imagestreams (IMO, CVO managed imagestreams should always set importMode to preserveOriginal as the images are associated with the payload)

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Stop setting `-cloud-provider` and `-cloud-config` arguments on KAS, KCM and MCO
  • Remove `CloudControllerOwner` condition from CCM and KCM ClusterOperators
  • Remove feature gating reliance in library-go IsCloudProviderExternal
  • Remove CloudProvider feature gates from openshift/api

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

OCPCLOUD-2514 prevented feature gates from being used with the CCMs.
We have been asked not to remove the feature gates themselves until 4.18.

PR to track: https://github.com/openshift/api/pull/1780

We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.

Steps

  • Update library go to remove reliance on feature gates
  • Update callers to no longer rely on feature gate accessor (KCMO, KASO, MCO, CCMO)
  • Remove feature gates from API repo

Stakeholders

  • Cluster Infra
  • MCO team
  • Workloads team
  • API server team

Definition of Done

  • Feature gates for external cloud providers are removed from the product
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Description

This is a placeholder epic to group refactoring and maintenance work required in the monitoring plugin

Background

In order to provide customers the option to process alert data externally, we need to provide a way the data can be downloaded from the OpenShift console. The monitoring plugin uses a Virtualized table from the dynamic plugin SDK. We should include the change in this table so is available for others.

Outcomes

  • a CSV can be downloaded from the alerts table, including the alert labels, severity and state (firing)

 

--- 

NOTE: 

There is a duplicate issue in the OpenShift console board:  https://issues.redhat.com//browse/CONSOLE-4185

This is because the console > CI/CD > prow configurations require that any PR in the openshift/console repo needs to have an associated Jira issue in the openshift console Jira board. 

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update OCP release number in OLM metadata manifests of:

  • local-storage-operator
  • aws-efs-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • secrets-store-csi-driver-operator
  • smb-csi-driver-operator

OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56 

We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

    As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate.

However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100%

Steps to Reproduce:

   $ oc get featuregates.config.openshift.io cluster -oyaml 
<......>
spec:
  featureSet: TechPreviewNoUpgrade
status:
  featureGates:
    enabled:
    - name: ExternalRouteCertificate
    - name: RouteExternalCertificate
<......>     

Actual results:

    Both RouteExternalCertificate and ExternalRouteCertificate were added in the API

Expected results:

We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html

Additional info:

 Git commits

https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3

https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930

Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219

Description of problem:

On pages under "Observe"->"Alerting", it shows "Not found" when no resources found 
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-11-082305
    

How reproducible:


    

Steps to Reproduce:

    1.Check tabs under "Observe"->"Alerting" when there is not any related resources, eg, "Alerts", "Silence","Alerting rules".
    2.
    3.
    

Actual results:

1. 'Not found' is shown under each tab.
    

Expected results:

1. It's better to show "No <resource> found" like other resources pages. eg: "No Deployments found"
    

Additional info:


    

Description of problem:

    openshift-install create cluster leads to error:
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. 

Vsphere standard port group

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. openshift-install create cluster
     2. Choose Vsphere
    3. fill in the blanks
4. Have a standard port group
    

Actual results:

    error

Expected results:

    cluster creation

Additional info:

    

Please review the following PR: https://github.com/openshift/ironic-image/pull/539

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    e980 is a valid system type for the madrid region but it is not listed as such in the installer.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy to mad02 with SysType set to e980
    2. Fail
    3.
    

Actual results:

    Installer exits

Expected results:

    Installer should continue as it's a valid system type.

Additional info:

    

Description of problem:

periodics are failing due to a change in coreos.    

Version-Release number of selected component (if applicable):

    4.15,4.16,4.17,4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Check any periodic conformance jobs
    2.
    3.
    

Actual results:

    periodic conformance fails with hostedcluster creation

Expected results:

    periodic conformance test suceeds 

Additional info:

    

Description of problem:

Navigation:
           Storage -> StorageClasses -> Create StorageClass -> Provisioner -> kubernetes.io/gce-pd

Issue:
           "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English.
        

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-01-063526

How reproducible:

Always

Steps to Reproduce:

1. Log into web console and set language to non en_US
2. Navigate to 
3. Storage -> StorageClasses -> Create StorageClass -> Provisioner
4. Select Provisioner "kubernetes.io/gce-pd"
5. "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English

Actual results:

Content is in English

Expected results:

Content should be in set language.

Additional info:

Screenshot reference attached

Modify the import to strip or change the bootOptions.efiSecureBootEnabled

https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319

archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}

ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil

{ err = fmt.Errorf("%s, %w", err.Error(), ferr) }

defer f.Close()

// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil

{ err = fmt.Errorf("%s, %w", err.Error(), cerr) }

return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}

ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil

{ return fmt.Errorf("failed to parse ovf: %w", err) }

Description of problem:

When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly *after* 2024-08-09-031511

How reproducible: always

Steps to Reproduce:

  1. Enable TechPreviewNoUpgrade
  2. Add a new vCenter to infrastructure. It can be the same one as the existing one - we just need to trigger "disable CSi migration when there are 2 or more vCenters"
  3. See that vsphere-csi-config-secret changed and has `migration-datastore-url =` (i.e. empty string value)

Actual results: the controller pods are not restarted

Expected results: the controller pods are  restarted

Description of problem

The cluster-dns-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-dns-operator repository also vendors k8s.io/* v0.29.2 packages. However, OpenShift 4.17 is based on Kubernetes 1.30.

Version-Release number of selected component (if applicable)

4.17.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/cluster-dns-operator/blob/release-4.17/go.mod.

Actual results

The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/* packages are at v0.29.2.

Expected results

The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and the k8s.io/* packages are at v0.30.0 or newer.

Additional info

The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.

Description of problem:

    In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Open /settings/cluster using Firefox with Dark mode selected
    2.
    3.
    

Actual results:

    The version numbers under Update status are black

Expected results:

    The version numbers under Update status are white

Additional info:

    

Description of problem:

See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700

 On 1.8.2024, assisted-installer-agent job started failing subsystem test "add_multiple_servers". We need to make sure it is occurs only in tests and The fix should be backported.

Description of problem:

    When an image is referenced by tag and digest, oc-mirror skips the image

Version-Release number of selected component (if applicable):

    

How reproducible:

    Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator

Steps to Reproduce:

    1 mirror to disk
    2 disk to mirror

Actual results:

    docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format

Expected results:

The image should be mirrored    

Additional info:

    

The AWS EFS CSI Operator primarily passes credentials to the CSI driver using environment variables. However, this practice is discouraged by the OCP Hardening Guide.

Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.

 

This is a clear regression and it is only present on 4.17, not 4.16.  It is present across all platforms, though I've selected AWS for links and screenshots.

 

4.17 graph - shows the change

4.16 graph - shows no change

slack thread if there are questions

courtesy screen shot

CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819

4.18 Micro upgrade failures began with the initial payload  4.18.0-0.ci-2024-08-09-234503

CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346

The common change in all of those scenarios appears to be:
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4437
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4518

In OCPBUGS-38414, a new featuregate was turned on that didn't work correctly on metal (or at least it's tests didn't).  Metal should have techpreview jobs to ensure new features are tested properly.  I think the right matrix is:

  • e2e-metal-ovn-techpreview
  • e2e-metal-ovn-ipv6-techpreview
  • e2e-metal-ovn-dualstack-techpreview

On standard CI jobs, we incorporate this by wiring in the appropriate FEATURE_SET variable, but metal jobs don't currently have a way to do this as far as I can tell.

These should be release informers.

 

https://github.com/openshift/release/blob/5ce4d77a6317479f909af30d66bc0285ffd38dbd/ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh#L63-L68 is the relevant step

 

Description of problem:

    If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail.

Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

While working on the readiness probes we have discovered that the single member health check always allocates a new client. 

Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check.

This should reduce CEO's and etcd CPU consumption.

Version-Release number of selected component (if applicable):

any supported version    

How reproducible:

always, but technical detail

Steps to Reproduce:

 na    

Actual results:

CEO creates a new etcd client when it is checking a single member health

Expected results:

CEO should use the existing pooled client to check for single member health    

Additional info:

    

Description of problem:

Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5:

step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000

spec used:
spec:
  raid:
    hardwareRAIDVolumes:
    - name: test-vol
      level: "1"
      numberOfPhysicalDisks: 2
      sizeGibibytes: 350
  online: true

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Provision an HEP worker with ILO 5 using redfish
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented.

On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power.

Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful.

[1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371

Version-Release number of selected component (if applicable):

Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions    

How reproducible:

Always

Steps to Reproduce:

    1. Deploy SNO node using ACM and fakefish as redfish interface
    2. Check metal3-ironic pod logs    

Actual results:

We can see a soft power_off command sent to the ironic agent running on the ramdisk:

2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197
2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234

Expected results:

There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.

Additional info:

    

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

I talked with Gerd Oberlechner; the hack/app-sre/saas_template.yaml - it is not used anymore in app-interface. 

It should be safe to remove this.

Description of problem:

Unable to deploy performance profile on multi nodepool hypershift cluster

Version-Release number of selected component (if applicable):

Server Version: 4.17.0-0.nightly-2024-07-28-191830 (management cluster)
Server Version: 4.17.0-0.nightly-2024-08-08-013133 (hosted cluster)

How reproducible:

    Always

Steps to Reproduce:

    1. In a multi nodepool hypershift cluster, attach performance profile unique to each nodepool.
    2. Check the configmap and nodepool status.

Actual results:

root@helix52:~# oc get cm -n clusters-foobar2 | grep foo
kubeletconfig-performance-foobar2            1      21h
kubeletconfig-pp2-foobar3                    1      21h
machineconfig-performance-foobar2            1      21h
machineconfig-pp2-foobar3                    1      21h
nto-mc-foobar2                               1      21h
nto-mc-foobar3                               1      21h
performance-foobar2                          1      21h
pp2-foobar3                                  1      21h
status-performance-foobar2                   1      21h
status-pp2-foobar3                           1      21h
tuned-performance-foobar2                    1      21h
tuned-pp2-foobar3                            1      21h
root@helix52:~# oc get np
NAME      CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION                         UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
foobar2   foobar2   2               2               False         False        4.17.0-0.ci-2024-08-08-225819   False             True             
foobar3   foobar2   1               1               False         False        4.17.0-0.ci-2024-08-08-225819   False             True      
Hypershift Pod logs -

{"level":"debug","ts":"2024-08-14T08:54:27Z","logger":"events","msg":"there cannot be more than one PerformanceProfile ConfigMap status per NodePool. found: 2 NodePool: foobar3","type":"Warning","object":{"kind":"NodePool","namespace":"clusters","name":"foobar3","uid":"c2ba814a-31fe-409d-88c2-b4e6b9a41b26","apiVersion":"hypershift.openshift.io/v1beta1","resourceVersion":"6411003"},"reason":"ReconcileError"}

Expected results:

   Performance profile should apply correctly on both node pools

Additional info:

    

Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/44

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Component Readiness has found a potential regression in the following test:

operator conditions control-plane-machine-set

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-09%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-03%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set

The version page in our docs is out of date and needs to be updated with the current versioning standards we expect.

 

Minimum of OCP mgmt cluster/k8s needs to be added. 

Description of problem:

IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699

Version-Release number of selected component (if applicable):

    

How reproducible:

Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    The installation will fail. 

Expected results:

    The installation succeeds to create a Nutanix OCP cluster with the DHCP network.

Additional info:

    

Description of problem:

    We should add validation in the Installer when public-only subnets is enabled to make sure that:

	1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set
	2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal
	3. If this flag is only applicable for byo-vpc configuration, we could
 consider exit earlier if no subnets provided in install-config.

Version-Release number of selected component (if applicable):

    all versions that support public-only subnets

How reproducible:

    always

Steps to Reproduce:

    1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY
    2. Do a cluster install without specifying a VPC.
    3.
    

Actual results:

    No warning about the invalid configuration.

Expected results:

    

Additional info:

    This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.

Description of problem:

When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    console message shows start installation service and agent register service has not started

Expected results:

    console message shows agent import cluster and add host services has started

Additional info:

    

arm64 is dev preview by CNV since 4.14. The installer shouldn't block installing it.

Just make sure it is shown in the UI as dev preview.

Description of problem:

Renable knative and A-04-TC01 tests that are being disabled in the pr  https://github.com/openshift/console/pull/13931   

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/143

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.

cc Ali Mobrem 

Component Readiness has found a potential regression in the following test:

[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry

Probability of significant regression: 98.02%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Image%20Registry&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-22%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-15%2000%3A00%3A00&testId=openshift-tests-upgrade%3A10a9e2be27aa9ae799fde61bf8c992f6&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers%20for%20ns%2Fopenshift-image-registry

Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.

The problem appears to be a permissions error preventing the pods from starting:

2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied

Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489

Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:

container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch

With slightly different versions in each stream, but both were on 3-2.231.

Hits other tests too:

operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

Description of problem:

Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.

Version-Release number of selected component (if applicable):

    

How reproducible:

The installation procedure fails systemically when using a predefined VPC

Steps to Reproduce:

    1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC
    2. Run `openshift-install create cluster ...'
    3. The procedure fails: `failed to create load balancer`
    

Actual results:

The installation procedure fails.

Expected results:

An OCP cluster to be provisioned in AWS, with public subnets only.    

Additional info:

    

Description of problem:

The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.    

Version-Release number of selected component (if applicable):

4.15.z and later    

How reproducible:

    Always when AlertmanagerConfig is enabled

Steps to Reproduce:

    1. Enable UWM with AlertmanagerConfig
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
    2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file)
    3. Wait for a couple of minutes.
    

Actual results:

Monitoring ClusterOperator goes Degraded=True.
    

Expected results:

No error
    

Additional info:

The Prometheus operator logs show that it doesn't understand the proxy_from_environment field.
The newer proxy fields are supported since Alertmanager v0.26.0 which is equivalent to OCP 4.15 and above. 
    

Description of problem:

When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment
    

Version-Release number of selected component (if applicable):

WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
    

How reproducible:

Always
    

Steps to Reproduce:

    1.  Setup OSUS in a reacheable  network 
    2. Cut all internet connection except for the mirror registry and OSUS service
    3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig
    

Actual results:


    

Expected results:

Should not fail
    

Additional info:


    

Description of problem:

When use UPDATE_URL_OVERRIDE env, the information is confused: 

./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 

2024/06/19 12:22:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/06/19 12:22:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/06/19 12:22:38  [INFO]   : ⚙️  setting up the environment for you...
2024/06/19 12:22:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
I0619 12:22:38.832303   66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported
2024/06/19 12:22:38  [INFO]   : 🕵️  going to discover the necessary images...

 

Version-Release number of selected component (if applicable):

./oc-mirror.latest  version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202406131541.p0.g157eb08.assembly.stream.el9-157eb08", GitCommit:"157eb085db0ca66fb689220119ab47a6dd9e1233", GitTreeState:"clean", BuildDate:"2024-06-13T17:25:46Z", GoVersion:"go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1) Set registry on the ocp cluster;
2) do mirror2disk + disk2mirror with following isc:
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  additionalImages:
   - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6
  platform:
    channels:
    - name: stable-4.15
      type: ocp
      minVersion: '4.15.10'
      maxVersion: '4.15.11'
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
      packages:
       - name: elasticsearch-operator

 3) set  ~/.config/containers/registries.conf
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-testzy.apps.yinzhou-619.qe.devcluster.openshift.com"
    insecure = false

4) use the isc from step 2 and mirror2disk with different dir:
`./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1`

Actual results: 

 

./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 
2024/06/19 12:22:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/06/19 12:22:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/06/19 12:22:38  [INFO]   : ⚙️  setting up the environment for you...
2024/06/19 12:22:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
I0619 12:22:38.832303   66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported
2024/06/19 12:22:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/06/19 12:22:38  [INFO]   : 🔍 collecting release images...
 

Expected results:

Give clear information to clarify the UPDATE_URL_OVERRIDE environment variable


slack discuss is here : https://redhat-internal.slack.com/archives/C050P27C71S/p1718800641718869?thread_ts=1718175617.310629&cid=C050P27C71S

Description of problem:

To summarize, when we meet the following three conditions, baremetal nodes cannot boot due to a hostname resolution failure.

  • HubCluster is IPv4/IPv6 Dual Stack
  • BMC of managed baremetal hosts are IPv6 single stack
  • A hostname is used instead of An IP address in "spec.bmc.address" of BMH resource
  • The hostname is resolved only to IPv6 address, not IPv4

According to the following update, the provisioning service checks the BMC address scheme on the target and provides a matching URL for the installation media:

When we create a BMH resource, spec.bmc.address will be an URL of the BMC.
However, when we put a hostname instead of an IP address in the spec.bmc.address like the following example,

 

<Example BMH definition>
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
  :
spec:
  bmc:
    address: redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1

we observe the following error.

$ oc logs -n openshift-machine-api metal3-baremetal-operator-6779dff98c-9djz7

{"level":"info","ts":1721660334.9622784,"logger":"provisioner.ironic","msg":"Failed to look up the IP address for BMC hostname","host":"myenv~mybmh","hostname":"redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1"} 

Because of name resolution failure, baremetal-operator cannot determine if the BMC is IPv4 or IPv6.
Therefore, the IP scheme is fall-back to IPv4 and ISO images are exposed via IPv4 address even if the BMC is IPv6 single stack.
In this case, the IPv6 BMC cannot access to the ISO image on IPv4, we observe error messages like the following example, and the baremetal host cannot boot from the ISO.

<Error message on iDRAC>
Unable to locate the ISO or IMG image file or folder in the network share location because the file or folder path or the user credentials entered are incorrect

The issue is caused by the following implementation.
The following line passes `p.bmcAddress` which is whole URL, that's why the name resolution fails.
I think we should pass `parsedURL.Hostname()` instead, which is the hostname part of the URL.

https://github.com/metal3-io/baremetal-operator/blob/main/pkg/provisioner/ironic/ironic.go#L657

		ips, err := net.LookupIP(p.bmcAddress) 

 

Version-Release number of selected component (if applicable):
We observe this issue on OCP 4.14 and 4.15. But I think this issue occurs even in the latest releases.

How reproducible:

  • HubCluster is IPv4/IPv6 Dual Stack
  • BMC of managed baremetal hosts are IPv6 single stack
  • A hostname is used instead of An IP address in "spec.bmc.address" of BMH resource
  • The hostname is resolved only to IPv6 address, not IPv4

Steps to Reproduce:

  1. Create a HubCluster with IPv4/IPv6 Dual Stack
  2. Prepare a baremetal host and BMC with IPv6 single stack
  3. Prepare a DNS server with an AAAA record entry which resolve the BMC hostname to an IPv6 address
  4. Create a BMH resource and use the hostname in the URL of "spec.bmc.address"
  5. BMC cannot boot due to IPv4/IPv6 mismatch

Actual results:
Name resolution fails and the baremetal host cannot boot

Expected results:
Name resolution works and the baremetal host can boot

Additional info:

 

Description of problem:

when normal user tries to create namespace scoped network policy, selected project in project selection dropdown was not taken

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-17-183402    

How reproducible:

Always    

Steps to Reproduce:

1. normal user with a project view networkpolicy page
/k8s/ns/yapei1-1/networkpolicies/~new/form
2. Hit on 'affected pods' in Pod selector section OR keep everything with default value and click on 'Create'     

Actual results:

2. User will see following error when click on 'affected pods'
Can't preview pods
r: pods is forbidden: User "yapei1" cannot list resource "pods" in API group "" at the cluster scope  

User will see following error when click on 'Create' button
An error occurrednetworkpolicies.networking.k8s.io is forbidden: User "yapei1" cannot create resource "networkpolicies" in API group "networking.k8s.io" at the cluster scope  

Expected results:

2. switching to 'YAML view' we can see that the selected project name was not auto populated in YAML  

Additional info:

    

Description of problem:

    ci/prow/security is failing on google.golang.org/grpc/metadata

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

always    

Steps to Reproduce:

    1. run ci/pro/security job on 4.15 pr
    2.
    3.
    

Actual results:

    Medium severity vulnerability found in google.golang.org/grpc/metadata

Expected results:

    

Additional info:

 

Description of problem:

    The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands.  These options exist in these docs:

https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html

but not in these docs:

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user 

Description of problem:

Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-multi-2024-08-15-212448    

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", then optionally insert interested settings (see [1])
2. "create cluster", and make sure the cluster turns healthy finally (see [2])
3. check the cluster's addresses on GCP (see [3])
4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])

Actual results:

The global address "<infra id>-apiserver" is not deleted during "destroy cluster".

Expected results:

Everything of the cluster shoudl get deleted during "destroy cluster".    

Additional info:

FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306    

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535

Description of problem:

INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision...
E0819 14:17:33.676051    2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
E0819 14:17:33.708233    2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
I0819 14:17:33.708279    2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"

As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.

Example:

compute:
- name: worker
  architecture: arm64
...
- name: edge
  architecture: amd64
  platform:
    aws:
      zones: ${edge_zones_str}

See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631

Description of problem:

I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.
    

Version-Release number of selected component (if applicable):

     [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

     Always
    

Steps to Reproduce:

    1. use imageSetConfig.yaml as shown below
    2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2
    3.
    

Actual results:

    fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2

2024/08/03 09:24:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/08/03 09:24:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/08/03 09:24:38  [INFO]   : ⚙️  setting up the environment for you...
2024/08/03 09:24:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/08/03 09:24:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/08/03 09:24:38  [INFO]   : 🔍 collecting release images...
2024/08/03 09:24:44  [INFO]   : kubeVirtContainer set to true [ including :  ]
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty
2024/08/03 09:24:44  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty 

    

Expected results:

    If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.
    

Additional info:

    [fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
      - name: stable-4.12
        minVersion: 4.12.61
        maxVersion: 4.12.61
    kubeVirtContainer: true
  operators:
  - catalog: oci:///test/ibm-catalog
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator
      minVersion: "0.26.0"
    - name: nfd
      maxVersion: "4.15.0-202402210006"
    - name: cluster-logging
      minVersion: 5.8.3
      maxVersion: 5.8.4
    - name: quay-bridge-operator
      channels:
      - name: stable-3.9
        minVersion: 3.9.5
    - name: quay-operator
      channels:
      - name: stable-3.9
        maxVersion: "3.9.1"
    - name: odf-operator
      channels:
      - name: stable-4.14
        minVersion: "4.14.5-rhodf"
        maxVersion: "4.14.5-rhodf"
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308

    

Description of problem:

Specify long cluster name in install-config, 
==============
metadata:
  name: jima05atest123456789test123

Create cluster, installer exited with below error:
08-05 09:46:12.788  level=info msg=Network infrastructure is ready
08-05 09:46:12.788  level=debug msg=Creating storage account
08-05 09:46:13.042  level=debug msg=Collecting applied cluster api manifests...
08-05 09:46:13.042  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating storage account jima05atest123456789tsh586sa: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima05atest123456789t-sh586-rg/providers/Microsoft.Storage/storageAccounts/jima05atest123456789tsh586sa
08-05 09:46:13.042  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.042  level=error msg=RESPONSE 400: 400 Bad Request
08-05 09:46:13.043  level=error msg=ERROR CODE: AccountNameInvalid
08-05 09:46:13.043  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.043  level=error msg={
08-05 09:46:13.043  level=error msg=  "error": {
08-05 09:46:13.043  level=error msg=    "code": "AccountNameInvalid",
08-05 09:46:13.043  level=error msg=    "message": "jima05atest123456789tsh586sa is not a valid storage account name. Storage account name must be between 3 and 24 characters in length and use numbers and lower-case letters only."
08-05 09:46:13.043  level=error msg=  }
08-05 09:46:13.043  level=error msg=}
08-05 09:46:13.043  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.043  level=error
08-05 09:46:13.043  level=info msg=Shutting down local Cluster API controllers...
08-05 09:46:13.298  level=info msg=Stopped controller: Cluster API
08-05 09:46:13.298  level=info msg=Stopped controller: azure infrastructure provider
08-05 09:46:13.298  level=info msg=Stopped controller: azureaso infrastructure provider
08-05 09:46:13.298  level=info msg=Shutting down local Cluster API control plane...
08-05 09:46:15.177  level=info msg=Local Cluster API system has completed operations    

See azure doc[1], the naming rules on storage account name, it must be between 3 and 24 characters in length and may contain numbers and lowercase letters only.

The prefix of storage account created by installer seems changed to use infraID with CAPI-based installation, it's "cluster" when installing with terraform.

Is it possible to change back to use "cluster" as sa prefix to keep consistent with terraform? because there are several storage accounts being created once cluster installation is completed. One is created by installer starting with "cluster", others are created by image-registry starting with "imageregistry". And QE has some CI profiles[2] and automated test cases relying on installer sa, need to search prefix with "cluster", and not sure if customer also has similar scenarios.

[1] https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview
[2] https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L241

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Summary

Duplicate issue of https://issues.redhat.com/browse/OU-258

To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board. 

This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin. 

openshift/console PR#4187: Removes the Metrics Page. 

openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page. 

Testing

Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.  

Except from OU-258 : https://issues.redhat.com/browse/OU-258 :

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.

Outcomes
  • The dev console metrics is loaded from monitoring-plugin and the code that is not shared with other components in the console is removed from the console codebase.
  • The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

 

Description of problem:

 azure-disk-csi-driver doesnt use registryOverrides 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.set registry override on CPO
    2.watch that azure-disk-csi-driver continues to use default registry
    3.
    

Actual results:

    azure-disk-csi-driver uses default registry

Expected results:

    azure-disk-csi-driver mirrored registry

Additional info:

    

Description of problem:

    After branching, main branch still publishes Konflux builds to mce-2.7

Version-Release number of selected component (if applicable):

    mce-2.7

How reproducible:

    100%

Steps to Reproduce:

    1.Post a PR to 

main

    2. Check the jobs that run
    

Actual results:

Both mce-2.7 and main Konflux builds get triggered    

Expected results:

Only main branch Konflux builds gets triggered

Additional info:

    

Description of problem:

This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing.

LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue. 

Version-Release number of selected component (if applicable):

4.15.11     

How reproducible:

    

Steps to Reproduce:

 (From the customer)   
    1. Configure LDAP IDP
    2. Configure Proxy
    3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
    

Actual results:

    LDAP IDP communication from the control plane oauth pod goes through proxy 

Expected results:

    LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings

Additional info:

For more information, see linked tickets.    

Description of problem:

    ose-aws-efs-csi-driver-operator has an invalid reference tools that cause build failed
this issue is due to https://github.com/openshift/csi-operator/pull/252/files#r1719471717

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem

Router pods use the "hostnetwork" SCC even when they do not use the host network.

Version-Release number of selected component (if applicable)

All versions of OpenShift from 4.11 through 4.17.

How reproducible

100%.

Steps to Reproduce

1. Install a new cluster with OpenShift 4.11 or later on a cloud platform.

Actual results

The router-default pods do not use the host network, yet they use the "hostnetwork" SCC:

% oc -n openshift-ingress get pods -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o go-template --template='{{range .items}}{{.metadata.name}} {{with .metadata.annotations}}{{index . "openshift.io/scc"}}{{end}} {{.spec.hostNetwork}}{{"\n"}}{{end}}'
router-default-5ffd4ff7cd-mhhv6 hostnetwork <no value>
router-default-5ffd4ff7cd-wmqnj hostnetwork <no value>
% 

Expected results

The router-default pods should use the "restricted" SCC.

Additional info

We missed this change from the OCP 4.11 release notes:

The restricted SCC is no longer available to users of new clusters, unless the access is explicitly granted. In clusters originally installed in OpenShift Container Platform 4.10 or earlier, all authenticated users can use the restricted SCC when upgrading to OpenShift Container Platform 4.11 and later.

Artifacts from CI jobs confirm that router pods used "restricted" for new 4.10 clusters and for 4.10→4.11 upgraded clusters, and "hostnetwork" for new 4.11 clusters:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1790552355406614528/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1790422949342220288/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1793013806733987840/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1793013781534609408/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade/1793670820518694912/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1793670819998601216/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1793062832263139328/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% 

Description of problem:

Remove the extra . from below INFO message when running add-nodes workdflow

INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z 

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Run  oc adm node-image create command to create a node iso
    2. See the INFO message at the end
    3.
    

Actual results:

 INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z   

Expected results:

    INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z 

Additional info:

    

Description of problem:

[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types    

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-07-16-033047   

How reproducible:

 Always

Steps to Reproduce:

1. Use instance type "vt1.3xlarge"/"g4ad.xlarge"/"g4dn.xlarge" install Openshift cluster on AWS

2. Check the csinode allocatable volumes count 
$ oc get csinode ip-10-0-53-225.ec2.internal -ojsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}'
26

g4ad.xlarge # 25 
g4dn.xlarge # 25
vt1.3xlarge # 26                                                              

$ oc get no/ip-10-0-53-225.ec2.internal -oyaml| grep 'instance-type'
    beta.kubernetes.io/instance-type: vt1.3xlarge
    node.kubernetes.io/instance-type: vt1.3xlarge
3. Create statefulset with pvc(which use the ebs csi storageclass), nodeAnffinity to the same node and set the replicas to the max volumesallocatable count to verify the the csinode allocatable volumes count is correct and all the pods should become Running 

# Test data
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: 26
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-0-53-225.ec2.internal # Make all volume attach to the same node
      containers:
      - name: openshifttest
        image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      #storageClassName: gp3-csi
      resources:
        requests:
          storage: 1Gi

Actual results:

In step 3 there's some pods stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node    

Expected results:

 In step 3 all the pods with pvc should become "Running", and In step 2 the csinode allocatable volumes count should be correct

-> g4ad.xlarge allocatable count should be 24
-> g4dn.xlarge allocatable count should be 24
-> vt1.3xlarge allocatable count should be 24   

Additional info:

  ...
attach or mount volumes: unmounted volumes=[data12 data6], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
06-25 17:51:23.680      Warning  FailedAttachVolume      4m1s (x13 over 14m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-d08d4133-f589-4aa3-bbef-f988058c419a" : rpc error: code = Internal desc = Could not attach volume "vol-0aa138f453d414ec3" to node "i-09d532f5155b3c05d": attachment of disk "vol-0aa138f453d414ec3" failed, expected device to be attached but was attaching
06-25 17:51:23.681      Warning  FailedMount             3m40s (x3 over 10m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[data6 data12], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
...  

Description of problem:

We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 

Description of problem:

 

Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.

Example error:

lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:

 

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

50%

Steps to Reproduce:

    1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI

Actual results:

    Flakes

Expected results:

    Shouldn't flake

Additional info:

CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB

CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations

CI Search: FAIL: TestAll/parallel/TestAWSLBSubnets

Description of problem:

    Based on the results in [Sippy|https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Etcd&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-19%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-13%2000%3A00%3A00&testId=Operator%20results%3A45d55df296fbbfa7144600dce70c1182&testName=operator%20conditions%20etcd], it appears that the periodic tests are not waiting for the etcd operator to complete before exiting.

The test is supposed to wait for up to 20 mins after the final control plane machine is rolled, to allow operators to settle. But we are seeing the etcd operator triggering 2 further revisions after this happens.

We need to understand if the etcd operator is correctly rolling out vs whether these changes should have rolled out prior to the final machine going away, and, understand if there's a way to add more stability to our checks to make sure that all of the operators stabilise, and, that they have been stable for at least some period (1 minute)

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared.

This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. 


Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared:


https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden)
https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden)


It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared.


This bug prevents users from successfully creating instances from templates in the WebConsole.

Version-Release number of selected component (if applicable):

4.15 4.14 

How reproducible:

YES

Steps to Reproduce:

1. Log in with a non-administrator account.
2. Select a template from the developer catalog and click on Instantiate Template.
3. Enter values into the initially empty form.
4. Wait for several seconds, and the entered values will disappear.

Actual results:

Entered values are disappeard

Expected results:

Entered values are appeard

Additional info:

I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.

Description of problem:

    When HO is installed without a pullsecret the shared ingress controller fails to create the router pod because the pullsecret is missing 

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1.Install HO without pullsecret
    2.Watch HO report error   "error":"failed to get pull secret &Secret{ObjectMeta:{
      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [][]},Data:map[string[]byte{},Type:,StringData:map[string]string{},Immutabl:nil,}: Secret \"pull-secret\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.
    3. Observe that no Router pod is created in the hypershift sharedingress namespace 

 

Actual results:

    router pod doesnt get created in hyeprshift sharedingress namespace

Expected results:

    router pod gets created in hyeprshift sharedingress namespace

Additional info:

    

Description of problem:

If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect

This does not occur if folder is defined.

An upstream bug was identified when debugging this:

https://github.com/vmware/govmomi/issues/3523

Description of problem:

On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-08-013133
4.16.0-0.nightly-2024-08-08-111530
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Check overview page's getting started resources card,  
    2.
    3.
    

Actual results:

1. There is "OpenShift LightSpeed" link  in "Explore new features and capabilities"
    

Expected results:

1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
    

Additional info:


    

Description of problem:

https://issues.redhat.com//browse/OCPBUGS-31919 partially fixed an issue consuming the test image from a custom registry.
The fix is about consuming in the test binary the pull-secret of the cluster under tests.
To complete it we have to do the same trusting custom CA as the cluster under test.

Without that, if the test image is exposed by a registry where the TLS cert is signed by a custom CA, the same tests will fail as for:

{  fail [github.com/openshift/origin/test/extended/operators/certs.go:120]: Unexpected error:
    <*errors.errorString | 0xc0023105c0>: 
    unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:
    StdOut>
    error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
    StdErr>
    error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
    exit status 1
    
    {
        s: "unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:\nStdOut>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nStdErr>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nexit status 1\n",
    }
occurred
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):

    release-4.16, release-4.17 and master branchs in origin.

How reproducible:

Always    

Steps to Reproduce:

    1. try to run the test suite against a cluster where the OCP release (and the test image) comes from a private registry with a cert signed by a custom CA
    2.
    3.
    

Actual results:

    3 failing tests:
: [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] expand_more
: [sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] expand_more
: [sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel] expand_more

Expected results:

    No failing tests

Additional info:

    OCPBUGS-31919 partially fixed it having the test binary downloading the pull secret from the cluster under test. But in order to have it working we have also to trust custom CAs trusted by the cluster under test

Description of problem:

    After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear.

// after changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:                   <<<< 
  connectionIdleTimeout: 0s            <<<<
networkLoadBalancer: {}
type: NLB

// create new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB



Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-08-013133

How reproducible:

    100%

Steps to Reproduce:

    1. changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}'

    2. create new ingresscontroller with NLB
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: nlb
  namespace: openshift-ingress-operator
spec:
  domain: nlb.<base-domain>
  replicas: 1
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: External
    type: LoadBalancerService

    3. check both ingresscontrollers status
    

Actual results:

// after changing default ingresscontroller to NLB 
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:
  connectionIdleTimeout: 0s
networkLoadBalancer: {}
type: NLB
 
// new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB
 

Expected results:

    If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB. 

Additional info:

    

Description of problem:

Creating a tuned profile with annotation  tuned.openshift.io/deferred: "update" first before label target node, then label node with profile=, the value of kernel.shmmni applied immediately. but it shows the message [The TuneD daemon profile is waiting for the next node restart: openshift-profile],  then reboot nodes, it will restore to default value of  kernel.shmmni, not setting to expected value.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Creating OCP cluster with latest 4.18 nightly version
    2. Create tuned profile before label node
       please refer to issue 1 if you want to reproduce the issue in the doc https://docs.google.com/document/d/1h-7AIyqf7sHa5Et2XF7a-RuuejwVkrjhiFFzqZnNfvg/edit
    

Actual results:

   It should show the message [TuneD profile applied]. the sysctl value should keep as expect after node reboot

Expected results:

    It shouldn't show the message The TuneD daemon profile is waiting for the next node restart: openshift-profile when executing oc get profile also the sysctl value shouldn't revert after node reboot

Additional info:

    

Description of problem:

    See https://github.com/prometheus/prometheus/issues/14503 for more details
    

Version-Release number of selected component (if applicable):

    4.16
    

How reproducible:


    

Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:

# TYPE requests_per_second_requests gauge
# UNIT requests_per_second_requests requests
# HELP requests_per_second_requests test-description
requests_per_second_requests 16 1722466225604
requests_per_second_requests 14 1722466226604
requests_per_second_requests 40 1722466227604
requests_per_second_requests 15 1722466228604
# EOF

2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:


    

Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)


    

Additional info:

     Regression introduced in Prometheus 2.52.
    Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685 
    

Description of problem:

Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

1. apply CRD yaml file
2. check the NetworkAttachmentDefinition status

Actual results:

status with error 

Expected results:

NetworkAttachmentDefinition has been created 

 

 

Description of problem:

   Specifying additionalTrustBundle in the HC doesnt propogate down to the worker nodes

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.Create CM with additionalTrustBundle
    2.Specify CM in HC.Spec.AdditionalTrustBundle
    3.Debug worker nodes and check if additionalTrustBundle has been updated
    

Actual results:

    additionalTrustBundle hasnt propogated down to nodes

Expected results:

     additionalTrustBundle propogated down to nodes

Additional info:

    

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/313

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml:

additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: ci.devcluster.openshift.com
compute:
- architecture: arm64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: arm64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

However, the installation will fail with ambiguous error messages:

ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused

The actual error hides in the bootstrap VM's System Log:

Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17

SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA)

SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519)

SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA)

ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac

Ignition: ran on 2024/08/14 12:34:24 UTC (this boot)

Ignition: user-provided config was applied

[0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m



[1;31mRelease image arch amd64 does not match host arch arm64[0m

ip-10-29-3-15 login: [   89.141099] Warning: Unmaintained driver is detected: nft_compat

    

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

Use amd64 installer to install a cluster with aarch64 nodes
    

Steps to Reproduce:

    1. download amd64 installer
    2. generate the install-config.yaml
    3. edit install-config.yaml to use aarch64 nodes
    4. invoke the installer
    

Actual results:

installation timed out after ~30mins
    

Expected results:

installation failed immediately with proper error message indicating the installation is not possible
    

Additional info:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379
    

Similar to the work done for AWS STS and Azure WIF support, the console UI (specifically OperatorHub) needs to:

  1. warn users when they are on an GCP cluster that support GCP's Workload Identity Management and the operator they will be installing supports it
  2. Subscribing to an operator that supports it can be customized in the UI by adding fields to the subscription config field that need to be provided to the operator at install time.

CONSOLE-3776 was adding filtering for the GCP WIP case, for the operator-hub tile view. Part fo the change was also check for the annotation which indicates that the operator supports GCP's WIF:

features.operators.openshift.io/token-auth-gcp: "true"

 

AC:

  • Add warning alert to the operator-hub-item-details component, if the cluster is GCP with WIF, similar to Azure and AWS.
  • Add warning alert to the operator-hub-subscribe component, if the cluster is GCP with WIF, similar to Azure and AWS.
  • If the cluster is in GCP WIF mode and the operator claims support for it the the subscription page provides configuring 4 additional fields, which will be set on the Subscription's spec.config.env field:
    • POOL_ID
    • PROVIDER_ID
    • SERVICE_ACCOUNT_EMAIL
  • Default subscription to manual for installs on WIF mode clusters for operators that support it.

 

Design docs

Description of problem:

The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.

Version-Release number of selected component (if applicable):
4.18

How reproducible:
Always

Steps to Reproduce:

  1. Open the network section

Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section

Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section

Additional info:

Component Readiness has found a potential regression in the following test:

[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.18
Start Time: 2024-08-14T00:00:00Z
End Time: 2024-08-21T23:59:59Z
Success Rate: 94.89%
Successes: 128
Failures: 7
Flakes: 2

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 647
Failures: 0
Flakes: 15

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=azure&Platform=azure&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Node%20%2F%20Kubelet&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-08-14%2000%3A00%3A00&testId=openshift-tests%3A9292c0072700a528a33e44338d37a514&testName=%5Bsig-node%5D%5Bapigroup%3Aconfig.openshift.io%5D%20CPU%20Partitioning%20node%20validation%20should%20have%20correct%20cpuset%20and%20cpushare%20set%20in%20crio%20containers%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

The test is permafailing on latest payloads on multiple platforms, not just azure. It does seem to coincide with arrival of the 4.18 rhcos images.

{  fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-z5sh003f-431b2-r2nm4-master-0
Unexpected error:
    <*errors.errorString | 0xc001e80190>: 
    err execing command jq: error (at <stdin>:1): Cannot index array with string "info"
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    {
        s: "err execing command jq: error (at <stdin>:1): Cannot index array with string \"info\"\njq: error (at <stdin>:1): Cannot iterate over null (null)",
    }
occurred
Ginkgo exit error 1: exit with code 1}

The script involved is likely in: https://github.com/openshift/origin/blob/a365380cb3a39cfc26b9f28f04b66418c993a879/test/extended/cpu_partitioning/crio.go#L4

Nightly payloads are fully blocked as multiple blocking aggregated jobs are permafailing this test.

compile errors when building an ironic image look like this

2024-08-14 09:07:21 + python3 -m compileall --invalidation-mode=timestamp /usr
2024-08-14 09:07:21 Listing '/usr'...
2024-08-14 09:07:21 Listing '/usr/bin'...
...
Listing '/usr/share/zsh/site-functions'...
Listing '/usr/src'...
Listing '/usr/src/debug'...
Listing '/usr/src/kernels'...
Error: building at STEP "RUN prepare-image.sh && rm -f /bin/prepare-image.sh && /bin/prepare-ipxe.sh && rm -f /tmp/prepare-ipxe.sh": while running runtime: exit status 1

with the actual error lost in 3000+ lines of output, we should suppress the file listings

Description of problem:

Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13.

Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs.

The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28

We have reproduced the issue and we found an ordering cycle error in the journal log

Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.)
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.

    

Version-Release number of selected component (if applicable):

    Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13:
    
      version: 4.13.0-0.nightly-2024-07-23-154444
      version: 4.12.0-0.nightly-2024-07-23-230744
      version: 4.11.59
      version: 4.10.67
      version: 4.9.59

    

How reproducible:

    Always
    

Steps to Reproduce:

    1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.

    

Actual results:


    Nodes become not ready
$ oc get nodes
NAME                                                 STATUS                        ROLES    AGE     VERSION
ci-op-g94jvswm-cc71e-998q8-master-0                  Ready                         master   6h14m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-1                  Ready                         master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-2                  NotReady,SchedulingDisabled   master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb   NotReady,SchedulingDisabled   worker   6h2m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6   Ready                         worker   6h4m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj   Ready                         worker   6h6m    v1.25.16+306a47e

And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
    

    

Expected results:

No ordering cycle error should happen and the upgrade should be executed without problems.
    

Additional info:


    

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/99

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Occasional machine-config daemon panics in test-preview. For example this run has:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736

And the referenced logs include a full stack trace, the crux of which appears to be:

E0801 19:23:55.012345    2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 127 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x2424b80?, 0x4166150?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0})
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d
github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208)
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65
github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208})
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact

How reproducible:

looks like ~15% impact in those CI runs CI Search turns up.

Steps to Reproduce:

Run lots of CI. Look for MCD panics.

Actual results

CI Search results above.

Expected results

No hits.

After looking at this test run we need to validate the following scenarios:

  1. Monitor test for nodes should fail when nodes go ready=false unexpectedly.
  2. Monitor test for nodes should fail when the unreachable taint is placed on them.
  3. Monitor test for node leases should create timeline entries when leases are not renewed “on time”.  This could also fail after N failed renewal cycles.

 

Do the monitor tests in openshift/origin accurately test these scenarios?

Description of problem:

We need to bump the Kubernetes Version. To the latest API version OCP is using.

This what was done last time:

https://github.com/openshift/cluster-samples-operator/pull/409

Find latest stable version from here: https://github.com/kubernetes/api

This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities

    

Version-Release number of selected component (if applicable):


    

How reproducible:

Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).
    

Description of problem:

The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.

Version-Release number of selected component (if applicable):

RHOCP 4.16.4

How reproducible:

100%

Steps to Reproduce:

1. Configure proxy custom resource in RHOCP 4.16.4 cluster
2. Create cluster-monitoring-config configmap in openshift-monitoring project
3. Inject remote-write config (without specifically configuring proxy for remote-write)
4. After saving the modification in  cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet:
==============
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
[...]
  name: k8s
  namespace: openshift-monitoring
spec:
[...]
  remoteWrite:
  - proxyUrl: http://proxy.abc.com:8080     <<<<<====== Injected Automatically but there is no noProxy URL.
    url: http://test-remotewrite.test.svc.cluster.local:9090
    

Actual results:

The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.

Expected results:

The noProxy URL should get injected in Prometheus k8s CR as well.

Additional info:

 

Description of problem:

In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname  which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>):  

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      type: Route
~~~  

On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: 

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      route:
        hostname: oauth.<custom-domain>
      type: Route

$ oc get routes -n hcp-ns --show-labels
NAME    HOST/PORT             LABELS
oauth oauth.<custom-domain>  hypershift.openshift.io/hosted-control-plane=hcp-ns <---
~~~

The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: 

~~~
$ oc get ingresscontroller -n openshift-ingress-default default -oyaml
    routeSelector:
      matchExpressions:
      - key: hypershift.openshift.io/hosted-control-plane <---
        operator: DoesNotExist <---
~~~

This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.

Version-Release number of selected component (if applicable):

   4.15

How reproducible:

    Easily

Steps to Reproduce:

    1. Install HCP cluster 
    2. Configure OAuthServer with type Route 
    3. Add a custom hostname different than default wildcard ingress URL from management cluster
    

Actual results:

    Oauth route is not admitted

Expected results:

    Oauth route should be admitted by Ingresscontroller

Additional info:

    

Description of the problem:

 

FYI - OCP 4.12 has reached end of maintenance support, not it is on extended support.

 

Looks like OCP 4.12 installations started failing lately due to hosts not discovering. for example - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_assisted-service/6628/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-4-12/1817416612257468416 

 

How reproducible:

 

Seems like every CI run, haven't tested locally

 

Steps to reproduce:

 

Trigger OCP 4.12 installation in the CI

 

Actual results:

 

failure, hosts not discovering

 

Expected results:

 

Successful cluster installation

Description of problem:

    Creating a faulty configmap for UWM results in cluster_operator_up=0 with the reason InvalidConfiguration. With https://issues.redhat.com/browse/MON-3421 we're expecting the reason to match UserWorkload.*

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    100%

Steps to Reproduce:

apply the following CM to a cluster with UWM enabled:

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    hah helo! :)     

Actual results:

    cluster_operator_up=0 with reason InvalidConfiguration

Expected results:

    cluster_operator_up=0 with reason matching pattern UserWorkload.*

Additional info:

https://issues.redhat.com/browse/MON-3421 streamlined reasons to allow separation between UWM and cluster monitoring. The above is a leftover that should be updated to match the same pattern.    

Description of problem:

Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15.
During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". 
During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. 
The CSR remain pending and do not get auto-approved

This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

   CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually. 

Expected results:

    CSR should get approved automatically and domain name scheme should not change.

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/559

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Our e2e jobs fail with:

pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError" 

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/55652/rehearse-55652-periodic-ci-openshift-csi-operator-release-4.19-periodic-e2e-aws-efs-csi/1824483696548253696

The jobs should succeed.

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Description of problem:

    certrotation controller is using applySecret/applyConfigmap functions from library-go to update secret/configmap. This controller has several replicas running in parallel, so it may overwrite changes applied by a different replica, which leads to unexpected signer updates and corrupted CA bundles.

applySecret/applyConfigmap does initial Get and calls Update, which overwrites the changes done to a copy received from the informer.
Instead it should issue .Update calls directly using a copy received from the informer, so that etcd would reject a change if its done after the resourceVersion was updated in parallel

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

User Story:

As a (user persona), I want to be able to:

  • install Hypershift with the minimum set of required CAPI/CAPx CRDs

so that I can achieve

  • CRDs not utilized by Hypershift shouldn't be installed 

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • Today, `hypershift install` command installs ALL CAPI providers CRDs, which includes for example `ROSACluster` & `ROSAMachinePool` which are not needed by Hypershift.
  • We need to review and remove any CRD that is not required.
     

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:
There are two enhancements we could have for cns-migration:
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 07:59:34.884908     131 logger.go:28] logging successfully to vcenter
I0806 07:59:36.078911     131 logger.go:28] ----------- Migration Summary ------------
I0806 07:59:36.078944     131 logger.go:28] Migrated 0 volumes
I0806 07:59:36.078960     131 logger.go:28] Failed to migrate 0 volumes
I0806 07:59:36.078968     131 logger.go:28] Volumes not found 0    

See the source datastore checing:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 08:02:08.719657     138 logger.go:28] logging successfully to vcenter
E0806 08:02:08.749709     138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter

 

 

2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.

 

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    Always

Steps to Reproduce:

    See Description     

Description of problem:

    IDMS is set on HostedCluster and reflected in their respective CR in-cluster.  Customers can create, update, and delete these today.  In-cluster IDMS has no impact.

Version-Release number of selected component (if applicable):

    4.14+

How reproducible:

    100%

Steps to Reproduce:

    1. Create HCP
    2. Create IDMS
    3. Observe it does nothing
    

Actual results:

    IDMS doesn't change anything if manipulated in data plane

Expected results:

    IDMS either allows updates OR IDMS updates are blocked.

Additional info:

    

The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.

The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.

The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.

See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.

 

Background

In order for customers to easily access the troubleshooting panel in the console, we need to add a button that can be accessed globally.

Outcomes

  • The troubleshooting panel can be triggered from the application launcher menu, present in the OpenShift console masthead

 

 

Component Readiness has found a potential regression in the following test:

[sig-storage] [Serial] Volume metrics Ephemeral should create volume metrics with the correct BlockMode PVC ref [Suite:openshift/conformance/serial] [Suite:k8s]

Probability of significant regression: 100.00%

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&Network=ovn&NetworkAccess=default&Platform=aws&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Storage&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-12%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-06%2000%3A00%3A00&testId=openshift-tests%3Acf25df8798e0307458a9892a76dd7a4a&testName=%5Bsig-storage%5D%20%5BSerial%5D%20Volume%20metrics%20Ephemeral%20should%20create%20volume%20metrics%20with%20the%20correct%20BlockMode%20PVC%20ref%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D%20%5BSuite%3Ak8s%5D

This feature conditionally creates a button within the VirtualizedTable component that allows clients to download the data within the table as comma-separated values (.csv). 

 

Both PRs are needed to test the feature.

The PRs are 

https://github.com/openshift/console/pull/14050

and 

https://github.com/openshift/monitoring-plugin/pull/133

 

The monitoring-plugin passes a string called 'csvData', which contains metrics data formatted in comma-separated values. The console then consumes the 'csvData' in the component 'VirtualizedTable'. 'VirtualizedTable' renders the 'Export as CSV' button only if this property, 'cvsData' is present. Without the property the button 'Export as CSV' will not render. 

 

The console's CI/CD pipeline > tide requires that issues have a valid Jira reference, presumably in this (OpenShift Console) board. This ticket is a duplication of

https://issues.redhat.com/browse/OU-431

 

 

User Story:

As a user of HyperShift, I want to be able to:

  • to pull the api server network proxy image from the release image

so that I can achieve

  • the HCCO will be using the image from the release image the HC is using.

Acceptance Criteria:

Description of criteria:

  • The api server network proxy image is no longer hardcoded.
  • All required tests are passing on the PR.

Out of Scope:

N/A

Engineering Details:

% oc adm release info quay.io/openshift-release-dev/ocp-release:4.14.33-multi -a ~/all-the-pull-secrets.json --pullspecs | grep apiserver
  apiserver-network-proxy 

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

Create cluster with publish:Mixed by using CAPZ,
1. publish: Mixed + apiserver: Internal
install-config:
=================
publish: Mixed
operatorPublishingStrategy:
  apiserver: Internal
  ingress: External

In this case, api dns should not be created in public dns zone, but it was created.
==================
$ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com
{
  "TTL": 300,
  "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19",
  "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.",
  "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api",
  "metadata": {},
  "name": "api.jima07api",
  "provisioningState": "Succeeded",
  "resourceGroup": "os4-common",
  "targetResource": {},
  "type": "Microsoft.Network/dnszones/CNAME"
}

2. publish: Mixed + ingress: Internal
install-config:
=============
publish: Mixed
operatorPublishingStrategy:
  apiserver: External
  ingress: Internal

In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found.
================
$ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg
[]

Version-Release number of selected component (if applicable):

    4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Specify publish: Mixed + mixed External/Internal for api/ingress 
    2. Create cluster
    3. check public dns records and load balancer rules in internal/external load balancer to be created expected
    

Actual results:

    see description, some resources are unexpected to be created or missed.

Expected results:

    public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config

Additional info:

    

User Story:

As a (user persona), I want to be able to:

  • Run a capg install and select a pd-balanced disk type
  • Run a capg install and select a hyperdisk-balanced disk type

so that I can achieve

  • Installations with capg where no regressions are available.
  • Support N4 and Metal Machine Types in GCP

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

It's not possible to either 
- create RWOP PVC
- Create RWOP clone
- Restore to RWOP PVC 
using 4.16.0/4.17.0 UI with ODF StorageClasses. 

Please see the attached print screen. 
The RWOP access mode should be added to all the relevant screens in the UI.     

Version-Release number of selected component (if applicable):

    OCP 4.16.0 & 4.17.0
    ODF (OpenShift Data Foundation) 4.16.0 & 4.17.0

How reproducible:

    

Steps to Reproduce:

    1. Open UI, go to OperatorHub
    2. Install ODF, once installed refresh for ConsolePlugin to get populated
    3. Go to operand "StorageSystem" and create the CR using the custom UI (you can just keep on clicking "Next" with the default selected options, it will work well on AWS cluster)
    5. Wait for "ocs-storagecluster-cephfs" and "ocs-storagecluster-ceph-rbd" StorageClasses to get created by ODF operator
    6. Go to PVC creation page, try to create new PVC (using StorageClasses mentioned in step 5)
    7. Try to create clone
    8. Try to restore PVC to RWOP pvc from existing snapshot 
    

Actual results:

It's not possible to create RWOP PVC, not possible to create RWOP clone and to restore to RWOP PVC from a snapshot using 4.16.0 & 4.17.0 UI. 

Expected result: 

 

It should be possible to create RWOP PVC, to create RWOP clone and to restore to a RWOP snapshot from PVC 

 

Additional info:

https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L111-L119 >> these needs to be updated