Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Today we expose two main APIs for HyperShift, namely `HostedCluster` and `NodePool`. We also have metrics to gauge adoption by reporting the # of hosted clusters and nodepools.
But we are still missing other metrics to be able to make correct inference about what we see in the data.
Today we have hypershift_hostedcluster_nodepools as a metric exposed to provide information on the # of nodepools used per cluster.
Additional NodePools metrics such as hypershift_nodepools_size and hypershift_nodepools_available_replicas are available but not ingested in Telemetry.
In addition to knowing how many nodepools per hosted cluster, we would like to expose the knowledge of the nodepool size.
This will help inform our decision making and provide some insights on how the product is being adopted/used.
The main goal of this epic is to show the following NodePools metrics on Telemeter, ideally as recording rules:
The implementation involves creating updates to the following GitHub repositories:
similar PRs:
https://github.com/openshift/hypershift/pull/1544
https://github.com/openshift/cluster-monitoring-operator/pull/1710
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
Customers can override the default (three) value and set it to a custom value.
Make sure we document (or link) the VMWare recommendations in terms of performances.
https://kb.vmware.com/s/article/1025279
The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.
No change in the default
As an OCP admin I would like to change the maximum number of snapshots per volumes.
Anything outside of
The default value can't be overwritten, reconciliation prevents it.
Make sure the customers understand the impact of increasing the number of snapshots per volume.
https://kb.vmware.com/s/article/1025279
Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.
N/A
Epic Goal*
The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.
Possible future candidates:
Why is this important? (mandatory)
Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.
Maximum number of snapshot is 32 per volume
https://kb.vmware.com/s/article/1025279
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
1) Write OpenShift enhancement (STOR-1759)
2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)
3) Update vSphere operator to use the new snapshot options (STOR-1804)
4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.
Drawbacks or Risk (optional)
Setting this config setting with a high value can introduce performances issues. This needs to be documented.
https://kb.vmware.com/s/article/1025279
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
In order to remove IPI/UPI support for Alibaba Cloud in OpenShift (currently Tech Preview, see also OCPSTRAT-1042), we need to provide an alternate method for Alibaba Cloud customers to spin up an OpenShift cluster. To that end, we want customers to use Assisted Installer with platform=none (and later platform=external) to bring up their OpenShift clusters.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | Multi-node |
Connected / Restricted Network | Connected for OCP 4.16 (Future: restricted) |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64 |
Operator compatibility | This should be the same for any operator on platform=none |
Backport needed (list applicable versions) | OpenShift 4.16 onwards |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Hybrid Cloud Console changes needed |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
For OpenShift 4.16, we want to remove IPI support (currently Tech Preview) for Alibaba Cloud support (OCPSTRAT-1042). Instead we want it to make it Assisted Installer (Tech Preview) with the agnostic platform for Alibaba Cloud in OpenShift 4.16 (OCPSTRAT-1149).
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Previous UPI-based installation doc: Alibaba Cloud Red Hat OpenShift Container Platform 4.6 Deployment Guide
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Assisted Installer using the agnostic platform (platform=none) for connected deployments.
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
{}USER STORY:{}
<!--
One sentence describing this story from an end-user perspective.
-->
As a [type of user], I want [an action] so that [a benefit/a value].
{}DESCRIPTION:{}
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
{}Required:{}
...
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
installing into Shared VPC stuck in waiting for network infrastructure ready
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-10-225505
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" and then insert Shared VPC settings (see [1]) 2. activate the service account which has the minimum permissions in the host project (see [2]) 3. "create cluster" FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project.
Actual results:
1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed. 2. 2 firewall-rules are created in the service project unexpectedly (see [3]).
Expected results:
The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.
Additional info:
Description of problem:
Shared VPC installation using service account having all required permissions failed due to cluster operator ingress degraded, by telling error "error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc'"
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-multi-2024-08-07-221959
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", then insert the interested settings (see [1]) 2. "create cluster" (see [2])
Actual results:
Installation failed, because cluster operator ingress degraded (see [2] and [3]). $ oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress False True True 113m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc', forbidden... $ In fact the mentioned k8s firewall-rule doesn't exist in the host project (see [4]), and, the given service account does have enough permissions (see [6]).
Expected results:
Installation succeeds, and all cluster operators are healthy.
Additional info:
Place holder epic to capture all azure tickets.
TODO: review.
As an end user of a hypershift cluster, I want to be able to:
so that I can achieve
From slack thread: https://redhat-external.slack.com/archives/C075PHEFZKQ/p1722615219974739
We need 4 different certs:
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
TechPreview featureSet check added in installer for userLabels and userTags should be removed and the TechPreview reference made in the install-config GCP schema should be removed.
Acceptance Criteria
TechPreview featureSet check added in machine-api-provider-gcp operator for userLabels and userTags.
And the new featureGate added in openshift/api should also be removed.
Acceptance Criteria
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
In install-config file, there is no zone/instance type setting under controlplane or defaultMachinePlatform ========================== featureSet: CustomNoUpgrade featureGates: - ClusterAPIInstallAzure=true compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 create cluster, master instances should be created in multi zones, since default instance type 'Standard_D8s_v3' have availability zones. Actually, master instances are not created in any zone. $ az vm list -g jima24a-f7hwg-rg -otable Name ResourceGroup Location Zones ------------------------------------------ ---------------- -------------- ------- jima24a-f7hwg-master-0 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-master-1 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-master-2 jima24a-f7hwg-rg southcentralus jima24a-f7hwg-worker-southcentralus1-wxncv jima24a-f7hwg-rg southcentralus 1 jima24a-f7hwg-worker-southcentralus2-68nxv jima24a-f7hwg-rg southcentralus 2 jima24a-f7hwg-worker-southcentralus3-4vts4 jima24a-f7hwg-rg southcentralus 3
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. CAPI-based install on azure platform with default configuration 2. 3.
Actual results:
master instances are created but not in any zone.
Expected results:
master instances should be created per zone based on selected instance type, keep the same behavior as terraform based install.
Additional info:
When setting zones under controlPlane in install-config, master instances can be created per zone. install-config: =========================== controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: zones: ["1","3"] $ az vm list -g jima24b-p76w4-rg -otable Name ResourceGroup Location Zones ------------------------------------------ ---------------- -------------- ------- jima24b-p76w4-master-0 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-master-1 jima24b-p76w4-rg southcentralus 3 jima24b-p76w4-master-2 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-worker-southcentralus1-bbcx8 jima24b-p76w4-rg southcentralus 1 jima24b-p76w4-worker-southcentralus2-nmgfd jima24b-p76w4-rg southcentralus 2 jima24b-p76w4-worker-southcentralus3-x2p7g jima24b-p76w4-rg southcentralus 3
Description of problem:
CAPZ creates an empty route table during installs
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Very
Steps to Reproduce:
1.Install IPI cluster using CAPZ 2. 3.
Actual results:
Empty route table created and attached to worker subnet
Expected results:
No route table created
Additional info:
Description of problem:
Launch CAPI based installation on Azure Government Cloud, installer was timeout when waiting for network infrastructure to become ready. 06-26 09:08:41.153 level=info msg=Waiting up to 15m0s (until 9:23PM EDT) for network infrastructure to become ready... ... 06-26 09:09:33.455 level=debug msg=E0625 21:09:31.992170 22172 azurecluster_controller.go:231] "failed to reconcile AzureCluster" err=< 06-26 09:09:33.455 level=debug msg= failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= RESPONSE 404: 404 Not Found 06-26 09:09:33.456 level=debug msg= ERROR CODE: SubscriptionNotFound 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= { 06-26 09:09:33.456 level=debug msg= "error": { 06-26 09:09:33.456 level=debug msg= "code": "SubscriptionNotFound", 06-26 09:09:33.456 level=debug msg= "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found." 06-26 09:09:33.456 level=debug msg= } 06-26 09:09:33.456 level=debug msg= } 06-26 09:09:33.456 level=debug msg= -------------------------------------------------------------------------------- 06-26 09:09:33.456 level=debug msg= . Object will not be requeued 06-26 09:09:33.456 level=debug msg= > logger="controllers.AzureClusterReconciler.reconcileNormal" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" reconcileID="f2ff1040-dfdd-4702-ad4a-96f6367f8774" x-ms-correlation-request-id="d22976f0-e670-4627-b6f3-e308e7f79def" name="jima26mag-9bqkl" 06-26 09:09:33.457 level=debug msg=I0625 21:09:31.992215 22172 recorder.go:104] "failed to reconcile AzureCluster: failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: SubscriptionNotFound\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"code\": \"SubscriptionNotFound\",\n \"message\": \"The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.\"\n }\n}\n--------------------------------------------------------------------------------\n. Object will not be requeued" logger="events" type="Warning" object={"kind":"AzureCluster","namespace":"openshift-cluster-api-guests","name":"jima26mag-9bqkl","uid":"20bc01ee-5fbe-4657-9d0b-7013bd55bf96","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"1115"} reason="ReconcileError" 06-26 09:17:40.081 level=debug msg=I0625 21:17:36.066522 22172 helpers.go:516] "returning early from secret reconcile, no update needed" logger="controllers.reconcileAzureSecret" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" name="jima26mag-9bqkl" reconcileID="2df7c4ba-0450-42d2-901e-683de399f8d2" x-ms-correlation-request-id="b2bfcbbe-8044-472f-ad00-5c0786ebbe84" 06-26 09:23:46.611 level=debug msg=Collecting applied cluster api manifests... 06-26 09:23:46.611 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure is not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline 06-26 09:23:46.611 level=info msg=Shutting down local Cluster API control plane... 06-26 09:23:46.612 level=info msg=Stopped controller: Cluster API 06-26 09:23:46.612 level=warning msg=process cluster-api-provider-azure exited with error: signal: killed 06-26 09:23:46.612 level=info msg=Stopped controller: azure infrastructure provider 06-26 09:23:46.612 level=warning msg=process cluster-api-provider-azureaso exited with error: signal: killed 06-26 09:23:46.612 level=info msg=Stopped controller: azureaso infrastructure provider 06-26 09:23:46.612 level=info msg=Local Cluster API system has completed operations 06-26 09:23:46.612 [[1;31mERROR[0;39m] Installation failed with error code '4'. Aborting execution. From above log, Azure Resource Management API endpoint is not correct, endpoint "management.azure.com" is for Azure Public cloud, the expected one for Azure Government should be "management.usgovcloudapi.net".
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-06-23-145410
How reproducible:
Always
Steps to Reproduce:
1. Install cluster on Azure Government Cloud, capi-based installation 2. 3.
Actual results:
Installation failed because of the wrong Azure Resource Management API endpoint used.
Expected results:
Installation succeeded.
Additional info:
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Requirement description:
As an VM Admin, I want to improve overall density. In our traditional VM environments, we find that we are memory bound much more than CPU. Even with properly sized VMs, we see a lot of memory just sitting around allocated to the VM, but not actually used. Moreover, we always see people requesting VMs that are sized way too big for their workloads. It is better customer service allow it to some degree and then recover the memory at the hypervisor level.
MVP:
Documents:
Prometheus query for UI:
sum by (instance)(((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) + (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes)) / node_memory_MemTotal_bytes) *100
In human words: This is approximating how much over-committment of memory is taking place. A value of 100 means RAM+SWAP usage are 100% of system RAM capacity. 105% means RAM+SWAP are factor 105% of system RAM capacity.
Threshold: Yellow 95%, Red 105%
Based on: https://docs.google.com/document/d/1AbR1LACNMRU2QMqFpe-Se2mCEFLMqW_M9OPKh2v3yYw,
https://docs.google.com/document/d/1E1joajwxQChQiDVTsr9Qk_iIhpQkSI-VQP-o_BMx8Aw
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal
Use scenarios
Why is this important
Requirement | Notes |
---|---|
OCI Bare Metal Shapes must be certified with RHEL | It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot ( Certified shapes: https://catalog.redhat.com/cloud/detail/249287 |
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. | Oracle will do these tests. |
Updating Oracle Terraform files | |
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. | Support Oracle Cloud in Assisted-Installer CI: |
RFEs:
Any bare metal Shape to be supported with OCP has to be certified with RHEL.
From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.
As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15 when using OCI external platform.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1 ip=ibft` kargs during install to enable iSCSI booting.
yes
PR https://github.com/openshift/assisted-service/pull/6257 must be adapted to be used along external platform.
Since we ensure that the iscsi network is not the default route, the PR above will ensure that automatically select the subnet used by the default route.
The secondary VNIC must be configured manually in OCI, a script must be injected in the discovery ISO to configure it.
Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.
As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.
Network Policy has its issues:
With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.
Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.
Test scenarios:
Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default
Benefits of Crun is covered here https://github.com/containers/crun
FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit
***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Check with ACS team; see if there are external repercussions.
CRIO wipe is existing feature in Openshift . When node reboots CRIO wipe goes and clear the node of all images so that node boots clean . When node come back up it need access to image registry to get all images and it takes time to get all images . For telco and edge situation node might not have access to image registry and takes time to come up .
Goal of this feature is to adjust CRIO wipe to wipe only images that has been corrupted because of sudden reboot not all images
Phase 2 of the enclave support for oc-mirror with the following goals
For 4.17 timeframe
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps
Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.
Epic Goal*
Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.18 cannot be released without Kubernetes 1.31
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
During an upgrade, once control plane is successfully updated, status items related to that part of the upgrade cease to be relevant, and therefore we can either hide them entirely, or we can show a simplified version of them. The relevant sections are Control plane and Control plane nodes.
We utilize MCO annotations to determine whether a node is degraded or unavailable, and we solely source the Reason annotation to put into the insight. Many common cases are not covered by this, especially the unavailable ones: nodes can be cordoned, have a condition like DiskPressure, be in the process of termination etc. Not sure whether our code or something like MCO should provide it, but captured this as a card for now.
Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.
Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.
We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.
As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.
As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
TBD
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.
Right now, our pods are SingleReplica because to have multiple replicas we need more than one zone for nodes which translates into AZ in OpenStack. We need to figure that out.
HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.
We deprecated "DeploymentConfig" in-favor of "Deployment" in OCP 4.14
Now in 4.18 we want to make "Deployment " as default out of box that means customer will get Deployment when they install OCP 4.18 .
Deployment Config will still be available in 4.18 as non default for user who still want to use it .
FYI "DeploymentConfig" is tier 1 API in Openshift and cannot be removed from 4.x product
Please Review this FAQ : https://docs.google.com/document/d/1OnIrGReZKpc5kzdTgqJvZYWYha4orrGMVjfP1fUpljY/edit#heading=h.oranye5nwtsy
Epic Goal*
WRKLDS-695 was implemented to make the DC enabled through capability in 4.14. In order to prepare customers for migration to Deployments the capability got enabled by default. After 3 releases we need to reconsider whether disabling the capability by default is feasible.
More about capabilities in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capability-sets.
Why is this important? (mandatory)
Disabling a capability by default make an OCP installation lighter. Less component running by default reduces a security risk/vulnerability surface.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
None. The DC capability can be enabled if needed.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Before the DCs can be disabled by default all the relevant e2e relying on DCs need to be migrated to Deployments to maintain the same testing coverage.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Allow customer to enabled EFS CSI usage metrics.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
OCP already supports exposing CSI usage metrics however the EFS metrics are not enabled by default. The goal of this feature is to allows customers to optionally turn on EFS CSI usage metrics in order to see them in the OCP console.
The EFS metrics are not enabled by default for a good reason as it can potentially impact performances. It's disabled in OCP, because the CSI driver would walk through the whole volume, and that can be very slow on large volumes. For this reason, the default will remain the same (no metrics), customers would need explicitly opt-in.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Clear procedure on how to enable it as a day 2 operation. Default remains no metrics. Once enabled the metrics should be available for visualisation.
We should also have a way to disable metrics.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | AWS only |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all AWS/EFS supported |
Operator compatibility | EFS CSI operator |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Should appear in OCP UI automatically |
Other (please specify) | OCP on AWS only |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP user i want to be able to visualise the EFS CSI metrics
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Additional metrics
Enabling metrics by default.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Customer request as per
https://issues.redhat.com/browse/RFE-3290
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
We need to be extra clear on the potential performance impact
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Document how to enable CSI metrics + warning about the potential performance impact.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
It can benefit any cluster on AWS using EFS CSI including ROSA
Epic Goal*
This goal of this epic is to provide a way to admin to turn on EFS CSI usage metrics. Since this could lead to performance because the CSI driver would walk through the whole volume this option will not be enabled by default; admin will need to explicitly opt-in.
Why is this important? (mandatory)
Turning on EFS metrics allows users to monitor how much EFS space is being used by OCP.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
None
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Enable CSI metrics via the operator - ensure the driver is started with the proper cmdline options. Verify that the metrics are sent and exposed to the users.
Drawbacks or Risk (optional)
Metrics are calculated by walking through the whole volume which can impact performances. For this reason enabling CSI metrics will need an explicit opt-in from the admin. This risk needs to be explicitly documented.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
When using OpenShift in a mixed, multi-architecture environment some key details or checks or not always available. With this feature we will take a first pass at improving the UI/UX for customers as adoption of this configuration continues at pace.
The UI/UX experience should improved when being used in a mixed architecture OCP cluster
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Y |
Classic (standalone cluster) | Y |
Hosted control planes | Y |
Multi node, Compact (three node), or Single node (SNO), or all | Y |
Connected / Restricted Network | Y |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All architectures |
Operator compatibility | n/a |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | OpenShift Console |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.
Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.
Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.
This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.
As a dev I want the base code to be easier to read, maintain and test
If devs are don't have a healthy dev environment the project will go and the business won't make $$
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...
Replace DropdownDeprecated
Replace SelectDeprecated
Acceptance Criteria
Note:
DropdownDeprecated and KebabToggleDeprecated are replaced with latest components
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon
https://www.patternfly.org/components/menus/dropdown
https://www.patternfly.org/components/menus/select
resource-dropdown.tsx (checkbox, options have tooltips, grouped options, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)
filter-toolbar.tsx (grouped, checkbox select)
monitoring/dashboards/index.tsx (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead) covered by https://issues.redhat.com/browse/ODC-7655
silence-form.tsx (Currently using DropdownDeprecated, should be using a Select)
timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655
poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655
Note
SelectDeprecated are replaced with latest Select component
https://www.patternfly.org/components/menus/menu
https://www.patternfly.org/components/menus/select
AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
# namespaces | 4.18 | 4.17 | 4.16 | 4.15 |
---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 |
fix needed | 69 | 69 | 69 | 69 |
fixed | 34 | 30 | 30 | 39 |
remaining | 35 | 39 | 39 | 30 |
~ remaining non-runlevel | 15 | 19 | 19 | 10 |
~ remaining runlevel (low-prio) | 20 | 20 | 20 | 20 |
~ untested | 2 | 2 | 2 | 82 |
# | namespace | 4.18 | 4.17 | 4.16 | 4.15 |
---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |
2 | openshift-apiserver-operator | #573 | #581 | ||
3 | openshift-authentication | #656 | #675 | ||
4 | openshift-authentication-operator | #656 | #675 | ||
5 | openshift-catalogd | #50 | #58 | ||
6 | openshift-cloud-credential-operator | #681 | #736 | ||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |
8 | openshift-cluster-csi-drivers | #170 #459 | #484 | ||
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||
10 | openshift-cluster-olm-operator | #54 | n/a | ||
11 | openshift-cluster-samples-operator | #535 | #548 | ||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |
13 | openshift-cluster-version | #1038 | #1068 | ||
14 | openshift-config-operator | #410 | #420 | ||
15 | openshift-console | #871 | #908 | #924 | |
16 | openshift-console-operator | #871 | #908 | #924 | |
17 | openshift-controller-manager | #336 | #361 | ||
18 | openshift-controller-manager-operator | #336 | #361 | ||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 |
20 | openshift-image-registry | #1008 | #1067 | ||
21 | openshift-infra | ||||
22 | openshift-ingress | #1031 | |||
23 | openshift-ingress-canary | #1031 | |||
24 | openshift-ingress-operator | #1031 | |||
25 | openshift-insights | #915 | #967 | ||
26 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 |
27 | openshift-kube-storage-version-migrator | #107 | #112 | ||
28 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||
29 | openshift-machine-api | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 | |
30 | openshift-machine-config-operator | #4219 | #4384 | #4393 | |
31 | openshift-manila-csi-driver | #234 | #235 | #236 | |
32 | openshift-marketplace | #578 | #561 | #570 | |
33 | openshift-metallb-system | #238 | #240 | #241 | |
34 | openshift-monitoring | #2335 | #2420 | ||
35 | openshift-network-console | ||||
36 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |
37 | openshift-network-node-identity | #2282 | #2490 | #2496 | |
38 | openshift-nutanix-infra | #4504 | #4504 | #4539 | #4540 |
39 | openshift-oauth-apiserver | #656 | #675 | ||
40 | openshift-openstack-infra | #4504 | #4504 | #4539 | #4540 |
41 | openshift-operator-controller | #100 | #120 | ||
42 | openshift-operator-lifecycle-manager | #703 | #828 | ||
43 | openshift-route-controller-manager | #336 | #361 | ||
44 | openshift-service-ca | #235 | #243 | ||
45 | openshift-service-ca-operator | #235 | #243 | ||
46 | openshift-sriov-network-operator | #754 #995 | #999 | #1003 | |
47 | openshift-storage | ||||
48 | openshift-user-workload-monitoring | #2335 | #2420 | ||
49 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 |
50 | (runlevel) kube-system | ||||
51 | (runlevel) openshift-cloud-controller-manager | ||||
52 | (runlevel) openshift-cloud-controller-manager-operator | ||||
53 | (runlevel) openshift-cluster-api | ||||
54 | (runlevel) openshift-cluster-machine-approver | ||||
55 | (runlevel) openshift-dns | ||||
56 | (runlevel) openshift-dns-operator | ||||
57 | (runlevel) openshift-etcd | ||||
58 | (runlevel) openshift-etcd-operator | ||||
59 | (runlevel) openshift-kube-apiserver | ||||
60 | (runlevel) openshift-kube-apiserver-operator | ||||
61 | (runlevel) openshift-kube-controller-manager | ||||
62 | (runlevel) openshift-kube-controller-manager-operator | ||||
63 | (runlevel) openshift-kube-proxy | ||||
64 | (runlevel) openshift-kube-scheduler | ||||
65 | (runlevel) openshift-kube-scheduler-operator | ||||
66 | (runlevel) openshift-multus | ||||
67 | (runlevel) openshift-network-operator | ||||
68 | (runlevel) openshift-ovn-kubernetes | ||||
69 | (runlevel) openshift-sdn |
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.
To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.
This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.
Implement Migration core for MAPI to CAPI for AWS
When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.
When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
粗文本*h3. *Feature Overview
Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.
Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).
Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.
This section contains all the test cases that we need to make sure work as part of the done^3 criteria.
This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.
For this task, we need to create a new periodical that will test the multi vcenter feature.
Add authentication to the internal components of the Agent Installer so that the cluster install is secure.
Requirements
Are there any requirements specific to the auth token?
Actors:
Do we need more than one auth scheme?
Agent-admin - agent-read-write
Agent-user - agent-read
Options for Implementation:
As a user, when creating node ISOs, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Goal Summary
This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.
This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This feature is to track automation in ODC, related packages, upgrades and some tech debts
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | No |
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
This won't impact documentation and this feature is to mostly enhance end to end test and job runs on CI
Questions to be addressed:
Improve onboarding experience for using Shipwright Builds in OpenShift Console
Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright
Requirements | Notes | IS MVP |
Enable creating Shipwright Builds using a form | Yes | |
Allow use of Shipwright Builds for image builds during import flows | Yes | |
Enable access to build strategies through navigation | Yes |
TBD
TBD
Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.
TBD
TBD
TBD
TBD
TBD
Creating Shipwright Builds through YAML is complex and requires Shipwright expertise which makes it difficult for novice user to user Shipwright
Provide a form for creating Shipwright Builds
To simply adoption of Shipwright and ease onboarding
Create build
As a user, I want to create a Shipwright build using the form,
To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.
Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.
There are currently three main obvious thrusts to how to go about this:
This BU priority focuses on the third of these approaches.
Reference Documents:
Both the Node Tuning Operator and TuneD assume the Intel x86 architecture is used when a Performance Profile is applied. For example, they both configure Intel x86 specific kernel parameters (e.g. intel_pstate).
In order to support Telco RAN DU deployments on the ARM architecture, we will need a way to apply a performance profile to configure the server for low latency applications. This will include tuning common to both Intel/ARM and tuning specific to one of the architectures.
The purpose of this Epic:
This story will serve to collect minor upstream enhancements to NTO that do not directly belong to an objective story in the greater epic
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
Convert Cluster Configuration single page form into a multi-step wizard. The goal is to avoid overwhelming user with all information on a single page, provide guidance through the configuration process.
Wireframes:
Phase1:
https://marvelapp.com/prototype/fjj6g57/screen/76442394
Future:
https://marvelapp.com/prototype/78g662d/screen/71444815
https://marvelapp.com/prototype/7ce7ib3/screen/73190117
Phase 1 wireframes: https://marvelapp.com/prototype/fjj6g57/screen/76442399
This requires UX investigation to handle the case when base dns is not set yet and clusters list has several clusters with the same name.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Description
This epic covers the changes needed to the ARO RP for
ACCEPTANCE CRITERIA:
What is "done", and how do we measure it? You might need to duplicate this a few times.
NON GOALS:
Only fill this out for Product Management / customer-driven work. Otherwise, delete it.
BREADCRUMBS:
Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.
NOTES:
Need to determine if (in 4.14 azure workload identity functionality) we need to create secrets/secret manifests for each operator manually as part of the ARO cluster install, or if we can leverage credentialsrequests to do this automatically somehow. How will necessary secrets be created?
DESCRIPTION:
ACCEPTANCE CRITERIA:
NON GOALS:
BREADCRUMBS:
Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.
yes
In order to successfully install OCP on an iSCSI boot volume, we need to make sure that the machine has 2 network interfaces:
This is required because on startup OVS/OVN will reconfigure the default interface (the network interface used for the default gateway). This behavior makes the usage of the default interface impracticable for the iSCSI traffic because we loose the root volume, and the node becomes unusable. See https://issues.redhat.com/browse/OCPBUGS-26071
In the scope of this issue we need to:
CMO creates a default Alertmanager configuration on cluster bootstrap. The configuration should have the following snippet when a cluster proxy is configured:
global: http_config: proxy_from_environment: true
The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image API with respect to importing imagestreams images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.
There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:
Some open questions:
This change enables the setting of import mode through the image config API which is then synced to the apiserver's observed config which then enables apiserver to set the import mode based on this value. The import mode in the observed config is also populated by default based on the payload type
poc: https://github.com/Prashanth684/api/commit/c660fba709b71a884d0fc96dd007581a25d2d17a
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
OCPCLOUD-2514 prevented feature gates from being used with the CCMs.
We have been asked not to remove the feature gates themselves until 4.18.
PR to track: https://github.com/openshift/api/pull/1780
We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.
In order to provide customers the option to process alert data externally, we need to provide a way the data can be downloaded from the OpenShift console. The monitoring plugin uses a Virtualized table from the dynamic plugin SDK. We should include the change in this table so is available for others.
---
NOTE:
There is a duplicate issue in the OpenShift console board: https://issues.redhat.com//browse/CONSOLE-4185
This is because the console > CI/CD > prow configurations require that any PR in the openshift/console repo needs to have an associated Jira issue in the openshift console Jira board.
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update OCP release number in OLM metadata manifests of:
OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56
We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate. However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
$ oc get featuregates.config.openshift.io cluster -oyaml <......> spec: featureSet: TechPreviewNoUpgrade status: featureGates: enabled: - name: ExternalRouteCertificate - name: RouteExternalCertificate <......>
Actual results:
Both RouteExternalCertificate and ExternalRouteCertificate were added in the API
Expected results:
We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html
Additional info:
Git commits https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3 https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930 Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219
Description of problem:
On pages under "Observe"->"Alerting", it shows "Not found" when no resources found
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-11-082305
How reproducible:
Steps to Reproduce:
1.Check tabs under "Observe"->"Alerting" when there is not any related resources, eg, "Alerts", "Silence","Alerting rules". 2. 3.
Actual results:
1. 'Not found' is shown under each tab.
Expected results:
1. It's better to show "No <resource> found" like other resources pages. eg: "No Deployments found"
Additional info:
Description of problem:
openshift-install create cluster leads to error: ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. Vsphere standard port group
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. openshift-install create cluster 2. Choose Vsphere 3. fill in the blanks 4. Have a standard port group
Actual results:
error
Expected results:
cluster creation
Additional info:
https://github.com/openshift/origin/pull/28945 are permafailing on metal
https://github.com/openshift/api/pull/1988 maybe needs to be reverted?
Please review the following PR: https://github.com/openshift/ironic-image/pull/539
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
e980 is a valid system type for the madrid region but it is not listed as such in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy to mad02 with SysType set to e980 2. Fail 3.
Actual results:
Installer exits
Expected results:
Installer should continue as it's a valid system type.
Additional info:
Description of problem:
periodics are failing due to a change in coreos.
Version-Release number of selected component (if applicable):
4.15,4.16,4.17,4.18
How reproducible:
100%
Steps to Reproduce:
1. Check any periodic conformance jobs 2. 3.
Actual results:
periodic conformance fails with hostedcluster creation
Expected results:
periodic conformance test suceeds
Additional info:
Description of problem:
Navigation: Storage -> StorageClasses -> Create StorageClass -> Provisioner -> kubernetes.io/gce-pd Issue: "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-06-01-063526
How reproducible:
Always
Steps to Reproduce:
1. Log into web console and set language to non en_US 2. Navigate to 3. Storage -> StorageClasses -> Create StorageClass -> Provisioner 4. Select Provisioner "kubernetes.io/gce-pd" 5. "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English
Actual results:
Content is in English
Expected results:
Content should be in set language.
Additional info:
Screenshot reference attached
Modify the import to strip or change the bootOptions.efiSecureBootEnabled
https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319
archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}
ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil
defer f.Close()
// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil
return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}
ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil
Description of problem:
When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly *after* 2024-08-09-031511
How reproducible: always
Steps to Reproduce:
Actual results: the controller pods are not restarted
Expected results: the controller pods are restarted
The cluster-dns-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-dns-operator repository also vendors k8s.io/* v0.29.2 packages. However, OpenShift 4.17 is based on Kubernetes 1.30.
4.17.
Always.
Check https://github.com/openshift/cluster-dns-operator/blob/release-4.17/go.mod.
The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/* packages are at v0.29.2.
The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and the k8s.io/* packages are at v0.30.0 or newer.
The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.
Description of problem:
In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Open /settings/cluster using Firefox with Dark mode selected 2. 3.
Actual results:
The version numbers under Update status are black
Expected results:
The version numbers under Update status are white
Additional info:
Description of problem:
See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700
On 1.8.2024, assisted-installer-agent job started failing subsystem test "add_multiple_servers". We need to make sure it is occurs only in tests and The fix should be backported.
Description of problem:
When an image is referenced by tag and digest, oc-mirror skips the image
Version-Release number of selected component (if applicable):
How reproducible:
Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator
Steps to Reproduce:
1 mirror to disk 2 disk to mirror
Actual results:
docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format
Expected results:
The image should be mirrored
Additional info:
The AWS EFS CSI Operator primarily passes credentials to the CSI driver using environment variables. However, this practice is discouraged by the OCP Hardening Guide.
Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.
This is a clear regression and it is only present on 4.17, not 4.16. It is present across all platforms, though I've selected AWS for links and screenshots.
slack thread if there are questions
courtesy screen shot
CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819
4.18 Micro upgrade failures began with the initial payload 4.18.0-0.ci-2024-08-09-234503
CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346
The common change in all of those scenarios appears to be:
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4437
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4518
In OCPBUGS-38414, a new featuregate was turned on that didn't work correctly on metal (or at least it's tests didn't). Metal should have techpreview jobs to ensure new features are tested properly. I think the right matrix is:
On standard CI jobs, we incorporate this by wiring in the appropriate FEATURE_SET variable, but metal jobs don't currently have a way to do this as far as I can tell.
These should be release informers.
https://github.com/openshift/release/blob/5ce4d77a6317479f909af30d66bc0285ffd38dbd/ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh#L63-L68 is the relevant step
Description of problem:
If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail. Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests: Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
While working on the readiness probes we have discovered that the single member health check always allocates a new client. Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check. This should reduce CEO's and etcd CPU consumption.
Version-Release number of selected component (if applicable):
any supported version
How reproducible:
always, but technical detail
Steps to Reproduce:
na
Actual results:
CEO creates a new etcd client when it is checking a single member health
Expected results:
CEO should use the existing pooled client to check for single member health
Additional info:
Description of problem:
Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5: step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000 spec used: spec: raid: hardwareRAIDVolumes: - name: test-vol level: "1" numberOfPhysicalDisks: 2 sizeGibibytes: 350 online: true
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Provision an HEP worker with ILO 5 using redfish 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented. On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power. Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful. [1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371
Version-Release number of selected component (if applicable):
Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions
How reproducible:
Always
Steps to Reproduce:
1. Deploy SNO node using ACM and fakefish as redfish interface 2. Check metal3-ironic pod logs
Actual results:
We can see a soft power_off command sent to the ironic agent running on the ramdisk: 2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197 2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234
Expected results:
There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.
Additional info:
Refactor name to Dockerfile.ocp as a better, version independent alternative
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
I talked with Gerd Oberlechner; the hack/app-sre/saas_template.yaml - it is not used anymore in app-interface.
It should be safe to remove this.
Description of problem:
Unable to deploy performance profile on multi nodepool hypershift cluster
Version-Release number of selected component (if applicable):
Server Version: 4.17.0-0.nightly-2024-07-28-191830 (management cluster) Server Version: 4.17.0-0.nightly-2024-08-08-013133 (hosted cluster)
How reproducible:
Always
Steps to Reproduce:
1. In a multi nodepool hypershift cluster, attach performance profile unique to each nodepool. 2. Check the configmap and nodepool status.
Actual results:
root@helix52:~# oc get cm -n clusters-foobar2 | grep foo kubeletconfig-performance-foobar2 1 21h kubeletconfig-pp2-foobar3 1 21h machineconfig-performance-foobar2 1 21h machineconfig-pp2-foobar3 1 21h nto-mc-foobar2 1 21h nto-mc-foobar3 1 21h performance-foobar2 1 21h pp2-foobar3 1 21h status-performance-foobar2 1 21h status-pp2-foobar3 1 21h tuned-performance-foobar2 1 21h tuned-pp2-foobar3 1 21h
root@helix52:~# oc get np NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE foobar2 foobar2 2 2 False False 4.17.0-0.ci-2024-08-08-225819 False True foobar3 foobar2 1 1 False False 4.17.0-0.ci-2024-08-08-225819 False True
Hypershift Pod logs - {"level":"debug","ts":"2024-08-14T08:54:27Z","logger":"events","msg":"there cannot be more than one PerformanceProfile ConfigMap status per NodePool. found: 2 NodePool: foobar3","type":"Warning","object":{"kind":"NodePool","namespace":"clusters","name":"foobar3","uid":"c2ba814a-31fe-409d-88c2-b4e6b9a41b26","apiVersion":"hypershift.openshift.io/v1beta1","resourceVersion":"6411003"},"reason":"ReconcileError"}
Expected results:
Performance profile should apply correctly on both node pools
Additional info:
Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Component Readiness has found a potential regression in the following test:
operator conditions control-plane-machine-set
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0
The version page in our docs is out of date and needs to be updated with the current versioning standards we expect.
Minimum of OCP mgmt cluster/k8s needs to be added.
Description of problem:
IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699
Version-Release number of selected component (if applicable):
How reproducible:
Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.
Steps to Reproduce:
1. 2. 3.
Actual results:
The installation will fail.
Expected results:
The installation succeeds to create a Nutanix OCP cluster with the DHCP network.
Additional info:
Refactor name to Dockerfile.ocp as a better, version independent alternative
Description of problem:
We should add validation in the Installer when public-only subnets is enabled to make sure that: 1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set 2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal 3. If this flag is only applicable for byo-vpc configuration, we could consider exit earlier if no subnets provided in install-config.
Version-Release number of selected component (if applicable):
all versions that support public-only subnets
How reproducible:
always
Steps to Reproduce:
1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY 2. Do a cluster install without specifying a VPC. 3.
Actual results:
No warning about the invalid configuration.
Expected results:
Additional info:
This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.
Description of problem:
When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
console message shows start installation service and agent register service has not started
Expected results:
console message shows agent import cluster and add host services has started
Additional info:
arm64 is dev preview by CNV since 4.14. The installer shouldn't block installing it.
Just make sure it is shown in the UI as dev preview.
Update our CPO and HO dockerfiles to use appropriate base image versions.
Description of problem:
Renable knative and A-04-TC01 tests that are being disabled in the pr https://github.com/openshift/console/pull/13931
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
gophercloud is outdated, we need to update it to get the latest dependencies and avoid CVEs.
Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/143
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.
cc Ali Mobrem
Component Readiness has found a potential regression in the following test:
[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry
Probability of significant regression: 98.02%
Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0
Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.
The problem appears to be a permissions error preventing the pods from starting:
2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied
Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489
Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:
container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch
With slightly different versions in each stream, but both were on 3-2.231.
Hits other tests too:
operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]
Description of problem:
Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.
Version-Release number of selected component (if applicable):
How reproducible:
The installation procedure fails systemically when using a predefined VPC
Steps to Reproduce:
1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC 2. Run `openshift-install create cluster ...' 3. The procedure fails: `failed to create load balancer`
Actual results:
The installation procedure fails.
Expected results:
An OCP cluster to be provisioned in AWS, with public subnets only.
Additional info:
Description of problem:
The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.
Version-Release number of selected component (if applicable):
4.15.z and later
How reproducible:
Always when AlertmanagerConfig is enabled
Steps to Reproduce:
1. Enable UWM with AlertmanagerConfig enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true 2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file) 3. Wait for a couple of minutes.
Actual results:
Monitoring ClusterOperator goes Degraded=True.
Expected results:
No error
Additional info:
The Prometheus operator logs show that it doesn't understand the proxy_from_environment field. The newer proxy fields are supported since Alertmanager v0.26.0 which is equivalent to OCP 4.15 and above.
Description of problem:
When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment
Version-Release number of selected component (if applicable):
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Setup OSUS in a reacheable network 2. Cut all internet connection except for the mirror registry and OSUS service 3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig
Actual results:
Expected results:
Should not fail
Additional info:
Description of problem:
When use UPDATE_URL_OVERRIDE env, the information is confused: ./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 2024/06/19 12:22:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/06/19 12:22:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/06/19 12:22:38 [INFO] : ⚙️ setting up the environment for you... 2024/06/19 12:22:38 [INFO] : 🔀 workflow mode: mirrorToDisk I0619 12:22:38.832303 66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported 2024/06/19 12:22:38 [INFO] : 🕵️ going to discover the necessary images...
Version-Release number of selected component (if applicable):
./oc-mirror.latest version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202406131541.p0.g157eb08.assembly.stream.el9-157eb08", GitCommit:"157eb085db0ca66fb689220119ab47a6dd9e1233", GitTreeState:"clean", BuildDate:"2024-06-13T17:25:46Z", GoVersion:"go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
1) Set registry on the ocp cluster; 2) do mirror2disk + disk2mirror with following isc: apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: additionalImages: - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6 platform: channels: - name: stable-4.15 type: ocp minVersion: '4.15.10' maxVersion: '4.15.11' graph: true operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: elasticsearch-operator 3) set ~/.config/containers/registries.conf [[registry]] location = "quay.io" insecure = false blocked = false mirror-by-digest-only = false prefix = "" [[registry.mirror]] location = "my-route-testzy.apps.yinzhou-619.qe.devcluster.openshift.com" insecure = false 4) use the isc from step 2 and mirror2disk with different dir: `./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1`
Actual results:
./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 2024/06/19 12:22:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/06/19 12:22:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/06/19 12:22:38 [INFO] : ⚙️ setting up the environment for you... 2024/06/19 12:22:38 [INFO] : 🔀 workflow mode: mirrorToDisk I0619 12:22:38.832303 66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported 2024/06/19 12:22:38 [INFO] : 🕵️ going to discover the necessary images... 2024/06/19 12:22:38 [INFO] : 🔍 collecting release images...
Expected results:
Give clear information to clarify the UPDATE_URL_OVERRIDE environment variable slack discuss is here : https://redhat-internal.slack.com/archives/C050P27C71S/p1718800641718869?thread_ts=1718175617.310629&cid=C050P27C71S
Description of problem:
To summarize, when we meet the following three conditions, baremetal nodes cannot boot due to a hostname resolution failure.
According to the following update, the provisioning service checks the BMC address scheme on the target and provides a matching URL for the installation media:
When we create a BMH resource, spec.bmc.address will be an URL of the BMC.
However, when we put a hostname instead of an IP address in the spec.bmc.address like the following example,
<Example BMH definition>
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
:
spec:
bmc:
address: redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1
we observe the following error.
$ oc logs -n openshift-machine-api metal3-baremetal-operator-6779dff98c-9djz7 {"level":"info","ts":1721660334.9622784,"logger":"provisioner.ironic","msg":"Failed to look up the IP address for BMC hostname","host":"myenv~mybmh","hostname":"redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1"}
Because of name resolution failure, baremetal-operator cannot determine if the BMC is IPv4 or IPv6.
Therefore, the IP scheme is fall-back to IPv4 and ISO images are exposed via IPv4 address even if the BMC is IPv6 single stack.
In this case, the IPv6 BMC cannot access to the ISO image on IPv4, we observe error messages like the following example, and the baremetal host cannot boot from the ISO.
<Error message on iDRAC> Unable to locate the ISO or IMG image file or folder in the network share location because the file or folder path or the user credentials entered are incorrect
The issue is caused by the following implementation.
The following line passes `p.bmcAddress` which is whole URL, that's why the name resolution fails.
I think we should pass `parsedURL.Hostname()` instead, which is the hostname part of the URL.
https://github.com/metal3-io/baremetal-operator/blob/main/pkg/provisioner/ironic/ironic.go#L657
ips, err := net.LookupIP(p.bmcAddress)
Version-Release number of selected component (if applicable):
We observe this issue on OCP 4.14 and 4.15. But I think this issue occurs even in the latest releases.
How reproducible:
Steps to Reproduce:
Actual results:
Name resolution fails and the baremetal host cannot boot
Expected results:
Name resolution works and the baremetal host can boot
Additional info:
Description of problem:
when normal user tries to create namespace scoped network policy, selected project in project selection dropdown was not taken
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-17-183402
How reproducible:
Always
Steps to Reproduce:
1. normal user with a project view networkpolicy page /k8s/ns/yapei1-1/networkpolicies/~new/form 2. Hit on 'affected pods' in Pod selector section OR keep everything with default value and click on 'Create'
Actual results:
2. User will see following error when click on 'affected pods' Can't preview pods r: pods is forbidden: User "yapei1" cannot list resource "pods" in API group "" at the cluster scope User will see following error when click on 'Create' button An error occurrednetworkpolicies.networking.k8s.io is forbidden: User "yapei1" cannot create resource "networkpolicies" in API group "networking.k8s.io" at the cluster scope
Expected results:
2. switching to 'YAML view' we can see that the selected project name was not auto populated in YAML
Additional info:
Description of problem:
ci/prow/security is failing on google.golang.org/grpc/metadata
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. run ci/pro/security job on 4.15 pr 2. 3.
Actual results:
Medium severity vulnerability found in google.golang.org/grpc/metadata
Expected results:
Additional info:
Description of problem:
The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands. These options exist in these docs: https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html but not in these docs: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user
Description of problem:
Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-08-15-212448
How reproducible:
Always
Steps to Reproduce:
1. "create install-config", then optionally insert interested settings (see [1]) 2. "create cluster", and make sure the cluster turns healthy finally (see [2]) 3. check the cluster's addresses on GCP (see [3]) 4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])
Actual results:
The global address "<infra id>-apiserver" is not deleted during "destroy cluster".
Expected results:
Everything of the cluster shoudl get deleted during "destroy cluster".
Additional info:
FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306
https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535
Description of problem:
INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision... E0819 14:17:33.676051 2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" E0819 14:17:33.708233 2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local" I0819 14:17:33.708279 2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.
Example:
compute: - name: worker architecture: arm64 ... - name: edge architecture: amd64 platform: aws: zones: ${edge_zones_str}
See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631
Description of problem:
I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.
Version-Release number of selected component (if applicable):
[fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. use imageSetConfig.yaml as shown below 2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2 3.
Actual results:
fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2 2024/08/03 09:24:38 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/08/03 09:24:38 [INFO] : 👋 Hello, welcome to oc-mirror 2024/08/03 09:24:38 [INFO] : ⚙️ setting up the environment for you... 2024/08/03 09:24:38 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/08/03 09:24:38 [INFO] : 🕵️ going to discover the necessary images... 2024/08/03 09:24:38 [INFO] : 🔍 collecting release images... 2024/08/03 09:24:44 [INFO] : kubeVirtContainer set to true [ including : ] 2024/08/03 09:24:44 [ERROR] : unknown image : reference name is empty 2024/08/03 09:24:44 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/08/03 09:24:44 [ERROR] : unknown image : reference name is empty
Expected results:
If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.
Additional info:
[fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.12 minVersion: 4.12.61 maxVersion: 4.12.61 kubeVirtContainer: true operators: - catalog: oci:///test/ibm-catalog - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator minVersion: "0.26.0" - name: nfd maxVersion: "4.15.0-202402210006" - name: cluster-logging minVersion: 5.8.3 maxVersion: 5.8.4 - name: quay-bridge-operator channels: - name: stable-3.9 minVersion: 3.9.5 - name: quay-operator channels: - name: stable-3.9 maxVersion: "3.9.1" - name: odf-operator channels: - name: stable-4.14 minVersion: "4.14.5-rhodf" maxVersion: "4.14.5-rhodf" additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308
Description of problem:
Specify long cluster name in install-config, ============== metadata: name: jima05atest123456789test123 Create cluster, installer exited with below error: 08-05 09:46:12.788 level=info msg=Network infrastructure is ready 08-05 09:46:12.788 level=debug msg=Creating storage account 08-05 09:46:13.042 level=debug msg=Collecting applied cluster api manifests... 08-05 09:46:13.042 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating storage account jima05atest123456789tsh586sa: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima05atest123456789t-sh586-rg/providers/Microsoft.Storage/storageAccounts/jima05atest123456789tsh586sa 08-05 09:46:13.042 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.042 level=error msg=RESPONSE 400: 400 Bad Request 08-05 09:46:13.043 level=error msg=ERROR CODE: AccountNameInvalid 08-05 09:46:13.043 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.043 level=error msg={ 08-05 09:46:13.043 level=error msg= "error": { 08-05 09:46:13.043 level=error msg= "code": "AccountNameInvalid", 08-05 09:46:13.043 level=error msg= "message": "jima05atest123456789tsh586sa is not a valid storage account name. Storage account name must be between 3 and 24 characters in length and use numbers and lower-case letters only." 08-05 09:46:13.043 level=error msg= } 08-05 09:46:13.043 level=error msg=} 08-05 09:46:13.043 level=error msg=-------------------------------------------------------------------------------- 08-05 09:46:13.043 level=error 08-05 09:46:13.043 level=info msg=Shutting down local Cluster API controllers... 08-05 09:46:13.298 level=info msg=Stopped controller: Cluster API 08-05 09:46:13.298 level=info msg=Stopped controller: azure infrastructure provider 08-05 09:46:13.298 level=info msg=Stopped controller: azureaso infrastructure provider 08-05 09:46:13.298 level=info msg=Shutting down local Cluster API control plane... 08-05 09:46:15.177 level=info msg=Local Cluster API system has completed operations See azure doc[1], the naming rules on storage account name, it must be between 3 and 24 characters in length and may contain numbers and lowercase letters only. The prefix of storage account created by installer seems changed to use infraID with CAPI-based installation, it's "cluster" when installing with terraform. Is it possible to change back to use "cluster" as sa prefix to keep consistent with terraform? because there are several storage accounts being created once cluster installation is completed. One is created by installer starting with "cluster", others are created by image-registry starting with "imageregistry". And QE has some CI profiles[2] and automated test cases relying on installer sa, need to search prefix with "cluster", and not sure if customer also has similar scenarios. [1] https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview [2] https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L241
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Summary
Duplicate issue of https://issues.redhat.com/browse/OU-258.
To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board.
This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin.
openshift/console PR#4187: Removes the Metrics Page.
openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page.
—
Testing
Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.
—
Except from OU-258 : https://issues.redhat.com/browse/OU-258 :
The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.
Description of problem:
azure-disk-csi-driver doesnt use registryOverrides
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1.set registry override on CPO 2.watch that azure-disk-csi-driver continues to use default registry 3.
Actual results:
azure-disk-csi-driver uses default registry
Expected results:
azure-disk-csi-driver mirrored registry
Additional info:
Description of problem:
After branching, main branch still publishes Konflux builds to mce-2.7
Version-Release number of selected component (if applicable):
mce-2.7
How reproducible:
100%
Steps to Reproduce:
1.Post a PR to
main
2. Check the jobs that run
Actual results:
Both mce-2.7 and main Konflux builds get triggered
Expected results:
Only main branch Konflux builds gets triggered
Additional info:
Description of problem:
This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing. LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue.
Version-Release number of selected component (if applicable):
4.15.11
How reproducible:
Steps to Reproduce:
(From the customer) 1. Configure LDAP IDP 2. Configure Proxy 3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
Actual results:
LDAP IDP communication from the control plane oauth pod goes through proxy
Expected results:
LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings
Additional info:
For more information, see linked tickets.
Description of problem:
ose-aws-efs-csi-driver-operator has an invalid reference tools that cause build failed this issue is due to https://github.com/openshift/csi-operator/pull/252/files#r1719471717
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Router pods use the "hostnetwork" SCC even when they do not use the host network.
All versions of OpenShift from 4.11 through 4.17.
100%.
1. Install a new cluster with OpenShift 4.11 or later on a cloud platform.
The router-default pods do not use the host network, yet they use the "hostnetwork" SCC:
% oc -n openshift-ingress get pods -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o go-template --template='{{range .items}}{{.metadata.name}} {{with .metadata.annotations}}{{index . "openshift.io/scc"}}{{end}} {{.spec.hostNetwork}}{{"\n"}}{{end}}' router-default-5ffd4ff7cd-mhhv6 hostnetwork <no value> router-default-5ffd4ff7cd-wmqnj hostnetwork <no value> %
The router-default pods should use the "restricted" SCC.
We missed this change from the OCP 4.11 release notes:
The restricted SCC is no longer available to users of new clusters, unless the access is explicitly granted. In clusters originally installed in OpenShift Container Platform 4.10 or earlier, all authenticated users can use the restricted SCC when upgrading to OpenShift Container Platform 4.11 and later.
Artifacts from CI jobs confirm that router pods used "restricted" for new 4.10 clusters and for 4.10→4.11 upgraded clusters, and "hostnetwork" for new 4.11 clusters:
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1790552355406614528/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "restricted" "restricted" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1790422949342220288/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "restricted" "restricted" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1793013806733987840/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "restricted" "restricted" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1793013781534609408/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade/1793670820518694912/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1793670819998601216/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" % curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1793062832263139328/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]' "hostnetwork" "hostnetwork" %
Description of problem:
Remove the extra . from below INFO message when running add-nodes workdflow INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Run oc adm node-image create command to create a node iso 2. See the INFO message at the end 3.
Actual results:
INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z
Expected results:
INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z
Additional info:
Description of problem:
[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-16-033047
How reproducible:
Always
Steps to Reproduce:
1. Use instance type "vt1.3xlarge"/"g4ad.xlarge"/"g4dn.xlarge" install Openshift cluster on AWS 2. Check the csinode allocatable volumes count $ oc get csinode ip-10-0-53-225.ec2.internal -ojsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}' 26 g4ad.xlarge # 25 g4dn.xlarge # 25 vt1.3xlarge # 26 $ oc get no/ip-10-0-53-225.ec2.internal -oyaml| grep 'instance-type' beta.kubernetes.io/instance-type: vt1.3xlarge node.kubernetes.io/instance-type: vt1.3xlarge 3. Create statefulset with pvc(which use the ebs csi storageclass), nodeAnffinity to the same node and set the replicas to the max volumesallocatable count to verify the the csinode allocatable volumes count is correct and all the pods should become Running # Test data apiVersion: apps/v1 kind: StatefulSet metadata: name: statefulset-vol-limit spec: serviceName: "my-svc" replicas: 26 selector: matchLabels: app: my-svc template: metadata: labels: app: my-svc spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - ip-10-0-53-225.ec2.internal # Make all volume attach to the same node containers: - name: openshifttest image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 volumeMounts: - name: data mountPath: /mnt/storage tolerations: - key: "node-role.kubernetes.io/master" effect: "NoSchedule" volumeClaimTemplates: - metadata: name: data spec: accessModes: [ "ReadWriteOnce" ] #storageClassName: gp3-csi resources: requests: storage: 1Gi
Actual results:
In step 3 there's some pods stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node
Expected results:
In step 3 all the pods with pvc should become "Running", and In step 2 the csinode allocatable volumes count should be correct -> g4ad.xlarge allocatable count should be 24 -> g4dn.xlarge allocatable count should be 24 -> vt1.3xlarge allocatable count should be 24
Additional info:
... attach or mount volumes: unmounted volumes=[data12 data6], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition 06-25 17:51:23.680 Warning FailedAttachVolume 4m1s (x13 over 14m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-d08d4133-f589-4aa3-bbef-f988058c419a" : rpc error: code = Internal desc = Could not attach volume "vol-0aa138f453d414ec3" to node "i-09d532f5155b3c05d": attachment of disk "vol-0aa138f453d414ec3" failed, expected device to be attached but was attaching 06-25 17:51:23.681 Warning FailedMount 3m40s (x3 over 10m) kubelet Unable to attach or mount volumes: unmounted volumes=[data6 data12], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition ...
Description of problem:
We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.
lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
50%
Steps to Reproduce:
1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI
Actual results:
Flakes
Expected results:
Shouldn't flake
Additional info:
CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB
CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations
Description of problem:
Based on the results in [Sippy|https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Etcd&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-19%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-13%2000%3A00%3A00&testId=Operator%20results%3A45d55df296fbbfa7144600dce70c1182&testName=operator%20conditions%20etcd], it appears that the periodic tests are not waiting for the etcd operator to complete before exiting. The test is supposed to wait for up to 20 mins after the final control plane machine is rolled, to allow operators to settle. But we are seeing the etcd operator triggering 2 further revisions after this happens. We need to understand if the etcd operator is correctly rolling out vs whether these changes should have rolled out prior to the final machine going away, and, understand if there's a way to add more stability to our checks to make sure that all of the operators stabilise, and, that they have been stable for at least some period (1 minute)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared. This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared: https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden) https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden) It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared. This bug prevents users from successfully creating instances from templates in the WebConsole.
Version-Release number of selected component (if applicable):
4.15 4.14
How reproducible:
YES
Steps to Reproduce:
1. Log in with a non-administrator account. 2. Select a template from the developer catalog and click on Instantiate Template. 3. Enter values into the initially empty form. 4. Wait for several seconds, and the entered values will disappear.
Actual results:
Entered values are disappeard
Expected results:
Entered values are appeard
Additional info:
I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.
Description of problem:
When HO is installed without a pullsecret the shared ingress controller fails to create the router pod because the pullsecret is missing
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1.Install HO without pullsecret 2.Watch HO report error "error":"failed to get pull secret &Secret{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [][]},Data:map[string[]byte{},Type:,StringData:map[string]string{},Immutabl:nil,}: Secret \"pull-secret\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller. 3. Observe that no Router pod is created in the hypershift sharedingress namespace
Actual results:
router pod doesnt get created in hyeprshift sharedingress namespace
Expected results:
router pod gets created in hyeprshift sharedingress namespace
Additional info:
Description of problem:
If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect
This does not occur if folder is defined.
An upstream bug was identified when debugging this:
Description of problem:
On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133 4.16.0-0.nightly-2024-08-08-111530
How reproducible:
Always
Steps to Reproduce:
1. Check overview page's getting started resources card, 2. 3.
Actual results:
1. There is "OpenShift LightSpeed" link in "Explore new features and capabilities"
Expected results:
1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
Additional info:
Description of problem:
https://issues.redhat.com//browse/OCPBUGS-31919 partially fixed an issue consuming the test image from a custom registry. The fix is about consuming in the test binary the pull-secret of the cluster under tests. To complete it we have to do the same trusting custom CA as the cluster under test. Without that, if the test image is exposed by a registry where the TLS cert is signed by a custom CA, the same tests will fail as for: { fail [github.com/openshift/origin/test/extended/operators/certs.go:120]: Unexpected error: <*errors.errorString | 0xc0023105c0>: unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342: StdOut> error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority StdErr> error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority exit status 1 { s: "unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:\nStdOut>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nStdErr>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nexit status 1\n", } occurred Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
release-4.16, release-4.17 and master branchs in origin.
How reproducible:
Always
Steps to Reproduce:
1. try to run the test suite against a cluster where the OCP release (and the test image) comes from a private registry with a cert signed by a custom CA 2. 3.
Actual results:
3 failing tests: : [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] expand_more : [sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] expand_more : [sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel] expand_more
Expected results:
No failing tests
Additional info:
OCPBUGS-31919 partially fixed it having the test binary downloading the pull secret from the cluster under test. But in order to have it working we have also to trust custom CAs trusted by the cluster under test
Description of problem:
After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear. // after changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws classicLoadBalancer: <<<< connectionIdleTimeout: 0s <<<< networkLoadBalancer: {} type: NLB // create new ingresscontroller with NLB $ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws networkLoadBalancer: {} type: NLB
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-08-013133
How reproducible:
100%
Steps to Reproduce:
1. changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}' 2. create new ingresscontroller with NLB kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: nlb namespace: openshift-ingress-operator spec: domain: nlb.<base-domain> replicas: 1 endpointPublishingStrategy: loadBalancer: providerParameters: aws: type: NLB type: AWS scope: External type: LoadBalancerService 3. check both ingresscontrollers status
Actual results:
// after changing default ingresscontroller to NLB $ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws classicLoadBalancer: connectionIdleTimeout: 0s networkLoadBalancer: {} type: NLB // new ingresscontroller with NLB $ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws networkLoadBalancer: {} type: NLB
Expected results:
If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB.
Additional info:
Description of problem:
Creating a tuned profile with annotation tuned.openshift.io/deferred: "update" first before label target node, then label node with profile=, the value of kernel.shmmni applied immediately. but it shows the message [The TuneD daemon profile is waiting for the next node restart: openshift-profile], then reboot nodes, it will restore to default value of kernel.shmmni, not setting to expected value.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Creating OCP cluster with latest 4.18 nightly version 2. Create tuned profile before label node please refer to issue 1 if you want to reproduce the issue in the doc https://docs.google.com/document/d/1h-7AIyqf7sHa5Et2XF7a-RuuejwVkrjhiFFzqZnNfvg/edit
Actual results:
It should show the message [TuneD profile applied]. the sysctl value should keep as expect after node reboot
Expected results:
It shouldn't show the message The TuneD daemon profile is waiting for the next node restart: openshift-profile when executing oc get profile also the sysctl value shouldn't revert after node reboot
Additional info:
Description of problem:
See https://github.com/prometheus/prometheus/issues/14503 for more details
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:
# TYPE requests_per_second_requests gauge # UNIT requests_per_second_requests requests # HELP requests_per_second_requests test-description requests_per_second_requests 16 1722466225604 requests_per_second_requests 14 1722466226604 requests_per_second_requests 40 1722466227604 requests_per_second_requests 15 1722466228604 # EOF
2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:
Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)
Additional info:
Regression introduced in Prometheus 2.52. Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685
Description of problem:
Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3
Version-Release number of selected component (if applicable):
4.17
How reproducible:
always
Steps to Reproduce:
1. apply CRD yaml file 2. check the NetworkAttachmentDefinition status
Actual results:
status with error
Expected results:
NetworkAttachmentDefinition has been created
Description of problem:
Specifying additionalTrustBundle in the HC doesnt propogate down to the worker nodes
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100%
Steps to Reproduce:
1.Create CM with additionalTrustBundle 2.Specify CM in HC.Spec.AdditionalTrustBundle 3.Debug worker nodes and check if additionalTrustBundle has been updated
Actual results:
additionalTrustBundle hasnt propogated down to nodes
Expected results:
additionalTrustBundle propogated down to nodes
Additional info:
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/313
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml: additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: ci.devcluster.openshift.com compute: - architecture: arm64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: arm64 hyperthreading: Enabled name: master platform: {} replicas: 3 However, the installation will fail with ambiguous error messages: ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused The actual error hides in the bootstrap VM's System Log: Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17 SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA) SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519) SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA) ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac Ignition: ran on 2024/08/14 12:34:24 UTC (this boot) Ignition: user-provided config was applied [0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m [1;31mRelease image arch amd64 does not match host arch arm64[0m ip-10-29-3-15 login: [ 89.141099] Warning: Unmaintained driver is detected: nft_compat
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Use amd64 installer to install a cluster with aarch64 nodes
Steps to Reproduce:
1. download amd64 installer 2. generate the install-config.yaml 3. edit install-config.yaml to use aarch64 nodes 4. invoke the installer
Actual results:
installation timed out after ~30mins
Expected results:
installation failed immediately with proper error message indicating the installation is not possible
Additional info:
https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379
Similar to the work done for AWS STS and Azure WIF support, the console UI (specifically OperatorHub) needs to:
CONSOLE-3776 was adding filtering for the GCP WIP case, for the operator-hub tile view. Part fo the change was also check for the annotation which indicates that the operator supports GCP's WIF:
features.operators.openshift.io/token-auth-gcp: "true"
AC:
Description of problem:
The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section
Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section
Additional info:
Sync downstream with upstream
Starting around the beginning of June, `-bm` (real baremetal) jobs started exhibiting a high failure rate. OCPBUGS-33255 was mentioned as a potential cause, but this was filed much earlier.
The start date for this is pretty clear in Sippy, chart here:
Example job run:
More job runs
Slack thread:
https://redhat-internal.slack.com/archives/C01CQA76KMX/p1722871253737309
Affecting these tests:
install should succeed: overall install should succeed: cluster creation install should succeed: bootstrap
Component Readiness has found a potential regression in the following test:
[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.18
Start Time: 2024-08-14T00:00:00Z
End Time: 2024-08-21T23:59:59Z
Success Rate: 94.89%
Successes: 128
Failures: 7
Flakes: 2
Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 647
Failures: 0
Flakes: 15
The test is permafailing on latest payloads on multiple platforms, not just azure. It does seem to coincide with arrival of the 4.18 rhcos images.
{ fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-z5sh003f-431b2-r2nm4-master-0 Unexpected error: <*errors.errorString | 0xc001e80190>: err execing command jq: error (at <stdin>:1): Cannot index array with string "info" jq: error (at <stdin>:1): Cannot iterate over null (null) { s: "err execing command jq: error (at <stdin>:1): Cannot index array with string \"info\"\njq: error (at <stdin>:1): Cannot iterate over null (null)", } occurred Ginkgo exit error 1: exit with code 1}
The script involved is likely in: https://github.com/openshift/origin/blob/a365380cb3a39cfc26b9f28f04b66418c993a879/test/extended/cpu_partitioning/crio.go#L4
Nightly payloads are fully blocked as multiple blocking aggregated jobs are permafailing this test.
compile errors when building an ironic image look like this
2024-08-14 09:07:21 + python3 -m compileall --invalidation-mode=timestamp /usr 2024-08-14 09:07:21 Listing '/usr'... 2024-08-14 09:07:21 Listing '/usr/bin'... ... Listing '/usr/share/zsh/site-functions'... Listing '/usr/src'... Listing '/usr/src/debug'... Listing '/usr/src/kernels'... Error: building at STEP "RUN prepare-image.sh && rm -f /bin/prepare-image.sh && /bin/prepare-ipxe.sh && rm -f /tmp/prepare-ipxe.sh": while running runtime: exit status 1
with the actual error lost in 3000+ lines of output, we should suppress the file listings
Description of problem:
Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13. Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs. The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28 We have reproduced the issue and we found an ordering cycle error in the journal log Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling. Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.) Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.
Version-Release number of selected component (if applicable):
Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13: version: 4.13.0-0.nightly-2024-07-23-154444 version: 4.12.0-0.nightly-2024-07-23-230744 version: 4.11.59 version: 4.10.67 version: 4.9.59
How reproducible:
Always
Steps to Reproduce:
1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.
Actual results:
Nodes become not ready $ oc get nodes NAME STATUS ROLES AGE VERSION ci-op-g94jvswm-cc71e-998q8-master-0 Ready master 6h14m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-1 Ready master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-master-2 NotReady,SchedulingDisabled master 6h13m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb NotReady,SchedulingDisabled worker 6h2m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6 Ready worker 6h4m v1.25.16+306a47e ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj Ready worker 6h6m v1.25.16+306a47e And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
Expected results:
No ordering cycle error should happen and the upgrade should be executed without problems.
Additional info:
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Occasional machine-config daemon panics in test-preview. For example this run has:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736
And the referenced logs include a full stack trace, the crux of which appears to be:
E0801 19:23:55.012345 2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 127 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?}) /go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b panic({0x2424b80?, 0x4166150?}) /usr/lib/golang/src/runtime/panic.go:770 +0x132 github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0}) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208) /go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65 github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208}) /go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92
$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match' periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
looks like ~15% impact in those CI runs CI Search turns up.
Run lots of CI. Look for MCD panics.
CI Search results above.
No hits.
After looking at this test run we need to validate the following scenarios:
Do the monitor tests in openshift/origin accurately test these scenarios?
Description of problem:
We need to bump the Kubernetes Version. To the latest API version OCP is using. This what was done last time: https://github.com/openshift/cluster-samples-operator/pull/409 Find latest stable version from here: https://github.com/kubernetes/api This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Version-Release number of selected component (if applicable):
How reproducible:
Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).
Description of problem:
The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.
Version-Release number of selected component (if applicable):
RHOCP 4.16.4
How reproducible:
100%
Steps to Reproduce:
1. Configure proxy custom resource in RHOCP 4.16.4 cluster 2. Create cluster-monitoring-config configmap in openshift-monitoring project 3. Inject remote-write config (without specifically configuring proxy for remote-write) 4. After saving the modification in cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet: ============== apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: [...] name: k8s namespace: openshift-monitoring spec: [...] remoteWrite: - proxyUrl: http://proxy.abc.com:8080 <<<<<====== Injected Automatically but there is no noProxy URL. url: http://test-remotewrite.test.svc.cluster.local:9090
Actual results:
The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.
Expected results:
The noProxy URL should get injected in Prometheus k8s CR as well.
Additional info:
add a new monitor test: api unreachable interval from client perspectives
Description of problem:
In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>): ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: type: Route ~~~ On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: ~~~ $ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml - service: OAuthServer servicePublishingStrategy: route: hostname: oauth.<custom-domain> type: Route $ oc get routes -n hcp-ns --show-labels NAME HOST/PORT LABELS oauth oauth.<custom-domain> hypershift.openshift.io/hosted-control-plane=hcp-ns <--- ~~~ The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: ~~~ $ oc get ingresscontroller -n openshift-ingress-default default -oyaml routeSelector: matchExpressions: - key: hypershift.openshift.io/hosted-control-plane <--- operator: DoesNotExist <--- ~~~ This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Easily
Steps to Reproduce:
1. Install HCP cluster 2. Configure OAuthServer with type Route 3. Add a custom hostname different than default wildcard ingress URL from management cluster
Actual results:
Oauth route is not admitted
Expected results:
Oauth route should be admitted by Ingresscontroller
Additional info:
Description of the problem:
FYI - OCP 4.12 has reached end of maintenance support, not it is on extended support.
Looks like OCP 4.12 installations started failing lately due to hosts not discovering. for example - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_assisted-service/6628/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-4-12/1817416612257468416
How reproducible:
Seems like every CI run, haven't tested locally
Steps to reproduce:
Trigger OCP 4.12 installation in the CI
Actual results:
failure, hosts not discovering
Expected results:
Successful cluster installation
Description of problem:
Creating a faulty configmap for UWM results in cluster_operator_up=0 with the reason InvalidConfiguration. With https://issues.redhat.com/browse/MON-3421 we're expecting the reason to match UserWorkload.*
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
100%
Steps to Reproduce:
apply the following CM to a cluster with UWM enabled: apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | hah helo! :)
Actual results:
cluster_operator_up=0 with reason InvalidConfiguration
Expected results:
cluster_operator_up=0 with reason matching pattern UserWorkload.*
Additional info:
https://issues.redhat.com/browse/MON-3421 streamlined reasons to allow separation between UWM and cluster monitoring. The above is a leftover that should be updated to match the same pattern.
Description of problem:
Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15. During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. The CSR remain pending and do not get auto-approved This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually.
Expected results:
CSR should get approved automatically and domain name scheme should not change.
Additional info:
openshift/api was bumped in CNO without running codegen. codegen needs to be run
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/559
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Our e2e jobs fail with:
pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError" pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError" pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError"
The jobs should succeed.
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Description of problem:
certrotation controller is using applySecret/applyConfigmap functions from library-go to update secret/configmap. This controller has several replicas running in parallel, so it may overwrite changes applied by a different replica, which leads to unexpected signer updates and corrupted CA bundles. applySecret/applyConfigmap does initial Get and calls Update, which overwrites the changes done to a copy received from the informer. Instead it should issue .Update calls directly using a copy received from the informer, so that etcd would reject a change if its done after the resourceVersion was updated in parallel
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
There are two enhancements we could have for cns-migration:
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:
sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt KubeConfig is: /tmp/kubeconfig I0806 07:59:34.884908 131 logger.go:28] logging successfully to vcenter I0806 07:59:36.078911 131 logger.go:28] ----------- Migration Summary ------------ I0806 07:59:36.078944 131 logger.go:28] Migrated 0 volumes I0806 07:59:36.078960 131 logger.go:28] Failed to migrate 0 volumes I0806 07:59:36.078968 131 logger.go:28] Volumes not found 0
See the source datastore checing:
sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt KubeConfig is: /tmp/kubeconfig I0806 08:02:08.719657 138 logger.go:28] logging successfully to vcenter E0806 08:02:08.749709 138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter
2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
See Description
Description of problem:
IDMS is set on HostedCluster and reflected in their respective CR in-cluster. Customers can create, update, and delete these today. In-cluster IDMS has no impact.
Version-Release number of selected component (if applicable):
4.14+
How reproducible:
100%
Steps to Reproduce:
1. Create HCP 2. Create IDMS 3. Observe it does nothing
Actual results:
IDMS doesn't change anything if manipulated in data plane
Expected results:
IDMS either allows updates OR IDMS updates are blocked.
Additional info:
slack thread: https://redhat-internal.slack.com/archives/C058TF9K37Z/p1722890745089339?thread_ts=1722872764.429919&cid=C058TF9K37Z
Investigate what happens when machines are deleted when cluster is paused
The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.
The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.
The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.
See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.
Component Readiness has found a potential regression in the following test:
[sig-storage] [Serial] Volume metrics Ephemeral should create volume metrics with the correct BlockMode PVC ref [Suite:openshift/conformance/serial] [Suite:k8s]
Probability of significant regression: 100.00%
This feature conditionally creates a button within the VirtualizedTable component that allows clients to download the data within the table as comma-separated values (.csv).
Both PRs are needed to test the feature.
The PRs are
https://github.com/openshift/console/pull/14050
and
https://github.com/openshift/monitoring-plugin/pull/133
The monitoring-plugin passes a string called 'csvData', which contains metrics data formatted in comma-separated values. The console then consumes the 'csvData' in the component 'VirtualizedTable'. 'VirtualizedTable' renders the 'Export as CSV' button only if this property, 'cvsData' is present. Without the property the button 'Export as CSV' will not render.
The console's CI/CD pipeline > tide requires that issues have a valid Jira reference, presumably in this (OpenShift Console) board. This ticket is a duplication of
https://issues.redhat.com/browse/OU-431
As a user of HyperShift, I want to be able to:
so that I can achieve
Description of criteria:
N/A
% oc adm release info quay.io/openshift-release-dev/ocp-release:4.14.33-multi -a ~/all-the-pull-secrets.json --pullspecs | grep apiserver apiserver-network-proxy
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
Create cluster with publish:Mixed by using CAPZ, 1. publish: Mixed + apiserver: Internal install-config: ================= publish: Mixed operatorPublishingStrategy: apiserver: Internal ingress: External In this case, api dns should not be created in public dns zone, but it was created. ================== $ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com { "TTL": 300, "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19", "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.", "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api", "metadata": {}, "name": "api.jima07api", "provisioningState": "Succeeded", "resourceGroup": "os4-common", "targetResource": {}, "type": "Microsoft.Network/dnszones/CNAME" } 2. publish: Mixed + ingress: Internal install-config: ============= publish: Mixed operatorPublishingStrategy: apiserver: External ingress: Internal In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found. ================ $ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg []
Version-Release number of selected component (if applicable):
4.17 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Specify publish: Mixed + mixed External/Internal for api/ingress 2. Create cluster 3. check public dns records and load balancer rules in internal/external load balancer to be created expected
Actual results:
see description, some resources are unexpected to be created or missed.
Expected results:
public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
It's not possible to either - create RWOP PVC - Create RWOP clone - Restore to RWOP PVC using 4.16.0/4.17.0 UI with ODF StorageClasses. Please see the attached print screen. The RWOP access mode should be added to all the relevant screens in the UI.
Version-Release number of selected component (if applicable):
OCP 4.16.0 & 4.17.0 ODF (OpenShift Data Foundation) 4.16.0 & 4.17.0
How reproducible:
Steps to Reproduce:
1. Open UI, go to OperatorHub 2. Install ODF, once installed refresh for ConsolePlugin to get populated 3. Go to operand "StorageSystem" and create the CR using the custom UI (you can just keep on clicking "Next" with the default selected options, it will work well on AWS cluster) 5. Wait for "ocs-storagecluster-cephfs" and "ocs-storagecluster-ceph-rbd" StorageClasses to get created by ODF operator 6. Go to PVC creation page, try to create new PVC (using StorageClasses mentioned in step 5) 7. Try to create clone 8. Try to restore PVC to RWOP pvc from existing snapshot
Actual results:
It's not possible to create RWOP PVC, not possible to create RWOP clone and to restore to RWOP PVC from a snapshot using 4.16.0 & 4.17.0 UI.
Expected result:
It should be possible to create RWOP PVC, to create RWOP clone and to restore to a RWOP snapshot from PVC
Additional info:
https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L111-L119 >> these needs to be updated